The K-Zone: Understanding the Linux boot process
This document explains in moderate detail what happens when
a Linux system starts up. As far as possible, I have tried to
separate features which are specific to the various Linux
distributions from those that are generic. Where this isn't
possible -- because the explanation would be too convoluted --
I have used the RedHat set-up as an example. In addition, I
have tended to focus on the Intel/PC platform, for the same
reason.
To break the process into manageable pieces, I have broken
it into four stages: the `firmware' stage,
the `bootloader' stage, the `kernel' stage,
and the `init' stage. These are my names, and they aren't necessarily
used by other Linux users. Moreover, it isn't always
easy to separate the `firmware' stage from the initial
operations of the bootloader. On the PC platform, the
firmware is so unintelligent that a separate (software)
bootloader is required. On other platforms, notably
Sparc machines, the firmware is quite sophisticated,
and may be able to load a kernel directly.
Stage 1 (firmware stage)
The purpose of a bootloader is to get at least part
of the operating system kernel into memory and running.
After that, the kernel can take over the process.
However, unless the bootloader is in firmware,
to run the bootloader we must first retrieve
it, from disk or wherever else it is stored. The purpose of
the firmware stage, therefore, is to get a bootloader
into memory and run it.
On the Intel/PC platform, the firmware stage
(which does not depend on the operating system)
is governed by the BIOS.
Most modern PCs (and other types of computer, of course)
can boot from floppy disk, hard disk, or CD-ROM.
It is common for Sparc-based systems to have built-in network
bootloaders in firmware but, at present, this is unusual in the PC
world. The BIOS typically provides a mechanism by which the
operator can choose the devices that will be used to boot,
and it will probably be prepared to try more than one if
necessary. The process is slightly different for
the different media types.
Bootloader on floppy disk or hard disk
This is usually the simplest situation.
On a floppy disk, the first sector is reserved as the boot
sector. It must contain executable program code. The
BIOS loads the boot sector into memory and then runs it.
This process is largely the same whatever the hardware platform.
The situation is similar for PC hard disks, except that
it is conventional to divide the hard disk into partitions,
and to provide a boot sector for each partition. In
the world of DOS, the boot sector was, and remains,
combined with the partition table; the partition table
controls how much space is allocated to each partition.
In addition to the partition boot sectors there is
an overall boot sector/partition table
called the `master boot record' (MBR). When booting from
a hard disk formatted this way, the PC BIOS loads the MBR
and executes it as a boot sector; the code in the MBR
will then find which partition to boot from, and load
and run the boot sector from that partition.
Linux has no need to follow the convention of
partitioning that is meaningful to DOS/Windows, but if
the hard disk is to be used with more than one operating
system then it is a good idea to.
So, when booting from a hard disk the Linux bootloader can be placed
in the MBR, or in a partition boot sector.
In the latter case, it won't be the BIOS that will load the
Linux bootloader, it will be the bootloader on the master
boot record.
Whether the boot disk is a hard disk or a floppy disk,
the first stage of the boot process finds
a boot sector, which will contain the Linux bootloader,
and runs it.
Bootloader on CD-ROM
The ability to boot from a CDROM has been commonplace on
most platforms for some years. On some platforms a bootable
CDROM has the same structure as a bootable hard disk:
a boot sector followed by a load of data. A structure like this
is unworkable for PCs, owing to limitations
in the BIOS specification.
Most modern PCs are, however, able to
boot from a CDROM formatted according to the
El Torito specification. This
process is far more complex
than it ought to be. Because the BIOS can't cope with a full-sized
bootable Linux filesystem on a CDROM,
El Torito requires that the CDROM be provided
with an additional bootable filesystem. This filesystem
is considered
to be `outside' the normal data area of the CDROM, and won't
be visible if the CDROM is mounted as a filesystem in the
usual way. In fact, although the CDROM itself will normally
be formatted with an ISO9660 filesystem, the El Torito bootable image
can be of any filesystem type. In practise, the bootable image will be
formatted as a floppy disk: a boot sector followed by a
filesystem. When booting from the CDROM, the BIOS finds
the bootable filesystem image, loads the boot sector, and
makes the rest of the image available through BIOS calls just
as it does for a floppy disk. As far as the bootloader is
concerned, therefore, the BIOS treats a bootable CDROM
as an ordinary CDROM with an `embedded' bootable floppy disk.
Booting from CDROM is therefore just like booting from a
floppy disk in practise.
With Linux, this embedded floppy disk is usually formatted
with an ext2 filesystem. As with a floppy disk,
this filesystem will either become the root filesystem for
the next phase of the boot process, or will supply a
new, compressed filesystem which will be loaded into
memory as a `ramdisk' (see below).
The diagram below shows the structure of a typical Linux bootable
CD-ROM (but this isn't the only way to do it). The areas aren't to
scale, of course: the volume descriptors, etc., are only one sector
in length, but the filesystems will be many thousands of sectors.
Notice that there is a complete ext2 filesystem in
the boot filesystem image, along with the boot sector. The boot
sector will normally contain LILO code (see below). The
filesystem contains the kernel and the initial ramdisk (see below),
and the initial ramdisk in turn contains an ext2
filesystem which will become the root filesystem.
Bootloader retrieved from network
The problem with booting from a network is that
the functionality must be supplied in firmware,
because if there is no hard disk, there is no practical
place to load network-boot software from.
Most PCs do not contain firmware this sophisticated,
although some network adaptors
have this functionality. Sparc-based workstations
generally do have network boot functionality -- in
the OpenBoot firmware, and it
is quite comprehensive.
Note that there is nothing
to stop a PC getting a bootloader with network capabilities
from, say, a hard disk or CDROM and then using this
to complete the boot process over the network. However,
this is not network booting in the sense I am describing
here.
To get a bootloader via the network, the workstation must
first of all decide where to get it from. This may
be configurable at the firmware level or, more often, the
workstation will issue a broadcast, and then select
a boot server from the replies. Sun Sparc systems typically
make a RARP request, broadcasting their hardware MAC address
(`Ethernet address'). The reply from the server will contain
the IP number assigned to the workstation, and that
of the server itself. The workstation then uses the server's
IP as the target for a TFTP download. Whether this download
retrieves a network-aware bootloader, or a whole kernel,
varies from one system to another. Some Sparc systems
are able to TFTP a Linux kernel and load it, other
require the retrieval of a network-aware bootloader which
then retrieves the kernel (this is how Linux can be made
to run on the Sun Javastation network appliance, which has
somewhat stunted firmware).
Stage 2 (bootloader stage)
So we've got a bootloader into memory, from disk or network,
and it can be executed. Its job will be to get the kernel
into memory, again either from disk or network, and
execute it. The bootloader will have to supply various
vital pieces of information to the kernel, crucially
the location of its root filesystem.
There are a number of bootloaders
available for Linux: on the Intel/PC platform we have
LILO and GRUB; on Sparc we have SILO. LILO is probably
the best known, and has existed since the earliest
days of Linux. SILO is essentially the Sparc port of
LILO. GRUB is a much more sophisticated proposition.
LILO
LILO is a very rudimentary, single-stage bootloader. It has little
or no knowledge of Linux, and does not understand
the structure of any filesystem. Instead, it reads from the disk
using BIOS calls, supplying numerical values for the locations
on disk of the files it needs. Where does it get
these values from? It has no way to figure them out at
run-time, so the LILO installer has to supply them in
the form of a `map' file. The LILO installer is a
utility called lilo; this utility reads
a configuration file and builds the map file from
it. The location of the map file is then supplied to
the boot sector that lilo installs.
The bootloading process with LILO thus looks something
like this.
- The firmware loads the LILO bootsector and executes it.
- LILO loads its map file using BIOS calls. Using the
map file it finds the location of the boot message, which it
displays to the console, followed by a prompt.
- The user selects which kernel to boot -- if there's
more than one -- at the prompt
- LILO loads the kernel using BIOS calls, based on information
in the map file it loaded earlier
- (optional) LILO loads the initial ramdisk (see below)
- LILO executes the kernel, indicating where it can find
its root filesystem and (if necessary) initial ramdisk
A problem with LILO is that it can be quite tricky to use
it for creating a boot sector for a system different to
the one running the LILO installer (lilo).
The LILO configuration file (usually /etc/lilo.conf)
takes the names of files and devices as its inputs, but
these names are never passed through to the boot sector being
created. The files and devices referenced are simply analysed
for their numerical offsets. For example, if lilo.conf
contains the line
root=/dev/cdrom
and /dev/cdrom is a symbolic link to the
real device file (perhaps /dev/hdc), it
is important to understand that all lilo
will store is the major and minor device identifiers
of /dev/hdc. It is easy to imagine that
if the bootable filesystem you are building contains
a file called /dev/cdrom, and that is
a link to, say, /dev/hdd, then the root
filesystem will be found on /dev/hdd.
But it won't; LILO does not understand filesystems,
and the names in the configuration file are simply
rendered down to device IDs and file sector locations.
GRUB
GRUB is a very different bootloader from LILO. It has
a two-stage or three-stage operation, and has network
boot capabilities (of course, the network boot facilities
don't give you a way to get GRUB itself loaded: you'll still
need network boot firmware).
The additional sophistication of GRUB means that it
can't easily fit into a single boot sector. It
therefore uses a multiple-stage process to load
successively larger amounts into memory. In so
doing it becomes able to understand filesystems,
so the kernel itself, and the other files GRUB
uses, can be specified dynamically at boot time;
there is no need for explicit numerical maps
such as the ones that LILO uses.
In brief, the GRUB boot process looks like this.
- Stage 1: the firmware loads the GRUB boot sector into memory.
This is a standard (512 byte) boot sector and, thus far,
the process is the same as for lilo. Encoded
in the boot sector are the numerical disk block addresses
of the sectors that make up the implementation of the next stage.
GRUB then loads the blocks that are required for the next
stage using BIOS calls.
- Stage 1.5 (this name reflects the fact that, strictly speaking,
it is optional; its purpose is to load the code that recognizes
real filesystems, and GRUB can be set to use numerical block
offsets just like LILO): the code for stage 2 is loaded
using BIOS calls, but with knowledge of the filesystem.
Typically this code is in the file /boot/grub/stage2.
On my system this program is about 120 kB in size; clearly we
can offer far more sophisticated functionality in a program
of this size than in the 5000-or-so bytes of LILO. The
fact that GRUB loads its second stage as a file, and
not as a list of disk sectors, is the key to its power; LILO
can't do this, so you can't do much with it at boot time.
- Stage 2: GRUB puts up a menu of defined boot options, and
exposes a command-line to the operator. The command line
can be used to load arbitrary files as kernels and ramdisks
(because stage 2 understands filesystems). Each boot option in
the GRUB configuration
file is expressed in terms of GRUB command-line operations.
- GRUB executes the commands entered by the operator, either
from the configuration file or from the command line prompt.
Typical commands are
kernel, which loads a
kernel into memory, initrd, which loads an
initial ramdisk from a file, and boot.
The functionality offered by GRUB is quite similar to the
OpenBoot firmware in Sun workstations, and includes
the ability to retrieve kernels from a server using
TFTP.
Multiple-boot machines
Because Linux was designed to be able to co-exist with
other operating systems, the bootloader should be able
to boot other operating systems on a hard disk as well as Linux.
In practise this is relatively straightforward, as each
of the other operating systems will have its own boot sector.
All the Linux boot loader has to do is to locate the appropriate
boot sector, and execute it. After that, the process will
be under the control of the other system's bootloader.
LILO, GRUB, and SILO all have this functionality.
Stage 3 (kernel stage)
By the time this stage begins, the bootloader will have loaded
the kernel into memory, configured it with the location of
its root filesystem, and loaded the initial ramdisk, if
supplied. How we proceed from here depends to a large
extent on whether we are using an initial ramdisk or not.
So why is an initial ramdisk such a big deal? Well, the
concept arose from attempts to solve the problem of fitting
a fully bootable Linux system onto a single floppy disk.
The problem is that a Linux system that will boot as far
as giving a shell, and offering a few basic utilities,
needs about 8Mb -- far too much to fit onto a floppy.
However, such a system will in practise compress down to about 2 Mb
using gzip compression,
so if the root filesystem could be compressed, we could get
a working system in two standard floppies, or a single
2.88 Mb floppy.
Another problem that had to be solved was that of booting
from a floppy disk and then mounting
a root filesystem from a device other than an IDE drive.
SCSI drives were particularly problematic: if the kernel was
compiled to included all the necessary drivers, it would
not fit onto a floppy disk. However, the initial ramdisk
technique allows the drivers to be supplied as loadable
modules, which can be compressed.
In outline, an initial ramdisk is a root filesystem that
is unpacked from a compressed file. The boot loader
will load the compressed version into memory, then
the kernel uncompresses it and mounts it as the
root filesystem. In this
way we can get an 8 Mb root filesystem onto a 2.88 Mb
file. Initial ramdisks are also useful on bootable CDROMs,
because the bootable part of the CDROM is typically
implemented as an `embedded' floppy disk.
Stage 3a (common kernel stage)
Whether or not we are using an initial ramdisk,
the kernel will begin initializing itself and the hardware
devices for which support is compiled in. The process
will typically include the following steps.
- Detect the CPU and its speed, and calibrate the delay
loop
- Initialize the display hardware
- Probe the PCI bus and build a table of attached peripherals
and the resources they have been assigned
- Initialize the virtual memory management system,
including the swapper
kswapd
- Initialize all compiled-in peripheral drivers; these
typically include drivers for IDE hard disks, serial
ports, real-time clock, non-volatile RAM, and AGP bus.
Other drivers may be compiled in, but it is increasingly
common to compile as stand-alone modules those drivers
that are not required during this stage of the
boot process. Note that drivers must be compiled in if they
are needed to support the mounting of the root filesystem.
If the root filesystem is an NFS share, for example, then drivers
must be compiled in for NFS, TCP/IP, and low-level networking
hardware
If we aren't using an initial ramdisk, then the next step is to
mount the root filesystem.
The kernel can then run
the first true process from the root filesystem
(strictly speaking, kswapd and its
associates are not processes, they are kernel threads).
Conventionally this process is /sbin/init, although the
choice can be overridden by supplying the boot= parameter
to the kernel at boot time. The init process runs with
uid zero (i.e., as root) and will be the parent of all
other processes.
Note that kswapd and the other kernel threads have
process IDs but, even though they start before init,
init still has process ID 1. This is to maintain
the Unix convention that init is the first process.
Stage 3b (ramdisk kernel stage)
This stage is only relevant if we are using an initial ramdisk.
In this case, the kernel won't involve init,
but will proceed as follows.
- The kernel unpacks the compressed ramdisk into a normal, mountable
ramdisk
- It then mounts the uncompressed ramdisk as a root
filesystem. The original ramdisk memory is freed. It should
be obvious that the kernel must have drivers compiled in to
support whatever filesystem is in the ramdisk, as it won't
be able to load any modules until the root filesystem is
visible.
- The kernel then runs an initialization process. This
process will, in general, not be the
standard unix
init, but a script that will mount the
real root filesystem and then launch the next stage of the
boot process. Conventionally this script is called /linuxrc
but it can be specified to the kernel using the init
parameter.
/linuxrc
does whatever it needs to, in order to make the real
root filesystem available, probably
including loading some modules.
It then mounts the new root filesystem over the top
of the ramdisk filesystem.
- Conventionally
/linuxrc then spawns the
`real' init process. It will typically do this
using the exec command so that init
ends up as process number 1, rather than 2.
/linuxrc need not mount a new root filesystem over
the top of the ramdisk root, nor need it load init.
These activities are simply conventions. For example, in order
to boot a full Linux system from a CDROM, a workable proposition
is to retain the initial ramdisk as the root filesystem, and
have /linuxrc mount the CDROM at, say, /usr.
This allows the root filesystem to be read-write; if we mounted
the CDROM at /, the root filesystem would be read-only,
and we would have
to create a separate ramdisk and have a bunch of symbolic links
from the CDROM to parts of that ramdisk.
Similarly, a `rescue' disk -- floppy or CDROM -- would probably not
want to invoke init, but simply put up a root
shell.
If we are using /linuxrc to prepare a root filesystem,
it is a good idea to minimize the amount of initialization code
in it. This is not because it won't work, but because the correct place
for initialization is in the start-up script spawned by
init. Doing initialization here, and not
in /linuxrc enables us to ensure that the
same initialization code is available whether or
not an initial ramdisk is in use.
Stage 4 (init stage)
By now the kernel is loaded, memory management
is running, some hardware is initialized, and
the root filesystem is in place. All subsequent
operations are invoked -- directly or indirectly
by init.
This process takes its instructions -- again
by default -- from the file /etc/inittab.
inittab specifies at least three important pieces
of information.
- the `runlevel' to enter at startup
- a command to run to perform basic system initialization
(conventionally this is
/etc/rc.sysinit)
- the commands to run on entry to and exit from particular runlevels.
The order of operations is that the initialization command
(rc.sysinit) is run first, then the runlevel
scripts. The division of work between rc.sysinit
and the runlevel scripts is entirely a convention. If you are
building a custom Linux system you don't have to follow
this convention. In fact, you don't even have to run init
if it doesn't do what you need.
Stage 4a (rc.sysinit)
This script or executable is responsible for all the one-off
initialization of the system. Linux distributions differ
in the distribution of work between this script and
the runlevel scripts but, in general, the following
initialization steps are likely to be carried out here.
- Configure the system clock from the hardware clock
- Set up keymappings for the console(s)
- Mount the
/proc filesystem
- Set up swap space (if there is any)
- Mount and check `local' (i.e., non-network) filesystems
- Run
depmod to initialize the module
dependency tree. This is important because it makes it
possible for modprobe to work. The kernel's
module auto-loader refers to modules by name, not by
filename. It also expects that when it tries to load
a module by name, any modules on which it depends can also
be loaded by name. In a custom boot set-up, you may prefer
to load all your modules by filename, and not compile in
the auto-loader at all. This speeds the boot process
considerably. However, you'll lose the flexibility
of dynamically loading and unloading modules for hot-plug devices.
- Initialize and configure network interfaces. This step
usually has to come after the
depmod step
or its equivalent, because the network drivers are likely
to be loaded as modules.
- Load drivers for USB, PCMCIA, sound, etc. Again, these steps
probably load or reference modules.
Stage 4b (runlevel scripts)
Let's assume that we will be entering runlevel 5 which, by
convention, gives us a graphical login prompt under the
X server. A typical inittab will have entries
like this:
l5:5:wait:/etc/rc.d/rc 5
x:5:respawn:/etc/X11/prefdm -nodaemon
The first line says that on entry to runlevel 5, invoke a
script called rc, passing the argument `5'.
The second line says that on entry to runlevel 5, run the
script /etc/X11/prefdm -nodaemon.
This latter script is somewhat beyond the scope of this
article, being in the realm of X display management. In
outline, prefdm is a script inserted by
the RedHat installer. It contains code that will launch
the X display manager selected by the user, either at install
time or using a configuration utility. The reason it works this
way is so that configuration utilities don't have to mess about
with inittab, which is a bad file to mess up if
you want your system to keep working. The X display manager
will typically invoke the X server (i.e., the graphical display)
on the local machine and give you a login prompt.
But back to the `real' boot process...
The script rc runs the start
scripts in a directory for the runlevel given in inittab.
Usually, runlevel N will correspond to a directory
/etc/rc.d/rcN.d. As we've decided to enter
runlevel 5, the relevant directory is /etc/rc.d/rc5.d.
This directory will contain a
(possibly large) number of scripts with names beginning with
`S' or `K' followed by two digits, e.g., S12syslog.
The digits denote the order in which the scripts are executed:
The `S' scripts are executed in ascending numerical order
on entry to the runlevel (i.e., at boot), and the `K' scripts
are executed in descending order on exit (usually at shutdown).
rc passes the argument `start' to each script
at startup, and `stop' and shutdown. As a result, we don't really
need both `S' and `K' scripts, because we can use the argument
to determine whether we are starting or stopping. Thus it is
a convention on Linux systems that the K scripts are simply
symbolic links to their corresponding S scripts, and the S
scripts do both startup and shutdown operations.
So, for example, when entering runlevel 5, somewhere near the
beginning of the rc process we will execute
S12syslog start
On shutdown, somewhere towards the end of the shutdown process
we will do
K12syslog stop
which is, in fact, an invocation of
S12syslog stop
Inside the script S12syslog -- and most
of the other scripts in that directory -- you will find both
initialization and finalization code.
So what do these scripts do? Well, this depends on the runlevel,
and the distribution, and any customizations you have made.
A typical set of operations will included the following:
- Apply firewall settings to IP network interfaces
- Bring up non-IP networking (e.g., IPX, appletalk)
- Start the system logger
- Start the NFS portmapper, lock daemon, etc., and
mount any NFS shares specified in
/etc/fstab.
- Start the power management daemon
- Initialize the auto-mounter
- Initialize the PCMCIA subsystem, loading drivers and daemons
both for the PCMCIA hardware itself, and any cards that are
currently inserted
- Start up the inet daemon (
inetd or
xinetd) which will take care of accepting
incoming network connections
- Start the printer daemon
- Start
cron
- Start the X font server
The very last step in the boot process will be to run a script
S99local. This is the conventional place to put
machine-specific initialization.
It is considered bad manners to customize any of the initialization
scripts that are supplied as part of a Linux distribution, simply
because other people who may have to manage the system will
have expectations about what is in them. Making arbitrary changes
here will defeat these expectations. However, everybody expects to
see machine-specific configuration in S99local.
Gotchas
It should be clear that the boot process on a fully-featured
Linux system is fairly complex. You can simplify it a great
deal if you are building a custom Linux system, or if you
just want your machine to start up faster. However, there are a
few things to watch out for when constructing a custom
boot process.
- The various stages of the boot process are quite well
separated, particularly the bootloader stage and the kernel
stage. What does this mean? Well, imagine a situation in which
we download a network bootloader, which loads a kernel from
a file server. The kernel then starts up and wants to initialize
its network settings (IP number, etc). Now, the machine had to have
an IP number during the bootloader stage, didn't it? Otherwise,
how would it have been able to do network operations to fetch
the kernel? So one might expect that the kernel could simply get its
IP number from the bootloader. The problem is that Linux bootloaders
don't know how to supply this information to the kernel. Why should
they? The bootloader designers can't anticipate everything that the
kernel might need to know in advance. Therefore the kernel must then
have the machine
find its IP number, etc., again, independently of what the
bootloader may have done. In practise, it's probably going to get the
same IP number, but that makes no difference. This causes
problems for people who want to build a fully-diskless installations (like
the Javastation example elsewhere on this site). Your network-boot
firmware probably uses RARP or DHCP to find the machine's network
settings, but that doesn't mean that you don't need to include
the same support in the kernel when you build it: the kernel will
have to do it again. When you come to mounting the root filesystem
as an NFS mount, you need to make sure that you have a way to tell
the kernel where the NFS server is (usually via kernel command-line
parameters, but on some dumb systems you have to hard code them
into the kernel before compilation). The kernel has no way to know
whether the machine that replied to the RARP or DHCP request is
going to be the one to supply the root filesystem.
-
Another problem, which appears different but is in fact identical,
is that of booting from SCSI devices. So, you have a PC or workstation
that can boot from a SCSI CDROM drive. The firmware loads the
boot sector, which initializes the bootloader, which loads the
kernel. So far so good. Then the kernel takes over. It tries to
mount the SCSI CDROM as a filesystem, but fails. Why? Because
SCSI drivers aren't included in the kernel. It is wrong to
believe that, because the system can read from a CDROM during boot,
that the kernel will be able to read from the CDROM. The kernel
won't use BIOS calls to read the CDROM, which is what the bootloader
will probably do. The kernel will use the standard Linux VFS (virtual
filesystem) infrastructure, which will communicate with the SCSI
infrastructure, which will communicate with the low-level SCSI device
driver, which will communicate with the hardware. To boot a kernel
from a SCSI CDROM, you need to make sure that all of these components
are available to the kernel (in the form of modules), or are
compiled in.
©1994-2006 Kevin Boone, all rights reserved