The K-Zone: File handling in the Linux kernel: device driver layer
Into the device driver
In the previous article I described how requests to read or write
files ended up as queues of requests to read or right blocks on
a storage device. When the device driver is initialized, it supplies
the address either of a function that can read or write a single block
to the device, or of a function that can read or write queues of
requests. The latter is the more usual. I also pointed out that
the generic block device layer provides various utility functions
that drivers can use to simplify queue management and notification.
It's time to look in more detail at what goes on inside a typical
block driver, using the IDE disk driver as an example. Of course,
the details will vary with the hardware type and, to a certain
extent, the platform, but the principles will be similar in most
cases. Before we do that, we need to stop for a while and think
about hardware interfacing in general: interrupts, ports, and
DMA.
Interrupts in Linux
The problem with hardware access is that it's slow. Dead slow. Even
a trivial operation like move a disk head a couple of tracks takes
an age in CPU terms. Now, Linux is a multitasking system, and it would
be a shame to make all the processes on the system stop and wait
while a hardware operation completes. There are really only two strategies
for getting around this problem, both of which are supported by the
Linux kernel.
The first strategy is the `poll-and-sleep' approach. The kernel thread that
is interacting with the hardware checks whether the operation has finished;
if it hasn't, it puts itself to sleep for a short time. While it is asleep,
other processes can get a share of the CPU. Poll-and-sleep is easy to
implement but it has a significant problem. The problem is that most
hardware operations don't take predictable times to complete. So when the
disk controller tells the disk to read a block, it may get the results
almost immediately, or it may have to wait will the disk spins up and
repositions the head. The final wait time could be anywhere between
a microsecond and a second. So, how long should the process sleep between polls? If it sleeps for a second, then little CPU time is wasted, but every disk
operation will take at least a second. If it sleeps for a microsecond, then
there will be a faster response, but perhaps up to a million wasted
polls. This is far from ideal. However, polling can work reasonably well
for hardware that responds in a predictable time.
The other strategy, and the one that is most widely used, is to use
hardware interrupts. When the CPU receives an interrupt,
provided interrupts haven't
been disabled or masked out, and there isn't another interrupt of
the same type currently
in service, then the CPU will stop what it's doing, and execute a piece of
code called the interrupt handler. In Unix, interrupts are conceptually
similar to signals, but interrupts typically jump right into the kernel.
In the x86 world, hardware interrupts are generally called IRQs, and
you'll see both the terms `interrupt' and `IRQ' used in the names of
kernel APIs.
When the interrupt handler is finished, execution resumes at the point
at which it was broken off. So, what the driver can do is to tell the hardware to
do a particular operation, and then put itself to sleep until the
hardware generates an interrupt to say that it's finished. The driver can
then finish up the operation and return the data to the caller.
Most hardware devices that are attached to a computer are capable of
generating interrupts. Different architectures support different numbers
of interrupts, and have different sharing capabilities. Until recently
there was a significant chance that you would have more hardware than
you had interrupts for, at least in the PC world. However, Linux
now supports `interrupt sharing', at least on compatible hardware.
On the laptop PC I am using to write this, the interrupt 9 is
shared by the ACPI (power management) system, the USB interfaces (two of
them) and the FireWire interface. Interrupt allocations can be found
by doing
%cat /proc/interrupts
So, what a typical harddisk driver will usually do, when asked to
read or write one or more blocks of data, will be to write to the
hardware registers of the controller whatever is needed to start
the operation, then wait for an interrupt.
In Linux, interrupt handlers can usually be written in C. This
works because the real interrupt handler is in the kernel's
IO subsystem - all interrupts actually come to the same place.
The kernel then calls the registered handler for the interrupt.
The kernel takes care of matters like saving the CPU register
contents before calling the handler, so we don't need to do
that stuff in C.
An interrupt handler is defined and registered like this:
void my_handler (int irq, void *data,
struct pt_regs *regs)
{
// Handle the interrupt here
}
int irq = 9; // IRQ number
int flags = SA_INTERRUPT | SA_SHIRQ; // For example
char *name = "myhardware";
char *data = // any data needed by the handler
request_irq(irq, my_handler, flags, data);
request_irq takes the IRQ number (1-15 on x86),
some flags, a pointer to the handler, and a name. The name is
nothing more than the text that appears in /proc/interrupts.
The flags dictate two important things -- whether the interrupt
is `fast' (SA_INTERRUPT, see below) and whether it is shareable (SA_SHIRQ).
If the interrupt is not available, or is available but only to a driver
that supports sharing, then request_irq returns a non-zero
status. The last argument to the function is a pointer to an arbitrary
block of data. This will be made available to the handler when the
interrupt arrives, and is a nice way to supply data to the handler.
However, this is relatively new thing in Linux, and not all the
existing drivers use it.
The interrupt handler is invoked with the number of the
IRQ, the registers that were saved by the interrupt service routine
in the kernel, and the block of data passed when the handler was
registered.
An important historical distinction in Linux was
between `fast' and `slow' interrupt handlers, and because this continues
to
confuse developers and arouse heated debate, it might merit a brief
mention here.
. In the early days (1.x kernels), the Linux
architecture supported two types of interrupt handler - a `fast' and
a `slow' handler. A fast handler was invoked without the stack
being fully formatted, and without all the registers preserved.
It was intended for handlers that were genuinely fast, and didn't do much.
They couldn't do much, because they weren't set up to.
In order to avoid the complexity of interacting with the interrupt
controller, which would have been been necessary to prevent other instances
of the same interrupt entering the handler re-entrantly and breaking it,
a fast
handler was entered with interrupts completely disabled. This was
a satisfactory approach when the interrupts really had to be fast.
As hardware got faster, the benefit of servicing an interrupt with an
incompletely formatted stack became less obvious. In addition, the
technique was developed of allowing the handler to be constructed of
two parts: a `top half' and a `bottom half'. The `top half' was the part of the
handler that had to complete immediately. The top half would do the minimum
amount of work, then schedule the `bottom half' to execute when the
interrupt was finished. Because the bottom half could be pre-empted, it
did not hold up other processes. The meaning of
a `fast' interrupt handler therefore changed: a fast handler completed
all its work within the main handler, and did not need to schedule
a bottom half.
In modern (2.4.x and later) kernels, all these historical features are
gone. Any interrupt handler can register a bottom half and, if it
does, the bottom half will be scheduled to run in normal time when
the interrupt handler returns. You can use the macro mark_bh
to schedule a bottom half; doing so requires a knowledge of kernel
tasklets, which are beyond the scope of this article
(read the comments in include/linux/interrupt.h in the
first instance). The only thing that the `fast handler' flag
SA_INTERRUPT now does is to cause the handler to be
invoked with interrupts disabled. If the flag is ommitted, interrupts
are enabled, but only for IRQs different to the one currently being
serviced. One type of interrupt can still interrupt a different type.
Port IO in Linux
Interrupts allow the hardware to wake up the device driver when it
is ready, but we need something else to send and receive data
to and from the device. In some computer architectures, peripherals
are commonly mapped into the CPU's ordinary address space. In such a
system, reading
and writing a peripheral is identical to reading and writing memory,
except that region of `memory' is fixed. Most architectures on
which Linux runs do support the `memory mapping' strategy, although
it is no longer widely used.
In a sense, DMA (direct memory access) is perhaps a more subtle way
of achieving the same effect. Most architectures provide separate
address spaces for IO devices (`ports') and for memory, and most
perhiperals are constructed to make use of this form of addressing.
Typically the IO address
space is smaller than the memory address space -- 64 kBytes is quite
a common figure. Different CPU instructions are used to read and write
IO ports, compared to memory. To make port IO as portable as possible,
the kernel source code provides macros for use in C that expand into
the appropriate assembler code for the platform. So we have, for
example, outb to output a byte value to a port, and
inl to input a long (32-bit) value from a port.
Device drivers are encouraged to use the function request_region
to reserve a block of IO ports. For example:
if (!request_region (0x300, 8, "mydriver"))
{ printk ("Can't allocate ports\n"); }
This prevents different drivers trying to control the same devices. Ports
allocated this way appear in /proc/ioports. Note that some
architectures allow port numbers to be dynamically allocated at boot time, while
others are largely static. In the PC world, most systems now support
dynamic allocation, which makes drivers somewhat more complicated to
code, but gives users and administrators an easier time.
DMA in Linux
The use of DMA in Linux is a big subject in its own right, and one whose
details are highly architecture-dependent. I only intend to deal with it
in outline here. In short, DMA (direct memory access) provides a
mechanism by which peripheral devices can read or write main memory
independently of the CPU. DMA is usually much faster than a scheme were
the CPU iterates over the data to be transferred, and moves it byte-by-byte
into memory (`programmed IO').
In the PC world, there are two main forms of DMA. The
earlier form, which has been around since the PC-AT days, uses a dedicated
DMA controller to do the data transfer. The IO device and the memory are
essentially passive. This scheme was considered very fast in the early 1990s,
but worked only over the ISA bus. These days, many peripheral devices
that can take part in DMA use bus mastering. In bus mastering DMA,
it is the peripheral that takes control of the DMA process, and a specific
DMA controller is not required.
A block device driver that uses DMA is usually not very different from
one that does not. DMA, although faster than programmed IO, is unlikely
to be instantaneous. Consequently, the driver will still have to schedule
an operation, then receive an interrupt when it has completed.
The IDE disk driver
We are now in a position to look at what goes on inside the IDE disk
driver. IDE disks are not very smart -- they need a lot of help from
the driver. Each read or write operation will go through various stages,
and these stages are coordinated by the driver. You should also remember
that a disk drive has to do more than simply read and write, but I
won't discuss the other operations here.
When the driver is initialized, it probes the hardware and, for each
controller found, initializes a block device, a request queue
handler, and an interrupt handler for each controller
(ide-probe.c). In outline, the
initialization code for the first IDE controller
looks like this:
// Register the device driver with the kernel. For
// first controller, major number is 3. The name
// `ide0' will appear in /proc/devices. ide_fops is a
// structure that contains pointers to the open,
// close, and ioctl functions (see previous article).
devfs_register_blkdev (3, "ide0", ide_fops)
// Initialize the request queue and point it to the
// request handler. Note that we aren't using the
// kernel's default queue this time
request *q = // create a queue
blk_dev[3].queue = q; // install it in kernel
blk_init_queue(q, do_ide_request);
// Register an interrupt handler
// The real driver works out the IRQ number, but it's
// usually 14 for the first controller on PCs
// ide_intr is the handler
// SA_INTERRUPT means call with interrupts disabled
request_irq(14, &ide_intr, SA_INTERRUPT, "ide0",
/*... some drive-related data */)
// Request the relevant block of IO ports; again
// 0x1F0 is common on PCs.
request_region (0x01F0, 8, "ide0");
}
The interrupt handler ide_intr()
is quite straightforward, because it delegates the
processing to a function defined by the pointer
handler:
void ide_intr (int irq, void *data, struct pt_regs *regs)
{
// Check that an interrupt is expected
// Various other checks, and eventually...
handler();
}
We will see how handler gets set shortly.
When requests are delivered to the driver, the method
do_ide_request is invoked. This
determines the type of the request, and whether the
driver is in a position to service the request. If it
is not, it puts itself to sleep for a while. If the
request can be serviced, then
do_ide_request() calls the appropriate
function for that type of request. For a read or write request,
the function is
__do_rw_disk() (in
drivers/ide/ide-disk.c).
__do_rw_disk()
tells the IDE
controller which blocks to read, by calculating the
drive parameters and outputing them to the control
registers. It is a fairly long and complex function,
but the part that is important for
this discussion looks (rather simplified) looks like this:
int block = // block to read, from the request
outb(block, IDE_SECTOR_REG);
outb(block>>=8, IDE_LCYL_REG);
outb(block>>=8, IDE_HCYL_REG);
// etc., and eventually set the address of the
// function that will be invoked on the next
// interrupt, and schedule the operation on the drive
if (rq->cmd == READ)
{
handler = &read_intr;
outb(WIN_READ, IDE_COMMAND_REG); // start the read
}
The outb function outputs bytes of data
to the control registers IDE_SECTOR_REG, etc.
These are defined in
include/linux/ide.h,
and expand to the IO port addresses of
the control registers for specific IDE disks.
If the IDE controller supports bus-mastering DMA, then
the driver will intialize a DMA channel for it to use.
read_intr is the function that will be
invoked on the next interrupt; its address is stored
in the pointer handler, so it gets
invoked by ide_intr, the registered interrupt handler.
void read_intr(ide_drive_t *drive)
{
// Extract working data from the drive structure
// passed by the interrupt handler
struct request *rq = //... request queue
int nsect = //... number of sectors expected
char *name = //...name of device
// Get the buffer from the first request in the
// queue
char *to = ide_map_buffer(rq, /*...*/);
// in ide-taskfile.c
// And store the data in the buffer
// This will either be done by reading ports, or it
// will already have been done by the DMA transfer
taskfile_input_data(drive, to, nsect * SECTOR_WORDS);
// in ide-taskfile.c
// Now shuffle the request queue, so the next
// request becomes the head of the queue
if (end_that_request_first(rq, 1, name))
{
// All requests done on this queue
// So reset, and wake up anybody who is listening
end_that_request_last (rq);
}
}
The convenience functions
end_that_request_first() and
end_that_request_last() are defined in
devices/block/ll_rw_blk.c
end_that_request_first()
shuffles the next request to the head or
the queue, so it is available to be processed, and
then calls b_end_io on the request that
was just finished.
int end_that_request_first(structure request *req,
int uptdate, char *name)
{
struct buffer_head *bh = req->bh;
bh->b_end_io (bh, uptodate);
// Adjust buffer to make next request current
if (/* all requests done */) return 1;
return 0;
}
bh->b_end_io points to
end_buffer_io_sync (in
fs/buffer.c) which just marks
the buffer complete, and wakes up any threads that are
sleeping in wait for it to complete.
Summary
So that's it. We've seen how a file read operation
travels all the way from the application program,
through the standard library, into the kernel's VFS
layer, through the filesystem handler, and into the
block device infrastructure. We've even seen how the
block device interracts with the physical hardware.
Of course, I've left a great deal out in this
discussion. If you look at all the functions I've
mentioned in passing, you'll see that they amount to
about 20,000 lines of code. Probably about half of
that volume is concerned with handling errors and
unexpected situations. All the same, I hope my
description of the basic principles has been helpful.
If you found these articles helpful, please
contact me, and it might inspire me to
post up some more.
Happy hacking!
©1994-2006 Kevin Boone, all rights reserved