|
©1994-2007 Kevin Boone | ||||||||||||||||||||||
|
Home > Computing > Linux > File handling in the Linux kernel
File handling in the Linux kernel: device driver layer
Last modified: Fri Aug 3 08:31:50 2007
Into the device driverIn the previous article I described how requests to read or write files ended up as queues of requests to read or right blocks on a storage device. When the device driver is initialized, it supplies the address either of a function that can read or write a single block to the device, or of a function that can read or write queues of requests. The latter is the more usual. I also pointed out that the generic block device layer provides various utility functions that drivers can use to simplify queue management and notification. It's time to look in more detail at what goes on inside a typical block driver, using the IDE disk driver as an example. Of course, the details will vary with the hardware type and, to a certain extent, the platform, but the principles will be similar in most cases. Before we do that, we need to stop for a while and think about hardware interfacing in general: interrupts, ports, and DMA.Interrupts in LinuxThe problem with hardware access is that it's slow. Dead slow. Even a trivial operation like move a disk head a couple of tracks takes an age in CPU terms. Now, Linux is a multitasking system, and it would be a shame to make all the processes on the system stop and wait while a hardware operation completes. There are really only two strategies for getting around this problem, both of which are supported by the Linux kernel.The first strategy is the `poll-and-sleep' approach. The kernel thread that is interacting with the hardware checks whether the operation has finished; if it hasn't, it puts itself to sleep for a short time. While it is asleep, other processes can get a share of the CPU. Poll-and-sleep is easy to implement but it has a significant problem. The problem is that most hardware operations don't take predictable times to complete. So when the disk controller tells the disk to read a block, it may get the results almost immediately, or it may have to wait will the disk spins up and repositions the head. The final wait time could be anywhere between a microsecond and a second. So, how long should the process sleep between polls? If it sleeps for a second, then little CPU time is wasted, but every disk operation will take at least a second. If it sleeps for a microsecond, then there will be a faster response, but perhaps up to a million wasted polls. This is far from ideal. However, polling can work reasonably well for hardware that responds in a predictable time. The other strategy, and the one that is most widely used, is to use hardware interrupts. When the CPU receives an interrupt, provided interrupts haven't been disabled or masked out, and there isn't another interrupt of the same type currently in service, then the CPU will stop what it's doing, and execute a piece of code called the interrupt handler. In Unix, interrupts are conceptually similar to signals, but interrupts typically jump right into the kernel. In the x86 world, hardware interrupts are generally called IRQs, and you'll see both the terms `interrupt' and `IRQ' used in the names of kernel APIs.
%cat /proc/interruptsSo, what a typical harddisk driver will usually do, when asked to read or write one or more blocks of data, will be to write to the hardware registers of the controller whatever is needed to start the operation, then wait for an interrupt. In Linux, interrupt handlers can usually be written in C. This works because the real interrupt handler is in the kernel's IO subsystem - all interrupts actually come to the same place. The kernel then calls the registered handler for the interrupt. The kernel takes care of matters like saving the CPU register contents before calling the handler, so we don't need to do that stuff in C. An interrupt handler is defined and registered like this:
void my_handler (int irq, void *data,
struct pt_regs *regs)
{
// Handle the interrupt here
}
int irq = 9; // IRQ number
int flags = SA_INTERRUPT | SA_SHIRQ; // For example
char *name = "myhardware";
char *data = // any data needed by the handler
request_irq(irq, my_handler, flags, data);
request_irq takes the IRQ number (1-15 on x86),
some flags, a pointer to the handler, and a name. The name is
nothing more than the text that appears in /proc/interrupts.
The flags dictate two important things -- whether the interrupt
is `fast' (SA_INTERRUPT, see below) and whether it is shareable (SA_SHIRQ).
If the interrupt is not available, or is available but only to a driver
that supports sharing, then request_irq returns a non-zero
status. The last argument to the function is a pointer to an arbitrary
block of data. This will be made available to the handler when the
interrupt arrives, and is a nice way to supply data to the handler.
However, this is relatively new thing in Linux, and not all the
existing drivers use it.
The interrupt handler is invoked with the number of the IRQ, the registers that were saved by the interrupt service routine in the kernel, and the block of data passed when the handler was registered. An important historical distinction in Linux was between `fast' and `slow' interrupt handlers, and because this continues to confuse developers and arouse heated debate, it might merit a brief mention here. . In the early days (1.x kernels), the Linux architecture supported two types of interrupt handler - a `fast' and a `slow' handler. A fast handler was invoked without the stack being fully formatted, and without all the registers preserved. It was intended for handlers that were genuinely fast, and didn't do much. They couldn't do much, because they weren't set up to. In order to avoid the complexity of interacting with the interrupt controller, which would have been been necessary to prevent other instances of the same interrupt entering the handler re-entrantly and breaking it, a fast handler was entered with interrupts completely disabled. This was a satisfactory approach when the interrupts really had to be fast. As hardware got faster, the benefit of servicing an interrupt with an incompletely formatted stack became less obvious. In addition, the technique was developed of allowing the handler to be constructed of two parts: a `top half' and a `bottom half'. The `top half' was the part of the handler that had to complete immediately. The top half would do the minimum amount of work, then schedule the `bottom half' to execute when the interrupt was finished. Because the bottom half could be pre-empted, it did not hold up other processes. The meaning of a `fast' interrupt handler therefore changed: a fast handler completed all its work within the main handler, and did not need to schedule a bottom half. In modern (2.4.x and later) kernels, all these historical features are gone. Any interrupt handler can register a bottom half and, if it does, the bottom half will be scheduled to run in normal time when the interrupt handler returns. You can use the macro mark_bh
to schedule a bottom half; doing so requires a knowledge of kernel
tasklets, which are beyond the scope of this article
(read the comments in include/linux/interrupt.h in the
first instance). The only thing that the `fast handler' flag
SA_INTERRUPT now does is to cause the handler to be
invoked with interrupts disabled. If the flag is ommitted, interrupts
are enabled, but only for IRQs different to the one currently being
serviced. One type of interrupt can still interrupt a different type.
Port IO in LinuxInterrupts allow the hardware to wake up the device driver when it is ready, but we need something else to send and receive data to and from the device. In some computer architectures, peripherals are commonly mapped into the CPU's ordinary address space. In such a system, reading and writing a peripheral is identical to reading and writing memory, except that region of `memory' is fixed. Most architectures on which Linux runs do support the `memory mapping' strategy, although it is no longer widely used. In a sense, DMA (direct memory access) is perhaps a more subtle way of achieving the same effect. Most architectures provide separate address spaces for IO devices (`ports') and for memory, and most perhiperals are constructed to make use of this form of addressing. Typically the IO address space is smaller than the memory address space -- 64 kBytes is quite a common figure. Different CPU instructions are used to read and write IO ports, compared to memory. To make port IO as portable as possible, the kernel source code provides macros for use in C that expand into the appropriate assembler code for the platform. So we have, for example,outb to output a byte value to a port, and
inl to input a long (32-bit) value from a port.
Device drivers are encouraged to use the function request_region
to reserve a block of IO ports. For example:
if (!request_region (0x300, 8, "mydriver"))
{ printk ("Can't allocate ports\n"); }
This prevents different drivers trying to control the same devices. Ports
allocated this way appear in /proc/ioports. Note that some
architectures allow port numbers to be dynamically allocated at boot time, while
others are largely static. In the PC world, most systems now support
dynamic allocation, which makes drivers somewhat more complicated to
code, but gives users and administrators an easier time.
DMA in LinuxThe use of DMA in Linux is a big subject in its own right, and one whose details are highly architecture-dependent. I only intend to deal with it in outline here. In short, DMA (direct memory access) provides a mechanism by which peripheral devices can read or write main memory independently of the CPU. DMA is usually much faster than a scheme were the CPU iterates over the data to be transferred, and moves it byte-by-byte into memory (`programmed IO'). In the PC world, there are two main forms of DMA. The earlier form, which has been around since the PC-AT days, uses a dedicated DMA controller to do the data transfer. The IO device and the memory are essentially passive. This scheme was considered very fast in the early 1990s, but worked only over the ISA bus. These days, many peripheral devices that can take part in DMA use bus mastering. In bus mastering DMA, it is the peripheral that takes control of the DMA process, and a specific DMA controller is not required.A block device driver that uses DMA is usually not very different from one that does not. DMA, although faster than programmed IO, is unlikely to be instantaneous. Consequently, the driver will still have to schedule an operation, then receive an interrupt when it has completed. The IDE disk driverWe are now in a position to look at what goes on inside the IDE disk driver. IDE disks are not very smart -- they need a lot of help from the driver. Each read or write operation will go through various stages, and these stages are coordinated by the driver. You should also remember that a disk drive has to do more than simply read and write, but I won't discuss the other operations here.When the driver is initialized, it probes the hardware and, for each controller found, initializes a block device, a request queue handler, and an interrupt handler for each controller ( ide-probe.c). In outline, the
initialization code for the first IDE controller
looks like this:
// Register the device driver with the kernel. For // first controller, major number is 3. The name // `ide0' will appear in /proc/devices. ide_fops is a // structure that contains pointers to the open, // close, and ioctl functions (see previous article). devfs_register_blkdev (3, "ide0", ide_fops) // Initialize the request queue and point it to the // request handler. Note that we aren't using the // kernel's default queue this time request *q = // create a queue blk_dev[3].queue = q; // install it in kernel blk_init_queue(q, do_ide_request); // Register an interrupt handler // The real driver works out the IRQ number, but it's // usually 14 for the first controller on PCs // ide_intr is the handler // SA_INTERRUPT means call with interrupts disabled request_irq(14, &ide_intr, SA_INTERRUPT, "ide0", /*... some drive-related data */) // Request the relevant block of IO ports; again // 0x1F0 is common on PCs. request_region (0x01F0, 8, "ide0"); }The interrupt handler ide_intr()
is quite straightforward, because it delegates the
processing to a function defined by the pointer
handler:
void ide_intr (int irq, void *data, struct pt_regs *regs)
{
// Check that an interrupt is expected
// Various other checks, and eventually...
handler();
}
We will see how handler gets set shortly.
When requests are delivered to the driver, the method
int block = // block to read, from the request
outb(block, IDE_SECTOR_REG);
outb(block>>=8, IDE_LCYL_REG);
outb(block>>=8, IDE_HCYL_REG);
// etc., and eventually set the address of the
// function that will be invoked on the next
// interrupt, and schedule the operation on the drive
if (rq->cmd == READ)
{
handler = &read_intr;
outb(WIN_READ, IDE_COMMAND_REG); // start the read
}
The outb function outputs bytes of data
to the control registers IDE_SECTOR_REG, etc.
These are defined in
include/linux/ide.h,
and expand to the IO port addresses of
the control registers for specific IDE disks.
If the IDE controller supports bus-mastering DMA, then
the driver will intialize a DMA channel for it to use.
read_intr is the function that will be
invoked on the next interrupt; its address is stored
in the pointer handler, so it gets
invoked by ide_intr, the registered interrupt handler.
void read_intr(ide_drive_t *drive)
{
// Extract working data from the drive structure
// passed by the interrupt handler
struct request *rq = //... request queue
int nsect = //... number of sectors expected
char *name = //...name of device
// Get the buffer from the first request in the
// queue
char *to = ide_map_buffer(rq, /*...*/);
// in ide-taskfile.c
// And store the data in the buffer
// This will either be done by reading ports, or it
// will already have been done by the DMA transfer
taskfile_input_data(drive, to, nsect * SECTOR_WORDS);
// in ide-taskfile.c
// Now shuffle the request queue, so the next
// request becomes the head of the queue
if (end_that_request_first(rq, 1, name))
{
// All requests done on this queue
// So reset, and wake up anybody who is listening
end_that_request_last (rq);
}
}
The convenience functions
end_that_request_first() and
end_that_request_last() are defined in
devices/block/ll_rw_blk.c
end_that_request_first()
shuffles the next request to the head or
the queue, so it is available to be processed, and
then calls b_end_io on the request that
was just finished.
int end_that_request_first(structure request *req,
int uptdate, char *name)
{
struct buffer_head *bh = req->bh;
bh->b_end_io (bh, uptodate);
// Adjust buffer to make next request current
if (/* all requests done */) return 1;
return 0;
}
bh->b_end_io points to
end_buffer_io_sync (in
fs/buffer.c) which just marks
the buffer complete, and wakes up any threads that are
sleeping in wait for it to complete.
SummarySo that's it. We've seen how a file read operation travels all the way from the application program, through the standard library, into the kernel's VFS layer, through the filesystem handler, and into the block device infrastructure. We've even seen how the block device interracts with the physical hardware.Of course, I've left a great deal out in this discussion. If you look at all the functions I've mentioned in passing, you'll see that they amount to about 20,000 lines of code. Probably about half of that volume is concerned with handling errors and unexpected situations. All the same, I hope my description of the basic principles has been helpful. If you found these articles helpful, please contact me, and it might inspire me to post up some more. Happy hacking!
|
|
|||||||||||||||||||||