|
©1994-2007 Kevin Boone | ||||||||||||||||||||||
|
Home > Computing > Linux > File handling in the Linux kernel
File handling in the Linux kernel: generic device layer
Last modified: Fri Aug 3 08:31:42 2007
Into the block generic device layerIn the previous article we traced the flow of execution from the VFS layer, down into the filesystem handler, and from there into the generic filesystem handling code. Before we examine the kernel's generic block device layer, which provides the general functionality that most block devices will require, let's have a quick recap of what goes on in the filesystem layer.
When the application manipulates a file, the kernel's
VFS layer
finds the file's Device drivers, special files, and modulesBefore we look at what goes on in the block device layer, we need to consider how the VFS layer (or the filesystem handler in some cases) knows how to find the driver implementation that supports a particular block device. After all, the system could be supporting IDE disks, SCSI disks, RAMdisks, loopback devices, and any number of other things. Anext2 filesystem will look much the same whichever of
these things it is installed on, but naturally the hardware operations
will be quite different. Underneath each block device, that is,
each block special file of the form /dev/hdXX will
be a driver. The same driver can, and often does, support mutliple
block devices, and each block device can, in principle, support multiple
hardware units. The driver code may be compiled directly into
the core kernel, or made available as a loadable module.
Each block special file is indentified by two
numbers - a major device number and a minor device number. In
practice, a block special file is not a real file, and does not take
any space on disk. The device numbers typically live in
the file's inode; however, this is filesystem-dependant.
Conventionally the major device number identifies either a particular
driver or a particular hardware controller, while the minor number
identifies a particular device attached to that controller.
There have been some significant changes to the way that Linux handles block special files and drivers in the last year or so. One of the problems that these changes attempt to solve is that major and minor numbers are, and always will be, 8-bit integers. If we assume loosely that each specific hardware controller attached to the system has its own major number (and that's a fair approximation) then we could have 200 or so different controllers attached (we have to leave some numbers free for things like /dev/null).
However, the mapping between controller types and major numbers has
traditionally always been static. What this means is that
the Linux designers decided in advance what numbers should be assigned to
what controllers. So, on an x86 system, major 3 is the primary IDE
controller (/dev/hda and /dev/hdb), major 22 is the secondary IDE controller (/dev/hdc and /dev/hdc),
major 8 is for the first 16 SCSI hard disks (/dev/sda...),
and so on. In fact, most of the major numbers have been pre-allocated,
so it's hard to find numbers for new devices.
In more recent Linux kernels, we have the ability to mount /dev as a filesystem, in much the same way that /proc works.
Under this system, device numbers get allocated dynmically, so we can have
200-odd devices per system, rather 200-odd for the whole world.
This issue of device number allocation may seen to be off-topic, but I am mentioning it because the system I about to describe assumes that we are using the old-fashioned (static major numbers) system, and may be out-of-date by the time you read this. However, the basic principles remain the same. You should be aware also that block devices have been with Linux for a long time, and kernel support for driver implementers has developed significantly over the years. In 2.2-series kernels, for example, driver writers typically took advantage of a set of macros defined in kernel header files, to simplify the structure of the driver. For a good example of this style of driver authoring, look at drivers/ide/legacy/hd.c, the PC-AT legacy hard-disk driver.
There are, in consequence, a number of different ways of implementing
even a simple block device driver. In what follows, I will describe only
the technique that seems to be most widely used in the latest 2.4.XX
kernels. As ever, the principles are the same in all versions, but the
mechanics are different.
Finding the device numbers for a filesystemThere's one more thing to consider before we look at how the filesystem layer interacts with the block device layer, and that is how the filesystem layer knows which driver to use for a given filesystem. If you think back to themount operation described above,
you may remember that sys_mount took the name of
the block special file as an argument; this argument will usually have
come from the command-line to the mount command.
sys_mount then descends the filesystem to find the
inode for the block special file:
path_lookup (dev_name, /*...*/, ∓nd_dev);and from that inode it extracts the major and minor numbers, among other things, and stores them in the superblock structure for the filesystem. The superblock is then stored in a vfsmount
structure, and the vfsmount attached to the dentry
of the directory on which the filesystem is mounted. So, in short,
as VFS descends the pathname of a requested file, it can determine the
major and minor device numbers from the closest vfsmount above
the desired file. Then, if we have the major number, we can
ask the kernel for the struct block_device_operations
that supports it, which was stored by the kernel when the driver was
registered.
Registering the driverWe have seen that each device, or group of devices, is assigned a major device number, which identifies the driver to be invoked when requests are issued on that device. We now need to consider how the kernel knows which driver to invoke, when an IO request is queued for a specific major device number.It is the responsibility of the block device driver to register itself with the kernel's device manager. It does this by making a call to register_blkdev() (or devfs_register_blkdev()
in modern practice).
This call will usually be in the driver's initialization section,
and therefore be invoked at boot time (if the driver is compiled in)
or when the driver's module is loaded. Let's assume for now that
the filesystem is hosted on an IDE disk partition, and will be handled
by the ide-disk driver. When the IDE subsystem
is initialized it probes for IDE controllers, and for each one it
finds it executes (drivers/ide/ide-probe.c):
devfs_register_blkdev (hwif->major, hwif->name, ide_fops)); ide_fops is a structure of type block_device_operations, which contains pointers to functions implemented in the driver for
doing the various low-level operations. We'll come back to this later.
devfs_register_blkdev adds the driver to the kernel'
s driver table, assigning
it a particular name, and a particular major number (the code for
this is in fs/block_dev.c, but it's not particularly
interesting). What the call really does is map a major device number
to a block_device_operations structure. This structure
is defined in include/linux/fs.h like this:
struct block_device_operations
{
int (*open) (struct inode *, struct file *);
int (*release) (struct inode *, struct file *);
int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long);
// ... and a few more
}
Each of these elements is a pointer to a function defined in the driver.
For example, when the IDE driver is initialized, if its bus probe reveals
that the attached device is a hard disk, then it points the
open function at idedisk_open() (in
drivers/ide/ide-disk.c). All this function does is
signal to the kernel that the driver is now in use and, if the
drive supports removeable media, locks the drive door.
In the code extract above there were not read() or
write() functions. That's not because I left them out,
but because they don't exist. Unlike a character device, block devices
don't expose read and write functionality directly to the layer above;
instead they expose a function that handles requests delivered to
a request queue.
Request queue managementWe have seen that the filesystem layer builds a queue of requests for blocks to read or write. It then typically submits that queue to the block device layer by a call tosubmit_bh() (in drivers/block/ll_rw_block.c).
This function does some checks on
the requests submitted, and then calls the request handling
function registered by the driver (see below for details).
The driver can either
directly specify a request handler in its own code, or make
use of the generic request handler in the block device layer.
The latter is usually preferred, for the following reason.
Most block devices, disk drives in particular, work most efficiently when asked to read or write a contiguous region of data. Suppose that the filesystem handler is asked to provide a list of the physical blocks that comprise a particular file, and that list turns out to be blocks 10, 11, 12, 1, 2, 3, and 4. Now, we could ask the block device driver to load the blocks in that order, but this would involve seven short reads, probably with a repositioning of the disk head between the third and fourth block. It would be more effecient to ask the hardware to load blocks 10-12, which it could do in a continuous read, and then 1-4, which are also contiguous. In addition, it would probably be more efficient to re-order the reads so that blocks 1-4 get done first, then 10-12. These processes are refered to in the kernel documentation as `coalescing' and `sorting'. Now, coalescing and sorting themselves require a certain amount of CPU time, and not all devices will benefit. In particular, if the block device offers true random access -- a ramdisk, for example -- the overheads of sorting and coalescing may well outweigh the benefits. Consequently, the implementer of a block device driver can choose whether to make use of the request queue management features or not. If the driver should receive requests one-at-a-time as they are delivered from the filesystem layer, it can use the function
void blk_queue_make_request
(request_queue_t *q, make_request_fn *mrf);
This takes two arguments: the queue in the kernel to which requests
are delivered by the filesystem (of which, more later), and the
function to call when each request arrives. An example of the use
of this function might be:
#define MAJOR = NNN; // Our major number
/*
my_request_fn() will be called whenever a request is ready
to be serviced. Requests are delivered in no particular
order
*/
static int my_request_fn
(request_queue_t *q, int rw, struct buffer_head *rbh)
{
// read or write the buffer specified in rbh
// ...
}
// Initialization section
blk_queue_make_request(BLK_DEFAULT_QUEUE(MAJOR), my_request_fn);
The kernel's block device manager maintains a default queue for
each device, and in this example we have simply attached a
request handler to that default queue.
If the driver is taking advantage of the kernel's request ordering and coalescing functions, then it register's itself using the function
void blk_init_queue
(request_queue_t * q, request_fn_proc * rfn);
(also defined in drivers/block/ll_rw_blk.c).
The second argument to this function is a pointer to a function
that will be invoked when a sorted queue of requests is available
to be processed. The driver might use this function like this:
/*
my_request_fn() will be called whenever a queue of requests
is ready to be serviced. Requests are delivered ordered and
coalesced
*/
static int my_request_fn
(request_queue_t *q)
{
// read or write the queue of buffer specified in *q
// ...
}
// Initialization section
blk_init_q(BLK_DEFAULT_QUEUE(MAJOR), my_request_fn);
So we have seen how the device registers itself with the generic
block device layer, so that it can accept requests to read or write blocks.
We must now consider
what happens when these requests have been completed.
You may remember that the interface between the filesystem layer and
the block device layer is asynchronous. When the filesystem handler added
the specifications of blocks to load into the buffer_head
structure, it could also write a pointer to the function to call to indicate
that the block had been read. This function was stored in the
field b_end_io. In practice, when the filesystem layer
submits a queue of blocks to read to the submit_bh()
function in the block device layer, submit_bh()
ultimately sets
b_end_io to a generic end-of-block handler.
This is the function end_buffer_io_sync
(in fs/buffer.c). This generic handler simply marks
the buffer complete and unlocks its memory. As the interface between
the filesystem layer and the generic block device layer is asynchronous,
the interface between the generic block device layer and the driver
itself is also asynchronous. The request handling functions
described above (named my_request_fn in the code
snippets) are expected not to block. Instead, these
methods should schedule an IO request on the hardware, then notify
the block device layer by calling b_end_io on each block
when it is completed. In practice, device drivers typically make use
of utility functions in the generic block device layer, which combine
this notification of completion with the manipulation of the queue.
If the driver registers itself using blk_init_q(),
its request handler can expected to be passed a pointed to a
queue whenever there are requests available to be serviced. It uses
utility functions to iterate through the queue, and to notify the
block device layer everytime a block is completed. We will look at
these functions in more detail in the next section.
Device driver interfaceSo, in summary, a specific block device driver has two interfaces with the generic block device layer. First, it provides functions to open, close, and manage the device, and registers them by callingregister_blkdev(). Second, it provides a function that
handles incoming requests, or incoming request queues, and registers
that function by the appropriate kernel API call:
blk_queue_make_request()
or blk_init_queue. Having registered a queue handler, the
device driver typically uses utility functions in the generic block device
layer to retrieve requests from the queue, and issue notifications when
hardware operations are complete.
In concept, then, a block device driver is relatively simple. Most of the work will be done in its request handling method, which will schedule hardware operations, then call notification functions when these operations complete. In reality, hardware device drivers have to contend with the complexities of interrupts, DMA, spinlocks, and IO, and are consequently much more complex that the simple interface between the device driver and the kernel would suggest. In the next, and final, installment, we will consider some of these low-level issues, using the IDE disk driver as an example. Next: the device device driver >>
|
|
|||||||||||||||||||||