Logo ©1994-2007 Kevin Boone
My professional interests
Computing
Law
Education
Science and research

My leisure interests
Martial arts
Heritage railways
Garden railways
Motorcycles
DIY

Downloads
Linux downloads
Windows downloads
Java downloads
Perl downloads
Home automation downloads

About me
Home & family
My CV

Site info
Contact the author
Download policy
Keyword index

  Home > Computing > Linux > File handling in the Linux kernel

File handling in the Linux kernel: generic device layer

Last modified: Fri Aug 3 08:31:42 2007

Into the block generic device layer

In the previous article we traced the flow of execution from the VFS layer, down into the filesystem handler, and from there into the generic filesystem handling code. Before we examine the kernel's generic block device layer, which provides the general functionality that most block devices will require, let's have a quick recap of what goes on in the filesystem layer.

When the application manipulates a file, the kernel's VFS layer finds the file's dentry structure in memory, and calls the file operations open(), read(), etc., through the file's inode structure. The inode contains pointers into the filesystem handler for the particular filesystem on which the file is located. The handler delegates most of its work to the generic filesystem support code, which is split between the VFS subsystem and the memory management subsystem. The generic VFS code attempts to find the requested blocks in the buffer cache but, if it can't, it calls back into the filesystem handler to determine the physical blocks that correspond to the logical blocks that VFS has asked it to read or write. The filesystem handler populates buffer_head structures containing the block numbers and device numbers, which are then built into a queue of IO requests. The queue of requests is then passed to the generic block device layer.

Device drivers, special files, and modules

Before we look at what goes on in the block device layer, we need to consider how the VFS layer (or the filesystem handler in some cases) knows how to find the driver implementation that supports a particular block device. After all, the system could be supporting IDE disks, SCSI disks, RAMdisks, loopback devices, and any number of other things. An ext2 filesystem will look much the same whichever of these things it is installed on, but naturally the hardware operations will be quite different. Underneath each block device, that is, each block special file of the form /dev/hdXX will be a driver. The same driver can, and often does, support mutliple block devices, and each block device can, in principle, support multiple hardware units. The driver code may be compiled directly into the core kernel, or made available as a loadable module. Each block special file is indentified by two numbers - a major device number and a minor device number. In practice, a block special file is not a real file, and does not take any space on disk. The device numbers typically live in the file's inode; however, this is filesystem-dependant. Conventionally the major device number identifies either a particular driver or a particular hardware controller, while the minor number identifies a particular device attached to that controller.
      There have been some significant changes to the way that Linux handles block special files and drivers in the last year or so. One of the problems that these changes attempt to solve is that major and minor numbers are, and always will be, 8-bit integers. If we assume loosely that each specific hardware controller attached to the system has its own major number (and that's a fair approximation) then we could have 200 or so different controllers attached (we have to leave some numbers free for things like /dev/null). However, the mapping between controller types and major numbers has traditionally always been static. What this means is that the Linux designers decided in advance what numbers should be assigned to what controllers. So, on an x86 system, major 3 is the primary IDE controller (/dev/hda and /dev/hdb), major 22 is the secondary IDE controller (/dev/hdc and /dev/hdc), major 8 is for the first 16 SCSI hard disks (/dev/sda...), and so on. In fact, most of the major numbers have been pre-allocated, so it's hard to find numbers for new devices.
      In more recent Linux kernels, we have the ability to mount /dev as a filesystem, in much the same way that /proc works. Under this system, device numbers get allocated dynmically, so we can have 200-odd devices per system, rather 200-odd for the whole world.
      This issue of device number allocation may seen to be off-topic, but I am mentioning it because the system I about to describe assumes that we are using the old-fashioned (static major numbers) system, and may be out-of-date by the time you read this. However, the basic principles remain the same.
      You should be aware also that block devices have been with Linux for a long time, and kernel support for driver implementers has developed significantly over the years. In 2.2-series kernels, for example, driver writers typically took advantage of a set of macros defined in kernel header files, to simplify the structure of the driver. For a good example of this style of driver authoring, look at drivers/ide/legacy/hd.c, the PC-AT legacy hard-disk driver. There are, in consequence, a number of different ways of implementing even a simple block device driver. In what follows, I will describe only the technique that seems to be most widely used in the latest 2.4.XX kernels. As ever, the principles are the same in all versions, but the mechanics are different.

Finding the device numbers for a filesystem

There's one more thing to consider before we look at how the filesystem layer interacts with the block device layer, and that is how the filesystem layer knows which driver to use for a given filesystem. If you think back to the mount operation described above, you may remember that sys_mount took the name of the block special file as an argument; this argument will usually have come from the command-line to the mount command. sys_mount then descends the filesystem to find the inode for the block special file:
path_lookup (dev_name, /*...*/, ∓nd_dev);
and from that inode it extracts the major and minor numbers, among other things, and stores them in the superblock structure for the filesystem. The superblock is then stored in a vfsmount structure, and the vfsmount attached to the dentry of the directory on which the filesystem is mounted. So, in short, as VFS descends the pathname of a requested file, it can determine the major and minor device numbers from the closest vfsmount above the desired file. Then, if we have the major number, we can ask the kernel for the struct block_device_operations that supports it, which was stored by the kernel when the driver was registered.

Registering the driver

We have seen that each device, or group of devices, is assigned a major device number, which identifies the driver to be invoked when requests are issued on that device. We now need to consider how the kernel knows which driver to invoke, when an IO request is queued for a specific major device number.
      It is the responsibility of the block device driver to register itself with the kernel's device manager. It does this by making a call to register_blkdev() (or devfs_register_blkdev() in modern practice). This call will usually be in the driver's initialization section, and therefore be invoked at boot time (if the driver is compiled in) or when the driver's module is loaded. Let's assume for now that the filesystem is hosted on an IDE disk partition, and will be handled by the ide-disk driver. When the IDE subsystem is initialized it probes for IDE controllers, and for each one it finds it executes (drivers/ide/ide-probe.c):
devfs_register_blkdev (hwif->major, hwif->name, ide_fops));
ide_fops is a structure of type block_device_operations, which contains pointers to functions implemented in the driver for doing the various low-level operations. We'll come back to this later. devfs_register_blkdev adds the driver to the kernel' s driver table, assigning it a particular name, and a particular major number (the code for this is in fs/block_dev.c, but it's not particularly interesting). What the call really does is map a major device number to a block_device_operations structure. This structure is defined in include/linux/fs.h like this: struct block_device_operations { int (*open) (struct inode *, struct file *); int (*release) (struct inode *, struct file *); int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); // ... and a few more } Each of these elements is a pointer to a function defined in the driver. For example, when the IDE driver is initialized, if its bus probe reveals that the attached device is a hard disk, then it points the open function at idedisk_open() (in drivers/ide/ide-disk.c). All this function does is signal to the kernel that the driver is now in use and, if the drive supports removeable media, locks the drive door.
      In the code extract above there were not read() or write() functions. That's not because I left them out, but because they don't exist. Unlike a character device, block devices don't expose read and write functionality directly to the layer above; instead they expose a function that handles requests delivered to a request queue.

Request queue management

We have seen that the filesystem layer builds a queue of requests for blocks to read or write. It then typically submits that queue to the block device layer by a call to submit_bh() (in drivers/block/ll_rw_block.c). This function does some checks on the requests submitted, and then calls the request handling function registered by the driver (see below for details). The driver can either directly specify a request handler in its own code, or make use of the generic request handler in the block device layer. The latter is usually preferred, for the following reason.
      Most block devices, disk drives in particular, work most efficiently when asked to read or write a contiguous region of data. Suppose that the filesystem handler is asked to provide a list of the physical blocks that comprise a particular file, and that list turns out to be blocks 10, 11, 12, 1, 2, 3, and 4. Now, we could ask the block device driver to load the blocks in that order, but this would involve seven short reads, probably with a repositioning of the disk head between the third and fourth block. It would be more effecient to ask the hardware to load blocks 10-12, which it could do in a continuous read, and then 1-4, which are also contiguous. In addition, it would probably be more efficient to re-order the reads so that blocks 1-4 get done first, then 10-12. These processes are refered to in the kernel documentation as `coalescing' and `sorting'.
      Now, coalescing and sorting themselves require a certain amount of CPU time, and not all devices will benefit. In particular, if the block device offers true random access -- a ramdisk, for example -- the overheads of sorting and coalescing may well outweigh the benefits. Consequently, the implementer of a block device driver can choose whether to make use of the request queue management features or not. If the driver should receive requests one-at-a-time as they are delivered from the filesystem layer, it can use the function
void blk_queue_make_request
      (request_queue_t *q, make_request_fn *mrf);
This takes two arguments: the queue in the kernel to which requests are delivered by the filesystem (of which, more later), and the function to call when each request arrives. An example of the use of this function might be:
#define MAJOR = NNN; // Our major number

/*
my_request_fn() will be called whenever a request is ready 
to be serviced. Requests are delivered in no particular
order
*/
static int my_request_fn 
    (request_queue_t *q, int rw, struct buffer_head *rbh)
  {
  // read or write the buffer specified in rbh
  // ...
  }

// Initialization section
blk_queue_make_request(BLK_DEFAULT_QUEUE(MAJOR), my_request_fn);
The kernel's block device manager maintains a default queue for each device, and in this example we have simply attached a request handler to that default queue.

If the driver is taking advantage of the kernel's request ordering and coalescing functions, then it register's itself using the function

void blk_init_queue
      (request_queue_t * q, request_fn_proc * rfn);
(also defined in drivers/block/ll_rw_blk.c). The second argument to this function is a pointer to a function that will be invoked when a sorted queue of requests is available to be processed. The driver might use this function like this:
/*
my_request_fn() will be called whenever a queue of requests 
is ready to be serviced. Requests are delivered ordered and 
coalesced
*/
static int my_request_fn 
    (request_queue_t *q)
  {
  // read or write the queue of buffer specified in *q 
  // ...
  }

// Initialization section
blk_init_q(BLK_DEFAULT_QUEUE(MAJOR), my_request_fn);
So we have seen how the device registers itself with the generic block device layer, so that it can accept requests to read or write blocks. We must now consider what happens when these requests have been completed. You may remember that the interface between the filesystem layer and the block device layer is asynchronous. When the filesystem handler added the specifications of blocks to load into the buffer_head structure, it could also write a pointer to the function to call to indicate that the block had been read. This function was stored in the field b_end_io. In practice, when the filesystem layer submits a queue of blocks to read to the submit_bh() function in the block device layer, submit_bh() ultimately sets b_end_io to a generic end-of-block handler. This is the function end_buffer_io_sync (in fs/buffer.c). This generic handler simply marks the buffer complete and unlocks its memory. As the interface between the filesystem layer and the generic block device layer is asynchronous, the interface between the generic block device layer and the driver itself is also asynchronous. The request handling functions described above (named my_request_fn in the code snippets) are expected not to block. Instead, these methods should schedule an IO request on the hardware, then notify the block device layer by calling b_end_io on each block when it is completed. In practice, device drivers typically make use of utility functions in the generic block device layer, which combine this notification of completion with the manipulation of the queue. If the driver registers itself using blk_init_q(), its request handler can expected to be passed a pointed to a queue whenever there are requests available to be serviced. It uses utility functions to iterate through the queue, and to notify the block device layer everytime a block is completed. We will look at these functions in more detail in the next section.

Device driver interface

So, in summary, a specific block device driver has two interfaces with the generic block device layer. First, it provides functions to open, close, and manage the device, and registers them by calling register_blkdev(). Second, it provides a function that handles incoming requests, or incoming request queues, and registers that function by the appropriate kernel API call: blk_queue_make_request() or blk_init_queue. Having registered a queue handler, the device driver typically uses utility functions in the generic block device layer to retrieve requests from the queue, and issue notifications when hardware operations are complete.
      In concept, then, a block device driver is relatively simple. Most of the work will be done in its request handling method, which will schedule hardware operations, then call notification functions when these operations complete.
      In reality, hardware device drivers have to contend with the complexities of interrupts, DMA, spinlocks, and IO, and are consequently much more complex that the simple interface between the device driver and the kernel would suggest. In the next, and final, installment, we will consider some of these low-level issues, using the IDE disk driver as an example.

Next: the device device driver >>

   
Search

WebThis site

Shameless plug

By the author of this site. Buy on-line from Amazon USA | UK

Editorial
So you want to be a university lecturer? Read this first!

Speak like your boss: new developments in managerese

Computing features
File handling in the Linux kernel: an in-depth look at how Linux handles files, filesystems, and file I/O

All sorts of Linux stuff

Confused about CLASSPATH? answers are here

First steps in EJB using jBoss (recently revised for jBoss 3.2)