The K-Zone: File handling in the Linux kernel: filesystem layer

Into the filesystem layer

In the previous article in this series, we traced the execution from the entry point to the VFS layer, to the inode structure for a particular file. This was a long and convoluted process, and before we describe the filesystem layer, it might be worth recapping it briefly. The purpose of the filesystem layer is, in outline, to convert operations on files, to operations on disk blocks. It is the way in which file operations are converted to block operations that distinguishes one filesystem type from another. Since we have to use some sort of example, in the following I will concetrate on the ext2 filesystem type that we have all come to know and love.

We have seen how the VFS layer calls open through the inode which was created by the filesystem handler for the requested file. As well as open(), a large number of other operations is exposed in the inode structure. Looking at the definition of struct inode (in include/linux/fs.h), we have:

struct inode 
  {
  unsigned long           i_ino;
  umode_t                 i_mode;
  nlink_t                 i_nlink;
  uid_t                   i_uid;
  gid_t                   i_gid;
  // ... many other data fields

  struct inode_operations *i_op;
  struct file_operations  *i_fop; 
  }
The interface for manipulating the file is provided by the i_op and i_fop structures. These structures contain the pointers to the functions that do the real work; these functions are provided by the filesytem handler. For example, file_operations, contains the following pointers:
struct file_operations 
  {
  int (*open) (struct inode *, struct file *);
  ssize_t (*read) (struct file *, char *, size_t, loff_t *);
  ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
  int (*release) (struct inode *, struct file *);
  // ... and many more
  }
You can see that the interface is clean -- there are no references to lower-level structures, such as disk block lists, and there is nothing in the interface that presupposes a particular type of low-level hardware. This clean interface should, in principle, make it easy to understand the interaction between the VFS layer and the filesystem handlers. Conceptually, for example, a file read operation ought to look like this: No doubt in some very simple filesystems, the sequence of operations that comprise a disk read is just like this. But in most cases the interaction between VFS and the filesystem is far from straightforward. To understand why, we need to consider in more detail what goes on at the filesystem level.

The pointers in struct file_operations and struct inode_operations are hooks into the filesytem handler, and we can expect each handler to implement them slightly differently. Or can we? It's worth thinking, for example, about exactly what happens in the filesystem layer when an application opens or reads a file. Consider the `open' operation first. What exactly is meant by `opening' a file at the filesystem level? To the application programmer, `opening' has connotations of checking that file exists, checking that the requested access mode is allowed, creating the file if necessary, and marking it as open. At the filesystem layer, by the time we call open() in struct file_operations() all of this has been done. It was done on the cached dentry for the file, and if the file did not exist, or had the wrong permissions, we would have found out before now. So the open() operation is more-or-less a no-brainer, on most filesystem types. What about the read() operation? This operation will involve doing some real work, won't it? Well, possibly not. If we're lucky, the requested file region will have been read already in the not-too-distant past, and will be in a cache somewhere. If we're very lucky, then it will be available in a cache even if it hasn't been read recently. This is because disk drives work most effectively when they can read in a continuous stream. If we have just read physical block 123, for example, from the disk, there is an excellent chance that the application will need block 124 shortly. So the kernel will try to `read ahead' and load disk blocks into cache before they are actually requested. Similar considerations apply to disk write operations: writes will normally be performed on memory buffers, which will be flushed periodically to the hardware.
      Now, this disk caching, buffering, and read-ahead support is, for the most part filesystem-independent. Of the huge amount of work that goes on when the application reads data from a file, the only part that is file-system specific is the mapping of logical file offsets to physical disk blocks. Everything else is generic. Now, if it is generic, it can be considered part of VFS, along with all the other generic file operations, right? Well, no actually. I would suggest that conceptually the generic filesystem stuff forms a separate architectural layer, sitting between the individual filesystem handlers and the block devices. Whatever the merits of this argument, the Linux kernel is not structured like this. You'll see, in fact, that the code is split between two subsystems: the VFS subsystem (in the fs directory of the kernel source), and the memory management subsystem (in the mm directory). There is a reason for this, but it is rather convoluted, and you may not need to understand it to make sense of the rest of the disk access procedure which I will describe later. But, if you are interested, it's like this.

A digression: memory mapped files

Recent linux kernels make use of the concept of `memory mapped files' for abstracting away low-level file operations, even within the kernel. To use a memory-mapped file, the kernel maps a contiguous region of virtual memory to a file. Suppose, for example, the kernel was manipulating a file 100 megabytes long. The kernel sets up 100 megabytes of virtual memory, at some particular point in its address space. Then, in order to read from some particular offset within the file, it reads from the corresponding offset into virtual memory. On some occassions, the requested data will be in physical memory, having been read from disk. On others, the data will not be there when it is read. After all, we aren't really going to read a hundred megabyte file all at once, and then find we only need ten bytes of it. When the kernel tries to read from the file region that does not exist in physical memory, a page fault is generated, which traps into the virtual memory management system. This system then allocates physical memory, then schedules a file read to bring the data into memory.
      You may be wondering what advantage this memory mapping offers over the simplistic view of disk access I described above, where VFS asks for the data, the filesystem converts the file region into blocks, and the block device reads those blocks. Well, apart from being a convenient abstraction, the kernel will have a memory-mapped file infrastructure anyway. It must have, even if it doesn't use that particular term. The ability to swap physical memory with backing store (disk, usually), when particular regions of virtual memory are requested, is a fundamental part of memory management on all modern operating systems. If we couldn't do this, the total virtual memory available to the system would be limited to the size of physical memory. There could be no demand paging. So, the argument goes, if we have to have a memory-mapped file concept, with all the complex infrastructure that entails, we may as well use it to support ordinary files, as well as paging to and from a swap file. Consequently, most (all?) file operations carried out in the Linux kernel make use of the memory-mapped file infrastructure.

The ext2 filesystem handler

In the following, I am using ext2 as an example of a filesystem handler but, as should be clear by now, we don't lose a lot of generality. Most of the filesystem infrastructure is generic anyway. I ought to point out, for the sake of completeness, that none of what follows is mandatory for a filesystem handler. So long as the handler implements the functions defined in struct file_operations and struct inode_operations, the VFS layer couldn't care less what goes on inside the handler. In practice, most filesystem type handlers do work the way I am about to described, with minor variations.

Let's dispose of the open() operation first, as this is trivial (remember that VFS has done all the hard work by the time the filesystem handler is invoked). VFS calls open() in the struct file_operations provided by the filesystem handler. In the ext2 handler, this structure is initialized like this (fs/ext2/file.c):

struct file_operations ext2_file_operations = 
  {
  llseek:         generic_file_llseek,
  read:           generic_file_read,
  write:          generic_file_write,
  ioctl:          ext2_ioctl,
  mmap:           generic_file_mmap,
  open:           generic_file_open,
  release:        ext2_release_file,
  fsync:          ext2_sync_file,
  };
Notice that most of the file operations are simply delegated to the generic filesystem infrastructure. open() maps onto generic_file_open(), which is defined in fs/open.c:
int generic_file_open
  (struct inode * inode, struct file * filp)
  {
  if (!(filp->f_flags & O_LARGEFILE) && 
       inode->i_size > MAX_NON_LFS)
    return -EFBIG;
  return 0;
  }
Not very interesting, it it? All this function does is to check whether we have requested an operation with large file support on a filesystem that can't accomodate it. All the hard work has already been done by this point.

The read() operation is more interesting. This function results in a call on generic_file_read(), which is defined in mm/filemap.c (remember, file reads are part of the memory management infrastructure!). The logic is fairly complex, but for our purposes -- talking about file management, not memory management -- can be distilled down to something like this:

/*
arguments to generic_file_read:
filp - the file structure from VFS
buf - buffer to read into
count - number of bytes to read
ppos - offset into file at which to read
*/
ssize_t generic_file_read (struct file * filp, 
    char * buf, size_t count, loff_t *ppos)
  {
  struct address_space *mapping = 
      filp->f_dentry->d_inode->i_mapping;
   
  // Use the inode to convert the file offset and byte
  //  count into logical disk blocks. Work out the number of 
  //  memory pages this corresponds to. Then loop until we
  //  have all the required pages 
  while (more_to_read)
    {
    if (is_page_in_cache)
      {
      // add cached page to buffer
      } 
    else
      {
      struct page *page = page_cache_alloc(mapping);
      // Ask the filesystem handler for the logical page.
      // This operation is non-blocking
      mapping->a_ops->readpage(filp, page);

      // Schedule a look-ahead read if possible
      generic_file_readahead(...);

      // Wait for request page to be delivered from the IO subsystem
      wait_on_page(page);

      // Add page to buffer
      // Mark new page as clean in cache
      }
    } 
  }
In this code we can see (in outline) the caching and read-ahead logic. It's important to remember that because the generic_file_read code is part of the memory management subsystem, its operations are expressed in terms of (virtual memory) pages, not disk blocks. Ultimately we will be reading disk blocks, but not here. In practice, disk blocks will often be 1kB, and pages 4kB. So we will have four block reads for each page read. generic_read_file can't get real data from a real filesystem, either in blocks or in pages, because only the filesystem knows where the logical blocks in the file are located on the disk. So, for this discussion, the most important feature of the above code is the call:
mapping->a_ops->readpage(filp, page);
This is a call through the inode of the file, back into the filesystem handler. It is expected to schedule the read of a page of data, which may encompass multiple disk blocks. In reality, reading a page is also a generic operation -- it is only reading blocks that is filesystem-specific. A page read just works out the number of blocks that constitute a page, and then calls another function to read each block. So, in the ext2 filesystem example, the readpage function pointer points to ext2_readpage() (in fs/ext2/inode.c), which simply calls back into the generic VFS layer like this:
static int ext2_readpage
      (struct file *file, struct page *page)
  {
  return block_read_full_page(page,ext2_get_block);
  }
block_read_full_page() (in fs/buffer.c) calls the ext2_get_block() function once for each block in the page. This function does not do any IO itself, or even delegate it to the block device. Instead, it determines the location on disk of the requested logical block, and returns this information in a buffer_head structure (of which, more later). The ext2 handler does know the block device (because this information is stored in the inode object that has been passed all the way down from the VFS layer). So it could quite happily ask the device to do a read. It doesn't, and the reason for this is quite subtle. Disk devices generally work most efficiently when they are reading or writing continuously. They don't work so well if the disk head is constantly switching tracks. So for best performance, we want to try to arrange the disk blocks to be read sequentially, even if that means that they are not read or written in the order they are requested. The code to do this is likely to be the same for most, if not all, disk devices. So disk drivers typically make use of the generic request management code in the block device layer, rather than scheduling IO operations themselves.
      So, in short, the filesystem handler does not do any IO, it merely fills in a buffer_head structure for each block required. The salient parts of the structure are:
struct buffer_head 
  {
  struct buffer_head *b_next;     /* Next buffer in list */
  unsigned long b_blocknr;        /* Block number */
  kdev_t b_dev;                   /* Device */
  struct page *page;              /* Memory this block is mapped to */
  void (*b_end_io)(struct buffer_head *bh, int uptodate); 
  // ... and lots more
  }
You can see that the structure contains a block number, an identifier for the device (b_dev), and a reference to the memory into which the disk contents should be read. kdev_t is an integer containing the major and minor device numbers packed together. buffer_head therefore contains everything the IO subsystem needs to do the real read. It also defines a function called b_end_io that the block device layer will call when it has loaded the requested block (remember this operation is asynchronous). However, the VFS generic filesystem infrastructure does not hand off this structure to the IO subsystem immediately it is returned from the filesystem handler. Instead, as the filesystem handler populates buffer_head objects, VFS builds them into a queue (a linked list), and then submits the whole queue to the block device layer. A filesystem handler can implement its own b_end_io function or, more commonly, make using of the generic end-of-block processing found in the generic block device layer, which we will consider next.

Next: the generic block device layer >>
©1994-2006 Kevin Boone, all rights reserved