The K-Zone: File handling in the Linux kernel: VFS layer

In the previous article in this series, I explained how application code entered the kernel by means of a system call. In this article we will examine the VFS (`virtual filesystem') layer, which provides the most general implementation of these system calls.

The VFS layer

The sys_open function, and the corresponding sys_read, sys_write, and so on, are entry points into the kernel's VFS layer. These functions are largely architecture-independent, for reasons that will become clear later. The `virtual' part of `virtual filesystem' reflects the fact that the operations carried out on this layer are independent of lower-level implementation details. In particular, VFS abstracts from the application programmer: These two lower-level layers are somewhat independent of one another. We can, for example, put an ext2 filesytem on a SCSI disk, or a UFS filesystem on an IDE disk. The system administrator has a fair degree of discretion, but not total freedom. You can't, for example, put an NFS filesystem on a hard disk of any type, because NFS is, by definition, a network filesystem.
      The VFS layer deals with the file open operation we are currently discussing, and other primitive operations on files such as reads and writes. It also deals with mounting filesytems and attaching block device drivers to mount points, as we shall see later.

In outline, sys_open looks like this:

int sys_open (const char *name, int flags, int mode)
  {
  int fd = get_unused_fd();
  struct file *f = filp_open(name, flags, mode);
  fd_install(fd, f);
  return fd;
  }
The function get_unused_fd() attempts to find an empty slot in the current process's file descriptor table. This operation can fail, of course, if the process has too many file descriptors open. If it succeeds, the function filp_open() is called to do the real open; if filp_open() succeeds, the file structure it returns is assigned to the allocated slot in the file descriptor table by fd_install().

In summary, all sys_open does is to call filp_open() and then assign its output to a free file descriptor. filp_open() is also defined in open.c, and looks in outline like this:

struct file *filp_open 
      (const char *name, int flags, int mode)
  {
  struct nameidata nd;
  open_namei (filename, namei_flags, mode, &nd);
  return dentry_open(nd.dentry, nd.mnt, flags);
  }                                                                              
The filp_open function has, in essence, two steps. First, it uses open_namei (in fs/namei.c) to generate a nameidata structure). This structure provides, among other things, a link to the file's inode (of which, more later). The second step calls dentry_open(), passing the salient information from the nameidata structure. It is this latter step which will typically do the `open' operation, by delegating it to a handler for the appropriate filesystem.
      In order for either of these steps to make sense, we will have to make a brief digression into the world of inodes and dentries. In particular, we need to distinguish between the purposes of the inode, dentry, and file structures.

The concept of an `inode' is an ancient one in the Unix world. An inode is a block of data that stores system-level information about a single file, such as its access modes, size, and references to its location on the physical storage device. In Unix, different filenames may point to the same file on disk (this is what links are for), but each file has exactly one inode. A directory, therefore, is nothing more than a mapping between filenames and inodes. A directory is itself a file, and consequently has an inode of its own. A number of processes may have the same file open, subject to locking restrictions, but each process `sees' the same inode, because each `sees' the same file. An inode is partly exposed to the application program by a stat structure. stat does not expose the low-level information about the file, but it does expose the owner, group, size, datastamps, and so on.
      Now, although many processes can see the same file, and therefore the same inode, these various processes have different rights over that file. For example, if one process has a file open for writing, very likely it will seek to limit the rights of other processes. It should be obvious that we need a way for each process to maintain its own private view of the file, in addition to the shared view provided by the inode. In Linux, the private view is provided by the file structure. The file contains information about the file as seen by a particular process. Notice that the filp_open function shown about returns a file, not an inode. However, the file contains (indirectly) a reference to its inode.
      In practical Linux systems, operations on inodes are a significant part of the work of the system. Consequently these operations need to be optimized if possible. The Linux VFS layer attempts to cache inodes in memory; this cache is maintained by a linked list of dentry structures. Each dentry contains a reference to the inode, and the name of the file.
      So, in summary, an inode is a representation of a file, shared by all users of that file; a file structure is the view of a file (and its inode) seen by a particular process; a dentry is a structure that caches an inode in memory along with the filename that locates it.

So, let's return to filp_open(). This function calls open_namei(), which finds the dentry for a particular filename. It does this by descending the requested pathname one component at a time, relative either to the root (/), or at the process's working directory, depending on whether the pathname begins with a '/' or not. It descends until it gets to the requested file, or realizes that it is never going to get there. If each component of the pathname has already been cached as a dentry, this is a fast process, as all operations are carried out in memory. If the file has not been the subject of an access recently, its inode will probably not be cached. In such a case, the inode will have to be read from the filesystem layer (see below), and then cached. But we'll deal with that process later. The open_namei() function populates a nameidata, the salient elements of which look like this:

struct nameidata 
  {
  struct dentry *dentry;
  struct vfsmount *mnt;
  //...
  }
The dentry contains a reference to the cached inode of the file, while the vfsmount references the filesytem on which the file is located. As we shall see, the mount utility has the effect of causing vfsmount elements to be registered with the kernel.

So, the filp_open function now knows the dentry, and therefore the inode (if it exists) for the requested file, and a vfsmount structure which references the block special file for the filesystem. The open() code does not itself refer to the vfsmount as this code gets access to the lower-level functions the file through pointers in the inode structure itself. However, the vfsmount is stored in the file structure for future use. The filp_open fuction now calls dentry_open(). This function looks like this:

struct file *dentry_open
     (struct dentry *dentry, struct vfsmount *mnt, int flags)i
  {
  struct file * f = // get an empty file structure;

   f->f_flags = flags;
   // ...populate other file structure elements

   // Get the inode for the file
   struct inode = dentry->d_inode;

   // Call open() through the inode 
   f->f_op = fops_get(inode->i_fop);
   f->f_op->open(inode, f);
  }
This function finds the address of a function in the filesystem layer that can handle the open operation and calls it, passing the inode and the file structure. Where does this function address come from? It will be found either in the inode itself, or in one of the inode's parent inodes. In order to understand where these inodes come from, we need to see how filesystem handlers are registered, and individual filesystems mounted. We will return to the sys_open() method later.

mount() support in the VFS layer

We have seen that application-level file operations like open() and read() are made, via a thin layer in the C standard library, on the kernel's VFS layer. The key feature of the VFS layer is that it is filesystem-independent, and device-independent. An open() call will work on any kind of filesystem, on any kind of physical hardware, in more-or-less the same way. However, underneath the VFS layer will be a filesytem layer, which contains support for the various filesystems known to the kernel. Underneath the filestem layer will be the device layer, which interacts with the hardware. The mounting process provides the bridge between the VFS layer and the lower-level operations of the filesystem and device.
      There is a potential ambiguity in terminology here. The word `filesystem' is sometimes used mean a particular type of filesystem (e.g., ext3, ufs), and sometimes to mean a particular mounted instance of that filesystem type (e.g., the ext3 filesystem mounted on /usr). In the following, I will use the term `filesystem type' to refer to a particular type of filesystem, and `mounted filesystem' to refer to a specific instance. But don't expect the general Linux documentation to be this consistent.

You can get a list of supported filesystem types by doing

cat /proc/filesystems
A typically Linux system will support at least ext2 and iso9660 (CDROM) filesystem types, and some systems will be configured to support many more. Some of the filesytem types will be implemented in code directly compiled into the core kernel. The proc filesystem type itself is likely to be of this type. The proc type supports the /proc directory which, as you probably know, allows applications to find out about, and interact with, the kernel and device drivers. The proc filesystem type is interesting in another respect -- it demonstrates that in Linux a filesystem need not correspond to any real, physical hardware. /proc exists entirely in memory, and is generated dynamically.
      Some filesystem types will more likely be supported through loadable modules, particularly if they are not used all that often. For example, if you mount DOS or Windows disks only occasionally, you may have configured the handlers for FAT and VFAT filesystems as loadable modules.
      Regardless of whether a filesystem handler is compiled into the core kernel, or implemented in a loadable module, the handler must register itself with the kernel. It will usually do this in its initialization code, by making a call on the kernel's register_filesystem(), which is defined in fs/super.c. This function looks like this:
int register_filesystem(struct file_system_type *fs)
  {
  struct file_system_type ** p = 
    find_filesystem(fs->name);
  if (*p)
    return -EBUSY;
  else
    { *p = fs; return 0; }
  }
All this simple function does is to check whether the specified filesystem type is already registered and, if not, stores the supplied file_system_type structure in the kernel's filesystem table. The filesystem handler's initialization code initializes a struct file_system_type, which looks like this:
struct file_system_type 
  {
  const char *name;
  struct super_block *(*read_super) 
    (struct super_block *, void *, int);
  //...
  }
The structure contains the name of the filesystem type (e.g., ext3), and the address of a function called read_super. This function is provided by the filesystem handler, and will be called by the kernel when a filesystem of this type is mounted. The read_super function will have the task of initializing a struct super_block, the contents of which will very likely be derived from the superblock on the physical disk. The superblock (of which there may be more than one on a real disk or partition) contains fundamental information about the structure of the filesystem, the maximum supported file size, and the filesytem type. The super_block struct contains a memory image of this information, and also pointers to the block device operations needed to operate on this filesystem. We will discuss superblock operations in more detail later.

At the user level, to make a particular instance of a filesystem available, we typically use the mount utility. The mount utility makes a call on the mount() function, which is defined in libc-XXX.so. Execution ultimately arrives at the VFS function sys_mount(), defined in fs/namespace.h, by the same method as described above for the open() call. sys_mount() does a fair amount of work, because it has to check and manipulate the mount tables, carry out error checking, and manipulate memory in the kernel, as well as the mount itself. But, in essence, it looks like the following snippet. Please note that I've collapsed a number of discrete functions into one here, so that the flow is clearer.

/*
sys_mount arguments:
dev_name - name of the block special file, e.g., /dev/hda1
dir_name - name of the mount point, e.g., /usr
fstype - name of the filesystem type, e.g., ext3
flags - mount flags, e.g., read-only
data - filesystem-specific data
*/
long sys_mount
      (char *dev_name, char *dir_name, char *fstype, 
       int flags, char *name, void *data)
  {
  // Get a dentry for the mount point directory 
  struct nameidata nd_dir;
  path_lookup (dir_name, /*...*/, ∓nd_dir);

  // Get a dentry from the block special file that
  //  represents the disk hardware (e.g., /dev/hda)
  struct nameidata nd_dev;
  path_lookup (dev_name, /*...*/, ∓nd_dev);
  
  // Get the block device structure which was allocated 
  // when loading the dentry for the block special file.
  // This contains the major and minor device numbers 
  struct block_device *bdev = nd_dev->inode->i_bdev;

  // Get these numbers into a packed k_dev_t (see later)
  k_dev_t dev = to_kdev_t(bdev->bd_dev);

  // Get the file_system_type struct for the given
  //  filesystem type name
  struct file_system_type *type = get_fs_type(fstype);

  struct super_block *sb = // allocate space
  
  // Store the block device information in the sb 
  sb->s_dev = dev;
  // ... populate other generate sb fields
 
  // Ask the filesystem type handler to populate the
  //  rest of the superblock structure
  type->read_super(s, data, flags & MS_VERBOSE)); 

  // Now populate a vfsmount structure from the superblock
  struct vfsmount *mnt = // allocate space 
  mnt->mnt_sb = sb;
  //... Initialize other vfsmount elements from sb

  // Finally, attach the vfsmount structure to the 
  //  mount point's dentry (in `nd_dir')
  graft_tree (mnt, nd_dir);
  }
I should point out that the code above is a considerable simplification of the real implementation because, apart from omitting all the error handling code, it doesn't reflect the fact that filesystem types are not really generic. For example, some filesystem types do not have a block device associated with them (the proc filesystem is of this type). Some filesystem types can serve multiple mount points, while others can't. And so on. The code above only shows the operation of a mount that associates a block special file with a mount point, which is usually the most important case.
      The block special file is just a particular kind of file, as far as Linux is concerned, and therefore has a dentry of its own. In the dentry will be an inode, and that inode will itself have been obtained by a disk read at some point. So you might be wondering how, if we need a block special file, and that needs a disk read to fetch its inode, how can the kernel do the very first disk read at boot time? Is this not a chicken-and-egg situation? In order to resolve this infinite regress, the Linux kernel recognizes a particular kind of disk filesystem called rootfs. This models the root filesystem, and is initialized at boot time, not from the inode of a block device (which cannot yet exist), but from physical device properties passed in from the boot loader. During the boot sequence, the root filesytem will normally be remounted as an ordinary filesystem, which is why you run the command
% mount
you will see something like:
/dev/hda1 on / type ext2 (rw)
rather than `type rootfs'.

The place where the filesystem is to be mounted is itself a directory, and therefore has its own dentry, and its own inode.
      The sys_mount() function allocates an empty superblock structure, and stores in it the address of the block device structure, among other things. It then finds the filesystem handler for the specified filesystem type. With luck, you will remember that the filesystem handler code registered itself with the kernel by calling register_filesystem(), specifying the name of the filesystem type and the address of the read_super() function. So, when the sys_mount() function has set up the generic elements in the super_block structure, it calls read_super() in the filesystem handler. What happens then depends on the specific filesystem. If it is a disk filesytem, very like the filesystem handler will attempt to read the physical superblock from the disk, and check that is of the expected type. In all filesystem types, however, read_super will initialize a super_operations structure, and insert a pointer to it into the s_op field in the super_block structure. super_operations contains pointers to a set of functions that carry out basic file operations, such as creating a new file or deleting a file (strictly speaking, we create and delete inodes, not files, of course).
      When the super_block structure has successfully been initialized, the mount code populates a vfsmount structure with the data from the superblock, then attaches that vfsmount to the dentry for the mount point directory. This process is the mechanism that Linux uses for indicating that a particular directory is a mount point, not a directory in its own right. We will return to this mechanism later.

Opening a file... continued

You may remember that we broke off our discussion of opening a file, in the dentry_open() function, at the point where it called the open function on the file's inode structure. We should now be in a position to see where that inode comes from.
      When the application requests that a file be opened, the sys_open() code and the functions it calls will attempt to find a cached inode for that file. It does this by looking in the list of dentry structures in memory. If the inode is not in the dentry cache, then the file open code walks down the requested pathname, from '/' if necessary, opening a dentry for each directory, until it reaches the requested file.
      You should keep in mind, without brooding on it too much, that these pathname descent operations are themselves filesystem operations, and themselves have to go through the filesystem handler(s), and perhaps the associated block device(s). This means that at the very least, the top level directory of any mounted filesytem must be locatable on disk without needing a disk read. If this were not the case, no other file below the top level directory would be locatable.
      The location of the top-level directory is made possible by insisting that it always have the same inode number (2). So, to get the root directory ('/') into memory, the VFS code simply asks the filesystem handler for the root directory to read inode number 2. It can then descend the pathname of the file to be opened by opening and reading each directory component in the path, and getting the inode number that corresponds to the directory name at each level of the tree. Of course, in doing this, the descent may cross filesystem boundaries. Suppose, for example, the application is opening the file `/home/fred/test'. The directory '/' may be on one physical disk, and the directory '/home' a different filesystem type on a different physical disk. The VFS implementation also has to contend with symbolic links, which may themselves cross filesystems, but we won't go into that technicality here. The general principle is that, as each inode is loaded and cached, it inherits the elements of its ancestor in the directory tree, unless the directory is a mount point. If it is a mount point, the VFS code finds the vfsmount structure that was stored in the dentry for the mount point when the filesystem was mounted, gets the super_operations structure that was initialized by the filesystem handler, then calls the function to open the inode for the top level directory. And so the process continues, until it arrives at the requested file. The cached inode structure for this file will contain pointers to functions to carry out file operations, inherited from the inode for the top-level directory, which in turn obtained them from the super_block structure, which itself got them from the the filesystem handler. So, when the file open procedure in the VFS layer executes the following code:
   // Call open() through the inode 
   f->f_op = fops_get(inode->i_fop);
   f->f_op->open(inode, f);
it is calling a function to open a file that was provided by the handler for the filesystem on which the file is located.

Next: the filesystem layer >>
©1994-2006 Kevin Boone, all rights reserved