sys_open function, and the corresponding sys_read,
sys_write, and so on, are entry points into the kernel's
VFS layer.
These functions are largely architecture-independent,
for reasons that will become clear later. The `virtual' part of
`virtual filesystem' reflects the fact that the operations carried out on this
layer are independent of lower-level implementation details. In particular,
VFS abstracts from the application programmer:
In outline, sys_open
looks like this:
int sys_open (const char *name, int flags, int mode)
{
int fd = get_unused_fd();
struct file *f = filp_open(name, flags, mode);
fd_install(fd, f);
return fd;
}
The function get_unused_fd() attempts to find an
empty slot in the current process's file descriptor table.
This operation can fail, of course, if the process has too many
file descriptors open. If it succeeds, the function
filp_open() is called to do the real open; if
filp_open() succeeds, the file structure
it returns is assigned to the allocated slot in the file descriptor
table by fd_install().
In summary, all sys_open does is to call filp_open()
and then assign its output to a free file descriptor. filp_open()
is also defined in open.c, and looks in outline like this:
struct file *filp_open
(const char *name, int flags, int mode)
{
struct nameidata nd;
open_namei (filename, namei_flags, mode, &nd);
return dentry_open(nd.dentry, nd.mnt, flags);
}
The filp_open function has, in essence, two steps.
First, it uses open_namei (in fs/namei.c) to
generate a nameidata structure). This structure provides,
among other things, a link to the file's inode (of which, more later).
The second step calls dentry_open(), passing the
salient information from the nameidata structure. It is
this latter step which will typically do the `open' operation,
by delegating it to a handler for the appropriate filesystem.
inode, dentry, and file
structures.
The concept of an `inode' is an ancient one in the Unix world. An
inode is a block of data that stores system-level information about
a single file, such as its access modes, size, and references to its
location on the physical storage device. In Unix, different
filenames may point to the same file on disk (this is what
links are for), but each file has exactly one inode. A directory,
therefore, is nothing more than a mapping between filenames and
inodes. A directory is itself a file, and consequently has an inode
of its own. A number of processes may have the same file open,
subject to locking restrictions, but each process `sees' the same
inode, because each `sees' the same file. An inode is partly
exposed to the application program by a stat structure.
stat does not expose the low-level information about
the file, but it does expose the owner, group, size, datastamps,
and so on.
Now, although many processes can see the same file, and therefore the
same inode, these various processes have different rights over that
file. For example, if one process has a file open for writing, very
likely it will seek to limit the rights of other processes. It should be
obvious that we need a way for each process to maintain its own
private view of the file, in addition to the shared view provided by the inode.
In Linux, the private view is provided by the file
structure. The file contains information about the
file as seen by a particular process. Notice that the filp_open
function shown about returns a file, not an inode. However,
the file contains (indirectly) a reference to its inode.
In practical Linux systems, operations on inodes are a significant
part of the work of the system. Consequently these operations need
to be optimized if possible. The Linux VFS layer
attempts to cache inodes in memory; this cache is maintained by
a linked list of dentry structures. Each dentry
contains a reference to the inode, and the name of the file.
So, in summary, an inode is a representation of a file, shared by all
users of that file; a file structure is the view of a
file (and its inode) seen by a particular process; a dentry
is a structure that caches an inode in memory along with the filename
that locates it.
So, let's return to filp_open(). This function calls
open_namei(), which finds the dentry
for a particular filename. It does this by descending the requested
pathname one component at a time,
relative either to the root (/), or at the process's
working directory, depending on whether the pathname begins with
a '/' or not. It descends until it gets to the requested file,
or realizes that it is never going to get there. If each component
of the pathname has already been cached as a dentry, this
is a fast process, as all operations are carried out in memory.
If the file has not been the subject of an access recently, its inode
will probably not be cached. In such a case, the inode will have to
be read from the filesystem layer (see below), and then cached. But
we'll deal with that process later. The open_namei() function
populates a nameidata, the salient elements
of which look like this:
struct nameidata
{
struct dentry *dentry;
struct vfsmount *mnt;
//...
}
The dentry contains a reference to the cached inode of the file,
while the vfsmount references the filesytem on which the
file is located.
As we shall see, the mount utility
has the effect of causing vfsmount elements to be registered
with the kernel.
So, the filp_open function now knows the dentry,
and therefore the inode (if it exists) for the requested file, and
a vfsmount structure which references the block special
file for the filesystem. The open() code does not itself
refer to the
vfsmount as this code
gets access to the lower-level functions
the file through pointers in the inode structure itself. However,
the vfsmount is stored in the file structure
for future use. The filp_open fuction now calls
dentry_open(). This function looks like this:
struct file *dentry_open
(struct dentry *dentry, struct vfsmount *mnt, int flags)i
{
struct file * f = // get an empty file structure;
f->f_flags = flags;
// ...populate other file structure elements
// Get the inode for the file
struct inode = dentry->d_inode;
// Call open() through the inode
f->f_op = fops_get(inode->i_fop);
f->f_op->open(inode, f);
}
This function finds the address of a function in the filesystem layer
that can handle the open operation and calls it, passing the inode
and the file structure. Where does this function address
come from? It will be found either in the inode itself, or in one of
the inode's parent inodes. In order to understand where these inodes
come from, we need to see how filesystem handlers are registered, and
individual filesystems mounted. We will return to the sys_open()
method later.
open()
and read() are made, via a thin layer in the C standard
library, on the kernel's VFS layer. The key feature of the VFS layer is
that it is filesystem-independent, and device-independent. An open() call will work on any kind of filesystem, on any kind of physical hardware,
in more-or-less the same way. However, underneath the VFS layer will be a filesytem layer, which contains support for the various filesystems known to the
kernel. Underneath the filestem layer will be the device layer, which interacts
with the hardware. The mounting process provides the bridge between the
VFS layer and the lower-level operations of the filesystem and device.
ext3, ufs), and
sometimes to mean a particular mounted instance of that filesystem type
(e.g., the ext3 filesystem mounted on /usr).
In the following, I will use the term `filesystem type' to refer to
a particular type of filesystem, and `mounted filesystem' to refer to
a specific instance. But don't expect the general Linux documentation to be
this consistent.
You can get a list of supported filesystem types by doing
cat /proc/filesystemsA typically Linux system will support at least
ext2
and iso9660 (CDROM) filesystem types, and some systems will
be configured to support many more.
Some of the filesytem types will be implemented in code directly
compiled into the core kernel. The proc filesystem type itself
is likely to be of this type. The proc type supports the
/proc directory which, as you probably know, allows applications
to find out about, and interact with, the kernel and device drivers.
The proc filesystem type is interesting in another respect
-- it demonstrates that in Linux a filesystem need not correspond to
any real, physical hardware. /proc exists entirely in memory,
and is generated dynamically.
register_filesystem(), which is
defined in fs/super.c. This function looks like
this:
int register_filesystem(struct file_system_type *fs)
{
struct file_system_type ** p =
find_filesystem(fs->name);
if (*p)
return -EBUSY;
else
{ *p = fs; return 0; }
}
All this simple function does is to check whether the specified
filesystem type is already registered and, if not, stores
the supplied file_system_type structure in the
kernel's filesystem table.
The filesystem handler's initialization
code initializes a struct file_system_type,
which looks like this:
struct file_system_type
{
const char *name;
struct super_block *(*read_super)
(struct super_block *, void *, int);
//...
}
The structure contains the name of the filesystem type
(e.g., ext3), and the address of a function called
read_super. This function is provided by the filesystem
handler, and will be called by the kernel when a filesystem of this
type is mounted. The read_super function will have the
task of initializing a struct super_block, the contents
of which will very likely be derived from the superblock on
the physical disk. The superblock (of which there may be more than
one on a real disk or partition) contains fundamental information
about the structure of the filesystem, the maximum supported
file size, and the filesytem type. The super_block
struct contains a memory image of this information, and also
pointers to the block device operations needed to operate
on this filesystem. We will discuss superblock operations
in more detail later.
At the user level, to make a particular instance of a filesystem available,
we typically use the mount utility.
The mount utility makes a call on the mount()
function, which is defined in libc-XXX.so. Execution ultimately arrives
at the VFS function sys_mount(), defined in fs/namespace.h, by the same method as described above for the open() call.
sys_mount() does a fair amount of work, because it has to
check and manipulate the mount tables, carry out error checking, and
manipulate memory in the kernel, as well as the mount itself.
But, in essence,
it looks like the following snippet. Please note that I've
collapsed a number
of discrete functions into one here, so that the flow is clearer.
/*
sys_mount arguments:
dev_name - name of the block special file, e.g., /dev/hda1
dir_name - name of the mount point, e.g., /usr
fstype - name of the filesystem type, e.g., ext3
flags - mount flags, e.g., read-only
data - filesystem-specific data
*/
long sys_mount
(char *dev_name, char *dir_name, char *fstype,
int flags, char *name, void *data)
{
// Get a dentry for the mount point directory
struct nameidata nd_dir;
path_lookup (dir_name, /*...*/, ∓nd_dir);
// Get a dentry from the block special file that
// represents the disk hardware (e.g., /dev/hda)
struct nameidata nd_dev;
path_lookup (dev_name, /*...*/, ∓nd_dev);
// Get the block device structure which was allocated
// when loading the dentry for the block special file.
// This contains the major and minor device numbers
struct block_device *bdev = nd_dev->inode->i_bdev;
// Get these numbers into a packed k_dev_t (see later)
k_dev_t dev = to_kdev_t(bdev->bd_dev);
// Get the file_system_type struct for the given
// filesystem type name
struct file_system_type *type = get_fs_type(fstype);
struct super_block *sb = // allocate space
// Store the block device information in the sb
sb->s_dev = dev;
// ... populate other generate sb fields
// Ask the filesystem type handler to populate the
// rest of the superblock structure
type->read_super(s, data, flags & MS_VERBOSE));
// Now populate a vfsmount structure from the superblock
struct vfsmount *mnt = // allocate space
mnt->mnt_sb = sb;
//... Initialize other vfsmount elements from sb
// Finally, attach the vfsmount structure to the
// mount point's dentry (in `nd_dir')
graft_tree (mnt, nd_dir);
}
I should point out that the code above is a considerable simplification
of the real implementation because, apart from omitting all the
error handling code, it doesn't reflect the fact that filesystem
types are not really generic. For example, some filesystem types do
not have a block device associated with them (the proc
filesystem is of this type). Some filesystem types can serve multiple
mount points, while others can't. And so on. The code above only
shows the operation of a mount that associates a block special file
with a mount point, which is usually the most important case.
dentry of its own. In the dentry
will be an inode, and that inode will itself have been obtained by a
disk read at some point. So you might be wondering how, if we need a block
special file, and that needs a disk read to fetch its inode,
how can the kernel do the very
first disk read at boot time? Is this not a chicken-and-egg situation?
In order to
resolve this infinite regress, the Linux kernel recognizes a particular
kind of disk filesystem called rootfs. This models the
root filesystem, and is initialized at boot time, not from the inode
of a block device (which cannot yet exist), but from physical device
properties passed in from the boot loader. During the boot sequence,
the root filesytem will normally be remounted as an ordinary filesystem,
which is why you run the command
% mountyou will see something like:
/dev/hda1 on / type ext2 (rw)rather than `type rootfs'.
The place where the filesystem is to be mounted is itself a directory,
and therefore has its own dentry, and its own inode.
The sys_mount() function allocates an empty
superblock structure, and stores in it the address of the block device
structure, among other things. It then finds the filesystem
handler for the specified filesystem type. With luck, you will remember
that the filesystem handler code registered itself with the kernel
by calling register_filesystem(), specifying the name
of the filesystem type and the address of the read_super()
function. So, when the sys_mount() function has
set up the generic elements in the super_block structure,
it calls read_super() in the filesystem
handler. What happens then depends on the specific filesystem. If it
is a disk filesytem, very like the filesystem handler will attempt to
read the physical superblock from the disk, and check that is of the
expected type. In all filesystem types, however, read_super
will initialize a super_operations structure,
and insert a pointer to it into the s_op
field in the super_block structure. super_operations
contains pointers to a set of functions that carry out basic
file operations, such as creating a new file or deleting a file (strictly
speaking, we create and delete inodes, not files, of course).
When the super_block structure has successfully been
initialized, the mount code populates a vfsmount structure
with the data from the superblock, then attaches that vfsmount
to the dentry for the mount point directory. This process
is the mechanism that Linux uses for indicating that a particular directory
is a mount point, not a directory in its own right. We will return to
this mechanism later.
dentry_open() function, at the point where it
called the open function on the file's inode
structure. We should now be in a position to see where that inode
comes from.
sys_open() code and the functions it calls will attempt
to find a cached inode for that file. It does this by looking in the
list of dentry structures in memory. If the inode is not in
the dentry cache, then the file open code walks
down the requested pathname, from '/' if necessary,
opening a dentry for each directory, until it reaches the
requested file.
vfsmount
structure that was stored in the dentry for the mount
point when the filesystem was mounted, gets the super_operations
structure that was initialized by the filesystem handler, then calls the
function to open the inode for the top level directory.
And so the process continues,
until it arrives at the requested file. The cached inode structure
for this file will contain pointers to functions to carry out file
operations, inherited from the inode for the top-level directory,
which in turn obtained them from the super_block structure,
which itself got them from the the filesystem handler.
So, when the file open procedure in the VFS layer executes the following
code:
// Call open() through the inode f->f_op = fops_get(inode->i_fop); f->f_op->open(inode, f);it is calling a function to open a file that was provided by the handler for the filesystem on which the file is located.
Next: the filesystem layer >>
©1994-2006 Kevin Boone, all rights reserved