inode
structure for a particular
file. This was a long and convoluted process, and before we describe
the filesystem layer, it might be worth recapping it briefly.
open() in libc-XXX.so.
libc-XXX.so traps into the kernel, in a way that is
architecture-dependent.
sys_open.
sys_open finds the dentry for
the file which, we hope, is cached.
inode structure is retrieved from the
dentry.
open function in the inode,
which is a pointer to a function provided by the handler for
the filesystem type. This handler was installed at boot time, or
by loading a kernel module,
and made a available by being attached to the mount point during
the VFS mount process. The mounted filesystem is represented as
a vfsmount structure, which contains a pointer
to the filesystem's superblock, which in turn contains a pointer to the
block device that handles the physical hardware.
ext2
filesystem type that we have all come to know and love.
We have seen
how the VFS layer calls open through the inode which
was created by the filesystem handler for the requested file. As well
as open(), a large number of other operations
is exposed in the inode structure.
Looking at the definition of struct inode (in
include/linux/fs.h), we have:
struct inode
{
unsigned long i_ino;
umode_t i_mode;
nlink_t i_nlink;
uid_t i_uid;
gid_t i_gid;
// ... many other data fields
struct inode_operations *i_op;
struct file_operations *i_fop;
}
The interface for manipulating the file is provided by the i_op
and i_fop structures. These structures contain the pointers to the
functions that do the real work; these functions are provided by the
filesytem handler. For example, file_operations, contains the
following pointers:
struct file_operations
{
int (*open) (struct inode *, struct file *);
ssize_t (*read) (struct file *, char *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
int (*release) (struct inode *, struct file *);
// ... and many more
}
You can see that the interface is clean -- there are no references to
lower-level structures, such as disk block lists, and there is nothing
in the interface that presupposes a particular type of low-level hardware.
This clean interface should, in principle, make it easy to understand
the interaction between the VFS layer and the filesystem handlers.
Conceptually, for example, a file read operation ought to look
like this:
read()
function.
The pointers in struct file_operations and
struct inode_operations
are hooks into the filesytem handler, and we can expect
each handler to implement them slightly differently. Or can we? It's
worth thinking, for example, about exactly what happens in the filesystem
layer when an application opens or reads a file. Consider the `open'
operation first. What exactly is meant by `opening' a file at the filesystem
level? To the application programmer, `opening' has connotations of
checking that file exists, checking that the requested access mode is
allowed, creating the file if necessary, and marking it as open. At the
filesystem layer, by the time we call open() in
struct file_operations() all of this has been done. It was
done on the cached dentry for the file, and if the file did
not exist, or had the wrong permissions, we would have found out before now.
So the open() operation is more-or-less a no-brainer,
on most filesystem types. What about
the read() operation? This operation will involve doing
some real work, won't it? Well, possibly not. If we're lucky, the
requested file region will have been read already in the not-too-distant
past, and will be in a cache somewhere. If we're very lucky, then
it will be available in a cache even if it hasn't been read recently.
This is because disk drives work most effectively when they can read
in a continuous stream. If we have just read physical block 123, for example, from the disk, there is an excellent chance that the application will need block 124 shortly. So the kernel will try to `read ahead' and load disk blocks
into cache before they are actually requested. Similar considerations
apply to disk write operations: writes will normally be performed on memory
buffers, which will be flushed periodically to the hardware.
Now, this disk caching, buffering, and read-ahead support is, for the
most part filesystem-independent. Of the huge amount of work
that goes on when the application reads data from a file, the only
part that is file-system specific is the mapping of logical
file offsets to physical disk blocks. Everything else is generic.
Now, if it is generic, it can be considered part of VFS, along with
all the other generic file operations, right? Well, no actually.
I would suggest that conceptually the generic filesystem
stuff forms a separate architectural layer,
sitting between the individual filesystem handlers and the block devices.
Whatever the merits of this argument, the Linux
kernel is not structured like this. You'll see, in fact, that the
code is split between two subsystems: the VFS subsystem
(in the fs directory of the kernel source),
and the memory management subsystem
(in the mm directory). There is a reason for this, but it
is rather convoluted, and you may not need to understand it to make
sense of the rest of the disk access procedure which I will describe
later. But, if you are interested, it's like this.
ext2 as an example of a
filesystem handler but, as should be clear by now, we don't lose a lot
of generality. Most of the filesystem infrastructure is generic
anyway. I ought to point out, for the sake of completeness, that
none of what follows is mandatory for a filesystem handler.
So long as the handler implements the functions defined in
struct file_operations and struct inode_operations,
the VFS layer couldn't care less what goes on inside the handler. In practice,
most filesystem type handlers do work the way I am about to described,
with minor variations.
Let's dispose of the open() operation first, as this is
trivial (remember that VFS has done all the hard work by the time
the filesystem handler is invoked). VFS calls
open() in the struct file_operations provided
by the filesystem handler. In the ext2 handler,
this structure is initialized like this (fs/ext2/file.c):
struct file_operations ext2_file_operations =
{
llseek: generic_file_llseek,
read: generic_file_read,
write: generic_file_write,
ioctl: ext2_ioctl,
mmap: generic_file_mmap,
open: generic_file_open,
release: ext2_release_file,
fsync: ext2_sync_file,
};
Notice that most of the file operations are simply delegated to the
generic filesystem infrastructure. open() maps onto
generic_file_open(), which is defined in fs/open.c:
int generic_file_open
(struct inode * inode, struct file * filp)
{
if (!(filp->f_flags & O_LARGEFILE) &&
inode->i_size > MAX_NON_LFS)
return -EFBIG;
return 0;
}
Not very interesting, it it? All this function does is to check
whether we have requested
an operation with large file support on a filesystem that can't accomodate
it. All the hard work has already been done by this point.
The read() operation is more interesting. This function
results in a call on generic_file_read(), which is defined
in mm/filemap.c (remember, file reads are part of
the memory management infrastructure!). The logic is fairly complex,
but for our purposes -- talking about file management, not memory
management -- can be distilled down to something like this:
/*
arguments to generic_file_read:
filp - the file structure from VFS
buf - buffer to read into
count - number of bytes to read
ppos - offset into file at which to read
*/
ssize_t generic_file_read (struct file * filp,
char * buf, size_t count, loff_t *ppos)
{
struct address_space *mapping =
filp->f_dentry->d_inode->i_mapping;
// Use the inode to convert the file offset and byte
// count into logical disk blocks. Work out the number of
// memory pages this corresponds to. Then loop until we
// have all the required pages
while (more_to_read)
{
if (is_page_in_cache)
{
// add cached page to buffer
}
else
{
struct page *page = page_cache_alloc(mapping);
// Ask the filesystem handler for the logical page.
// This operation is non-blocking
mapping->a_ops->readpage(filp, page);
// Schedule a look-ahead read if possible
generic_file_readahead(...);
// Wait for request page to be delivered from the IO subsystem
wait_on_page(page);
// Add page to buffer
// Mark new page as clean in cache
}
}
}
In this code we can see (in outline) the caching and read-ahead
logic.
It's important to remember that because
the generic_file_read code is part of the memory management
subsystem, its operations are expressed in terms of (virtual memory)
pages, not disk blocks. Ultimately we will be reading disk blocks,
but not here. In practice, disk blocks will often be 1kB, and
pages 4kB. So we will have four block reads for each page read.
generic_read_file can't get real data from a real filesystem,
either in blocks or in pages,
because only the filesystem knows where the logical blocks in the file
are located on the disk.
So, for this discussion, the most important
feature of the above code is the call:
mapping->a_ops->readpage(filp, page);This is a call through the inode of the file, back into the filesystem handler. It is expected to schedule the read of a page of data, which may encompass multiple disk blocks. In reality, reading a page is also a generic operation -- it is only reading blocks that is filesystem-specific. A page read just works out the number of blocks that constitute a page, and then calls another function to read each block. So, in the
ext2 filesystem example,
the readpage function pointer points to
ext2_readpage() (in fs/ext2/inode.c),
which simply calls back into the generic VFS layer like this:
static int ext2_readpage
(struct file *file, struct page *page)
{
return block_read_full_page(page,ext2_get_block);
}
block_read_full_page() (in fs/buffer.c)
calls the ext2_get_block() function once
for each block in the page. This function does
not do any IO itself, or even delegate it to the block device.
Instead, it determines the location on disk of the requested
logical block, and returns this information in a
buffer_head structure (of which, more later).
The ext2
handler does know the block device (because this information
is stored in the inode object that has been passed
all the way down from the VFS layer). So it could quite happily
ask the device to do a read. It doesn't, and the reason for this
is quite subtle. Disk devices generally work most efficiently
when they are reading or writing continuously. They don't work
so well if the disk head is constantly switching tracks. So for
best performance, we want to try to arrange the disk blocks to be
read sequentially, even if that means that they are not read
or written in the order they are requested. The code to do this
is likely to be the same for most, if not all, disk devices. So
disk drivers typically make use of the generic request management
code in the block device layer, rather than scheduling IO operations
themselves.
buffer_head structure for each block required.
The salient parts of the structure are:
struct buffer_head
{
struct buffer_head *b_next; /* Next buffer in list */
unsigned long b_blocknr; /* Block number */
kdev_t b_dev; /* Device */
struct page *page; /* Memory this block is mapped to */
void (*b_end_io)(struct buffer_head *bh, int uptodate);
// ... and lots more
}
You can see that the structure contains a block number, an identifier
for the device (b_dev),
and a reference to the memory into which the disk
contents should be read. kdev_t is an integer containing
the major and minor device numbers packed together.
buffer_head therefore contains everything the IO
subsystem needs to do the real read. It also defines a function
called b_end_io that the block device layer will call
when it has loaded the requested block (remember this operation
is asynchronous). However,
the VFS generic filesystem infrastructure does not hand off
this structure to the IO subsystem immediately it is returned from
the filesystem handler. Instead, as the
filesystem handler populates buffer_head objects, VFS
builds them into a queue (a linked list), and then submits the whole queue
to the block device layer. A filesystem handler can implement its
own b_end_io function or, more commonly, make using of
the generic end-of-block processing found in the generic block device
layer, which we will consider next.
Next: the generic block device layer >>
©1994-2006 Kevin Boone, all rights reserved