Here is the PDF version of this book.
Chapter 4 Files and file systems
When a process completes (or crashes), any data stored in main memory is lost. But data stored on a hard disk drive (HDD) or solid state drive (SSD) is “persistent;” that is, it survives after the process completes, even if the computer shuts down.
Hard disk drives are complicated. Data is stored in blocks, which are laid out in sectors, which make up tracks, which are arranged in concentric circles on platters.
Solid state drives are simpler in one sense, because blocks are numbered sequentially, but they raise a different complication: each block can be written a limited number of times before it becomes unreliable.
As a programmer, you don’t want to deal with these complications. What you want is an appropriate abstraction of persistent storage hardware. The most common abstraction is called a “file system.”
File names are usually strings, and they are usually “hierarchical”; that is, the string specifies a path from a top-level directory (or folder), through a series of subdirectories, to a specific file.
The primary difference between the abstraction and the underlying mechanism is that files are byte-based and persistent storage is block-based. The operating system translates byte-based file operations in the C library into block-based operations on storage devices. Typical block sizes are 1–8 KiB.
For example, the following code opens a file and reads the first byte:
FILE *fp = fopen("/home/downey/file.txt", "r"); char c = fgetc(fp); fclose(fp);
When this code runs:
The process for writing a file is similar, but there are some additional steps. Here is an example that opens a file for writing and changes the first character.
FILE *fp = fopen("/home/downey/file.txt", "w"); fputc('b', fp); fclose(fp);
When this code runs:
To summarize, the C library provides the abstraction of a file system that maps from file names to streams of bytes. This abstraction is built on top of storage devices that are actually organized in blocks.
4.1 Disk performance
I mentioned earlier that disk drives are slow. On current HDDs, the average time to read a block from disk to memory might be 5–25 ms (see https://en.wikipedia.org/wiki/Hard_disk_drive_performance_characteristics). SSDs are faster, taking 25 µs to read a 4 KiB block and 250 µs to write one (see http://en.wikipedia.org/wiki/Ssd#Controller).
To put these numbers in perspective, let’s compare them to the clock cycle of the CPU. A processor with clock rate 2 GHz completes one clock cycle every 0.5 ns. The time to get a byte from memory to the CPU is typically around 100 ns. If the processor completes one instruction per clock cycle, it would complete 200 instructions while waiting for a byte from memory.
In one microsecond, it would complete 2000 instructions, so while waiting 25 µs for a byte from an SSD, it would complete 50,000.
In one millisecond, it would complete 2,000,000 instructions, so while waiting 20 ms for a byte from a HDD, it might complete 40 million. If there’s nothing for the CPU to do while it waits, it would be idle. That’s why the operating system generally switches to another process while it is waiting for data from disk.
The gap in performance between main memory and persistent storage is one of the major challenges of computer system design. Operating systems and hardware provide several features intended to “fill in” this gap:
Some of these features are also implemented in hardware. For example, some disk drives provide a cache that stores recently-used blocks, and many disk drives read more than one block at a time, even if only one is requested.
These mechanisms generally improve the performance of programs, but they don’t change the behavior. Usually programmers don’t have to think about them, with two exceptions: (1) if the performance of a program is unexpectedly bad, you might have to know something about these mechanisms to diagnose the problem, and (2) when data is buffered, it can be harder to debug a program. For example, if a program prints a value and then crashes, the value might not appear, because it might be in a buffer. Similarly, if a program writes data to disk and then the computer loses power, the data might be lost if it is in a cache and not yet on disk.
4.2 Disk metadata
The blocks that make up a file might be arranged contiguously on disk, and file system performance is generally better if they are, but most operating systems don’t require contiguous allocation. They are free to place a block anywhere on disk, and they use various data structures to keep track of them.
In many UNIX file systems, that data structure is called an “inode,” which stands for “index node”. More generally, information about files, including the location of their blocks, is called “metadata”. (The content of the file is data, so information about the file is data about data, hence “meta”.)
Since inodes reside on disk along with the rest of the data, they are designed to fit neatly into disk blocks. A UNIX inode contains information about a file, including the user ID of the file owner; permission flags indicating who is allowed to read, write, or execute it; and timestamps that indicate when it was last modified and accessed. In addition, it contains block numbers for the first 12 blocks that make up the file.
If the block size is 8 KiB, the first 12 blocks make up 96 KiB. On most systems, that’s big enough for a large majority of files, but it’s definitely not big enough for all of them. That’s why the inode also contains a pointer to an “indirection block”, which contains nothing but pointers to other blocks.
The number of pointers in an indirection block depends on the sizes of the blocks and the block numbers, but it is often 1024. With 1024 block numbers and 8 KiB blocks, an indirection block can address 8 MiB. That’s big enough for all but the largest files, but still not big enough for all.
That’s why the inode also contains a pointer to a “double indirection block”, which contains pointers to indirection blocks. With 1024 indirection blocks, we can address 8 GiB.
And if that’s not big enough, there is (finally) a triple indirection block, which contains pointers to double indirection blocks, yielding a maximum file size of 8 TiB. When UNIX inodes were designed, that seemed big enough to serve for a long time. But that was a long time ago.
As an alternative to indirection blocks, some files systems, like FAT, use a File Allocation Table that contains one entry for each block, called a “cluster” in this context. A root directory contains a pointer to the first cluster in each file. The FAT entry for each cluster points to the next cluster in the file, similar to a linked list. For more details, see http://en.wikipedia.org/wiki/File_Allocation_Table.
4.3 Block allocation
File systems have to keep track of which blocks belong to each file; they also have to keep track of which blocks are available for use. When a new file is created, the file system finds an available block and allocates it. When a file is deleted, the file system makes its blocks available for re-allocation.
The goals of the block allocation system are:
It is hard to design a file system that achieves all of these goals, especially since file system performance depends on “workload characteristics” like file sizes, access patterns, etc. A file system that is well tuned for one workload might not perform as well for another.
For this reason, most operating systems support several kinds of file systems, and file system design is an active area of research and development. In the last decade, Linux systems have migrated from ext2, which was a conventional UNIX file system, to ext3, a “journaling” file system intended to improve speed and contiguity, and more recently to ext4, which can handle larger files and file systems. Within the next few years, there might be another migration to the B-tree file system, Btrfs.
4.4 Everything is a file?
The file abstraction is really a “stream of bytes” abstraction, which turns out to be useful for many things, not just file systems.
One example is the UNIX pipe, which is a simple form of inter-process communication. Processes can be set up so that output from one process is taken as input into another process. For the first process, the pipe behaves like a file open for writing, so it can use C library functions like fputs and fprintf. For the second process, the pipe behaves like a file open for reading, so it uses fgets and fscanf.
Network communication also uses the stream of bytes abstraction. A UNIX socket is a data structure that represents a communication channel between processes on different computers (usually). Again, processes can read data from and write data to a socket using “file” handling functions.
Reusing the file abstraction makes life easier for programmers, since they only have to learn one API (application program interface). It also makes programs more versatile, since a program intended to work with files can also work with data coming from pipes and other sources.
Are you using one of our books in a class?We'd like to know about it. Please consider filling out this short survey.