2.6. The UNIX File Abstraction¶

When multiple processes exist on a single machine, they rely on virtual memory to create the illusion that they have sole access to the CPU; the context switch mechanism prevents one process from accessing another process’s register values, stack, heap, etc. However, processes ultimately do not have sole access to the entire machine. There are many resources, such as a network interface, storage devices, user input devices, and so on, that must be shared with other processes on the same machine. As such, processes act as a unit of ownership for instances of access to these resources.

The UNIX file abstraction, which is widely used in modern OS design, provides a uniform interface to these various shared resources. This abstraction relies on two features: a file is a sequence of bytes and everything is a file. It is important to emphasize that this definition is different from the common usage of the term “file,” which is typically associated with persistent data storage. The key differences between this common usage and the UNIX file abstraction are as follows:

Arbitrary or bidirectional access to a file is not necessarily possible. In some cases, once a byte has been read from the file, that byte no longer exists in the file; there is no way to seek to a previous position in such files. Similarly, sequential access of the bytes in order may be required, with no way to skip ahead.

Files may not have names or persistent storage. Some files (such as those described in Chapter 3 for interprocess communication, commonly referred to as IPC) exist solely as in-memory constructs at run-time, identified only by an integer file descriptor. Other files (such as /dev/random on UNIX and Linux systems) exist solely as an abstract interface to a hardware component or generate data at run-time on demand.

Files do not necessarily have structure or typing. Readers are likely family with persistent files that can be distinguished by a file extension. For instance, a file with the .pdf extension has a different internal structure than one with a .png extension; programs that read or write these files must make sure that the bytes adhere to a pre-defined semantic structure. However, in the UNIX file abstraction, this pre-defined structure does not exist; a file is just a sequence of bytes.

By removing so much contextual information about files, this abstraction might seem to lose much of its meaning or utility. On the contrary, this abstraction greatly simplifies the work of dealing with a variety of resources; there are certain operations (creating, deleting, opening, closing, reading, writing) that are common to the lifecycle of all files. The UNIX file abstraction provides a single, consistent interface for these operations, thus eliminating much of the complexity of supporting many such resources.

2.6.1. Basic File Access¶

The most basic operations for working with files are creating and opening them. For files that can be identified with named locations in the file system directory structure (such as /dev/random, /usr/bin/cksum, or /home/csf/movies.csv), we can use the open() function. The first parameter is the path to the file; this path can be an absolute path (such as /dev/random) or a relative path (such as ../src/main.c) that describes the location relative to the current working directory. If the file is successfully opened, the return value from open() is the file descriptor, a non-negative integer value that other functions use to identify the file. This value should typically be greater than 2, as the default behavior is to open three files when a process is created: 0 (STDIN_FILENO) for standard input (such as reading from the command prompt), 1 (STDOUT_FILENO) for standard output (such as writing out to the screen), and 2 (STDERR_FILENO) for standard error (also writing out to the screen).

C library functions – <fcntl.h>

int open(const char *path, int oflag, ...);: Open or create a file for reading or writing.

The second parameter (oflag) specifies how the file will be accessed by the current process. Table 2.2 shows the flags that may be passed as a bit-mask to open(). Note that these flags do not necessarily align with the common notion of file permissions; a file that is accessible for both reading and writing may be opened in read-only mode (O_RDONLY). However, if the file permissions do not allow the requested access, open() will return -1.

Permission	Purpose
`O_RDONLY`	Open for reading only
`O_WRONLY`	Open for writing only
`O_RDWR`	Open for reading and writing
`O_NONBLOCK`	Do not block on opening while waiting for data
`O_CREAT`	Create the file if it does not exist; requires passing `mode_t` argument
`O_TRUNC`	Truncate to size 0
`O_EXCL`	Error if `O_CREAT` and the file exists

Table 2.2: Flags for opening files

For the common usage of the term “file,” the O_NONBLOCK flag is the least intuitive in Table 2.2, as this flag is normally used for other purposes. Specifically, this flag plays an important role in IPC and network programming. When using a file to communicate with other processes (either on the same machine or across the network), the default behavior for reading is for processes to block (pause) until the data has been received from the sender. The O_NONBLOCK flag changes this behavior so that reading will immediately fail and the process can move on to other work instead of waiting.

Code Listing 2.12 illustrates how the flags can be combined as a bitmask using the bitwise-or (|). In this case, the file is also being created (O_CREAT) with a size of 0 bytes initially (O_TRUNC) and the current process will have write-only access (O_WRONLY). This file will be persistent and stored in the file system with 644 permissions (6 = read and write for the owner of the file, 4 = read-only for the associated group and others); as such, the file could later be opened in read-write mode. Note that this third parameter (mode) is required when creating a new file, but is ignored at other times.

/* Code Listing 2.12:
   Creating a new (empty) file that is ready for writing
 */

/* This will create an empty file */
char *path = "data.log";
mode_t mode = 0644;
int fd = open (path, O_CREAT | O_TRUNC | O_WRONLY, mode);

Once the file has been opened, it can be read from. The read() function takes three parameters: the file descriptor, the address of a buffer in memory to read the bytes into, and the maximum number of bytes to read. [1] The value returned from read() indicates the actual number of bytes successfully read, which may be fewer than the nbyte parameter. (Calling read() with nbyte set to 100 on a file that only contains 10 bytes of data will return 10, not 100.) Finally, when the process is finished working with a file, the close() function will release any associated resources in the kernel or the C library data that have been allocated for this process.

C library functions – <unistd.h>

ssize_t read(int fildes, void *buf, size_t nbyte);: Read up to nbyte bytes from a file into the buffer identified by buf.
int close(int fildes);: Deletes a file descriptor.

Bug Warning

There are several key aspects of working with files that are easy to underestimate. First and foremost is the importance of using a correct value for the nbyte parameter of read(). This parameter always indicates the maximum number of bytes to read and it should never indicate more than the size of the allocated buffer pointer. Buffer overflows are some of the most dangerous and persistent sources of software vulnerabilities, and passing an incorrect parameter to read() is a common culprit. Consider the following example:

/* Allocate a buffer of 2 bytes */
char *buffer = calloc (2, sizeof (char));
/* WRONG: This reads MORE THAN 2 bytes into the buffer */
read (fd, buffer, sizeof (buffer));

The problem here is a misunderstanding of the sizeof() keyword, which returns the size of the specified parameter. The misunderstanding is that sizeof(buffer) returns the size of a pointer variable (8 bytes on a 64-bit system), not the size of the dynamically allocated buffer on the heap. (Contrast this with lines 5 and 6 in Code Listing 2.13 below.) As such, this code is trying to read up to 8 bytes of data into a buffer than can only hold 2 bytes. The result is that read() will simply copy the additional 6 bytes into the memory after the end of the buffer, potentially corrupting other data.

There are other frequent, though less serious, problems with using files. One (which is also in the example above) is to call read() without checking its return value; programmers often assume that the number of bytes read is the same as the number of bytes requested, which is not necessarily true. To illustrate this, consider the possibility of calling read() on a file that has been opened in O_WRONLY mode; read() would return -1 to indicate the operation failed. Another problem is failing to call close(); this causes memory leaks, as allocated data is not freed up appropriately. On the other hand, another problem can arise when a file descriptor is used after the file has been closed; this can cause future reads to fail or (potentially even worse) to read from the wrong file.

Code Listing 2.13 illustrates how to open, read from, and close a file. In this example, we are reading from a special file known as /dev/random. This file can be used to generate a sequence of random numbers one byte at a time; every time this code runs, the result should be different. Note that the file is closed on line 13, but the data is not used by the program until line 17. This is not a problem, as the data was read into the process’s memory; that is, the read() operation has made a copy of the data on the stack, so access to the file is no longer necessary.

/* Code Listing 2.13:
   Reading 10 random numbers using the /dev/random file
 */

uint8_t buffer[10]; // space allocated automatically on the stack
memset (buffer, 0, sizeof (buffer));

/* /dev/random is a special device file that produces an
   unending stream of random numbers */
int fd = open ("/dev/random", O_RDONLY);
assert (fd > 0);
ssize_t bytes = read (fd, buffer, sizeof (buffer));
close (fd);

printf ("Read %zd bytes of random data:\n", bytes);
for (int i = 0; i < bytes; i++)
  printf ("  %02" PRIx8, buffer[i]);
printf ("\n");

Some files, particularly IPC and device interface files, require special handling when reading. Recall that the default behavior for open files is to block until data is ready; this behavior is undesirable when other productive work could be done. For instance, a web server that is blocking while trying to read data from one client could be missing out on connection requests from other clients. The poll() function provides a useful interface for avoiding this situation.

C library functions – <poll.h>

int poll(struct pollfd fds[], nfds_t nfds, int timeout);: Examine an array of file descriptors to determine if some are ready for I/O.

The first argument to poll() consists of an array of struct pollfd instances, the second parameter is the length of the array, and the timeout designates a maximum amount of time (measured in milliseconds) to wait for input to be ready. The fields of the struct pollfd are shown below. For each struct in the array, the fd field designates a file descriptor to monitor for input or output events, and the events field designates the events to wait for. Typically, events is set to the constant POLLIN to indicate a check for the presence of normal data that can be read without blocking. The revents field is set by the call to poll().

/* defined in poll.h */
struct pollfd {
  int    fd;       /* file descriptor */
  short  events;   /* events to look for */
  short  revents;  /* events returned */
};

Code Listing 2.14 shows how to use poll() to check for available data. If poll() returns 0, then the requested event (available input data) has not occurred before the timeout expired. The revents field would be set to a value to indicate why the poll() failed. For instance, POLLHUP indicates the device has been disconnected, POLLNVAL indicates the file descriptor is not open, and POLLERR indicates an error has occurred with the device.

/* Code Listing 2.14:
   Checking a file descriptor for available input data
 */

/* Set up a single pollfd for the file descriptor fd */
struct pollfd fds[1];
fds[0].fd = fd;
fds[0].events = POLLIN; // Looking for input data

if (poll (fds, 1, 100) == 0) // wait for 100 ms
  {
    /* No data is available to be read */
    printf ("Poll failed: %d\n", fds[0].revents);

    /* Close and exit if appropriate */
    close (fd);
    exit (1);
  }

2.6.2. Working with Files¶

In addition to reading, programs typically need to write to a file. The arguments to write() are identical to those for read(). Unlike read(), there is not really a concern with buffer overflow with write(), as data is being sent away from the current process; the kernel buffers on the other end will prevent such errors. However, checking the return value from write() is as important as it is with read() to make sure that all of the intended data was written successfully; this is especially true when writing large pieces of data. Code Listing 2.15 illustrates how to write to a file. Note that writing to the end of a persistent file will cause it to grow. In this example, the file is created to be empty (O_TRUNC), but writing six bytes creates a file of size six (the last byte is the null terminator '\0').

C library functions – <unistd.h>

ssize_t write(int fildes, const void *buf, size_t nbyte);: Write up to nbyte bytes from a buffer into the specified file.

/* Code Listing 2.15:
   Creating an empty file and writing to it
 */

/* Create an empty file with read and write permissions */
int fd = open ("blank", O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
close (fd);

/* Open the file for writing (write will append) */
fd = open ("blank", O_WRONLY);
assert (fd > 0);
assert (write (fd, "hello", 6) == 6);
close (fd);

/* Read in what we just wrote */
char buffer[6];
memset (buffer, 0, sizeof (buffer));
fd = open ("blank", O_RDONLY);
assert (read (fd, buffer, 6) == 6);
close (fd);
printf ("Contents: [%s]\n", buffer);

If the file supports arbitrary accesses, the lseek() function will change the file’s internal location information to a specified target. The offset can be specified as either a positive or negative value. The whence parameter, which takes a limited number of possible values, plays an important role in determining this location. If whence if set to SEEK_SET, then the offset argument is the exact number of bytes into the file to use as the location. Setting whence to SEEK_CUR will add the offset to the current location number; a negative offset will seek backwards, while a positive value seeks forward. Lastly, setting whence to SEEK_END will add the offset to the size of the file; using a negative offset moves the location to the number of bytes before the end of the file. Whichever value is passed, the final location must be positive. If the location is larger than the file size, performing a write at that point will increase the file size accordingly. Any gap between the existing end of the file and the new data will be filled with null bytes.

C library functions – <unistd.h>

off_t lseek(int fildes, off_t offset, int whence);: Reposition the offset of a file descriptor to a specified location.

/* Code Listing 2.16:
   Growing a file with lseek() and write()
 */

/* Open and jump to offset 10 (4 bytes after the end) */
int fd = open ("blank", O_RDWR);
off_t offset = lseek (fd, 4, SEEK_END);
printf ("Offset is now %lld\n", offset);

/* Write additional bytes, appending to the file */
size_t bytes = write (fd, "goodbye", 8);
printf ("Wrote %zd additional bytes\n", bytes);

/* Now read all of the file into a buffer and print it.
   Note that the file only contains 18 bytes */
char buffer[20];
memset (buffer, 0, sizeof (buffer));
offset = lseek (fd, 0, SEEK_SET);
printf ("Offset is now %lld\n", offset);
assert (read (fd, buffer, 20) == 18);

printf ("Contents:\n");
for (int i = 0; i < sizeof (buffer); i++)
  printf ("%02x '%c'\n", buffer[i], buffer[i]);
close (fd);

Code Listing 2.16 shows the effect of using lseek() and write() on the file created by Code Listing 2.15. The file initially contained six bytes ('h', 'e', 'l', 'l', 'o', '\0'). The seek on line 7 places the internal location to offset 10. The write() on line 11, then, extends the file size to include the new data, as well as the padding of null bytes; the new file size would then be 18 bytes. As such, the read() on line 20 requests 20 bytes but only gets 18. Printing the final file with hexdump shows the results:

00000000  68 65 6c 6c 6f 00 00 00  00 00 67 6f 6f 64 62 79  |hello.....goodby|
00000010  65 00                                             |e.|
00000012

2.6.3. Accessing File Metadata¶

When working with files, it is often important to access metadata – information about the file – rather than the contents about the file itself. For instance, when reading a persistent file into memory from storage, knowing the file’s size is necessary for allocating memory for the buffer. As another example, consider an intrusion detection program that is responsible for monitoring a file system for security threats or attacks; this program might check for changes to the associated permissions or the user ID that is considered the owner of the file.

C library functions – <sys/stat.h>

int fstat(int fildes, struct stat *buf);: Get status information about a file given an open file descriptor.
int stat(const char *path, struct stat *buf);: Get status information about a file.

The fstat() and stat() functions provide an interface for accessing file metadata. Note that stat() uses the path name of the file within the directory structure, which is appropriate for the common notion of a file as persistent storage; however, fstat() works on any file descriptor, which allows you to examine the metadata of any file, including unnamed IPC or device files. Both functions take a pointer to a struct stat, writing the file metadata into this buffer.

/* defined in sys/stat.h */
struct stat {
  dev_t    st_dev;    /* device inode resides on */
  ino_t    st_ino;    /* inode's number */
  mode_t   st_mode;   /* inode protection mode */
  nlink_t  st_nlink;  /* number of hard links to the file */
  uid_t    st_uid;    /* user-id of owner */
  gid_t    st_gid;    /* group-id of owner */
  dev_t    st_rdev;   /* device type, for special file inode */ 
  off_t    st_size;   /* file size, in bytes */
  /* ... other fields depending on OS ... */
};

This struct definition contains additional fields based on the particular operating system, but the ones shown here are consistent across multiple platforms. A full discussion of all of these fields is beyond the scope of this book, but a few of them are particularly important. To start, consider the st_ino and st_nlink fields. Each file stored on typical storage device (USB drive, hard drive, etc.) is uniquely identified by an inode, an on-disk data structure that contains the metadata; each inode is uniquely identified by an inode number (st_ino). However, the file might have multiple human-readable names in the directory structure. These names – links (also called hard links) – all point to the same file contents; the st_nlink field indicates the number of links that exist to a single file. With hard links, there is only one file; there are just multiple references to the same location. In contrast a symbolic link is a distinct file that is not represented in the inode. See Appendix A for a longer discussion of inodes and links.

Another key field of the struct stat is the st_mode field. The most common use of this field is to set permissions for accessing the file. These permissions include combinations of read, write, and execute for the owner of the file (the user), the associated group, or everyone else. The st_mode field also stores additional permissions and information about the file; for instance, this field can be used to distinguish symbolic links, regular files, or directories. Table 2.3 shows the standard list of bitmask values that can be combined in the st_mode field.

Name	Bitmask	Description
`S_IRUSR`	`000400`	Read (user)
`S_IWUSR`	`000200`	Write (user)
`S_IXUSR`	`000100`	Execute (user)
`S_IRGRP`	`000040`	Read (group)
`S_IWGRP`	`000020`	Write (group)
`S_IXGRP`	`000010`	Execute (group)
`S_IROTH`	`000004`	Read (other)
`S_IWOTH`	`000002`	Write (other)
`S_IXOTH`	`000001`	Execute (other)

Name	Bitmask	Description
`S_IFIFO`	`010000`	Named pipe (IPC)
`S_IFCHR`	`020000`	Character device (terminal)
`S_IFDIR`	`040000`	Directory file type
`S_IFBLK`	`006000`	Block device (disk drive)
`S_IFREG`	`100000`	Regular file type
`S_IFLNK`	`120000`	Symbolic link
`S_IFSOCK`	`140000`	Socket (IPC, networks)
`S_ISUID`	`004000`	Setuid (`SUID`) bit
`S_ISGID`	`002000`	Setgid (`SGID`) bit
`S_ISVTX`	`001000`	Sticky bit

Table 2.3: Bitmasks used in the st_mode field

For example, the hello.c file above would have the bitmask 100644 (displayed as -rw-r--r-- by the ls -l command), as it is a regular file (100000) with read/write permissions for the user and read for group and others. The symlink.c would have st_mode 120755 (displayed as lrwxr-xr-x). Note that the first character in the displayed version indicates the type of file (- for S_IFREG, l for S_IFLNK, d for S_IFDIR, and so on).

Note

The SUID, SGID, and sticky bits have complex meanings and interpretations. One source of their complexity is that SUID only affects executable regular files, the sticky bit (which is mostly obsolete and has changed over time) only affects, and SGID affects both executables and directories! These meanings can be summarized as follows:

SUID: Processes created with this executable will inherit the user ID of the file’s owner, rather than the user ID of the user executing the program.

SGID (regular file): Processes created with this executable will inherit the group ID of the file’s group, rather than the group ID of the user executing the program.

SGID (directory): Files and subdirectories created in this directory will inherit the group ID of this directory.

Sticky bit (modern usage): Files in this directory can only be deleted by the user who is considered the owner of the file.

When these bits are set on a file, ls -l displays them by overlaying them on top of the execute bits in the permission field, using an 's' in the user field for SUID, 's' in the group field for SGID, and 't' in the other field for the sticky bit; if the corresponding 'x' bit is present, a lower-case letter is used, while an upper-case letter indicates the 'x' bit is absent. For instance, rwsr-x--- would indicate both S_IXUSR and S_ISUID are set; rw-r-Sr-- would mean that S_IGUID is set but S_IXGRP is not.

Code Listing 2.17 illustrates how to use fstat() to investigate a file’s metadata. The address of the local variable info is passed to fstat() to collect the metadata on line 9. Lines 13 – 15 are using the bitwise-and operator (&) to determine if certain permission bits are set; the result would be non-zero (true) if the bit is set but would be zero (false) if not. Lines 18 and 19 demonstrate a very common technique when reading in files. Line 19 uses the st_size field to allocate the exact amount of space needed to read in the full file contents, then line 20 reads in exactly that number of bytes. Once the file is read into memory, it can be accessed in a variety of ways. Since this file is an ASCII-formatted CSV file, it can be manipulated just like a normal string. Line 24 uses strtok() to split this string at the first instance of the newline character ('\n'); line 25 can then print just that line as a string. This change only affects the in-memory buffer, and it does not change the contents of the original file stored on disk.

/* Code Listing 2.17:
   Using fstat() to determine the file size and read in the exact amount of data
 */

/* Open a CSV file and read its status information */
struct stat info;
int fd = open ("movies.csv", O_RDONLY);
assert (fd > 0);
assert (fstat (fd, &info) >= 0);

/* Check the file size and permissions */
printf ("File is %lld bytes in size\n", info.st_size);
printf ("Is file readable by user? %s\n", (info.st_mode & S_IRUSR ? "yes" : "no"));
printf ("Is file executable by user? %s\n", (info.st_mode & S_IXUSR ? "yes" : "no"));
printf ("Is this a directory? %s\n", (info.st_mode & S_IFDIR ? "yes" : "no"));

/* Create a buffer that is the exact size of the file and
   read in the contents */
char *buffer = calloc (info.st_size, sizeof (char));
ssize_t bytes = read (fd, buffer, info.st_size);
assert (bytes == info.st_size);
close (fd);

char *line = strtok (buffer, "\n");
printf ("Here is the first line:\n%s\n", line);

Note

The traditional UNIX permission structure—assigning permissions based only on the user, group, or other—is inflexible and not well suited for many applications. For example, consider two user that are collaborating on a project. These two users both need full permissions to read and write to a file, but they do not want to make the file publicly accessible otherwise. Under the traditional approach, a system administrator could create a group containing these two users; the users could then set permissions based on the group ID. The problem is that each user can only be assigned to a single group. If these users also have similar collaborations with different users on the same system, they cannot use the same approach.

To fix this problem, many modern systems support access control lists (ACLs). Using ACLs, users can grant or revoke permissions to other users on an individual basis. In addition, ACLs allow the same user to be a member of multiple groups. Rather than using the traditional ls and chmod commands to view and change permissions, ACLs use the getfacl and setfacl commands. Consider the following example of these two commands.

$ getfacl team
# file: team
# owner: csf
# group: staff
user::rwx
user:alissa:rwx
user:marcos:rwx
group::r-x
group:csfadmin:r-x
mask::rwx
other::--- 
default:user::rwx
default:user:alissa:rwx
default:user:marcos:rwx
default:group::r-x
default:csfadmin:r-x
default:other::---
$ setfacl -m u:sergei:rwx
$ setfacl -m d:u:sergei:rwx

For each file and directory, there is an assigned owner and group, just like the traditional UNIX permissions, as indicated by the lines beginning with #. The other lines are individual permissions that have been set for the particular file. The user and group permissions contain three fields separated by a colon (:). The middle field indicates which user or group and the third field indicates the permissions (read, write, execute); if the user or group field is empty, the permission applies to the owner or group of the file. Directories can also have default permission lines; any time a file is created in this directory, the specified default permissions are automatically assigned to it. When using setfacl to add, change, or remove a permission entry, the terms user, group, and default can be abbreviated as simply u, g, or d.

[1] To reiterate the notion of files as just a sequence of bytes, note that the file descriptor here was not necessarily the value returned from open() as described above. For files that do not correspond to named locations in the directory tree structure, the file descriptor may be created by a different function (such as pipe() or socket() as described in Chapters 3 and 4).