When multiple processes exist on a single machine, they rely on virtual memory to create the illusion that they have sole access to the CPU; the context switch mechanism prevents one process from accessing another process’s register values, stack, heap, etc. However, processes ultimately do not have sole access to the entire machine. There are many resources, such as a network interface, storage devices, user input devices, and so on, that must be shared with other processes on the same machine. As such, processes act as a unit of ownership for instances of access to these resources.
The UNIX file abstraction, which is widely used in modern OS design, provides a uniform interface to these various shared resources. This abstraction relies on two features: a file is a sequence of bytes and everything is a file. It is important to emphasize that this definition is different from the common usage of the term “file,” which is typically associated with persistent data storage. The key differences between this common usage and the UNIX file abstraction are as follows:
- Arbitrary or bidirectional access to a file is not necessarily possible. In some cases, once a byte has been read from the file, that byte no longer exists in the file; there is no way to seek to a previous position in such files. Similarly, sequential access of the bytes in order may be required, with no way to skip ahead.
- Files may not have names or persistent storage. Some files (such as those described in Chapter 3 for interprocess communication, commonly referred to as IPC) exist solely as in-memory constructs at run-time, identified only by an integer file descriptor. Other files (such as
/dev/random
on UNIX and Linux systems) exist solely as an abstract interface to a hardware component or generate data at run-time on demand.- Files do not necessarily have structure or typing. Readers are likely family with persistent files that can be distinguished by a file extension. For instance, a file with the
.png
extension; programs that read or write these files must make sure that the bytes adhere to a pre-defined semantic structure. However, in the UNIX file abstraction, this pre-defined structure does not exist; a file is just a sequence of bytes.
By removing so much contextual information about files, this abstraction might seem to lose much of its meaning or utility. On the contrary, this abstraction greatly simplifies the work of dealing with a variety of resources; there are certain operations (creating, deleting, opening, closing, reading, writing) that are common to the lifecycle of all files. The UNIX file abstraction provides a single, consistent interface for these operations, thus eliminating much of the complexity of supporting many such resources.
The most basic operations for working with files are creating and opening them. For files that can
be identified with named locations in the file system directory structure (such as /dev/random
,
/usr/bin/cksum
, or /home/csf/movies.csv
), we can use the open()
function. The first
parameter is the path to the file; this path can be an absolute path (such as
/dev/random
) or a relative path (such as ../src/main.c
) that describes the location
relative to the current working directory. If the file is successfully opened, the return value from
open()
is the file descriptor, a non-negative integer value that other functions use to identify
the file. This value should typically be greater than 2, as the default behavior is to open three
files when a process is created: 0 (STDIN_FILENO
) for standard input (such as reading from the
command prompt), 1 (STDOUT_FILENO
) for standard output (such as writing out to the screen), and
2 (STDERR_FILENO
) for standard error (also writing out to the screen).
C library functions – <fcntl.h>
int open(const char *path, int oflag, ...);
The second parameter (oflag
) specifies how the file will be accessed by the current process.
Table 2.2 shows the flags that may be passed as a bit-mask to open()
. Note that these flags do
not necessarily align with the common notion of file permissions; a file that is accessible for both
reading and writing may be opened in read-only mode (O_RDONLY
). However, if the file permissions
do not allow the requested access, open()
will return -1.
Permission | Purpose |
---|---|
O_RDONLY |
Open for reading only |
O_WRONLY |
Open for writing only |
O_RDWR |
Open for reading and writing |
O_NONBLOCK |
Do not block on opening while waiting for data |
O_CREAT |
Create the file if it does not exist; requires passing mode_t argument |
O_TRUNC |
Truncate to size 0 |
O_EXCL |
Error if O_CREAT and the file exists |
Table 2.2: Flags for opening files
For the common usage of the term “file,” the O_NONBLOCK
flag is the least intuitive in Table
2.2, as this flag is normally used for other purposes. Specifically, this flag plays an important
role in IPC and network programming. When using a file to communicate with other processes (either
on the same machine or across the network), the default behavior for reading is for processes to
block (pause) until the data has been received from the sender. The O_NONBLOCK
flag
changes this behavior so that reading will immediately fail and the process can move on to other
work instead of waiting.
Code Listing 2.12 illustrates how the flags can be combined as a bitmask using the bitwise-or
(|
). In this case, the file is also being created (O_CREAT
) with a size of 0 bytes initially
(O_TRUNC
) and the current process will have write-only access (O_WRONLY
). This file will be
persistent and stored in the file system with 644 permissions (6 = read and write for the owner of
the file, 4 = read-only for the associated group and others); as such, the file could later be
opened in read-write mode. Note that this third parameter (mode
) is required when creating a new
file, but is ignored at other times.
1 2 3 4 5 6 7 8 | /* Code Listing 2.12:
Creating a new (empty) file that is ready for writing
*/
/* This will create an empty file */
char *path = "data.log";
mode_t mode = 0644;
int fd = open (path, O_CREAT | O_TRUNC | O_WRONLY, mode);
|
Once the file has been opened, it can be read from. The read()
function takes three parameters:
the file descriptor, the address of a buffer in memory to read the bytes into, and the maximum
number of bytes to read. [1] The value returned from read()
indicates the actual number of
bytes successfully read, which may be fewer than the nbyte
parameter. (Calling read()
with
nbyte
set to 100 on a file that only contains 10 bytes of data will return 10, not 100.)
Finally, when the process is finished working with a file, the close()
function will release any
associated resources in the kernel or the C library data that have been allocated for this process.
C library functions – <unistd.h>
ssize_t read(int fildes, void *buf, size_t nbyte);
int close(int fildes);
Bug Warning
There are several key aspects of working with files that are easy to underestimate. First and
foremost is the importance of using a correct value for the nbyte
parameter of read()
. This
parameter always indicates the maximum number of bytes to read and it should never indicate more
than the size of the allocated buffer pointer. Buffer overflows are some
of the most dangerous and persistent sources of software vulnerabilities, and passing an incorrect
parameter to read()
is a common culprit. Consider the following example:
1 2 3 4 | /* Allocate a buffer of 2 bytes */
char *buffer = calloc (2, sizeof (char));
/* WRONG: This reads MORE THAN 2 bytes into the buffer */
read (fd, buffer, sizeof (buffer));
|
The problem here is a misunderstanding of the sizeof()
keyword, which returns the size of the
specified parameter. The misunderstanding is that sizeof(buffer)
returns the size of a pointer
variable (8 bytes on a 64-bit system), not the size of the dynamically allocated buffer on the
heap. (Contrast this with lines 5 and 6 in Code Listing 2.13 below.) As such, this code is trying
to read up to 8 bytes of data into a buffer than can only hold 2 bytes. The result is that
read()
will simply copy the additional 6 bytes into the memory after the end of the buffer,
potentially corrupting other data.
There are other frequent, though less serious, problems with using files. One (which is also in the
example above) is to call read()
without checking its return value; programmers often assume
that the number of bytes read is the same as the number of bytes requested, which is not
necessarily true. To illustrate this, consider the possibility of calling read()
on a file that
has been opened in O_WRONLY
mode; read()
would return -1 to indicate the operation failed.
Another problem is failing to call close()
; this causes memory leaks, as allocated data is not
freed up appropriately. On the other hand, another problem can arise when a file descriptor is used
after the file has been closed; this can cause future reads to fail or (potentially even worse) to
read from the wrong file.
Code Listing 2.13 illustrates how to open, read from, and close a file. In this example, we are
reading from a special file known as /dev/random
. This file can be used to generate a sequence
of random numbers one byte at a time; every time this code runs, the result should be different.
Note that the file is closed on line 13, but the data is not used by the program until line 17. This
is not a problem, as the data was read into the process’s memory; that is, the read()
operation
has made a copy of the data on the stack, so access to the file is no longer necessary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | /* Code Listing 2.13:
Reading 10 random numbers using the /dev/random file
*/
uint8_t buffer[10]; // space allocated automatically on the stack
memset (buffer, 0, sizeof (buffer));
/* /dev/random is a special device file that produces an
unending stream of random numbers */
int fd = open ("/dev/random", O_RDONLY);
assert (fd > 0);
ssize_t bytes = read (fd, buffer, sizeof (buffer));
close (fd);
printf ("Read %zd bytes of random data:\n", bytes);
for (int i = 0; i < bytes; i++)
printf (" %02" PRIx8, buffer[i]);
printf ("\n");
|
Some files, particularly IPC and device interface files, require special handling when reading.
Recall that the default behavior for open files is to block until data is ready; this behavior is
undesirable when other productive work could be done. For instance, a web server that is blocking
while trying to read data from one client could be missing out on connection requests from other
clients. The poll()
function provides a useful interface for avoiding this situation.
C library functions – <poll.h>
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
The first argument to poll()
consists of an array of struct pollfd
instances, the second
parameter is the length of the array, and the timeout
designates a maximum amount of time
(measured in milliseconds) to wait for input to be ready. The fields of the struct pollfd
are
shown below. For each struct
in the array, the fd
field designates a file descriptor to
monitor for input or output events, and the events
field designates the events to wait for.
Typically, events
is set to the constant POLLIN
to indicate a check for the presence of
normal data that can be read without blocking. The revents
field is set by the call to
poll()
.
1 2 3 4 5 6 | /* defined in poll.h */
struct pollfd {
int fd; /* file descriptor */
short events; /* events to look for */
short revents; /* events returned */
};
|
Code Listing 2.14 shows how to use poll()
to check for available data. If poll()
returns 0,
then the requested event (available input data) has not occurred before the timeout expired. The
revents
field would be set to a value to indicate why the poll()
failed. For instance,
POLLHUP
indicates the device has been disconnected, POLLNVAL
indicates the file descriptor
is not open, and POLLERR
indicates an error has occurred with the device.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | /* Code Listing 2.14:
Checking a file descriptor for available input data
*/
/* Set up a single pollfd for the file descriptor fd */
struct pollfd fds[1];
fds[0].fd = fd;
fds[0].events = POLLIN; // Looking for input data
if (poll (fds, 1, 100) == 0) // wait for 100 ms
{
/* No data is available to be read */
printf ("Poll failed: %d\n", fds[0].revents);
/* Close and exit if appropriate */
close (fd);
exit (1);
}
|
In addition to reading, programs typically need to write to a file. The arguments to write()
are
identical to those for read()
. Unlike read()
, there is not really a concern with buffer
overflow with write()
, as data is being sent away from the current process; the kernel buffers
on the other end will prevent such errors. However, checking the return value from write()
is as
important as it is with read()
to make sure that all of the intended data was written
successfully; this is especially true when writing large pieces of data. Code Listing 2.15
illustrates how to write to a file. Note that writing to the end of a persistent file will cause it
to grow. In this example, the file is created to be empty (O_TRUNC
), but writing six bytes
creates a file of size six (the last byte is the null terminator '\0'
).
C library functions – <unistd.h>
ssize_t write(int fildes, const void *buf, size_t nbyte);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | /* Code Listing 2.15:
Creating an empty file and writing to it
*/
/* Create an empty file with read and write permissions */
int fd = open ("blank", O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
close (fd);
/* Open the file for writing (write will append) */
fd = open ("blank", O_WRONLY);
assert (fd > 0);
assert (write (fd, "hello", 6) == 6);
close (fd);
/* Read in what we just wrote */
char buffer[6];
memset (buffer, 0, sizeof (buffer));
fd = open ("blank", O_RDONLY);
assert (read (fd, buffer, 6) == 6);
close (fd);
printf ("Contents: [%s]\n", buffer);
|
If the file supports arbitrary accesses, the lseek()
function will change the file’s internal
location information to a specified target. The offset
can be specified as either a positive or
negative value. The whence
parameter, which takes a limited number of possible values, plays an
important role in determining this location. If whence
if set to SEEK_SET
, then the
offset
argument is the exact number of bytes into the file to use as the location. Setting
whence
to SEEK_CUR
will add the offset
to the current location number; a negative
offset
will seek backwards, while a positive value seeks forward. Lastly, setting whence
to
SEEK_END
will add the offset
to the size of the file; using a negative offset moves the
location to the number of bytes before the end of the file. Whichever value is passed, the final
location must be positive. If the location is larger than the file size, performing a write at that
point will increase the file size accordingly. Any gap between the existing end of the file and the
new data will be filled with null bytes.
C library functions – <unistd.h>
off_t lseek(int fildes, off_t offset, int whence);
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | /* Code Listing 2.16:
Growing a file with lseek() and write()
*/
/* Open and jump to offset 10 (4 bytes after the end) */
int fd = open ("blank", O_RDWR);
off_t offset = lseek (fd, 4, SEEK_END);
printf ("Offset is now %lld\n", offset);
/* Write additional bytes, appending to the file */
size_t bytes = write (fd, "goodbye", 8);
printf ("Wrote %zd additional bytes\n", bytes);
/* Now read all of the file into a buffer and print it.
Note that the file only contains 18 bytes */
char buffer[20];
memset (buffer, 0, sizeof (buffer));
offset = lseek (fd, 0, SEEK_SET);
printf ("Offset is now %lld\n", offset);
assert (read (fd, buffer, 20) == 18);
printf ("Contents:\n");
for (int i = 0; i < sizeof (buffer); i++)
printf ("%02x '%c'\n", buffer[i], buffer[i]);
close (fd);
|
Code Listing 2.16 shows the effect of using lseek()
and write()
on the file created by Code
Listing 2.15. The file initially contained six bytes ('h'
, 'e'
, 'l'
, 'l'
, 'o'
,
'\0'
). The seek on line 7 places the internal location to offset 10. The write()
on line 11,
then, extends the file size to include the new data, as well as the padding of null bytes; the new
file size would then be 18 bytes. As such, the read()
on line 20 requests 20 bytes but only gets
18. Printing the final file with hexdump
shows the results:
00000000 68 65 6c 6c 6f 00 00 00 00 00 67 6f 6f 64 62 79 |hello.....goodby|
00000010 65 00 |e.|
00000012
When working with files, it is often important to access metadata – information about the file – rather than the contents about the file itself. For instance, when reading a persistent file into memory from storage, knowing the file’s size is necessary for allocating memory for the buffer. As another example, consider an intrusion detection program that is responsible for monitoring a file system for security threats or attacks; this program might check for changes to the associated permissions or the user ID that is considered the owner of the file.
C library functions – <sys/stat.h>
int fstat(int fildes, struct stat *buf);
int stat(const char *path, struct stat *buf);
The fstat()
and stat()
functions provide an interface for accessing file metadata. Note that
stat()
uses the path name of the file within the directory structure, which is appropriate for
the common notion of a file as persistent storage; however, fstat()
works on any file
descriptor, which allows you to examine the metadata of any file, including unnamed IPC or device
files. Both functions take a pointer to a struct stat
, writing the file metadata into this buffer.
1 2 3 4 5 6 7 8 9 10 11 12 | /* defined in sys/stat.h */
struct stat {
dev_t st_dev; /* device inode resides on */
ino_t st_ino; /* inode's number */
mode_t st_mode; /* inode protection mode */
nlink_t st_nlink; /* number of hard links to the file */
uid_t st_uid; /* user-id of owner */
gid_t st_gid; /* group-id of owner */
dev_t st_rdev; /* device type, for special file inode */
off_t st_size; /* file size, in bytes */
/* ... other fields depending on OS ... */
};
|
This struct
definition contains additional fields based on the particular operating system, but
the ones shown here are consistent across multiple platforms. A full discussion of all of these
fields is beyond the scope of this book, but a few of them are particularly important. To start,
consider the st_ino
and st_nlink
fields. Each file stored on typical storage device (USB
drive, hard drive, etc.) is uniquely identified by an inode, an on-disk data structure that
contains the metadata; each inode is uniquely identified by an inode number (st_ino
). However,
the file might have multiple human-readable names in the directory structure. These names –
links (also called hard links) – all point to the same file
contents; the st_nlink
field indicates the number of links that exist to a single file. With
hard links, there is only one file; there are just multiple references to the same location. In
contrast a symbolic link is a distinct file that is not represented in the inode. See
Appendix A for a longer discussion of inodes and links.
Another key field of the struct stat
is the st_mode
field. The most common use of this field
is to set permissions for accessing the file. These permissions include combinations of read, write,
and execute for the owner of the file (the user), the associated group, or everyone else. The
st_mode
field also stores additional permissions and information about the file; for instance,
this field can be used to distinguish symbolic links, regular files, or directories. Table 2.3 shows
the standard list of bitmask values that can be combined in the st_mode
field.
Name | Bitmask | Description |
---|---|---|
S_IRUSR |
000400 |
Read (user) |
S_IWUSR |
000200 |
Write (user) |
S_IXUSR |
000100 |
Execute (user) |
S_IRGRP |
000040 |
Read (group) |
S_IWGRP |
000020 |
Write (group) |
S_IXGRP |
000010 |
Execute (group) |
S_IROTH |
000004 |
Read (other) |
S_IWOTH |
000002 |
Write (other) |
S_IXOTH |
000001 |
Execute (other) |
Name | Bitmask | Description |
---|---|---|
S_IFIFO |
010000 |
Named pipe (IPC) |
S_IFCHR |
020000 |
Character device (terminal) |
S_IFDIR |
040000 |
Directory file type |
S_IFBLK |
006000 |
Block device (disk drive) |
S_IFREG |
100000 |
Regular file type |
S_IFLNK |
120000 |
Symbolic link |
S_IFSOCK |
140000 |
Socket (IPC, networks) |
S_ISUID |
004000 |
Setuid (SUID ) bit |
S_ISGID |
002000 |
Setgid (SGID ) bit |
S_ISVTX |
001000 |
Sticky bit |
Table 2.3: Bitmasks used in the st_mode field
For example, the hello.c file above would have the bitmask 100644
(displayed as -rw-r--r--
by the ls -l
command), as it is a regular file (100000
) with read/write permissions for the
user and read for group and others. The symlink.c would have st_mode
120755
(displayed as
lrwxr-xr-x
). Note that the first character in the displayed version indicates the type of file
(-
for S_IFREG
, l
for S_IFLNK
, d
for S_IFDIR
, and so on).
Note
The SUID
, SGID
, and sticky bits have complex meanings and interpretations. One source of
their complexity is that SUID
only affects executable regular files, the sticky bit (which is
mostly obsolete and has changed over time) only affects, and SGID
affects both executables and
directories! These meanings can be summarized as follows:
SUID
: Processes created with this executable will inherit the user ID of the file’s owner, rather than the user ID of the user executing the program.SGID
(regular file): Processes created with this executable will inherit the group ID of the file’s group, rather than the group ID of the user executing the program.SGID
(directory): Files and subdirectories created in this directory will inherit the group ID of this directory.- Sticky bit (modern usage): Files in this directory can only be deleted by the user who is considered the owner of the file.
When these bits are set on a file, ls -l
displays them by overlaying them on top of the execute
bits in the permission field, using an 's'
in the user field for SUID
, 's'
in the group
field for SGID
, and 't'
in the other field for the sticky bit; if the corresponding 'x'
bit is present, a lower-case letter is used, while an upper-case letter indicates the 'x'
bit
is absent. For instance, rwsr-x---
would indicate both S_IXUSR
and S_ISUID
are set;
rw-r-Sr--
would mean that S_IGUID
is set but S_IXGRP
is not.
Code Listing 2.17 illustrates how to use fstat()
to investigate a file’s metadata. The address
of the local variable info
is passed to fstat()
to collect the metadata on line 9. Lines 13 –
15 are using the bitwise-and operator (&
) to determine if certain permission bits are set; the
result would be non-zero (true) if the bit is set but would be zero (false) if not. Lines 18 and 19
demonstrate a very common technique when reading in files. Line 19 uses the st_size
field to
allocate the exact amount of space needed to read in the full file contents, then line 20 reads in
exactly that number of bytes. Once the file is read into memory, it can be accessed in a variety of
ways. Since this file is an ASCII-formatted CSV file, it can be manipulated just like a normal
string. Line 24 uses strtok()
to split this string at the first instance of the newline
character ('\n'
); line 25 can then print just that line as a string. This change only affects
the in-memory buffer, and it does not change the contents of the original file stored on disk.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | /* Code Listing 2.17:
Using fstat() to determine the file size and read in the exact amount of data
*/
/* Open a CSV file and read its status information */
struct stat info;
int fd = open ("movies.csv", O_RDONLY);
assert (fd > 0);
assert (fstat (fd, &info) >= 0);
/* Check the file size and permissions */
printf ("File is %lld bytes in size\n", info.st_size);
printf ("Is file readable by user? %s\n", (info.st_mode & S_IRUSR ? "yes" : "no"));
printf ("Is file executable by user? %s\n", (info.st_mode & S_IXUSR ? "yes" : "no"));
printf ("Is this a directory? %s\n", (info.st_mode & S_IFDIR ? "yes" : "no"));
/* Create a buffer that is the exact size of the file and
read in the contents */
char *buffer = calloc (info.st_size, sizeof (char));
ssize_t bytes = read (fd, buffer, info.st_size);
assert (bytes == info.st_size);
close (fd);
char *line = strtok (buffer, "\n");
printf ("Here is the first line:\n%s\n", line);
|
Note
The traditional UNIX permission structure—assigning permissions based only on the user, group, or other—is inflexible and not well suited for many applications. For example, consider two user that are collaborating on a project. These two users both need full permissions to read and write to a file, but they do not want to make the file publicly accessible otherwise. Under the traditional approach, a system administrator could create a group containing these two users; the users could then set permissions based on the group ID. The problem is that each user can only be assigned to a single group. If these users also have similar collaborations with different users on the same system, they cannot use the same approach.
To fix this problem, many modern systems support access control lists
(ACLs). Using ACLs, users can grant or revoke permissions to other users on an individual basis. In
addition, ACLs allow the same user to be a member of multiple groups. Rather than using the
traditional ls and chmod commands to view and change permissions, ACLs use the getfacl
and
setfacl
commands. Consider the following example of these two commands.
$ getfacl team
# file: team
# owner: csf
# group: staff
user::rwx
user:alissa:rwx
user:marcos:rwx
group::r-x
group:csfadmin:r-x
mask::rwx
other::---
default:user::rwx
default:user:alissa:rwx
default:user:marcos:rwx
default:group::r-x
default:csfadmin:r-x
default:other::---
$ setfacl -m u:sergei:rwx
$ setfacl -m d:u:sergei:rwx
For each file and directory, there is an assigned owner and group, just like the traditional UNIX
permissions, as indicated by the lines beginning with #
. The other lines are individual
permissions that have been set for the particular file. The user and group permissions contain
three fields separated by a colon (:
). The middle field indicates which user or group and the
third field indicates the permissions (read, write, execute); if the user or group field is empty,
the permission applies to the owner or group of the file. Directories can also have default
permission lines; any time a file is created in this directory, the specified default permissions
are automatically assigned to it. When using setfacl
to add, change, or remove a permission
entry, the terms user
, group
, and default
can be abbreviated as simply u
, g
, or
d
.
[1] | To reiterate the notion of files as just a sequence of bytes, note that the file
descriptor here was not necessarily the value returned from open() as described above. For
files that do not correspond to named locations in the directory tree structure, the file
descriptor may be created by a different function (such as pipe() or socket() as described
in Chapters 3 and 4). |