User-mode programs can execute standard CPU instructions that are focused on performing a calculation or implementing a logical control flow. However, user-mode programs have no direct access to any shared computing resource outside the CPU. For instance, user-mode software cannot read data from a hard drive, send data across a network interface, or even display information to the monitor screen. Instead, the user-mode program must execute a system call to request the kernel perform this action on its behalf.
At the level of assembly language, a system call involves executing a trap instruction. In modern
x86 code, the trap instruction is syscall
[1] , which acts in a manner analogous to call.
Instead of jumping to a function within the same program, though, syscall
triggers a mode switch
and jumps to a routine in the kernel portion of memory. The kernel validates the system call
parameters and checks the process’s access permissions. For instance, if the system call is a
request to write to a file, the kernel will determine whether the user running the program is
allowed to perform this action. Once the kernel has finished performing the system call, it uses the
sysret
instruction, which performs a role similar to the standard ret
instruction. The
difference is that sysret also changes the privilege level, returning the system to user mode.
From a higher level perspective, system calls are often written to look like standard C functions.
For instance, it is common to find references to the write()
system call. This practice is
simply a form of short-hand notation. In most cases, there is a C function that acts as a wrapper
for the system call. That is, there is a C function called write()
in the C standard library;
this function will perform a few initial steps before triggering the syscall
trap instruction.
To be clear, there is a distinction between the write()
C function and the system call, but this
distinction is often blurred in practice.
The Linux source code repository contains the full list of Linux system calls. [2] This table
identifies the mapping between the system call number (which actually specifies the system call),
the name that is commonly used, and the entry point routine within the Linux kernel itself. For
instance, system call 0 is the read()
system call. When a user-mode program executes the
read()
system call, the system will trigger a mode switch and jump to the sys_read()
function within the Linux kernel.
There are a couple of observations that can be made from this table. First, every system call has a
unique number associated with it. As we will explain next, x86 system call mechanics only use this
number. The name that associated with each number is just to give meaning to the programmer, just as
we use function names instead of relying on memorization of hard-coded addresses. Second, the names
of the entry point functions in Linux are the names of the system calls with sys_
prepended; for
instance, the open()
system call will call the sys_open()
function in the kernel, and
mmap()
will call sys_mmap()
.
Lastly, note that the names of the system calls correspond to many common C standard library
functions. For instance, open()
and close()
are the system calls that are used to establish
connections to files, socket()
is the system call to create a socket for network communication,
and exit()
can be used to terminate the current process. That is, many C functions are simply
wrappers for system calls.
In contrast, many C functions are implemented to provide additional functionality on top of system
calls. In the case of printf()
, the code will eventually trigger the write()
system call.
The primary difference is that write()
requires low-level details of how the system is being
used that printf()
abstracts away. In addition, calling write()
requires exact knowledge of
the length of the message to be printed, whereas printf()
does not. In summary, many C standard
library functions provide a thin wrapper for invoking system calls, while other functions do not.
Table 2.1 lists a small sample of the more than 300 system calls available on 64-bit Linux systems.
The full list of system calls can be found in the syscalls(2)
man or in <asm/unistd_64.h>
,
which is included (through a nested sequence of headers) by <sys/syscall.h>
. [3] Each system
call is documented in a section 2 man page [4] (e.g., man 2 read
).
Syscall | Number | Purpose |
---|---|---|
read |
0 | Read from a file descriptor |
write |
1 | Write to a file descriptor |
nanosleep |
35 | High-resolution sleep (units in seconds and nanoseconds) |
exit |
60 | Terminate the current process |
kill |
62 | Send a signal to a process |
uname |
63 | Get information (name, release, etc.) about the current kernel |
gettimeofday |
96 | Get the system time (in seconds since 12:00 AM Jan. 1, 1970) |
sysinfo |
99 | Get information about memory usage and CPU load average |
ptrace |
101 | Trace another process's execution |
Table 2.1: A sample of common Linux system calls
In assembly language, a system call looks almost exactly like a function call. Arguments are passed
to the system call using the general purpose registers and the stack as needed. The main difference
is that the system call number is stored into the %rax
register. As an example, we can write a
standard “Hello, world” program in assembly language using two system calls.
In Code Listing 2.3, the four mov instructions (lines 9 – 12) set up the arguments for the
write()
system call, which expects three arguments: the file handle to write to, the address of
the message to write, and the length of the message in bytes. As with a normal function, these are
passed in the %rdi
, %rsi
, and %rdx
registers. In a normal function call, the call
instruction would specify the function to execute. However, syscall
does not encode this
information. Instead, on line 5, we moved the constant 1 into %rax
, as this is the number for
the write()
system call. Similarly, lines 16 and 17 indicate that the exit()
system call
should be invoked with the value 0 as a parameter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Code Listing 2.3:
# An assembly-language “hello world” program with system calls
.global _start
.text
_start:
# write(1, message, 13)
mov $1, %rax # system call 1 is write
mov $1, %rdi # file handle 1 is stdout
mov $message, %rsi # address of string to output
mov $13, %rdx # number of bytes
syscall # invoke OS to write to stdout
# exit(0)
mov $60, %rax # system call 60 is exit
xor %rdi, %rdi # we want return code 0
syscall # invoke OS to exit
.data
message:
.ascii "Hello, world\n"
|
Many system calls have return values that can be used to determine if an error occurred. As with
standard functions, the kernel puts return values in the %rax
register. Negative values in the
range of -4095 to -1 indicate an error.
Another method for invoking Linux system calls directly is to use syscall()
. For instance, the
program in Code Listing 2.4 shows the C equivalent of the assembly language code shown in Code
Listing 2.3. As before, we can bypass the C standard library functions for write()
and
exit()
by invoking the system call directly. Specifcally, lines 12 and 13 make two system calls,
although they look like standard function calls. The C compiler will translate these into the
sequence of instructions in lines 9 – 13 and 16 – 18 from Code Listing 2.3.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | /* Code Listing 2.4:
Using syscall() in C to invoke Linux system calls for writing and exiting
*/
#include <unistd.h>
char *message = "Hello, world\n";
int
main (void)
{
syscall (1, 1, message, 13);
syscall (60, 0);
/* should never reach here */
return 0;
}
|
One aspect to note about the implementation of syscall()
is that its parameters get passed in
the wrong registers. Specifically, the compiler mostly treats syscall()
as a regular function
call, but it passes the first parameter in %rdi
instead of the standard %rax
, because the
kernel expects the system call number to be in %rdi
. Code Listing 2.5 shows how Linux implements
syscall()
, shifting the register values as needed (lines 9 – 13) and invoking the syscall
instruction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | # Code Listing 2.5:
# The Linux implementation of the C syscall() function
# From sysdeps/unix/sysv/linux/x86_64/syscall.S
.text
ENTRY (syscall)
movq %rdi, %rax # Syscall number -> rax.
movq %rsi, %rdi # shift arg1 - arg5.
movq %rdx, %rsi
movq %rcx, %rdx
movq %r8, %r10
movq %r9, %r8
movq 8(%rsp),%r9 # arg6 is on the stack.
syscall # Do the system call.
cmpq $-4095, %rax # Check %rax for error.
jae SYSCALL_ERROR_LABEL # Jump to error handler if error.
L(pseudo_end):
ret # Return to caller.
PSEUDO_END (syscall)
|
[1] | The syscall instruction is the primary trap instruction in 64-bit x86 systems. Earlier
x86 programs performed system calls by triggering an interrupt with the int $0x80 instruction;
the kernel would use iret to return from the interrupt. For performance reasons, this approach
was replaced with the sysenter and sysexit instructions on 32-bit systems. syscall and
sysret are the 64-bit equivalent of these faster system call instructions. |
[2] | See https://github.com/torvalds/linux/blob/v3.13/arch/x86/syscalls/syscall_64.tbl for example. |
[3] | To prevent naming collisions, the names of the system calls are more complicated than
shown in the table. Specifically, the Read() system call is listed in this table as __NR_read . |
[4] | For readers new to man pages, documentation on this system can be found by typing man
man on the command line. In brief, on Linux and UNIX systems, all C libraries are documented
through this manual. The manual is divided into several sections, with section 2 used for system
calls and section 3 used for the C standard library. The section of the manual for a function is
noted in parentheses after the name. For instance, exit(2) documents on the exit system call,
whereas exit(3) documents the standard C library exit() function. (Note there is a difference!) |