«  2.3. Kernel Mechanics   ::   Contents   ::   2.5. Process Life Cycle  »

2.4. System Call Interface

User-mode programs can execute standard CPU instructions that are focused on performing a calculation or implementing a logical control flow. However, user-mode programs have no direct access to any shared computing resource outside the CPU. For instance, user-mode software cannot read data from a hard drive, send data across a network interface, or even display information to the monitor screen. Instead, the user-mode program must execute a system call to request the kernel perform this action on its behalf.

2.4.1. System Calls vs. Function Calls

At the level of assembly language, a system call involves executing a trap instruction. In modern x86 code, the trap instruction is syscall [1] , which acts in a manner analogous to call. Instead of jumping to a function within the same program, though, syscall triggers a mode switch and jumps to a routine in the kernel portion of memory. The kernel validates the system call parameters and checks the process’s access permissions. For instance, if the system call is a request to write to a file, the kernel will determine whether the user running the program is allowed to perform this action. Once the kernel has finished performing the system call, it uses the sysret instruction, which performs a role similar to the standard ret instruction. The difference is that sysret also changes the privilege level, returning the system to user mode.

From a higher level perspective, system calls are often written to look like standard C functions. For instance, it is common to find references to the write() system call. This practice is simply a form of short-hand notation. In most cases, there is a C function that acts as a wrapper for the system call. That is, there is a C function called write() in the C standard library; this function will perform a few initial steps before triggering the syscall trap instruction. To be clear, there is a distinction between the write() C function and the system call, but this distinction is often blurred in practice.

2.4.2. Linux System Calls

The Linux source code repository contains the full list of Linux system calls. [2] This table identifies the mapping between the system call number (which actually specifies the system call), the name that is commonly used, and the entry point routine within the Linux kernel itself. For instance, system call 0 is the read() system call. When a user-mode program executes the read() system call, the system will trigger a mode switch and jump to the sys_read() function within the Linux kernel.

There are a couple of observations that can be made from this table. First, every system call has a unique number associated with it. As we will explain next, x86 system call mechanics only use this number. The name that associated with each number is just to give meaning to the programmer, just as we use function names instead of relying on memorization of hard-coded addresses. Second, the names of the entry point functions in Linux are the names of the system calls with sys_ prepended; for instance, the open() system call will call the sys_open() function in the kernel, and mmap() will call sys_mmap().

Lastly, note that the names of the system calls correspond to many common C standard library functions. For instance, open() and close() are the system calls that are used to establish connections to files, socket() is the system call to create a socket for network communication, and exit() can be used to terminate the current process. That is, many C functions are simply wrappers for system calls.

In contrast, many C functions are implemented to provide additional functionality on top of system calls. In the case of printf(), the code will eventually trigger the write() system call. The primary difference is that write() requires low-level details of how the system is being used that printf() abstracts away. In addition, calling write() requires exact knowledge of the length of the message to be printed, whereas printf() does not. In summary, many C standard library functions provide a thin wrapper for invoking system calls, while other functions do not.

Table 2.1 lists a small sample of the more than 300 system calls available on 64-bit Linux systems. The full list of system calls can be found in the syscalls(2) man or in <asm/unistd_64.h>, which is included (through a nested sequence of headers) by <sys/syscall.h>. [3] Each system call is documented in a section 2 man page [4] (e.g., man 2 read).

Syscall Number Purpose
read 0 Read from a file descriptor
write 1 Write to a file descriptor
nanosleep 35 High-resolution sleep (units in seconds and nanoseconds)
exit 60 Terminate the current process
kill 62 Send a signal to a process
uname 63 Get information (name, release, etc.) about the current kernel
gettimeofday 96 Get the system time (in seconds since 12:00 AM Jan. 1, 1970)
sysinfo 99 Get information about memory usage and CPU load average
ptrace 101 Trace another process's execution

Table 2.1: A sample of common Linux system calls

2.4.3. Calling System Calls in Assembly

In assembly language, a system call looks almost exactly like a function call. Arguments are passed to the system call using the general purpose registers and the stack as needed. The main difference is that the system call number is stored into the %rax register. As an example, we can write a standard “Hello, world” program in assembly language using two system calls.

In Code Listing 2.3, the four mov instructions (lines 9 – 12) set up the arguments for the write() system call, which expects three arguments: the file handle to write to, the address of the message to write, and the length of the message in bytes. As with a normal function, these are passed in the %rdi, %rsi, and %rdx registers. In a normal function call, the call instruction would specify the function to execute. However, syscall does not encode this information. Instead, on line 5, we moved the constant 1 into %rax, as this is the number for the write() system call. Similarly, lines 16 and 17 indicate that the exit() system call should be invoked with the value 0 as a parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Code Listing 2.3:
# An assembly-language “hello world” program with system calls

  .global _start	

  .text
_start:
  # write(1, message, 13)
  mov $1, %rax                # system call 1 is write
  mov $1, %rdi                # file handle 1 is stdout
  mov $message, %rsi          # address of string to output
  mov $13, %rdx               # number of bytes
  syscall                     # invoke OS to write to stdout

  # exit(0)
  mov $60, %rax               # system call 60 is exit
  xor %rdi, %rdi              # we want return code 0
  syscall                     # invoke OS to exit

  .data
message:
  .ascii "Hello, world\n"

Many system calls have return values that can be used to determine if an error occurred. As with standard functions, the kernel puts return values in the %rax register. Negative values in the range of -4095 to -1 indicate an error.

2.4.4. Calling System Calls with syscall()

Another method for invoking Linux system calls directly is to use syscall(). For instance, the program in Code Listing 2.4 shows the C equivalent of the assembly language code shown in Code Listing 2.3. As before, we can bypass the C standard library functions for write() and exit() by invoking the system call directly. Specifcally, lines 12 and 13 make two system calls, although they look like standard function calls. The C compiler will translate these into the sequence of instructions in lines 9 – 13 and 16 – 18 from Code Listing 2.3.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
/* Code Listing 2.4:
   Using syscall() in C to invoke Linux system calls for writing and exiting
 */

#include <unistd.h>

char *message = "Hello, world\n";

int
main (void)
{
  syscall (1, 1, message, 13);
  syscall (60, 0);

  /* should never reach here */
  return 0;
}

One aspect to note about the implementation of syscall() is that its parameters get passed in the wrong registers. Specifically, the compiler mostly treats syscall() as a regular function call, but it passes the first parameter in %rdi instead of the standard %rax, because the kernel expects the system call number to be in %rdi. Code Listing 2.5 shows how Linux implements syscall(), shifting the register values as needed (lines 9 – 13) and invoking the syscall instruction.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Code Listing 2.5:
# The Linux implementation of the C syscall() function

# From sysdeps/unix/sysv/linux/x86_64/syscall.S

  .text
ENTRY (syscall)
  movq %rdi, %rax          # Syscall number -> rax.
  movq %rsi, %rdi          # shift arg1 - arg5.
  movq %rdx, %rsi
  movq %rcx, %rdx
  movq %r8, %r10
  movq %r9, %r8
  movq 8(%rsp),%r9         # arg6 is on the stack.
  syscall                  # Do the system call.
  cmpq $-4095, %rax        # Check %rax for error.
  jae SYSCALL_ERROR_LABEL  # Jump to error handler if error.

L(pseudo_end):
  ret                      # Return to caller.

PSEUDO_END (syscall)
[1]The syscall instruction is the primary trap instruction in 64-bit x86 systems. Earlier x86 programs performed system calls by triggering an interrupt with the int $0x80 instruction; the kernel would use iret to return from the interrupt. For performance reasons, this approach was replaced with the sysenter and sysexit instructions on 32-bit systems. syscall and sysret are the 64-bit equivalent of these faster system call instructions.
[2]See https://github.com/torvalds/linux/blob/v3.13/arch/x86/syscalls/syscall_64.tbl for example.
[3]To prevent naming collisions, the names of the system calls are more complicated than shown in the table. Specifically, the Read() system call is listed in this table as __NR_read.
[4]For readers new to man pages, documentation on this system can be found by typing man man on the command line. In brief, on Linux and UNIX systems, all C libraries are documented through this manual. The manual is divided into several sections, with section 2 used for system calls and section 3 used for the C standard library. The section of the manual for a function is noted in parentheses after the name. For instance, exit(2) documents on the exit system call, whereas exit(3) documents the standard C library exit() function. (Note there is a difference!)
«  2.3. Kernel Mechanics   ::   Contents   ::   2.5. Process Life Cycle  »

Contact Us License