2.3. Kernel Mechanics¶

The kernel is a program that runs with full access privileges to the entire computer. The kernel controls access to all shared system resources, including physical memory, the file system, and I/O devices. The kernel is also responsible for handling all exceptional system and software events, such as power disruption or the addition of new plug-and-play peripheral components. To be precise, the kernel is primarily responsible for two functions. It acts as a resource manager, providing access to shared system hardware resources as needed. The kernel also acts as a control program, handling errors and access violations in a safe manner.

The origins of the kernel date back to the earliest days of mainframe computing, when it was known as the monitor. The monitor was a collection of software routines for standard operation that got included with every program when it was run; these included routines for clearing out memory and loading stored data. The term was later changed to resident monitor to reflect the fact that this code was always present (resident) in memory.

In the family of x86 architectures, the CPU’s operating mode is stored as a 2-bit value known as the current privilege level (CPL), which is also called a ring. Although these architectures make it possible to have four rings, only two are used in practice. When the system is in ring 3 (user mode), the set of instructions that are allowed is restricted and no instruction can access any part of memory owned by the kernel.

In ring 0 (kernel mode), all valid memory addresses can be accessed and an additional set of privileged instructions can be performed. Examples of privileged instructions include hlt, which halts the CPU, and invd, which can be used to invalidate the CPU cache. In addition, some normal instructions behave differently in ring 0 than they do in ring 3; popf (pops a word from the stack into the status flag register) is one example, as some status bits are not updated in ring 3.

Note

To further illustrate the distinction between the kernel and the common usage of the term OS, consider the notion of OS versions or distributions. Windows users are probably familiar with names such as XP, Vista, or Windows 10. Similarly, Mac users might distinguish macOS Sierra or OS X El Capitan, just as Linux users may talk about Linux Mint, Ubuntu, or Red Hat Linux.

In all three of these cases, all of the versions and distributions share a common kernel. In the case of Windows, the kernel is known as the NT kernel, which was first released in 1993. macOS is the most recent name of the Mac OS X kernel, which was first released in 2001. All distributions of Linux use the Linux kernel, which was first released in 1991. The various OS distributions are primarily distinguished by the non-kernel programs and services that are included. Although the kernel may contain some updates, the internal structure and services have remained moderately consistent.

2.3.1. Kernel Memory Structure and Protections¶

Figure 2.3.2: Application code running in user-mode cannot access any part of kernel memory

The kernel exists as a protected region of virtual memory within the context of every process. Just like a normal user-mode program, the kernel contains a code segment, global data, and a heap for dynamic memory allocation. Rather than having a single stack, however, the kernel contains many stacks; for each user-mode process, the kernel contains at least one stack.

In general, the kernel interacts with its memory regions the same way as normal programs interact with theirs. The CPU uses the %rip register to load the next instruction from the code segment. Dynamically allocated data structures are stored on the heap and local variables are stored on the stack that is currently in use.

Since the kernel contains information about all processes and system resources, user-mode programs must be prevented from tampering with it. For instance, you would not want a faulty program to reformat your hard drive or shut off power at random times; only the kernel should be able to perform these actions. To prevent tampering from other programs, the kernel configures the CPU to restrict access to the portions of physical memory that are storing the kernel’s virtual memory contents. As a result, if an instruction tries to access a memory location within the kernel while the CPU is set to user mode, the CPU itself will detect this invalid access and trigger an exception.

2.3.2. The Boot Procedure¶

The kernel is loaded and begins executing as a part of the boot sequence. When a computer is first turned on, the CPU begins to execute firmware code routines stored in non-volatile hardware storage. Examples of firmware include BIOS and UEFI. These firmware routines locate OS-specific boot loaders, such as GRUB (used by Linux), BOOTMGR (Windows), or BootX (macOS). If a system is configured to support dual booting (the ability to choose from more than one OS), it is the boot loader that provides this capability.

Once the boot loader determines the kernel to load, it locates a file containing the kernel on a storage device. In many systems, the kernel is stored in a compressed format (such as a tar archive file compressed with gzip), so the boot loader must decompress it when loading it into memory. Once the kernel has been loaded, the boot loader calls the kernel’s main() entry point. [1] Calling the kernel’s main() is the last action performed by the boot loader, and the kernel then takes over full control of the system. The kernel begins by initializing its own data structures for managing the system and launching its initial system services as separate processes. These system services include the service that allows users to login to the computer.

2.3.3. Kernel Invocation¶

The kernel can be invoked in two ways. First, the user-mode program may initiate a system call, which is a request for a specific service. As an example, calling the printf() function will eventually lead to a system call that makes a request for the kernel to write to a particular file, typically stdout. System calls invoke the kernel by executing a trap instruction. On the x86 family of architectures, this instruction is known as syscall.

The second way that the kernel can be invoked is in response to either an interrupt or an exception. An interrupt is an asynchronous notification from a hardware component that indicates service is needed. For example, every time you press a key on the keyboard or click a mouse button, that hardware device triggers an interrupt to make the kernel aware of the event. An exception (sometimes called a software interrupt) is a synchronous notification of a problem with the software. Exceptions include faults and aborts, such as segmentation faults, dividing by zero, or illegal memory values that may be the result of hardware failures.

2.3.4. Mode Switches and Privileged Instructions¶

A mode switch refers to a change in the CPL. System calls and interrupts both trigger a mode switch from ring 3 to ring 0. At the same time that the CPL changes, the %rip register is updated to begin reading from the kernel’s code segment. The address loaded into the %rip is determined by a data structure that the kernel sets up during the boot process. In addition to updating the CPL and the %rip, the CPU makes a copy of the user-mode program status (such as its %rip value).

One important aspect to note about a mode switch is how quickly it occurs. Specifically, mode switches occur within a single execution of the von Neumann instruction cycle. Once the %rip has been updated at the end of one instruction’s cycle, the CPU checks if an interrupt needs processing. If there is a pending interrupt, the CPU triggers a mode switch before fetching the next instruction; if not, then the next instruction is fetched.

After the system call or interrupt has been processed, the kernel forces a mode switch by executing the iret instruction. Just as ret updated the %rip to return to the portion of code that called a function, iret acts as a return from an interrupt to get back to the appropriate location in the user-mode program. The iret instruction restores the user-mode program’s status that it had stored previously and lowers the CPL back to ring 3.

[1]

More recent x86 processors have also added another bit to the CPL that is used by certain types of virtualization technologies. This additional bit is used to distinguish between “guest mode” and “host mode.” In these types of systems, multiple guest virtual machines may be running as “guests” while a single hypervisor manages them as the “host.” In these types of environments, ring 0 refers to kernel mode within a guest, whereas the hypervisor operates in “ring -1,” which is kernel mode within the host.