The kernel is a program that runs with full access privileges to the entire computer. The kernel controls access to all shared system resources, including physical memory, the file system, and I/O devices. The kernel is also responsible for handling all exceptional system and software events, such as power disruption or the addition of new plug-and-play peripheral components. To be precise, the kernel is primarily responsible for two functions. It acts as a resource manager, providing access to shared system hardware resources as needed. The kernel also acts as a control program, handling errors and access violations in a safe manner.
The origins of the kernel date back to the earliest days of mainframe computing, when it was known as the monitor. The monitor was a collection of software routines for standard operation that got included with every program when it was run; these included routines for clearing out memory and loading stored data. The term was later changed to resident monitor to reflect the fact that this code was always present (resident) in memory.
In the family of x86 architectures, the CPU’s operating mode is stored as a 2-bit value known as the current privilege level (CPL), which is also called a ring. Although these architectures make it possible to have four rings, only two are used in practice. When the system is in ring 3 (user mode), the set of instructions that are allowed is restricted and no instruction can access any part of memory owned by the kernel.
In ring 0 (kernel mode), all valid memory addresses can be accessed and an additional set of
privileged instructions can be performed. Examples of privileged
instructions include hlt
, which halts the CPU, and invd
, which can be used to invalidate the
CPU cache. In addition, some normal instructions behave differently in ring 0 than they do in ring
3; popf
(pops a word from the stack into the status flag register) is one example, as some
status bits are not updated in ring 3.
Note
To further illustrate the distinction between the kernel and the common usage of the term OS, consider the notion of OS versions or distributions. Windows users are probably familiar with names such as XP, Vista, or Windows 10. Similarly, Mac users might distinguish macOS Sierra or OS X El Capitan, just as Linux users may talk about Linux Mint, Ubuntu, or Red Hat Linux.
In all three of these cases, all of the versions and distributions share a common kernel. In the case of Windows, the kernel is known as the NT kernel, which was first released in 1993. macOS is the most recent name of the Mac OS X kernel, which was first released in 2001. All distributions of Linux use the Linux kernel, which was first released in 1991. The various OS distributions are primarily distinguished by the non-kernel programs and services that are included. Although the kernel may contain some updates, the internal structure and services have remained moderately consistent.
The kernel exists as a protected region of virtual memory within the context of every process. Just like a normal user-mode program, the kernel contains a code segment, global data, and a heap for dynamic memory allocation. Rather than having a single stack, however, the kernel contains many stacks; for each user-mode process, the kernel contains at least one stack.
In general, the kernel interacts with its memory regions the same way as normal programs interact
with theirs. The CPU uses the %rip
register to load the next instruction from the code segment.
Dynamically allocated data structures are stored on the heap and local variables are stored on the
stack that is currently in use.
Since the kernel contains information about all processes and system resources, user-mode programs must be prevented from tampering with it. For instance, you would not want a faulty program to reformat your hard drive or shut off power at random times; only the kernel should be able to perform these actions. To prevent tampering from other programs, the kernel configures the CPU to restrict access to the portions of physical memory that are storing the kernel’s virtual memory contents. As a result, if an instruction tries to access a memory location within the kernel while the CPU is set to user mode, the CPU itself will detect this invalid access and trigger an exception.
The kernel is loaded and begins executing as a part of the boot sequence. When a computer is first turned on, the CPU begins to execute firmware code routines stored in non-volatile hardware storage. Examples of firmware include BIOS and UEFI. These firmware routines locate OS-specific boot loaders, such as GRUB (used by Linux), BOOTMGR (Windows), or BootX (macOS). If a system is configured to support dual booting (the ability to choose from more than one OS), it is the boot loader that provides this capability.
Once the boot loader determines the kernel to load, it locates a file containing the kernel on a
storage device. In many systems, the kernel is stored in a compressed format (such as a tar
archive file compressed with gzip), so the boot loader must decompress it when loading it into
memory. Once the kernel has been loaded, the boot loader calls the kernel’s main()
entry point. [1]
Calling the kernel’s main()
is the last action performed by the boot loader, and the
kernel then takes over full control of the system. The kernel begins by initializing its own data
structures for managing the system and launching its initial system services as separate processes.
These system services include the service that allows users to login to the computer.
The kernel can be invoked in two ways. First, the user-mode program may initiate a system call,
which is a request for a specific service. As an example, calling the printf()
function will
eventually lead to a system call that makes a request for the kernel to write to a particular file,
typically stdout
. System calls invoke the kernel by executing a trap instruction. On the
x86 family of architectures, this instruction is known as syscall
.
The second way that the kernel can be invoked is in response to either an interrupt or an exception. An interrupt is an asynchronous notification from a hardware component that indicates service is needed. For example, every time you press a key on the keyboard or click a mouse button, that hardware device triggers an interrupt to make the kernel aware of the event. An exception (sometimes called a software interrupt) is a synchronous notification of a problem with the software. Exceptions include faults and aborts, such as segmentation faults, dividing by zero, or illegal memory values that may be the result of hardware failures.
A mode switch refers to a change in the CPL. System calls and interrupts both trigger a mode
switch from ring 3 to ring 0. At the same time that the CPL changes, the %rip
register is
updated to begin reading from the kernel’s code segment. The address loaded into the %rip
is
determined by a data structure that the kernel sets up during the boot process. In addition to
updating the CPL and the %rip
, the CPU makes a copy of the user-mode program status (such as its
%rip
value).
One important aspect to note about a mode switch is how quickly it occurs. Specifically, mode
switches occur within a single execution of the von Neumann instruction cycle. Once the %rip
has
been updated at the end of one instruction’s cycle, the CPU checks if an interrupt needs processing.
If there is a pending interrupt, the CPU triggers a mode switch before fetching the next
instruction; if not, then the next instruction is fetched.
After the system call or interrupt has been processed, the kernel forces a mode switch by executing
the iret
instruction. Just as ret
updated the %rip
to return to the portion of code that
called a function, iret
acts as a return from an interrupt to get back to the appropriate
location in the user-mode program. The iret
instruction restores the user-mode program’s status
that it had stored previously and lowers the CPL back to ring 3.
[1] | More recent x86 processors have also added another bit to the CPL that is used by certain types of virtualization technologies. This additional bit is used to distinguish between “guest mode” and “host mode.” In these types of systems, multiple guest virtual machines may be running as “guests” while a single hypervisor manages them as the “host.” In these types of environments, ring 0 refers to kernel mode within a guest, whereas the hypervisor operates in “ring -1,” which is kernel mode within the host. |