6.3. Race Conditions and Critical Sections¶

Multithreading shares some of the basic principles as processes and multiprogramming. Specifically, threads are used to achieve concurrent execution of software. [1] Furthermore, most modern systems use kernels that are thread-aware. That is, thread scheduling decisions are made by the kernel itself, just as the choice of processes during context switches are. Consequently, thread execution is nondeterministic, leading to unpredictable timing. As threads in a process share the same virtual memory space, this nondeterministic timing can lead to several new types of bugs.

6.3.1. Race Conditions¶

A race condition is the general term for a bug that arises from the nondeterministic timing of execution. If the threads are scheduled in one particular order, everything may be fine; however, if there is a slight delay or change in the order of scheduling, the process may encounter an error and crash. Race conditions can be extremely difficult to debug simply because the bug itself depends on the timing of nondeterministic events. It is quite common that the bug cannot be recreated by testers, particularly if the problematic timing results from a rare circumstance.

The Therac-25 radiation therapy machine is a classic and well-cited example of a race condition with deadly effects. The software for this machine included a one-byte counter. If the machine’s operator entered terminal input to the machine at the exact moment that the counter overflowed (quite common for a counter with only 256 possible values), a critical safety lock failed. This flaw made it possible for patients to receive approximately 100 times the intended radiation dose, which directly caused the deaths of three patients. [2]

Race conditions with threads can arise from a variety of causes, including (but not limited to):

Two threads try to modify the same global variable at the same time. (See the discussion of “Critical Sections” below.)

Data exists when a thread is created, but becomes invalid when the thread tries to access it later. For instance, a thread may be passed a pointer to a dynamically allocated data structure; before the thread gets a chance to run and retrieve the data, another thread calls free() to deallocate the memory on the heap.

A thread returns a pointer to internal variables that were declared within the thread’s scope (e.g., a local variable on the thread’s stack). Once the thread exits, its stack (and all such variables) are destroyed.

Multiple threads call the same non-thread-safe function at the same time. (See below.)

6.3.2. Critical Sections¶

A critical section is a sequence of instructions that must be executed atomically. That is, a critical section contains multiple instructions that create race conditions if they are interleaved with other threads. Every access to a global variable in a multithreaded program creates a critical section. For instance, consider the single line of C code globalvar++;, which simply adds 1 to a global variable. In x86 assembly language, this single line of code turns into three instructions:

movq  _globalvar(%rip), %rsi    # copy from memory into %rsi register
addq  $1, %rsi                  # increment the value in the register
movq  %rsi, _globalvar(%rip)    # store the result back into memory

To understand how this creates a race condition, assume that the variable initially stores the value of 5 and there are two threads (A and B) running. If both threads get to this section of code at the same time, a race condition arises. Specifically, consider the effects of this order of possible timing:

Thread A executes the first line, copying the current value of globalvar (5) into the %rsi register.

The kernel triggers a thread switch from A to B. When this occurs, the register values get copied into A’s stack.

Thread B executes all three lines of code. The first line loads the current value (5) into %rsi, adds 1 to it, and stores the result (6) back into globalvar.

Thread A resumes some time later, restoring the original value (5) from its stack back into the %rsi register. A then adds 1 and also stores the result (6) back into globalvar.

When another thread tries to get the value of globalvar later, its value would be 6 even though it was (technically) incremented twice!

The key aspect of this example is the timing of the thread switch as the second step in this sequence of events. If thread A had been able to execute all three instructions without interruption (which is also possible!), the final value would be correct. The most frustrating aspect of this situation is that re-running the program is unlikely to experience that thread switch in exactly the same instant. As such, re-running the program may repeatedly yield the correct value of 7 over and over again, giving the programmer the false impression that this line of code is behaving correctly. Synchronization of critical sections is a large and complex topic that will be addressed in its own chapter.

6.3.3. Thread Safety and Reentrancy¶

A function is considered to be thread-safe if it can be called concurrently by multiple threads and produce correct results. For instance, a mutex is a mechanism that can force a sequence of instructions to execute in a mutually exclusive (i.e., atomic) manner; in the example above, using a mutex with the globalvar increment would make the bad interleaving impossible. A function is considered to be thread-safe if it can be called by multiple threads concurrently without introducing any race conditions.

Reentrancy is a closely related concept to thread safety, and the terms are sometimes (incorrectly) used interchangeably. The confusion arises because reentrant functions tend to be thread safe, and vice versa. However, a reentrant function is one that can be interrupted during execution and called again safely before the first call is resumed.

An example of a function that is neither thread-safe nor reentrant is the string tokenizer strtok(), which takes a string and breaks it into one substring at a time based on a specified separator. The problem with strtok() is that it uses an internal static pointer to keep track of where it needs to continue. That is, when strtok() returns one a token, the internal pointer is pointing to the next token within that particular string. If strtok() is interrupted and called by another thread, then the internal pointer will be switched to this second thread’s string. When the first thread resumes, it will then attempt to tokenize the second thread’s string rather than its own.

C library functions – <string.h>

char *strtok (char *restrict s, const char *restrict sep);: Splits a string at instances of the substring sep; passing NULL as s continues with previous string.
char *strtok_r(char *str, const char *sep, char **lasts);: Reentrant string tokenizer that uses lasts to point to the continuation location.

The main requirements for reentrancy are that a function cannot hold any global or static non-constant data, and it cannot call any other non-reentrant function. It is still possible for a function to be reentrant but not thread safe. For instance, repeatedly using the value of a global variable is acceptable with reentrancy; but doing so is not thread safe, as another thread may change the value in between uses. On the other hand, a non-reentrant function can be made thread-safe by using a mutex to protect any critical sections.

Ultimately, when writing multithreaded code, programmers must take great care to ensure that the functions their threads use are appropriate. Using functions that are not thread-safe is an easy way to make debugging extraordinarily difficult.

[1]

Early thread implementations were handled solely within the process itself, with the thread library enforcing the thread scheduling. This style had several disadvantages. First, single- and multi-threaded processes were given the same quantum by the scheduler. If a process had 16 threads, each thread would be given only 1/16-th of the amount of CPU time as a single-threaded process. In fact, the situation was worse than this, as the thread library had to inject timed signals to force the switch from one thread to another, yielding significant performance overhead. Second, scheduling decisions had to be made twice and the code was redundant. In essence, the thread library was reproducing kernel code within the user-mode process itself. Consequently, it was possible for the library and the kernel to enforce different scheduling policies, leading to poor scheduling decisions. Finally, kernel thread-awareness is important to gain performance improvements with multiprocessing hardware (including multicore). Once the kernel is aware that a particular process consists of several threads, the kernel can schedule multiple threads at the same time for the threads to benefit from shared caches. If the kernel is not aware of the threads, it will miss these opportunities for potentially considerable speedup.

[2]

The Therac-25 story is complicated and there are many lessons that should be studied by anyone involved in designing and creating software. Another problem with the system is that the software would frequently issue warnings based on potential problems. However, these warnings did not shut the system down and had no obvious effects or meaning to the operators. Consequently, operators learned to ignore all warnings. This behavior is a natural response, as the machine would have been entirely unusable if the operators had to stop and investigate every warning. In other words, software designers should be careful when handling failures and exceptional cases. Blaming users for problems that arise is irresponsible and can make bad situations significantly worse.