7.3. Locks¶

One of the most fundamental synchronization primitives is to use a lock to eliminate the race conditions in critical sections. A lock is also called a mutex or mutex lock because it provides mutually exclusive access to a critical section. That is, once a thread acquires a lock and enters a critical section, no other thread can also enter the critical section until the first thread completes its work. Locks have the following key features:

Mutual exclusion: A lock is initially considered to be unlocked. Only one thread can acquire the lock at a time. Once a lock has been acquired by one thread, no other lock can acquire it until it has been released.

Non-preemption: If a thread acquires a lock, no other thread can take it away. The thread that has acquired the lock must voluntarily release it.

Atomic acquire and release operations: The implementation of these operations guarantee their executions to be free of race conditions. If two threads both try to acquire a lock at the same time, either one succeeds and one fails, or both fail (if the lock was previously acquired). Similarly, when a thread releases a lock that it has acquired, this operation cannot fail due to interrupts or other exceptional control flow.

Blocking acquire operations: If a thread tries to acquire a lock that is currently being held by another thread, the current request to acquire the lock blocks and is added to a request queue for the lock. When a thread that holds a lock with pending requests releases it, exactly one of the waiting threads will become unblocked and acquire it.

In the case of the threads trying to increment globalvar, the solution is to acquire the lock prior to the increment and release the lock afterwards:

/* acquire lock */
globalvar++;
/* release lock */

If an interrupt and thread switch occur after the first thread has acquired the lock, then the second thread would become blocked. The first thread would be allowed to increment globalvar atomically, storing back the new value of 6 to the variable in memory. Then, the first thread would release the lock, allowing the second thread to enter the critical section and increment globalvar again, producing the correct final result of 7.

7.3.1. POSIX Mutexes¶

In the POSIX interface, three functions provide the basic functionality for locks, which POSIX simply calls mutexes. POSIX provides functions for initializing, destroying, locking, and unlocking mutexes. POSIX also provides pthread_mutex_trylock(), which avoids blocking if the lock has already been acquired. The caller can check the return value if the mutex was successfully acquired; the function returns 0 if successful, or a non-zero error value otherwise. This function is useful if the system has several equivalent copies of a resource that are interchangeable. If the code tries to acquire the lock for one copy and fails, it can move on to another without blocking.

C library functions – <pthread.h>

int pthread_mutex_init (pthread_mutex_t *mutex, const pthread_mutexattr_t *attr);: Initialize a mutex based on initial attributes.
int pthread_mutex_destroy (pthread_mutex_t *mutex);: Destroy an allocated mutex.
int pthread_mutex_lock (pthread_mutex_t *mutex);: Acquire the specified mutex lock.
int pthread_mutex_trylock (pthread_mutex_t *mutex);: Try to acquire the mutex lock; return a non-zero value if unsuccessful.
int pthread_mutex_unlock (pthread_mutex_t *mutex);: Release the specified mutex lock.

Returning to the example of modifying globalvar described above, Code Listing 7.4 shows how to use the mutex within a thread.

/* Code Listing 7.4:
   Locking a global variable with a mutex
 */

/* Initialize the global variable to 5 */
int globalvar = 5;

/* Each increment thread gets a pointer to the mutex */
void *
increment (void *args)
{
  pthread_mutex_t *mutex = (pthread_mutex_t *) args;
  
  /* Lock for the critical section, then release the mutex */
  pthread_mutex_lock (mutex);
  globalvar++;
  pthread_mutex_unlock (mutex);

  pthread_exit (NULL);
}

The thread receives a pointer to the mutex as its argument. The mutex is locked before entering the critical section globalvar++, then unlocked immediately after. Given this code, regardless of how many threads are created and running, globalvar will be correctly incremented each time. Code Listing 7.5 shows how to create the mutex and pass it to the threads. In this simple scenario, we only create two threads.

/* Code Listing 7.5:
   Passing a mutex to multiple threads
 */

pthread_t threads[2];
pthread_mutex_t mutex;

/* Initialize the mutex, which will be passed to the threads */
pthread_mutex_init (&mutex, NULL);

/* Create the child threads, passing a pointer to the mutex */
assert (pthread_create (&threads[0], NULL, increment, &mutex)
        == 0);
assert (pthread_create (&threads[1], NULL, increment, &mutex)
        == 0);

/* Join the threads and confirm the result */
pthread_join (threads[0], NULL);
pthread_join (threads[1], NULL);

/* globalvar was initialized to 5 and incremented twice */
assert (globalvar == 7);

The behavior of these operations is undefined in the POSIX specification under certain unusual circumstances. These situations are typically errors in the code, but there is not enough information to determine precisely what happened. As such, POSIX leaves the following behaviors undefined as a default:

Attempting to lock a mutex that the thread has already acquired; this is also called relocking.

Attempting to unlock a mutex owned by another thread; this violates the non-preemption characteristic of locks.

Attempting to lock or unlock an uninitialized mutex; this is simply a programming error.

7.3.2. Spinlocks¶

Traditional locks also suffer from one very subtle (but occasionally significant) performance penalty: context switches. The penalty arises from the fact that operations to acquire the lock can block the current thread if the lock is not available. At this point, the kernel must select another thread to run (takes time), save the current thread’s registers (takes time), possibly flush one or more levels of the cache (takes time), possibly switch virtual memory spaces (takes time), and so on.

As a rough average estimate, context switches on modern systems take around 2000 ns. In many cases, those time penalties are expected and unavoidable. But consider a case where thread A is in possession of a lock and is almost ready to release it. If thread B is running and attempts to acquire the lock, it would have to wait at least 4000 ns for the kernel to switch from thread B to A and then back from thread A to B. However, what if B’s execution could be delayed for 10 ns, during which time A releases the lock? That is, by waiting 10 ns, B could avoid almost 4000 ns of unnecessary context. If this happens often enough, the system could experience a significant speedup.

In traditional uniprocessing systems, this argument is nonsensical. If one thread is running, then it is impossible for another thread to be releasing the lock at the same time. The reason is that uniprocessing systems could only execute one instruction (i.e., one thread) at a time. However, in multiprocessing systems, including multicore, there is true parallelism happening. That is, it is possible for threads A and B to be running—and working with the lock—at the exact same moment in time.

To exploit this advantage in parallel execution, POSIX (like other modern thread libraries) include a newer variation on locks called spinlocks. A spinlock behaves similarly to a traditional lock with one exception: Instead of blocking if the lock is owned by another thread, the spinlock stays in a loop that keeps trying. If the lock becomes immediately available, the thread can then continue without the penalty of a context switch. The POSIX interface for spinlocks is virtually identical to the one for mutexes. The difference is primarily in the internal implementation.

C library functions – <pthread.h>

int pthread_spin_init (pthread_spinlock_t *lock, int pshared);: Initialize a spinlock.
int pthread_spin_destroy (pthread_spinlock_t *lock);: Destroy an allocated spinlock.
int pthread_spin_lock (pthread_spinlock_t *lock);: Acquire the specified spinlock.
int pthread_spin_trylock (pthread_spinlock_t *lock);: Try to acquire the spinlock; return a non-zero value if unsuccessful.
int pthread_spin_unlock (pthread_spinlock_t *lock);: Release the specified spinlock.

Given that spinlocks can benefit from avoiding unnecessary context switches, it may seem that they are inherently superior to mutexes. However, this is not the case, and mutexes often perform better. Note that spinlocks only benefit when waiting takes less time than executing a context switch. In the example above, we assumed that the thread only needed to wait 10 ns for the lock to become available.

However, if the lock’s owner is performing a complex and long-running operation, the thread using the spinlock will consume the CPU time for the rest of its quantum, which is typically on the order of 100 ms (i.e., 100,000,000 ns). That is, if the lock does not become immediately available, the thread will waste millions of clock cycles in an attempt to save thousands. Clearly, this would be a horrible trade-off!

The choice of spinlocks vs. traditional mutexes ultimately comes down to the typical duration of the critical sections. If all of the critical sections are short in duration, then spinlocks are likely to provide superior performance, as there is an increased likelihood that the lock’s holder will be releasing it soon. If long critical sections are common, then mutexes are preferable to avoid the wasted CPU time that comes from spinlocks’ busy waiting.

7.3.3. Defining Critical Sections¶

Selecting the appropriate size for a critical section is very complicated. Consider the two loops shown in Code Listing 7.6.

/* Code Listing 7.6:
   Two possible critical sections
 */

/* Acquire/release the lock 1,000,000 times inside the loop */
for (i = 0; i < 1000000; i++)
  {
    pthread_mutex_lock (&mutex);
    globalvar++;
    pthread_mutex_unlock (&mutex);
  }

/* Acquire/release the lock only once outside the loop */
pthread_mutex_lock (&mutex);
for (i = 0; i < 1000000; i++)
  globalvar++;
pthread_mutex_unlock (&mutex);

If no other thread tries to access globalvar while this code is running, the end result of each loop is identical except for one critical factor: speed. Both of the loops would guarantee that globalvar gets incremented 1,000,000 times. The first version of the loop accomplishes this feat by acquiring and releasing the lock 1,000,000 times, whereas the second only acquires and releases the lock once. The first version would suffer from a tremendous overhead penalty. (A simple timing experiment showed the first loop consistently took 10 times as much time as the second.)

Despite the performance penalty of the first loop structure, there are times when it is necessary. Specifically, consider what is accomplished with the two loops. The first structure executes 1,000,000 tiny critical sections that each increment the global variable by 1. The second executes one gigantic critical section that increments the global variable by 1,000,000. While the second approach is much faster for this thread, it could lead to performance penalties for other threads. If you make the critical section too large, then you are increasing the amount of time that other threads have to wait.

Ultimately, there is no universal answer or approach to which style should be used, because the choice depends on the needs of the system being built. The best (and perhaps not useful) advice is to make the critical section as large as it needs to be, but no larger.