Synchronization

Programmers must be concerned about the concurrent access to shared data. The correctness of a program is usually considered with the restriction that only the program is manipulating the data. If a program places a certain number in a variable, it expects to retrieve that same number from that variable later on. When multiple threads act on shared data, that is a poor assumption to make. The programs will malfunction.

There are various reasons why threads will share data, and must synchronize their access to the data.

  1. Interrupts will suspend the kernel while the interrupt handler runs. The handler and the kernel must somehow communicate, and it will be through shared data structures. The consequence of multiple simultaneous access to this data must be considered.
  2. Modern computers have true parallelism. Multiple CPU's, or multiple cores in a CPU, or hypterthreading in a single core, all give rise to multiple threads of control running at the same time, possibly accessing and changing the same data structures.
  3. Kernels are now preemptible. The kernel has its own set of tasks that it multitasks. If these operate on shared data, they must be synchronized.

Whether it is true parallelism by multiple instances of hardware, or virtual parallelism by the technique of time slicing makes very little difference.

Critical race

Suppose two threads must insert an item to the head of a linked list kernel data structure. The familiar pattern is:

newp = create_item()
newp.next = head_of_list
head_of_list = newp 

Consider what occurs when two process simultaneously insert. The both have pointers to new items. They both then set the next field of their new items to the head item of the list. They both then point the linked list head pointer to the new item. Only the process that updated the head pointer last has their item inserted.

Lets simulate the parallelism by writing down the almost simultaneous running of the code along a single flow:

newp = create_item()
newp' = create_item()
newp.next = head_of_list
newp'.next = head_of_list
head_of_list = newp 
head_of_list = newp'

The above assumes the primed action occurs consistently just a bit later than the same unprimed action. This is a simplification and it can be much messier.

If the last pair of actions occurred in the opposite sense, the primed action first, the overall result will be completely different, with the unprimed action winning the race. This is called a critical race, or a race condition. A race occurs when timings outside of the programmer's control will determine the outcome of an algorithm.

It makes very little difference if we have true parallelism or time slicing. The previous example interleaved religiously the two code flows. Using time slicing, I can present the same example as follows:

newp = create_item()
newp.next = head_of_list
--- preempt! new task
newp' = create_item()
newp'.next = head_of_list
head_of_list = newp'
--- preempt! resume old task
head_of_list = newp 

The problem of synchronization can be isolated to those lines in the code flow that must be run from start to finish without intervening action by other tasks. The are called critical sections.

Synchronization properties

The above problems can be avoided if the thread is assured that a certain block of code is run on the data structure exclusive of any other thread's manipulation of the data structure. A thread can claim the data structure, the observe the values in the data structure. It can be assured that those values persist during the period in which it makes updates to the data structure. It then adjusts all the values to a consistent state before releasing its excluse claim to the data structure, allowing other threads to observer the values, or compete to update those values.

On possibility is to drive all manipulation of a data structure through a single section of code. For instance, all inserts or removals to a linked list must, by fiat and programmer agreement, call a specific function. This translates the critical race on the data structure to a critical race to assume control of the function. The body of the function is called a critical section.

This creates a restrictive situation which may or may not work out in practice. A more encompassing discipline is followed by Java, and the Java programming language. Every instantiated object has its own lock. If a method is marked "synchronized" then the instance's lock must be taken by any thread during the run of the method. The lock is per object, and all synchronized methods refer to that one lock. If there is a method to insert an element to a linked list object, and a method to remove an element from a linked list object, only one thread will enter either method at a time.

The beauty of Java is this is all done by the langauge: the lock is taken before entry to the method and is automatically released when the method exits. If multiple threads content for the lock, they are serialized one by one in the possession of the lock.

Whatever the implementation, synchronization is the enforcement of the following three properties:

  1. Safety, also known as mutual exclusion. A data structure of code is safe (or thread safe) if one and only one thread can be running the code, or holding the lock on the data structure, at any moment.
  2. Liveness. The opposite of liveness is called deadlock. It must be possible that among the threads contending for the lock that some thread makes progress. We have not yet discussed that happens when a thread cannot immediately get control of the critical section or lock. It way or another it must wait (see below for various interpretations of "wait"). It is not only possible, but a not uncommon programming error, for a mutual dependence to occur where threads end up waiting for each other, and no thread can make progress. This is a deadlock condition, or a failure of liveness.
  3. Fairness. This is an almost optional property. It says that any thread waiting on a lock or critical section must eventually get the lock or enter the critical section, and furthermore, it must be able to compete fairly (in some sense).

Synchronization algorithms

Peterson's Algorithm

Peterson's algorithm is a software only solution to the synchronization of two processes. In particular, it assures that one and only one of the two processes is in a critical section at a time. In addition to enforcing mutual exclusions, i.e. that at most one of the two processes is in the critical section at any moment, it also avoids deadlock, i.e. that at least one of the processes will be able to continue, even if they both contend for the critical section.

As a bonus, Peterson's algorithm also has a fairness property. A process wishing to enter the critical section will enter the section the next time it is available. It will not wait forever. In fact, it will have priority over the a process which has recently been in the critical section. So furthermore it is a particularly strong form of fairness.

set my.flag
set coin to "other"
loop:
  if other.flag is clear, continue
  if coin is set to "mine", continue
endloop
CRITICAL SECTION
clear my.flag

The algorithm data structures consists of a per process flag which is either set or cleared, and a shared coin which is in the possession of one or the other processes. In implementation, the flag is a per process memory location, and the coin is a single shared memory location. Giving the coin will correspond to writing the other's number into the coin location.

As a warm up to understanding the algorithm, consider a simplified algoritm without the coin. That algorithm would insure mutual exclusion, but it would be prone to deadlock. On the one hand, we have made progress. Of the two demands, we have achieved one. On the other hand, achieving mutual exclusion is easy - just let no process into the critical section ever.

Since we have done something easy, gained mutual exclusion but with possible deadlock, let's see if we can't analyze the modified algorithm for correctness. I do my synchronization proofs by looking at time sequences. I reason about sequences of actions, writing down inequalities that summarize those sequences. From those inequalities I draw my conclusions. For instance, let processes be represented by P and Q, the event of setting the flag by S, the event of deciding to enter the critical section by D. Then the statement that the algorithm first sets the flag and then decides whether to enter the critical section is written: P.S < P.D and Q.S < Q.D.

In the simplified algorithm, with no coin, if P is in the critical section, the decision was based on Q's flag being cleared. So P.D < Q.S, which implies P.S < Q.S. Likewise if Q is in the critical section, Q.S < P.S. Since both inequalities cannot be true, only one of P or Q can be in the critical section at a time.

We now reason about the full algorithm. We reason by cases on the decision. The previous argument shows that if P and Q are both in the critical section, it cannot be that both made the decision based on the other's flag being cleared. That proof works here too. If P and Q both saw the other's flag set, then entry was based on the coin. If Q is in the critical section Q.C < P.C < Q.D, and likewise if P is in the critical section. So we would have a problem ordering Q.C and P.C is they are both in the critical section together. Therefore, they are not.

Note that to be completely clear, the proof picks a point in time when P and Q are supposedly both in the critical section and looks back to the most recent event of the sort indicated. That is, P.S would be the most recent setting of the flag by P. Because we have to account for if P exited the critical section, reapplied for admission, and got in.

So we still have the case where one process made the decision based on a clear flag and the other on the coin. Without loss of generality, let P be the process entering the critical section on clear flag. Then:

  P.C < P.D < Q.S < Q.C < Q.D
Since Q entered by the coin, it must Q.C < P.C. Since P is still in the critical section, there would have been no events after Q.C. This means Q did not have the coin, so could not enter the critical section.

Test and set

Peterson's makes use of minimal assumptions on hardware. It even works in cases where the contents of a memory location is not well defined. For instance, most memories have caches, even multiple caches for multiple processors. A variable might be in multiple locations. But it all seems to work.

With a bit more assumption on memory we have a simple and practical synchronization scheme. We assume an instruction that updates memory and returns what was previously in the memory, in an atomic fashion. Meaning that if the instruction runs multiply and concurrently, there will be a time ordering and the settings and return values will be consistent with that ordering. Such instructions go by various names, such as test-and-set or compare-exchange.

Without a hardware boost, there might be separate read and write actions. It might be possible for two processes to read the initial value X, the first process writes a Y and the second a Z. There is no good time ordering then of the writes, since both seemed to have happened before the other. With test and set, only one will read X, the other will read either Y or Z, and thus determine a time ordering.

Implementing test and set is not easy because of memory caching. Usually writes occur to the cache, not the main memory. So test and set needs to write through to the main memory, and let the values settle, so that a time ordering is established among all the processors which share the main memory, but perhaps do not share their caches. Furthermore, memory might not have test and set, and reads and writes might be done exclusively in cache lines. Memory bus locking is often required to make the separate read and write operations into an atomic unit.

Given the existence of a working test-and-set instruction, we can then build a lock. A lock variable is set to 0. A process wanting to gain the lock test and sets a 1. If the returned value is a 0, it has gained the lock. If it is a 1, the lock has been gained, but not by this process. It might try again later, or try again immediately.

Dealing with contention

If two or more threads contend for a lock or critical section, the synchronization method must serialize their access. One thread will continue on to ownership of the lock, or entry to the critical section, and the other threads will wait until the winning thread relinquishes control. There are several ways to accomplish this waiting.

One possibility is for the contending threads to repeatedly test the lock variables, without yielding the CPU to another task. This tight busy looping is called a spin lock and the process is said to be spinning. Spin locks are associated either with lazy programming or which deep kernel programming and truly parallele processing. The other solutions to waiting for a lock to become available require data structures. The constraint on these other solutions is that those data structures themselves might need locking, hence an infinite regress of requirements could ensue.

Spinning is the most primitive solution. If there is only a single thread, spinning will not satisfy liveness: the CPU will be so busy spinning on the lock that the thread holding the lock will get no cycles to finish its critical section therefore release the lock.

An alternative strategy is to give up access to the CPU to another process, and to check later if the lock is free. There are two variabnts of this: the thread can yield for a prespecified amount of time, or it will invoke a mechanism where the process is returned to the CPU when the lock becomes free. The first requires support from the kernel for timed sleeps. The second requires support from the kernel for wakeups associated with events on the lock.

If a lock is busy, the contending thread can:

  1. Spin: retest repeatedly in a tight loop while holding the CPU;
  2. Poll: retest the lock later, but yield the CPU now for a definite amount of time before retesting;
  3. Wait: sleep the process on a queue attached to the event of the lock becoming available, then wake the process to restest the lock.
Which is the best option will depend on the situation. Code that is deep in the kernel cannot sleep and cannot be rescheduled. They must spin. Other than that, polling is rarely the best of the two remaining options. Often polling results from the unavailability of an appropriate sleep/wake mechanism.

It is possible to lock the data structure: associate a lock with the data structure and the lock must "be taken" before any operation can occur on the data structure. It is also possible to have two levels of lock: a read lock and a write lock. The locks were are considering are equivalent to write locks — absolute exclusivity.

For readers, they only have to make sure that they are not reading while some other thread is writing, else the result might be a mixture of old and new data as the work of the reader and writer interleave. Simultaneous multiple readers work fine. A read lock is taken just to advise write locks to wait. A queued write lock might prevent further read locks from being given, will wail until all read locks are released, and will then take the (exclusive) write lock for the writer to proceed.

Semaphores

These are synchronization packages which make synchronization more convenient. They come along with certain nice models of synchronization, and are provided as kernel services.

A semaphore is a simple device that summarizes the requirments for synchronization. A semaphore is a counter variable c, a queue for threads waiting on the counter variable, and two operations: P and V.

If the initial value of the counter is 1, the Semaphore is also known as a mutex, or a binary semaphore, and it enforces mutual exclusion.

To implement a semaphore, a test and set lock variable can be used to protect a data structure containing the counter and the queue of threads. It is important that the lock protects the counter variable. If two threads attempt to decrement the counter when it is 1, only one will continue, the other will block, and the final value of the counter will be -1. Without a lock, perhaps both threads will see an initial counter value of 1 and both will decide they can continue, allowing both into the critical section.

Deadlock

How not to deadlock: establish a an order among locks. All locks. For instance, the memory location of the lock's data structure provides a ready made order. Take locks in increasing order. If two processes are taking locks, and if they take them in increasing order, at least one of the two processes can proceed.

References