Hive: Fault Containment for Shared-Memory Multiprocessors

2. Fault Containment in SharedMemory Multiprocessors

Fault containment is a general reliability strategy that has been implemented in many distributed systems. It differs from fault tolerance in that partial failures are allowed, which enables the system to avoid the cost of replicating processes and data.

Fault containment is an attractive reliability strategy for multiprocessors used as general-purpose compute servers. The workloads characteristic of this environment frequently contain multiple independent processes, so some processes can continue doing useful work even if others are terminated by a partial system failure.

However, fault containment in a multiprocessor will only have reliability benefits if the operating system manages resources well. Few applications will survive a partial system failure if the operating system allocates resources randomly from all over the machine. Since application reliability is the primary goal, we redefine fault containment to include this resource management requirement:

A system provides fault containment if the probability that an application fails is proportional to the amount of resources used by that application, not to the total amount of resources in the system.

One important consequence of choosing this as the reliability goal is that large applications which use resources from the whole system receive no reliability benefits. For example, some compute server workloads contain parallel applications that run with as many threads as there are processors in the system. However, these large applications have previously used checkpointing to provide their own reliability, so we assume they can continue to do so.

The fault containment strategy can be used in both distributed systems and multiprocessors. However, the problems that arise in implementing fault containment are different in the two environments. In addition to all the problems that arise in distributed systems, the shared-memory hardware of multiprocessors increases vulnerability to both hardware faults and software faults. We describe the problems caused by each of these in turn.

Hardware faults: Consider the architecture of the Stanford FLASH, which is a typical large-scale shared-memory multiprocessor (Figure 2.1). FLASH consists of multiple nodes, each with a processor and its caches, a local portion of main memory, and local I/O devices. The nodes communicate through a high-speed low-latency mesh network. Cache coherence is provided by a coherence controller on each node. A machine like this is called a CC-NUMA multiprocessor (cache-coherent with non-uniform memory access time) since accesses to local memory are faster than accesses to the memory of other nodes.

In a CC-NUMA machine, an important unit of failure is the node. A node failure halts a processor and has two direct effects on the memory of the machine: the portion of main memory assigned to that node becomes inaccessible, and any memory line whose only copy was cached on that node is lost. There may also be indirect effects that cause loss of other data.

For the operating system to survive and recover from hardware faults, the hardware must make several guarantees about the behavior of shared memory after a fault. Accesses to unaffected memory ranges must continue to be satisfied with normal cache coherence. Processors that try to access failed memory or retrieve a cache line from a failed node must not be stalled indefinitely. Also, the set of memory lines that could be affected by a fault on a given node must be limited somehow, since designing recovery algorithms requires knowing what data can be trusted to be correct.

These hardware properties collectively make up a memory fault model, analogous to the memory consistency model of a multiprocessor which specifies the behavior of reads and writes. The FLASH memory fault model was developed to match the needs of Hive: it provides the above properties, guarantees that the network remains fully connected with high probability (i.e. the operating system need not work around network partitions), and specifies that only the nodes that have been authorized to write a given memory line (via the firewall) could damage that line due to a hardware fault.

FIGURE 2.1. FLASH architecture.


Software faults: The presence of shared memory makes each cell vulnerable to wild writes resulting from software faults in other cells. Wild writes are not a negligible problem. Studies have shown that software faults are more common than hardware faults in current systems [7]. When a software fault occurs, a wild write can easily follow. One study found that among 3000 severe bugs reported in IBM operating systems over a five-year period, between 15 and 25 percent caused wild writes [20].

Unfortunately, existing shared-memory multiprocessors do not provide a mechanism to prevent wild writes. The only mechanism that can halt a write request is the virtual address translation hardware present in each processor, which is under the control of the very software whose faults must be protected against.

Therefore an operating system designed to prevent wild writes must either use special-purpose hardware, or rely on a trusted software base that takes control of the existing virtual address translation hardware. For systems which use hardware support, the most natural place to put it is in the coherence controller, which can check permissions attached to each memory block before modifying memory. Systems following a software-only approach could use a microkernel as the trusted base, or could use the lower levels of the operating system's own virtual memory system, on top of which most of the kernel would run in a virtual address space.

The hardware and software-only approaches provide significantly different levels of reliability, at least for an operating system that is partitioned into cells. In the hardware approach, each cell's wild write defense depends only on the hardware and software of that cell. In the software-only approach, each cell's wild write defense depends on the hardware and trusted software layer of all other cells. By reducing the number and complexity of the components that must function correctly to defend each cell against wild writes, the hardware approach provides higher reliability than the software-only approach.

We chose to add firewall hardware, a write permission bit-vector associated with each page of memory, to the FLASH coherence controller. We found that the firewall added little to the cost of FLASH beyond the storage required for the bit vectors. Other large multiprocessors are likely to be similar in this respect, because the hardware required for access permission checking is close to that required for directory-based cache-coherence. The firewall and its performance impact are described in Section 4.2.

Last modified 08/31/95 by Dan Teodosiu.