Hive: Fault Containment for Shared-Memory Multiprocessors
Hive is structured as a set of cells (Figure 3.1). When the system boots, each cell is assigned a range of nodes that it owns throughout execution. Each cell manages the processors, memory, and I/O devices on those nodes as if it were an independent operating system. The cells cooperate to present the required single-system image to user-level processes.
FIGURE 3.1. Partition of a multiprocessor into Hive cells.On top of this structure, the architectural features of Hive fall into two broad categories: those related to implementing fault containment, and those related to providing resource sharing despite the fault containment boundaries between cells. After describing both parts of the architecture, we will briefly summarize the implementation status of the Hive prototype.
Message exchange: Most communication between cells is done through remote procedure calls (RPCs). Each cell sanity-checks all information received from other cells and sets timeouts whenever waiting for a reply. Experience with previous distributed systems shows that this approach provides excellent fault containment, even though it does not defend against all possible faults.
Remote reads: Cells also read each other's internal data structures directly, which can be substantially faster than exchanging RPCs. It is the reading cell's responsibility to defend itself against deadlocking or crashing despite such problems as invalid pointers, linked data structures that contain infinite loops, or data values that change in the middle of an operation. This is implemented with a simple careful reference protocol that includes checks for the various possible error conditions. Once the data has been safely read, it is sanity-checked just as message data is checked.
Remote writes: Cells never write to each other's internal data structures directly, as this would make fault containment impractical. This is enforced by using the FLASH firewall to protect kernel code and data against remote writes. However, cells frequently write to each other's user-level pages since pages can be shared by processes running on different cells. This creates two issues that must be addressed:
Unfortunately, the preemptive discard policy can not prevent all user-visible data integrity violations caused by wild writes. Corrupt data might be used before the cell failure is detected. Alternatively, a faulty cell might corrupt a page, then give up its write permission before the failure is detected, so the page will not be discarded.
This problem appears to be fundamental to a multicellular kernel architecture. The only way to prevent all data integrity violations (without excessive hardware overhead to log updates) is to avoid write-sharing user pages across cell boundaries. Giving up write-shared pages would give up one of the main performance advantages of a shared-memory multiprocessor.
It is unclear at present whether the probability of data integrity violations will be higher in a multicellular system than in a current SMP OS implementation. We intend to evaluate this in future studies. One way to reduce the probability is to shorten the time window within which corrupt data might be used, by detecting failures quickly.
Failure detection is a well-studied problem in the context of distributed systems. For Hive, there are two main issues. Although a halted cell is easily recognizable, a cell that is alive but acting erratically can be difficult to distinguish from one that is functioning correctly. Additionally, if one cell could declare that another had failed and cause it to be rebooted, a faulty cell which mistakenly concluded that other cells were corrupt could destroy a large fraction of the system.
Hive uses a two-part solution. First, cells monitor each other during normal operation with a number of heuristic checks. A failed check provides a hint that triggers recovery immediately. Second, consensus among the surviving cells is required to reboot a failed cell. When a hint alert is broadcast, all cells temporarily suspend processes running at user level and run a distributed agreement algorithm. If the surviving cells agree that a cell has failed, user processes remain suspended until the system has been restored to a consistent state and all potentially corrupt pages have been discarded.
This approach ensures that the window of vulnerability to wild writes lasts only until the first check fails and the agreement process runs (assuming the failure is correctly confirmed by the agreement algorithm). The window of vulnerability can be reduced by increasing the frequency of checks during normal operation. This is another tradeoff between fault containment and performance.
This approach is feasible because in Hive, unlike in previous distributed systems, cells are not responsible for deciding how to divide their resources between local and remote requests. Making that tradeoff correctly requires a global view of the system state, which is available only to Wax. Each cell is responsible only for maintaining its internal correctness (for example, by preserving enough local free memory to avoid deadlock) and for optimizing performance within the resources it has been allocated.
Resource sharing mechanisms: The resources that need to be shared particularly efficiently across cell boundaries are memory and processors.
Memory sharing occurs at two levels (Figure 3.2). In logical-level sharing, a cell that needs to use a data page from a file can access that page no matter where it is stored in the system. Logical-level sharing supports a globally-shared file buffer cache in addition to allowing processes on different cells to share memory. In physical-level sharing, a cell that has a free page frame can transfer control over that frame to another cell. Physical-level sharing balances memory pressure across the machine and allows data pages to be placed where required for fast access on a CC-NUMA machine.
To share processors efficiently, Hive extends the UNIX process abstraction to span cell boundaries. A single parallel process can run threads on multiple cells at the same time. Such processes are called spanning tasks. Each cell runs a separate local process containing the threads that are local to that cell. Shared process state such as the address space map is kept consistent among the component processes of the spanning task. This mechanism also supports migration of sequential processes among cells for load balancing.
Resource sharing policy: Intercell resource allocation decisions are centralized in Wax, a multithreaded user-level process (Figure 3.3). Table 3.4 lists some of the allocation decisions made by Wax.
FIGURE 3.2. Types of memory sharing.
Wax addresses a problem faced by previous distributed systems, which were limited to two unattractive resource management strategies. Resource management can be distributed, in which case each kernel has to make decisions based on an incomplete view of the global state. Alternatively, it can be centralized, in which case the kernel running the policy module can become a performance bottleneck, and the policy module has difficulty responding to rapid changes in the system.
Wax takes advantage of shared memory and the support for spanning tasks to provide efficient resource management. Wax has a complete, up-to-date view of the system state but is not limited to running on a single cell. The threads of Wax running on different cells can synchronize with each other using standard locks and nonblocking data structures, enabling efficient resource management decisions.
Despite its special privileges, Wax is not a special kind of process. It uses resources from all cells, so its pages are discarded and it exits whenever any cell fails. The recovery process starts a new incarnation of Wax which forks to all cells and rebuilds its picture of the system state from scratch. This avoids the considerable complexity of trying to recover consistency of Wax's internal data structures after they are damaged by a cell failure.
Wax does not weaken the fault containment boundaries between cells. Each cell protects itself by sanity-checking the inputs it receives from Wax. Also, operations required for system correctness are handled directly through RPCs rather than delegated to Wax. Thus if Wax is damaged by a faulty cell it can hurt system performance but not correctness.
FIGURE 3.3. Intercell optimization using a user-level process.
The single-system image is only partially complete at present. It provides forks across cell boundaries, distributed process groups and signal delivery, and a shared file system name space. Spanning tasks, Wax, the distributed agreement protocol, and a fault-tolerant file system with single-system semantics remain to be implemented.
The current prototype is sufficient to demonstrate that fault containment is possible in a shared-memory multiprocessor, and that memory sharing can function efficiently without weakening fault containment. Performance results from the current prototype are promising, but further work is required to determine whether a fully-implemented system will perform as well as previous UNIX kernels.
The performance measurements reported in the following sections were obtained using SimOS [18]. We model a machine similar in performance to an SGI Challenge multiprocessor with four 200-MHz MIPS R4000 processors and a 700 nanosecond main memory access latency. We use two types of workloads, characteristic of the two environments we expect to be most common for Hive. For compute-server usage, we use pmake (parallel compilation). To model use by large parallel applications, we use ocean (scientific simulation) and raytrace (graphics rendering). Section 7 describes SimOS, the machine model, and the workloads in detail.

