Hive: Fault Containment for Shared-Memory Multiprocessors
Operating systems run on SimOS as they would run on a real machine. The primary changes required to enable IRIX and Hive to run on SimOS are to the lowest level of the SCSI driver, ethernet and console interfaces. Fewer than 100 lines of code outside the device drivers needed modification.
Running on SimOS exposes an operating system to all the concurrency and all the resource stresses it would experience on a real machine. Unmodified binaries taken from SGI machines execute normally on top of IRIX and Hive running under SimOS. We believe that this environment is a good way to develop an operating system that requires hardware features not available on current machines. It is also an excellent performance measurement and debugging environment [17].
Each processor has a 32 KB two-way-associative primary instruction cache with 64-byte lines, a 32 KB two-way-associative primary data cache with 32-byte lines, and a 1 MB two-way-associative unified secondary cache with 128-byte lines. The simulator executes one instruction per cycle when the processor is not stalled on a cache miss.
A first-level cache miss that hits in the second-level cache stalls the processor for 50 ns. The second-level cache miss latency is fixed at the FLASH average miss latency of 700 ns. An interprocessor interrupt (IPI) is delivered 700 ns after it is requested, while a SIPS message requires an IPI latency plus 300 ns when the receiving processor accesses the data.
Disk latency is computed for each access using an experimentally-validated model of an HP 97560 disk drive [9]. SimOS models both DMA latency and the memory controller occupancy required to transfer data from the disk controller to main memory.
There are two inaccuracies in the machine model that affect our performance numbers. We model the cost of a firewall status change as the cost of the uncached writes required to communicate with the coherence controller. In FLASH, additional latency will be required when revoking write permission to ensure that all pending valid writebacks have completed. The cost of this operation depends on network design details that have not yet been finalized. Also, the machine model provides an oracle that indicates unambiguously to each cell the set of cells that have failed after a fault. This performs the function of the distributed agreement protocol described in Section 4.3, which has not yet been implemented.
We measured the time to completion of the workloads for Hive configurations of one, two, and four cells. For comparison purposes, we also measured the time under IRIX 5.2 on the same four-processor machine model. Table 7.2 gives the performance of the workloads on the various system configurations.
As we expected, the overall impact of Hive's multicellular architecture is negligible for the parallel scientific applications. After a relatively short initialization phase which uses the file system services, most of the execution time is spent in user mode.
Even for a parallel make, which stresses operating system services heavily, Hive is within 11% of IRIX performance when configured for maximum fault containment with one cell per processor. The overhead is spread over many different kernel operations. We would expect the overhead to be higher on operations which are highly optimized in IRIX. To illustrate the range of overheads, we ran a set of microbenchmarks on representative kernel operations and compared the latency when crossing cell boundaries with the latency in the local case.
Table 7.3 gives the results of these microbenchmarks. The overhead is quite small on complex operations such as large file reads and writes. It ranges up to 7.4 times for simple operations such as a page fault that hits in the page cache. These overheads could be significant for some workloads, but the overall performance of pmake shows that they are mostly masked by other effects (such as disk access costs) which are common to both SMP and multicellular operating systems.
For fault injection tests in Hive, we selected a few situations that stress the intercell resource sharing mechanisms. These are the parts of the architecture where the cells cooperate most closely, so they are the places where it seems most likely that a fault in one cell could corrupt another. We also injected faults into other kernel data structures and at random times to stress the wild write defense mechanism.
When a fault occurs, the important parts of the system's response are the latency until the fault is detected, whether the damage is successfully confined to the cell in which the fault occurred, and how long it takes to recover and return to normal operation. The latency until detection is an important part of the wild write defense, while time required for recovery is relatively unimportant because faults are assumed to be rare. We measured these quantities using both the pmake and raytrace workloads, because multiprogrammed workloads and parallel applications stress different parts of the wild write defense.
We used a four-processor four-cell Hive configuration for all the tests. After injecting a fault into one cell we measured the latency until recovery had begun on all cells, and observed whether the other cells survived. After the fault injection and completion of the main workload, we ran the pmake workload as a system correctness check. Since pmake forks processes on all cells, its success is taken as an indication that the surviving cells were not damaged by the effects of the injected fault. To check for data corruption, all files output by the workload run and the correctness check run were compared to reference copies.
Table 7.4 summarizes the results of the fault injection tests. In all tests, the effects of the fault were contained to the cell in which it was injected, and no output files were corrupted.
Development of the fault containment mechanisms has been substantially simplified through the use of SimOS rather than real hardware. The ability to deterministically recreate execution from a checkpoint of the machine state, provided by SimOS, makes it straightforward to analyze the complex series of events that follow after a software fault. We expect to continue using SimOS for this type of development even after the FLASH hardware is available.

