Hive: Fault Containment for Shared-Memory Multiprocessors

7 Experimental Results

In this section we report the results of experiments on the Hive prototype. First we describe SimOS and the machine model used for our experiments in more detail. Next we present the results of performance experiments, fail-stop hardware fault experiments, and software fault experiments.

7.1 SimOS environment

SimOS [18] is a complete machine simulator detailed enough to provide an accurate model of the FLASH hardware. It can also run in a less-accurate mode where it is fast enough (on an SGI Challenge) to boot the operating system quickly and execute interactive applications in real time. The ability to dynamically switch between these modes allows both detailed performance studies and extensive testing.

Operating systems run on SimOS as they would run on a real machine. The primary changes required to enable IRIX and Hive to run on SimOS are to the lowest level of the SCSI driver, ethernet and console interfaces. Fewer than 100 lines of code outside the device drivers needed modification.

Running on SimOS exposes an operating system to all the concurrency and all the resource stresses it would experience on a real machine. Unmodified binaries taken from SGI machines execute normally on top of IRIX and Hive running under SimOS. We believe that this environment is a good way to develop an operating system that requires hardware features not available on current machines. It is also an excellent performance measurement and debugging environment [17].

7.2 Simulated machine

We simulate a machine similar in performance to an SGI Challenge multiprocessor, with four 200-MHz MIPS R4000-class processors, 128 MB of memory, four disk controllers each with one attached disk, four ethernet interfaces, and four consoles. The machine is divided into four nodes, each with one processor, 32 MB of memory, and one of each of the I/O devices. This allows Hive to be booted with either one, two or four cells.

Each processor has a 32 KB two-way-associative primary instruction cache with 64-byte lines, a 32 KB two-way-associative primary data cache with 32-byte lines, and a 1 MB two-way-associative unified secondary cache with 128-byte lines. The simulator executes one instruction per cycle when the processor is not stalled on a cache miss.

A first-level cache miss that hits in the second-level cache stalls the processor for 50 ns. The second-level cache miss latency is fixed at the FLASH average miss latency of 700 ns. An interprocessor interrupt (IPI) is delivered 700 ns after it is requested, while a SIPS message requires an IPI latency plus 300 ns when the receiving processor accesses the data.

Disk latency is computed for each access using an experimentally-validated model of an HP 97560 disk drive [9]. SimOS models both DMA latency and the memory controller occupancy required to transfer data from the disk controller to main memory.

There are two inaccuracies in the machine model that affect our performance numbers. We model the cost of a firewall status change as the cost of the uncached writes required to communicate with the coherence controller. In FLASH, additional latency will be required when revoking write permission to ensure that all pending valid writebacks have completed. The cost of this operation depends on network design details that have not yet been finalized. Also, the machine model provides an oracle that indicates unambiguously to each cell the set of cells that have failed after a fault. This performs the function of the distributed agreement protocol described in Section 4.3, which has not yet been implemented.

7.3 Performance tests

For performance measurements, we selected the workloads shown in Table 7.1. These workloads are characteristic of the two ways we expect Hive to be used. Raytrace and ocean (taken from the Splash-2 suite [22]) are parallel scientific applications that use the system in ways characteristic of supercomputer environments. Pmake (parallel make) is characteristic of use as a multi-programmed compute server. In all cases the file cache was warmed up before running the workloads.

We measured the time to completion of the workloads for Hive configurations of one, two, and four cells. For comparison purposes, we also measured the time under IRIX 5.2 on the same four-processor machine model. Table 7.2 gives the performance of the workloads on the various system configurations.

As we expected, the overall impact of Hive's multicellular architecture is negligible for the parallel scientific applications. After a relatively short initialization phase which uses the file system services, most of the execution time is spent in user mode.

Even for a parallel make, which stresses operating system services heavily, Hive is within 11% of IRIX performance when configured for maximum fault containment with one cell per processor. The overhead is spread over many different kernel operations. We would expect the overhead to be higher on operations which are highly optimized in IRIX. To illustrate the range of overheads, we ran a set of microbenchmarks on representative kernel operations and compared the latency when crossing cell boundaries with the latency in the local case.

Table 7.3 gives the results of these microbenchmarks. The overhead is quite small on complex operations such as large file reads and writes. It ranges up to 7.4 times for simple operations such as a page fault that hits in the page cache. These overheads could be significant for some workloads, but the overall performance of pmake shows that they are mostly masked by other effects (such as disk access costs) which are common to both SMP and multicellular operating systems.

7.4 Fault injection tests

It is difficult to predict the reliability of a complex system before it has been used extensively, and probably impossible to demonstrate reliability through fault injection tests. Still, fault injection tests can provide an initial indication that reliability mechanisms are functioning correctly.

For fault injection tests in Hive, we selected a few situations that stress the intercell resource sharing mechanisms. These are the parts of the architecture where the cells cooperate most closely, so they are the places where it seems most likely that a fault in one cell could corrupt another. We also injected faults into other kernel data structures and at random times to stress the wild write defense mechanism.

When a fault occurs, the important parts of the system's response are the latency until the fault is detected, whether the damage is successfully confined to the cell in which the fault occurred, and how long it takes to recover and return to normal operation. The latency until detection is an important part of the wild write defense, while time required for recovery is relatively unimportant because faults are assumed to be rare. We measured these quantities using both the pmake and raytrace workloads, because multiprogrammed workloads and parallel applications stress different parts of the wild write defense.

We used a four-processor four-cell Hive configuration for all the tests. After injecting a fault into one cell we measured the latency until recovery had begun on all cells, and observed whether the other cells survived. After the fault injection and completion of the main workload, we ran the pmake workload as a system correctness check. Since pmake forks processes on all cells, its success is taken as an indication that the surviving cells were not damaged by the effects of the injected fault. To check for data corruption, all files output by the workload run and the correctness check run were compared to reference copies.

Table 7.4 summarizes the results of the fault injection tests. In all tests, the effects of the fault were contained to the cell in which it was injected, and no output files were corrupted.

We also measured the latency of recovery. The latency of recovery varied between 40 and 80 milliseconds, but the use of the failure oracle in these experiments implies that the latency in practice could be substantially higher. We intend to characterize the costs of recovery more accurately in future studies.

Development of the fault containment mechanisms has been substantially simplified through the use of SimOS rather than real hardware. The ability to deterministically recreate execution from a checkpoint of the machine state, provided by SimOS, makes it straightforward to analyze the complex series of events that follow after a software fault. We expect to continue using SimOS for this type of development even after the FLASH hardware is available.

Last modified 09/20/95 by Dan Teodosiu.