Hive: Fault Containment for Shared-Memory Multiprocessors
Fault containment is a key technique that will improve the reliability of large-scale shared-memory multiprocessors used as general-purpose compute servers. The challenge is to provide better reliability than current multiprocessor operating systems without reducing performance.
Hive implements fault containment by running an internal distributed system of independent kernels called cells. The basic memory isolation assumed by a distributed system is provided through a combination of write protection hardware (the firewall) and a software strategy that discards all data writable by a failed cell. The success of this approach demonstrates that shared memory is not incompatible with fault containment.
Hive strives for performance competitive with current multiprocessor operating systems through two main strategies. Cells share memory freely, both at a logical level where a process on one cell accesses the data on another, and at a physical level where one cell can transfer control over a page frame to another. Load balancing and resource reallocation are designed to be driven by a user-level process, Wax, which uses shared memory to build a global view of system state and synchronize the actions of various cells. Performance measurements on the current prototype of Hive are encouraging, at least for the limited tests carried out so far.
Finally, the multicellular architecture of Hive makes it inherently scalable to multiprocessors significantly larger than current systems. We believe this makes the architecture promising even for environments where its reliability benefits are not required.
This work was supported in part by ARPA contract DABT63-94-C-0054. John Chapin is supported by a Fannie and John Hertz Foundation fellowship. Mendel Rosenblum is partially supported by a National Science Foundation Young Investigator award. Anoop Gupta is partially supported by a National Science Foundation Presidential Young Investigator award.

