Hive: Fault Containment for Shared-Memory Multiprocessors
Fault containment in shared-memory multiprocessor operating systems appears to be a new problem. We know of no other operating systems that try to contain the effects of wild writes without giving up standard multiprocessor resource sharing. Sullivan and Stonebraker considered the problem in the context of database implementations [19], but the strategies they used are focused on a transactional environment and thus are not directly applicable to a standard commercial operating system.
Reliability is one of the goals of microkernel research. A microkernel could support a distributed system like Hive and prevent wild writes, as discussed in Section 2. However, existing microkernels such as Mach [15] are large and complex enough that it is difficult to trust their correctness. New microkernels such as the Exokernel [6] and the Cache Kernel [4] may be small enough to provide reliability.
An alternative reliability strategy would be to use traditional fault-tolerant operating system implementation techniques. Previous systems such as Tandem Guardian [2] provide a much stronger reliability guarantee than fault containment. However, full fault tolerance requires replication of computation, so it uses the available hardware resources inefficiently. While this is appropriate when supporting applications that cannot tolerate partial failures, it is not acceptable for performance-oriented and cost-sensitive multiprocessor environments.
Another way to look at Hive is as a distributed system where memory and other resources are freely shared between the kernels. This approach to achieving scalability in a multiprocessor operating system has been previously explored by the Hurricane project [21]. Although Hurricane is a microkernel that does not implement full SMP OS functionality or fault containment, and does not use shared memory between the separate kernels, its implementation strategies are close to those developed independently in Hive.
The NOW project at U.C. Berkeley is studying how to couple a cluster of workstations more tightly for better resource sharing [1]. The hardware they assume for a NOW environment does not provide shared memory, so they do not face the challenge of wild writes or the opportunity of directly accessing remote memory. However, much of their work is directly applicable to improving the resource management policies of a system like Hive.
The internal distributed system of Hive requires it to synthesize a single-system image from multiple kernels. The single-system image problem has been studied in depth by other researchers (Sprite [13], Locus [14], OSF/1 AD TNC [23]). Hive reuses some of the techniques developed in Sprite and Locus.

