Hive: Fault Containment for Shared-Memory Multiprocessors
8. Discussion

The current Hive prototype demonstrates that it is possible to provide significantly better reliability for shared-memory multiprocessors than is achieved by SMP OS implementations. However, there are several issues that must be addressed before we can suggest that production operating systems be constructed using the techniques described in this paper:
Hardware support: Various aspects of the Hive design depend on hardware features that are not standard in current multiprocessors. Table 8.1 summarizes the special-purpose support that we added to FLASH, including a few features not discussed earlier in the paper. Of these features, the firewall requires the most hardware resources (for bit vector storage). The memory fault model requires attention while designing the cache-coherence protocol, but need not have a high hardware cost as long as it does not try to protect against all possible faults.

The hardware features used by Hive appear to allow a range of implementations that trade off among performance, cost, and fault containment. This suggests that a system manufacturer interested in improved reliability could choose an appropriate level of hardware support. We do not see this issue as a barrier to production use of a system like Hive.
Architectural tradeoffs: Significant further work on the Hive prototype is required to explore the costs of a multicellular architecture.
- Wax:
- There are two open questions to be investigated once Wax is implemented. We must determine whether an optimization module that is "out of the loop" like Wax can respond rapidly to changes in the system state, without running continuously and thereby wasting processor resources. We also need to investigate whether a two-level optimization architecture (intracell and intercell decisions made independently) can compete with the resource management efficiency of a modern UNIX implementation.
- Resource sharing:
- Policies such as page migration and intercell memory sharing must work effectively under a wide range of workloads for a multicellular operating system to be a viable replacement for a current SMP OS. Spanning tasks and process migration must be implemented. The resource sharing policies must be systematically extended to consider the fault containment implications of sharing decisions. Some statistical measure is needed to predict the probability of data integrity violations in production operation.
- File system:
- A multicellular architecture requires a fault-tolerant high performance file system that preserves single-system semantics. This will require mechanisms that support file replication and striping across cells, as well as an efficient implementation of a globally coherent and location indepen-dent file name space.
Other advantages of the architecture: We also see several areas, other than the reliability and scalability issues which are the focus of this paper, in which the techniques used in Hive might provide substantial benefits.
- Heterogenous resource management:
- For large diverse workloads, performance may be improved by managing separate resource pools with separate policies and mechanisms. A multicellular operating system can segregate processes by type and use different strategies in different cells. Different cells can even run different kernel code if their resource management mechanisms are incompatible or the machine's hardware is heterogenous.
- Support for CC-NOW:
- Researchers have proposed workstation add-on cards that will provide cache-coherent shared memory across local-area networks [12]. Also, the FLASH architecture may eventually be distributed to multiple desktops. Both approaches would create a cache-coherent network of workstations (CC-NOW). The goal of a CC-NOW is a system with the fault isolation and administrative independence characteristic of a workstation cluster, but the resource sharing characteristic of a multiprocessor. Hive is a natural starting point for a CC-NOW operating system.


Last modified 08/31/95 by
Dan Teodosiu.