Hive: Fault Containment for Shared-Memory Multiprocessors
1. Introduction

Shared-memory multiprocessors are becoming an increasingly
common server platform because of their excellent performance under
dynamic multiprogrammed workloads. However, the symmetric
multiprocessor operating systems (SMP OS) commonly used for
small-scale machines are difficult to scale to the large shared-memory
multiprocessors that can now be built (Stanford DASH [11],
MIT Alewife [3],
Convex Exemplar [5]).
In this paper we describe Hive, an operating system designed for
large-scale shared-memory multiprocessors. Hive is fundamentally
different from previous monolithic and microkernel SMP OS
implementations: it is structured as an internal distributed system of
independent kernels called cells. This multicellular kernel
architecture has two main advantages:
- Reliability: In SMP OS implementations, any significant hardware
or software fault causes the entire system to crash. For large-scale
machines this can result in an unacceptably low mean time to
failure. In Hive, only the cell where the fault occurred crashes, so
only the processes using the resources of that cell are affected. This
is especially beneficial for compute server workloads where there are
multiple independent processes, the predominant situation today. In
addition, scheduled hardware maintenance and kernel software upgrades
can proceed transparently to applications, one cell at a time.
- Scalability: SMP OS implementations are difficult to scale to
large machines because all processors directly share all kernel
resources. Improving parallelism in a "shared-everything"
architecture is an iterative trial-and-error process of identifying
and fixing bottlenecks. In contrast, Hive offers a systematic approach
to scalability. Few kernel resources are shared by processes running
on different cells, so parallelism can be improved by increasing the
number of cells.
However, the multicellular architecture of Hive also creates new implementation challenges. These include:
- Fault containment: The effects of faults must be confined to the cell in which they occur. This is difficult since a shared-memory multiprocessor allows a faulty cell to issue wild writes which can corrupt the memory of other cells.
- Resource sharing: Processors, memory, and other system
resources must be shared flexibly across cell boundaries, to preserve
the execution efficiency that justifies investing in a shared-memory
multiprocessor.
- Single-system image: The cells must cooperate to present a
standard SMP OS interface to applications and users.
In this paper, we focus on Hive's solution to the fault containment
problem and on its solution to a key resource sharing problem, sharing
memory across cell boundaries. The solutions rely on hardware as well
as software mechanisms: we have designed Hive in conjunction with the
Stanford FLASH multiprocessor [10],
which has enabled us to add hardware support in a few critical
areas.
Hive's fault containment strategy has three main components. Each cell
uses firewall hardware provided by FLASH to defend most of its memory
pages against wild writes. Any pages writable by a failed cell are
preemptively discarded when the failure is detected, which prevents
any corrupt data from being read subsequently by applications or
written to disk. Finally, aggressive failure detection reduces the
delay until preemptive discard occurs. Cell failures are detected
initially using heuristic checks, then confirmed with a distributed
agreement protocol that minimizes the probability of concluding that a
functioning cell has failed.
Hive provides two types of memory sharing among cells. First, the file
system and the virtual memory system cooperate so processes on
multiple cells can use the same memory page for shared data. Second,
the page allocation modules on different cells cooperate so a free
page belonging to one cell can be loaned to another cell that is under
memory pressure. Either type of sharing would cause fault containment
problems on current multiprocessors, since a hardware fault in memory
or in a processor caching the data could halt some other processor
that tries to access that memory. FLASH makes memory sharing safe by
providing timeouts and checks on memory accesses.
The current prototype of Hive is based on and remains binary
compatible with IRIX 5.2 (a version of UNIX SVR4 from Silicon
Graphics, Inc.). Because FLASH is not available yet, we used the SimOS
hardware simulator [18]
to develop and test Hive. Our early experiments using SimOS
demonstrate that:
- Hive can survive the halt of a processor or the failure of a
range of memory. In all of 49 experiments where we injected a
fail-stop hardware fault, the effects were confined to the cell where
the fault occurred.
- Hive can survive kernel software faults. In all of 20 experiments
where we randomly corrupted internal operating system data structures,
the effects were confined to the cell where the fault occurred.
- Hive can offer reasonable performance while providing fault
containment. A four-cell Hive executed three test workloads with
between 0% and 11% slowdown as compared to IRIX 5.2 on a
four-processor machine.
These results indicate that a multicellular kernel architecture can provide fault containment in a shared-memory multiprocessor. The performance results are also promising, but significant further work is required on resource sharing and the single-system image before we can make definitive conclusions about performance.
We begin this paper by defining fault containment more precisely and
describing the fundamental problems that arise when implementing it in
multiprocessors. Next we give an overview of the architecture and
implementation of Hive. The implementation details follow in three
parts: fault containment, memory sharing, and the intercell remote
procedure call subsystem. We conclude with an evaluation of the
performance and fault containment of the current prototype, a
discussion of other applications of the Hive architecture, and a
summary of related work.


Last modified 08/31/95 by
Dan Teodosiu.