Hive: Fault Containment for Shared-Memory Multiprocessors

1. Introduction

Shared-memory multiprocessors are becoming an increasingly common server platform because of their excellent performance under dynamic multiprogrammed workloads. However, the symmetric multiprocessor operating systems (SMP OS) commonly used for small-scale machines are difficult to scale to the large shared-memory multiprocessors that can now be built (Stanford DASH [11], MIT Alewife [3], Convex Exemplar [5]).

In this paper we describe Hive, an operating system designed for large-scale shared-memory multiprocessors. Hive is fundamentally different from previous monolithic and microkernel SMP OS implementations: it is structured as an internal distributed system of independent kernels called cells. This multicellular kernel architecture has two main advantages:

However, the multicellular architecture of Hive also creates new implementation challenges. These include:

In this paper, we focus on Hive's solution to the fault containment problem and on its solution to a key resource sharing problem, sharing memory across cell boundaries. The solutions rely on hardware as well as software mechanisms: we have designed Hive in conjunction with the Stanford FLASH multiprocessor [10], which has enabled us to add hardware support in a few critical areas.

Hive's fault containment strategy has three main components. Each cell uses firewall hardware provided by FLASH to defend most of its memory pages against wild writes. Any pages writable by a failed cell are preemptively discarded when the failure is detected, which prevents any corrupt data from being read subsequently by applications or written to disk. Finally, aggressive failure detection reduces the delay until preemptive discard occurs. Cell failures are detected initially using heuristic checks, then confirmed with a distributed agreement protocol that minimizes the probability of concluding that a functioning cell has failed.

Hive provides two types of memory sharing among cells. First, the file system and the virtual memory system cooperate so processes on multiple cells can use the same memory page for shared data. Second, the page allocation modules on different cells cooperate so a free page belonging to one cell can be loaned to another cell that is under memory pressure. Either type of sharing would cause fault containment problems on current multiprocessors, since a hardware fault in memory or in a processor caching the data could halt some other processor that tries to access that memory. FLASH makes memory sharing safe by providing timeouts and checks on memory accesses.

The current prototype of Hive is based on and remains binary compatible with IRIX 5.2 (a version of UNIX SVR4 from Silicon Graphics, Inc.). Because FLASH is not available yet, we used the SimOS hardware simulator [18] to develop and test Hive. Our early experiments using SimOS demonstrate that:

These results indicate that a multicellular kernel architecture can provide fault containment in a shared-memory multiprocessor. The performance results are also promising, but significant further work is required on resource sharing and the single-system image before we can make definitive conclusions about performance.

We begin this paper by defining fault containment more precisely and describing the fundamental problems that arise when implementing it in multiprocessors. Next we give an overview of the architecture and implementation of Hive. The implementation details follow in three parts: fault containment, memory sharing, and the intercell remote procedure call subsystem. We conclude with an evaluation of the performance and fault containment of the current prototype, a discussion of other applications of the Hive architecture, and a summary of related work.

Last modified 08/31/95 by Dan Teodosiu.