Hive: Fault Containment for Shared-Memory Multiprocessors

4. Fault Containment Implementation

As described earlier, the three ways one cell can damage another are by sending bad messages, providing bad data to remote reads, and writing to remote addresses. The mechanisms that Hive uses to prevent damage from spreading through messages have proven their effectiveness in previous distributed systems such as NFS. Therefore, we will focus on the novel mechanisms related to remote reads and writes: the careful reference protocol for remote reads, the wild write defense, and aggressive failure detection.

4.1 Careful reference protocol

One cell reads another's internal data structures in cases where RPCs are too slow, an up-to-date view of the data is required, or the data needs to be published to a large number of cells. Once the data has been read, it has to be sanity-checked just as an RPC received from the remote cell would be checked. However, the remote reads create additional fault containment problems.

An access to the memory of a remote cell can result in a hardware exception. For example, a bus error will occur if the remote node has failed. Cells normally panic (shut themselves down) if they detect such hardware exceptions during kernel execution, because this indicates internal kernel corruption. Some mechanism is needed to prevent errors that occur during remote reads from causing a kernel panic.

Hive uses a simple careful reference protocol to avoid these problems, as well as to handle data errors such as linked data structures with loops and values that change unexpectedly. The reading cell follows these steps:

  1. Call the careful_on function, which captures the current stack frame and records which remote cell the kernel intends to access. If a bus error occurs while reading the memory of that cell, the trap handler restores to the saved function context.

  2. Before using any remote address, check that it is aligned properly for the expected data structure and that it addresses the memory range belonging to the expected cell.

  3. Copy all data values to local memory before beginning sanity-checks, in order to defend against unexpected changes.

  4. Check each remote data structure by reading a structure type identifier. The type identifier is written by the memory allocator and removed by the memory deallocator. Checking for the expected value of this tag provides a first line of defense against invalid remote pointers.

  5. Call careful_off when done so future bus errors in the reading cell will correctly cause the kernel to panic.

An example use of the careful reference protocol is the clock monitoring algorithm, in which the clock handler of each cell checks another cell's clock value on every tick (Section 4.3). With simulated 200-MHz processors, the average latency from the initial call to careful_on until the terminating careful_off call finishes is 1.16 ms (232 cycles), of which 0.7 ms (140 cycles) is the latency we model for the cache miss to the memory line containing the clock value. This is substantially faster than sending an RPC to get the data, which takes a minimum of 7.2 ms (Section 6) and requires interrupting a processor on the remote cell.

4.2 Wild write defense

Hive defends against wild writes using a two-part strategy. First, it manages the FLASH hardware firewall to minimize the number of pages writable by remote cells. Second, when a cell failure is detected, other cells preemptively discard any pages writable by the failed cell.

FLASH firewall: The firewall controls which processors are allowed to modify each region of main memory. FLASH provides a separate firewall for each 4 KB of memory, specified as a 64-bit vector where each bit grants write permission to a processor. On systems larger than 64 processors, each bit grants write permission to multiple processors. A write request to a page for which the corresponding bit is not set fails with a bus error. Only the local processor can change the firewall bits for the memory of its node.

The coherence controller of each node stores and checks the firewall bits for the memory of that node. It checks the firewall on each request for cache line ownership (read misses do not count as ownership requests) and on most cache line writebacks. Uncached accesses to I/O devices on other cells always receive bus errors, while DMA writes from I/O devices are checked as if they were writes from the processor on that node.

We chose a 4 KB firewall granularity to match the operating system page size. Anything larger would constrain operating system memory allocation, whereas it is unclear whether a finer granularity would be useful.

We chose a bit vector per page after rejecting two options that would require less storage. A single bit per page, granting global write access, would provide no fault containment for processes that use any remote memory. A byte or halfword per page, naming a processor with write access, would prevent the scheduler in each cell from balancing the load on its processors.

The performance cost of the firewall is minimal. We ran several of the test workloads twice using a cycle-accurate FLASH memory system model, once with firewall checking enabled and once with it disabled. The firewall check increases the average remote write cache miss latency under pmake by 6.3% and under ocean by 4.4%. This increase has little overall effect since write cache misses are a small fraction of the workload run time.

Firewall management policy: Firewall management is a tradeoff between fault containment and performance. The only time remote write access to a page is required is when a write-enabled mapping to the page is present in a processor of another cell. However, the set of active hardware mappings changes on each TLB miss, a rate that is far too high to send RPCs requesting firewall status changes. Some other policy is needed to decide when firewall write permission should be granted.

Choosing the correct policy requires careful evaluation under various workloads. At present we use a policy that was straightforward to implement and keeps the number of writable pages fairly small.

Write access to a page is granted to all processors of a cell as a group, when any process on that cell faults the page into a writable portion of its address space. Granting access to all processors of the cell allows it to freely reschedule the process on any of its processors without sending RPCs to remote cells. Write permission remains granted as long as any process on that cell has the page mapped.

The address space region is marked writable only if the process had explicitly requested a writable mapping to the file. Thus this policy ensures that a fault in a cell can only corrupt remote pages to which a process running on that cell had requested write access.

To measure the effectiveness of this policy we used pmake, which shares few writable pages between the separate compile processes, and ocean, which shares its data segment among all its threads. We observed that, over 5.0 seconds of execution sampled at 20 millisecond intervals, pmake had an average of 15 remotely writable pages per cell at each sample (out of about 6000 user pages per cell), while ocean showed an average of 550 remotely writable pages.

The behavior of the firewall under pmake shows that the current policy should provide good wild write protection to a system used predominately by sequential applications. The highest recorded number of writable pages during the workload was 42, on the cell acting as the file server for the directory where compiler intermediate files are stored (/tmp).

In the case of ocean, the current policy provides little protection since the global data segment is write-shared by all processors. However, the application is running on all processors and will exit anyway when a cell fails, so any efforts to prevent its pages from being discarded will be wasted. The simple firewall management policy appears to be working well in this case, avoiding protection status changes that would create unnecessary performance overheads.

Preemptive discard: It is difficult to efficiently determine which pages to discard after a cell failure. Many cells could be using a given page and therefore need to cooperate in discarding it, but only one cell knows the precise firewall status of that page (the data home cell, defined in Section 5). Distributing firewall status information during recovery to all cells using the page would require significant communication. Instead, all TLBs are flushed and all remote mappings are removed during recovery. This ensures that a future access to a discarded page will fault and send an RPC to the owner of the page, where it can be checked.

The accesses need to be checked because discarding a page can violate the expected stable write semantics of the file system, if the page was dirty with respect to disk. Processes that attempt to access a discarded dirty page should receive an error. However, the accesses might occur arbitrarily far in the future, making it quite expensive to record exactly which pages of each file have been discarded. We solve this problem by relaxing the process-visible error semantics slightly.

In most current UNIX implementations the file system does not attempt to record which dirty pages were lost in a system crash. It simply fetches stale data from disk after a reboot. This is acceptable because no local processes can survive the crash, so a process that accessed the dirty data will never observe that it was unstable.

We take advantage of this in Hive and allow any process that opens a damaged file after a cell failure to read whatever data is available on disk. Only processes that opened the file before the failure will receive I/O errors. This is implemented with a generation number, maintained by the file system, that is copied into the file descriptor or address space map of a process when it opens the file. When a dirty page of a file is discarded, the file's generation number is incremented. An access via a file descriptor or address space region with a mismatched generation number generates an error.

4.3 Failure detection and recovery

Hive attempts to detect the failure of a cell quickly in order to reduce the probability that wild writes will cause user-visible data corruption. This is implemented with consistency checks that run regularly in normal operation. When one of the checks fails, it is confirmed by a distributed agreement algorithm.

Just as in previous distributed systems, a cell is considered potentially failed if an RPC sent to it times out. Additionally, a cell is considered potentially failed if:

To prevent a corrupt cell from repeatedly broadcasting alerts and damaging system performance over a long period, a cell that broadcasts the same alert twice but is voted down by the distributed agreement algorithm both times is considered corrupt by the other cells.

The distributed agreement algorithm is an instance of the well-studied group membership problem, so Hive will use a standard algorithm (probably [16]). This is not implemented yet and is simulated by an oracle for the experiments reported in this paper.

Recovery algorithms: Given consensus on the live set of cells, each cell runs recovery algorithms to clean up dangling references and determine which processes must be killed. One interesting aspect of these algorithms is the use of a double global barrier to synchronize the preemptive discard operation. The double barrier in recovery is part of a strategy to increase the speed of page faults that hit in the file cache, an extremely common intercell operation.

When a cell exits distributed agreement and enters recovery, it is not guaranteed that all page faults and accesses to its memory from other cells have finished. User-level processes will be suspended, but processes running at kernel level will not be suspended. (Allowing kernel-level processes to continue during recovery permits the recovery algorithms to grab kernel locks and modify kernel data structures.) Each cell only joins the first global barrier when it has flushed its processor TLBs and removed any remote mappings from process address spaces. A page fault that occurs after a cell has joined the first barrier is held up on the client side.

After the first barrier completes, each cell knows that no further valid page faults or remote accesses are pending. This allows it to revoke any firewall write permission it has granted to other cells and clean up its virtual memory data structures. It is during this operation that the virtual memory subsystem detects pages that were writable by a failed cell and notifies the file system, which increments its generation count on the file to record the loss.

Each cell joins the second global barrier after it has finished virtual memory cleanup. Cells that exit the second barrier can safely resume normal operation, including sending page faults to other cells.

Given this design, the server-side implementation of a page fault RPC need not grab any blocking locks to synchronize with the recovery algorithms. This allows page faults that hit in the file cache to be serviced entirely in an interrupt handler, which has significant performance benefits (Section 5.2).

At the end of every recovery round, a recovery master is elected from the new live set. The recovery master runs hardware diagnostics on the nodes belonging to the failed cells. If the diagnostic checks succeed, the failed cells are automatically rebooted and reintegrated into the system. Reintegration is not yet implemented but appears straightforward.

Last modified 08/31/95 by Dan Teodosiu.