Hive: Fault Containment for Shared-Memory Multiprocessors
As described earlier, the three ways one cell can damage another are by sending bad messages, providing bad data to remote reads, and writing to remote addresses. The mechanisms that Hive uses to prevent damage from spreading through messages have proven their effectiveness in previous distributed systems such as NFS. Therefore, we will focus on the novel mechanisms related to remote reads and writes: the careful reference protocol for remote reads, the wild write defense, and aggressive failure detection.
An access to the memory of a remote cell can result in a hardware exception. For example, a bus error will occur if the remote node has failed. Cells normally panic (shut themselves down) if they detect such hardware exceptions during kernel execution, because this indicates internal kernel corruption. Some mechanism is needed to prevent errors that occur during remote reads from causing a kernel panic.
Hive uses a simple careful reference protocol to avoid these problems, as well as to handle data errors such as linked data structures with loops and values that change unexpectedly. The reading cell follows these steps:
FLASH firewall: The firewall controls which processors are allowed to modify each region of main memory. FLASH provides a separate firewall for each 4 KB of memory, specified as a 64-bit vector where each bit grants write permission to a processor. On systems larger than 64 processors, each bit grants write permission to multiple processors. A write request to a page for which the corresponding bit is not set fails with a bus error. Only the local processor can change the firewall bits for the memory of its node.
The coherence controller of each node stores and checks the firewall bits for the memory of that node. It checks the firewall on each request for cache line ownership (read misses do not count as ownership requests) and on most cache line writebacks. Uncached accesses to I/O devices on other cells always receive bus errors, while DMA writes from I/O devices are checked as if they were writes from the processor on that node.
We chose a 4 KB firewall granularity to match the operating system page size. Anything larger would constrain operating system memory allocation, whereas it is unclear whether a finer granularity would be useful.
We chose a bit vector per page after rejecting two options that would require less storage. A single bit per page, granting global write access, would provide no fault containment for processes that use any remote memory. A byte or halfword per page, naming a processor with write access, would prevent the scheduler in each cell from balancing the load on its processors.
The performance cost of the firewall is minimal. We ran several of the test workloads twice using a cycle-accurate FLASH memory system model, once with firewall checking enabled and once with it disabled. The firewall check increases the average remote write cache miss latency under pmake by 6.3% and under ocean by 4.4%. This increase has little overall effect since write cache misses are a small fraction of the workload run time.
Firewall management policy: Firewall management is a tradeoff between fault containment and performance. The only time remote write access to a page is required is when a write-enabled mapping to the page is present in a processor of another cell. However, the set of active hardware mappings changes on each TLB miss, a rate that is far too high to send RPCs requesting firewall status changes. Some other policy is needed to decide when firewall write permission should be granted.
Choosing the correct policy requires careful evaluation under various workloads. At present we use a policy that was straightforward to implement and keeps the number of writable pages fairly small.
Write access to a page is granted to all processors of a cell as a group, when any process on that cell faults the page into a writable portion of its address space. Granting access to all processors of the cell allows it to freely reschedule the process on any of its processors without sending RPCs to remote cells. Write permission remains granted as long as any process on that cell has the page mapped.
The address space region is marked writable only if the process had explicitly requested a writable mapping to the file. Thus this policy ensures that a fault in a cell can only corrupt remote pages to which a process running on that cell had requested write access.
To measure the effectiveness of this policy we used pmake, which shares few writable pages between the separate compile processes, and ocean, which shares its data segment among all its threads. We observed that, over 5.0 seconds of execution sampled at 20 millisecond intervals, pmake had an average of 15 remotely writable pages per cell at each sample (out of about 6000 user pages per cell), while ocean showed an average of 550 remotely writable pages.
The behavior of the firewall under pmake shows that the current policy should provide good wild write protection to a system used predominately by sequential applications. The highest recorded number of writable pages during the workload was 42, on the cell acting as the file server for the directory where compiler intermediate files are stored (/tmp).
In the case of ocean, the current policy provides little protection since the global data segment is write-shared by all processors. However, the application is running on all processors and will exit anyway when a cell fails, so any efforts to prevent its pages from being discarded will be wasted. The simple firewall management policy appears to be working well in this case, avoiding protection status changes that would create unnecessary performance overheads.
Preemptive discard: It is difficult to efficiently determine which pages to discard after a cell failure. Many cells could be using a given page and therefore need to cooperate in discarding it, but only one cell knows the precise firewall status of that page (the data home cell, defined in Section 5). Distributing firewall status information during recovery to all cells using the page would require significant communication. Instead, all TLBs are flushed and all remote mappings are removed during recovery. This ensures that a future access to a discarded page will fault and send an RPC to the owner of the page, where it can be checked.
The accesses need to be checked because discarding a page can violate the expected stable write semantics of the file system, if the page was dirty with respect to disk. Processes that attempt to access a discarded dirty page should receive an error. However, the accesses might occur arbitrarily far in the future, making it quite expensive to record exactly which pages of each file have been discarded. We solve this problem by relaxing the process-visible error semantics slightly.
In most current UNIX implementations the file system does not attempt to record which dirty pages were lost in a system crash. It simply fetches stale data from disk after a reboot. This is acceptable because no local processes can survive the crash, so a process that accessed the dirty data will never observe that it was unstable.
We take advantage of this in Hive and allow any process that opens a damaged file after a cell failure to read whatever data is available on disk. Only processes that opened the file before the failure will receive I/O errors. This is implemented with a generation number, maintained by the file system, that is copied into the file descriptor or address space map of a process when it opens the file. When a dirty page of a file is discarded, the file's generation number is incremented. An access via a file descriptor or address space region with a mismatched generation number generates an error.
Just as in previous distributed systems, a cell is considered potentially failed if an RPC sent to it times out. Additionally, a cell is considered potentially failed if:
The distributed agreement algorithm is an instance of the well-studied group membership problem, so Hive will use a standard algorithm (probably [16]). This is not implemented yet and is simulated by an oracle for the experiments reported in this paper.
Recovery algorithms: Given consensus on the live set of cells, each cell runs recovery algorithms to clean up dangling references and determine which processes must be killed. One interesting aspect of these algorithms is the use of a double global barrier to synchronize the preemptive discard operation. The double barrier in recovery is part of a strategy to increase the speed of page faults that hit in the file cache, an extremely common intercell operation.
When a cell exits distributed agreement and enters recovery, it is not guaranteed that all page faults and accesses to its memory from other cells have finished. User-level processes will be suspended, but processes running at kernel level will not be suspended. (Allowing kernel-level processes to continue during recovery permits the recovery algorithms to grab kernel locks and modify kernel data structures.) Each cell only joins the first global barrier when it has flushed its processor TLBs and removed any remote mappings from process address spaces. A page fault that occurs after a cell has joined the first barrier is held up on the client side.
After the first barrier completes, each cell knows that no further valid page faults or remote accesses are pending. This allows it to revoke any firewall write permission it has granted to other cells and clean up its virtual memory data structures. It is during this operation that the virtual memory subsystem detects pages that were writable by a failed cell and notifies the file system, which increments its generation count on the file to record the loss.
Each cell joins the second global barrier after it has finished virtual memory cleanup. Cells that exit the second barrier can safely resume normal operation, including sending page faults to other cells.
Given this design, the server-side implementation of a page fault RPC need not grab any blocking locks to synchronize with the recovery algorithms. This allows page faults that hit in the file cache to be serviced entirely in an interrupt handler, which has significant performance benefits (Section 5.2).
At the end of every recovery round, a recovery master is elected from the new live set. The recovery master runs hardware diagnostics on the nodes belonging to the failed cells. If the diagnostic checks succeed, the failed cells are automatically rebooted and reintegrated into the system. Reintegration is not yet implemented but appears straightforward.

