Hive: Fault Containment for Shared-Memory Multiprocessors
Given the fault containment provided by the features described in the previous section, the next challenge is to share resources flexibly across cell boundaries without weakening fault containment. This section describes Hive's solution to one of the major resource sharing problems, memory sharing among cells.
As described earlier (Figure 3.2) there are two types of memory sharing: logical-level sharing and physical-level sharing. The two types require different data structure management and are implemented at different places in the system.
We found it useful to give names to the three roles that cells can play in sharing a memory page:
We start our description of memory sharing by introducing the virtual memory page cache design in IRIX, because it is the basis for the implementation. Then we discuss each of the types of memory sharing in turn.
When a page fault to a mapped file page occurs, the virtual memory system first checks the pfdat hash table. If the data page requested by the process is not present, the virtual memory system invokes the read operation of the vnode object provided by the file system to represent that file. The file system allocates a page frame, fills it with the requested data, and inserts it in the pfdat hash table. Then the page fault handler in the virtual memory system restarts and finds the page in the hash table.
Read and write system calls follow nearly the same path as page faults. The system call dispatcher calls through the vnode object for the file. The file system checks the pfdat hash table for the requested page in order to decide whether I/O is necessary.
The Hive virtual memory system implements export and import functions that set up the binding between a page of one cell and an extended pfdat on another (Table 5.1). These functions are most frequently called as part of page fault processing, which proceeds as follows.
A page fault to a remote file is initially processed just as in other distributed file systems. The virtual memory system first checks the pfdat hash table on the client cell. If the data page requested by the process is not present, the virtual memory system invokes the read operation on the vnode for that file. This is a shadow vnode which indicates that the file is remote. The file system uses information stored in the vnode to determine the data home for the file and the vnode tag on the data home, and sends an RPC to the data home. The server side of the file system issues a disk read using the data home vnode if the page is not already cached.
Once the page has been located on the data home, Hive functions differently from previous systems. The file system on the data home calls export on the page. This records the client cell in the data home's pfdat, which prevents the page from being deallocated and provides information necessary for the failure recovery algorithms. export also modifies the firewall state of the page if write access is requested.
The server-side file system returns the address of the data page to the client cell. The client-side file system calls import, which allocates an extended pfdat for that page frame and inserts it into the client cell's pfdat hash table. Further faults to that page can hit quickly in the client cell's hash table and avoid sending an RPC to the data home. The page also remains in the data home's pfdat hash table, allowing processes on other cells to find and share it. Figure 5.3a illustrates the state of the virtual memory data structures after export and import have completed.
When the client cell eventually frees the page, the virtual memory system calls release rather than putting the page on the local free list. release frees the extended pfdat and sends an RPC to the data home, which places the page on the data home free list if no other references remain. Keeping the page on the data home free list rather than client free lists increases memory allocation flexibility for the data home. The data page remains in memory until the page frame is reallocated, providing fast access if the client cell faults to it again.
We measure the overhead of the entire mechanism described in this section by comparing the minimal cost of a page fault that hits in the client cell page cache with one that goes remote and hits in the data home page cache. The local case averages 6.9 ms while the remote case averages 50.7 ms in microbenchmarks run on SimOS. Table 5.2 shows a detailed breakdown of the remote page fault latency. 17.3 ms of the remote case is due to RPC costs which are explained in Section 6. Another 14.2 ms (listed in the table as client cell locking overhead and miscellaneous VM) is due to an implementation structure inherited from IRIX. IRIX assumes that any miss in the client cell's hash table will result in a disk access, and so does not optimize that code path. Reorganizing this code could provide substantial further reduction in the remote overhead.
In practice the remote costs can be somewhat higher, because some of the remote faults cannot be serviced at interrupt level. Faults which encounter certain synchronization conditions at the data home must be queued for an RPC server process, which adds substantial latency (Section 6). To check the overall effect of remote faults, we measured their contribution to the slowdown of pmake on a four-cell system compared to a one-cell system. During about six seconds of execution on four processors, there are 8935 page faults that hit in the page cache, of which 4946 are remote on the four-cell system. This increases the time spent in these faults from 117 to 455 milliseconds (cumulative across the processors), which is about 13% of the overall slowdown of pmake from a one-cell to a four-cell system. This time is worth optimizing but is not a dominant effect on system performance.
FIGURE 5.3. Implementation of memory sharing.
In IRIX, anonymous pages are managed in copy-on-write trees, similar to the MACH approach [15]. An anonymous page is allocated when a process writes to a page of its address space that is shared copy-on-write with its parent. The new page is recorded at the current leaf of the copy-on-write tree. When a process forks, the leaf node of the tree is split with one of the new nodes assigned to the parent and the other to the child. Pages written by the parent process after the fork are recorded in its new leaf node, so only the anonymous pages allocated before the fork are visible to the child. When a process faults on a copy-on-write page, it searches up the tree to find the copy created by the nearest ancestor who wrote to the page before forking.
In Hive, the parent and child processes might be on different cells. There are several different ways to change anonymous page management to respond to this. We chose this issue as the subject for an experiment on the effectiveness of building distributed kernel data structures.
We keep the existing tree structure nearly intact, and allow the pointers in the tree to cross cell boundaries. The leaf node corresponding to a process is always local to a process. Other nodes might be remote. This does not create a wild write vulnerability because the lookup algorithms do not need to modify the interior nodes of the tree or synchronize access to them.
When a child read-faults on a shared page, it searches up the tree, potentially using the careful reference protocol to read from the kernel memory of other cells. If it finds the page recorded in a remote node of the tree, it sends an RPC to the cell that owns that node to set up the export/import binding. The cell that owns the node is always the data home for the anonymous page.
The fact that this implementation appears to work reliably in the face of fault injection experiments (Section 7) indicates that distributed data structures can be built without weakening fault containment. However, we do not observe any substantial performance benefit in this case. When the child finds a desired page it usually has to send an RPC to bind to the page in any case, so the use of shared memory does not save much time unless the tree spans multiple cells. A more conventional RPC-based approach would be simpler and probably just as fast, at least for the workloads we evaluated.
Hive reuses the extended pfdat mechanism to enable a cell, the memory home, to loan one of its page frames to another cell, which becomes the data home (Figure 5.3b). The memory home moves the page frame to a reserved list and ignores it until the data home frees it or fails. The data home allocates an extended pfdat and subsequently manages the frame as one of its own (except it must send an RPC to the memory home when it needs to change the firewall state).
Frame loaning is usually demand-driven by the page allocator. When the page allocator receives a request, it may decide to allocate a remote frame. Wax will eventually provide the policy support for remote allocation. If a cell decides to allocate remotely, it sends an RPC to the memory home asking for a set of pages.
Borrowed frames are not acceptable for all requests. For example, frames allocated for internal kernel use must be local, since the firewall does not defend against wild writes by the memory home. The page allocator supports constraints by taking two new arguments, a set of cells that are acceptable for the request and one cell that is preferred.
Hive's current policy for freeing borrowed frames is similar to its policy for releasing imported pages. It sends a free message to the memory home as soon as the data cached in the frame is no longer in use. This can be a poor choice in some cases because it results in immediately flushing the data. We have not yet developed a better policy.
To support this CC-NUMA optimization efficiently, the virtual memory system reuses the preexisting pfdat rather than allocating an extended pfdat when reimporting a loaned page. This is possible because the logical-level and physical-level state machines use separate storage within each pfdat.
There are no operations in the memory sharing subsystem for a cell to request that another return its page or page frame. The information available to each cell is not sufficient to decide whether its local memory requests are higher or lower priority than those of the remote processes using those pages. This information will eventually be provided by Wax, which will direct the virtual memory clock hand process running on each cell to preferentially free pages whose memory home is under memory pressure.

