Hive: Fault Containment for Shared-Memory Multiprocessors

6. RPC Performance Optimization

We have focused development so far on the fault containment and memory sharing functionality of Hive. However, it was clear from the start that intercell RPC latency would be a critical factor in system performance. RPCs could be implemented on top of normal cache-coherent memory reads and writes, but we chose to add hardware message support to FLASH in order to minimize latency.

Without hardware support, intercell communication would have to be layered on interprocessor interrupts (IPIs) and producer-consumer buffers in shared memory. This approach is expensive if the IPI carries no argument data, as on current multiprocessors. The receiving cell would have to poll per-sender queues to determine which cell sent the IPI. (Shared per-receiver queues are not an option as this would require granting global write permission to the queues, allowing a faulty cell to corrupt any message in the system.) Data in the queues would also ping-pong between the processor caches of the sending and receiving cells.

We added a short interprocessor send facility (SIPS) to the FLASH coherence controller. We combine the standard cache-line delivery mechanism used by the cache-coherence protocol with the interprocessor interrupt mechanism and a pair of short receive queues on each node. Each SIPS delivers one cache line of data (128 bytes) in about the latency of a cache miss to remote memory, with the reliability and hardware flow control characteristic of a cache miss. Separate receive queues are provided on each node for request and reply messages, making deadlock avoidance easy. An early version of the message send primitive is described in detail in [8].

The Hive RPC subsystem built on top of SIPS is much leaner than the ones in previous distributed systems. No retransmission or duplicate suppression is required because the primitive is reliable. No message fragmentation or reassembly is required because any data beyond a cache line can be sent by reference (although the careful reference protocol must then be used to access it). 128 bytes is large enough for the argument and result data of most RPCs. The RPC subsystem is also simplified because it supports only kernel-to-kernel communication. User-level RPCs are implemented at the library level using direct access to the message send primitive.

The base RPC system only supports requests that are serviced at interrupt level. The minimum end-to-end null RPC latency measured using SimOS is 7.2 ms (1440 cycles), of which 2 ms is SIPS latency. This time is fast enough that the client processor spins waiting for the reply. The client processor only context-switches after a timeout of 50 msec, which almost never occurs.

In practice the RPC system can add somewhat more overhead than measured with the null RPC. As shown in Table 5.2, we measured an average of 9.6 ms (1920 cycles) for the RPC component of commonly-used interrupt-level request (excluding the time shown in that table to allocate and copy memory for arguments beyond 128 bytes). The extra time above the null RPC latency is primarily due to stub execution.

Layered on top of the base interrupt-level RPC mechanism is a queuing service and server process pool to handle longer-latency requests (for example, those that cause I/O). A queued request is structured as an initial interrupt-level RPC which launches the operation, then a completion RPC sent from the server back to the client to return the result. The minimum end-to-end null queued RPC latency is 34 msec, due primarily to context switch and synchronization costs. In practice the latency can be much higher because of scheduling delays.

The significant difference in latency between interrupt-level and queued RPCs had two effects on the structure of Hive. First, we reorganized data structures and locking to make it possible to service common RPCs at interrupt level. Second, common services that may need to block are structured as initial best-effort interrupt-level service routines that fall back to queued service routines only if required.

Last modified 08/31/95 by Dan Teodosiu.