Thesis Abstract

Flexible Use of Memory for Replication/Migration in Cache-Coherent DSM Multiprocessors

Shared-memory multiprocessors are being used increasingly as compute servers. These systems enable efficient usage of computing resources through the aggregation and tight coupling of CPU, memory, and I/O. One popular design for shared-memory machines is a bus-based architecture. However, as processors get faster, the shared bus becomes a bandwidth bottleneck. CC-NUMA (Cache-Coherent with Non-Uniform Memory Access time) machines remove this architectural limitation and provide a scalable shared-memory architecture. One significant characteristic of the CC-NUMA architecture is that the latency to access data on a remote node is considerably larger than the latency to access local memory. On such machines, good data locality can reduce memory stall time and is therefore a critical factor in application performance.

In this thesis we study the various options available to system designers to transparently decrease the fraction of data misses serviced remotely. This work is done in the context of the Stanford FLASH multiprocessor. We utilize the programmability of the FLASH memory controller to explore a number of techniques for improving data locality. Specifically, we compare base cache-coherence to techniques that use some or all of the local memory in a node to improve data locality. These techniques include a Remote Access Cache (RAC), in which a portion of local memory is used to cache remotely-allocated data at cache-line granularity; a Cache-Only Memory Architecture (COMA-F), in which all of local memory is used as a cache under hardware control; and OS-assisted page migration/replication, in which the operating system migrates or replicates pages according to the observed miss pattern to cache lines on a page. Based on the comparison of these approaches, we propose and evaluate a novel hybrid scheme, MIGRAC, that combines the benefits of the RAC design and OS-based page migration and replication.

Our work differs from previous comparison work in several important respects. First, all of our schemes are complete implementations evaluated on the same base platform, providing a detailed and consistent evaluation that has not be possible previously. In contrast, previous studies have compared existing machines with different underlying architectures, making a fair evaluation difficult, or have compared simulations of high-level behavioral models rather than real implementations. Second, we evaluate our work on compute-server workloads, rather than on scientific applications, as has been done for previous related research. Third, our implementation of COMA is the first complete implementation of a COMA-F design in hardware.

We find that a simple RAC can improve performance significantly over base cache-coherence (up to 64%). COMA-F improves locality but its additional complexity limits its gains versus base cache-coherence (only 14% improvement). Page migration/replication performs well (up to 56% gains) but does not handle fine-grain sharing as effectively as RAC or COMA-F. Finally, our MIGRAC approach performs well relative to a simple RAC (up to 37% faster) or page migration/replication (up to 8%) faster and is robust.

Back home.