Abstract: Coherent Block Data Transfer in the FLASH Multiprocessor
A key goal of the Stanford FLASH project is to explore the integration of
multiple communication protocols in a single multiprocessor architecture. To
achieve this goal, FLASH includes a programmable node controller called MAGIC,
which contains an embedded protocol processor capable of implementing multiple
protocols. In this paper we present a specialized protocol for block data
transfer integrated with a conventional cache coherence protocol. Block
transfer forms the basis for message passing implementations on top of shared
memory, occurs in important workloads such as databases, and is frequently
used by the operating system. We discuss the issues that arise in designing a
fully integrated protocol and its interactions with cache coherence. Using
microbenchmarks, MPI communication primitives, and an application running on
the operating system, we compare our protocol with standard bcopy and bcopy
augmented with prefetches. Our results show that integrated block transfer can
accelerate communication between nodes while off-loading the task from the
main processor, utilizing the network more efficiently, and reducing the
associated cache pollution. Given the aggressive support for prefetching in
FLASH, prefetched bcopy is able to achieve competitive performance in many
cases but lacks the other three advantages of our protocol.
Architecture
FLASH
Last modified 1/21/97 by Joel Baxter, webmaster@www-flash.stanford.edu.