Distributed shared memory (DSM) machines can be characterized by four parameters, based on a slightly modified version of the logP model. The l (latency) and o (occupancy of the communication controller) parameters are the keys to performance in these machines, and are largely determined by major architectural decisions about the aggressiveness and customization of the node and network. For recent and upcoming machines, the g (gap) parameter that measures node-to-network bandwidth does not appear to be a bottleneck. Conventional wisdom is that latency is the dominant factor in determining the performance of a DSM machine. We show, however, that controller occupancy---which causes contention even in highly optimized applications---plays a major role, especially at low latencies. When latency hiding is used, occupancy becomes more critical, even in machines with high latency networks. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. We show that in many structured computations occupancy-induced contention is not alleviated by increasing problem size, and that there are important classes of applications for which the performance lost by using higher latency networks or higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.
Figure 1.1. Convergent DSM architecture
There are various ways to build cache-coherent DSM machines, arising from differences in desired performance and cost characteristics and in the extent to which one wants to use commodity parts and interfaces rather than build customized hardware. We assume the use of a commodity microprocessor, cache subsystem and main memory in this paper. The major sources of variability are in the network and the communication controller, which together constitute the communication architecture of the multiprocessor.
Candidate networks vary in their latency and bandwidth characteristics as well as in their topologies. They range from low-latency, high-bandwidth MPP networks, all the way to commodity asynchronous transfer mode (ATM) networks. On the controller side, there are two important and related variables. One is the location of the communication controller in the processing node. The communication controller can be located in the cache controller, in the memory subsystem, or on the I/O bus. The closer to the processor we locate the controller, the greater the performance, but the less we can leverage off commodity parts for the processing node. The other design variable is how specialized the controller is for the tasks it performs; for instance, it may be a hardware finite state machine, a customized special-purpose processor that runs protocol code in response to events, or an inexpensive off-the-shelf general-purpose processor.
Because of their differences in design cost, all of these types of systems are interesting. Current and proposed architectures for fine-grained distributed shared memory take different positions on the above tradeoffs. The unanswered question is how the performance characteristics of the network and controller affect how well the machines will run actual parallel programs. That is, as we move from more tightly-coupled and specialized systems to less tightly-coupled and more commodity-based systems, how much effectiveness in parallel performance do we lose over a wide range of computations? This is the question we address in this paper, by studying a range of important computations and interesting communication architectures through a combination of analytical modeling and detailed simulation.
We characterize the communication architectures of DSM multiprocessors by a few key parameters that are similar to those in the logP model [CKP+93]. An abstract model is of course simplistic from the perspective of an actual architecture, since it does not capture many of the details of real machines. We shall set up our architectural context and discuss some of the more detailed issues in the next section. Section 3 describes the framework and methodology we use to study the effectiveness of different types of DSM architectures. Section 4 and Section 5 present and analyze our results, and Section 6 concludes the paper.
We fix the number of processors P at 64 in this paper. The three parameters that characterize the communication architecture---latency, occupancy, and bandwidth or gap---all have complicated aspects to them, and we make certain simplifying assumptions. Let us first discuss each parameter individually, before placing our range of variations of these parameters in the context of realistic machines.
Latency: The latency of a message through the network depends, among other things, on how many hops the message travels in the network topology. For the moderate-scale machines that we consider, the overhead of getting the message from the processor into the network and vice versa usually dominates the topology-related component of the latency seen by the processor. We therefore ignore topology, and assume that the network transit time from one node to another is always the same.
Occupancy: The occupancy that the controller incurs for a request affects performance in two ways. First, it contributes directly to the latency of the current request because the request must pass through the controller. Second, it can contribute indirectly to the latencies of subsequent requests, through contention for the occupied controller. Occupancy may be more difficult to represent as an abstract parameter than network latency for two reasons. First, we have to decide which types of transactions invoke actions on the controller and hence incur controller occupancy. Second, while occupancy in real machines often depends on the type of the transaction, we want to represent it by a single parameter o. We now examine these issues separately.
Clearly, all events related to internode communication and protocol processing incur controller occupancy, including local cache misses that need data from another node, references from the processor that require the communication of state changes to other nodes, and incoming requests and replies from the network containing data and protocol information. The question is whether the controller should also handle local cache misses that do not generate any communication. In this paper, we assume a bypass path to local memory for these misses, so that they need not invoke the controller [RLW94, SFL+94]. We also assume that the state lookup that determines whether or not a miss needs to invoke the controller is free, and hence does not contribute to the latency of the miss.
In a real machine, particularly one in which the communication controller runs software code sequences for protocol processing, the occupancies of the controller are often different for different types of protocol actions. We make the following assumptions about occupancy. When the communication controller is simply generating a request into the network or receiving a reply from the network it incurs occupancy o. When the communication controller is the home of a network request it incurs occupancy 2o, because it has to retrieve data and/or manipulate coherence state information [HKO+94]. In this case the memory access occurs in parallel with the operation of the controller. If the state lookup at the home reveals that the requested line is dirty in the home node's cache, the communication controller incurs an extra occupancy C. If the requested line is dirty in a third processor's cache, the home node forwards the request to that processor and the communication controller at that node incurs an occupancy of 2o+C. The only other time occupancy is incurred is when the communication controller at the home node is servicing a write and must send invalidations to all nodes that are sharing the data. In this case the controller incurs an additional occupancy of one cycle per invalidation that it sends.
Bandwidth or gap: Another important parameter related to the communication architecture is the node-to-network bandwidth, which determines how fast data can be transferred through the network interface, i.e. between the communication controller and the network itself. We ignore the gap parameter in the main discussion in the paper, and simply compute the node-to-network bandwidth requirements of the individual applications in Section 4.3. We ignore gap because the node-to-network bandwidth we assume (400MB/s peak, which corresponds to MPP networks on next-generation machines) is large enough to never be a performance bottleneck in our experiments. For coherence messages that do not carry data, the occupancy of the communication controller always dominates our gap limitation. For messages that carry data, 400MB/s node-to-network bandwidth can theoretically become the bottleneck before controller occupancy for the two lowest occupancies we examine. However, we never observed the symptomatic filling of network interface buffers in practice, both because a given processor allows only a limited number of outstanding requests, and since transactions that do not involve data are usually interspersed with data requests.
Given these assumptions about l, o and, g, let us examine the path of a read miss to a line that is allocated on a remote node and is clean at its home. The request travels through the communication controller on the requesting node (o), traverses the network (l), travels through the communication controller at the home where the request is satisfied (2o), traverses the network again (l), and finally travels back through the communication controller at the source node (o). Including the fixed external interface delays into and out of each controller (kin, kout) leads to a total round-trip latency as seen by the processor (without any contention) of kin + o + kout + l + kin + 2o + kout + l + kin + o + kout for the miss, or 2l + 4o + 6k if we assume kin = kout. If the line were dirty at the home node's cache, there would be an extra fixed cost of C at the home for retrieving the data from the cache. For a line that is dirty in the cache of a third processor (not the requestor or the home), the latency would be kin + o + kout + l + kin + 2o + kout + l + kin + 2o + C + kout + l + kin + o + kout, or 3l + 6o + C + 8k.
The network latency l and the controller occupancy o are the variables in the above costs. We study a range of values for each variable (instantiated as L1, O1, and multiples of these), as shown in Table 2.1, covering a variety of interesting architectural alternatives. We have used reasonable estimates to characterize the latencies and occupancies of these architectural alternatives, though of course each alternative could be customized or improved. Our latencies l vary from tightly-coupled, low-latency MPP networks, through high-latency physically distributed MPP networks, all the way to commodity networks composed of ATM switches(2). Small values of occupancy o represent communication controllers which are tightly-integrated, hardwired state machines [ACD+91, KSR92, LLG+92]. As o increases the controller becomes less hardwired and more general-purpose, from specialized co-processors [KOH+94], through inexpensive off-the-shelf processors on the memory bus, to an off-the-shelf processor located on the I/O bus of the main processor. The cycle counts used in Table 2.1 and the remainder of this paper are in terms of 100 MHz system cycles, i.e. the clock speed of the controller and the rest of the node, not that of the main processor. The entries in Table 2.1 correspond to specific values in our range of latencies and occupancies. The L1, O1 point is our base architecture, with a network latency of 25 system cycles or 250ns, and a controller occupancy of 7 system cycles(3). The other entries are multiples of these base values.
Table 2.1. Machine Configurations and their Abbreviations
-----------------------------------------------------------------------------------------------
25 50-200 400 800
Distributed Aggressive Today's
MPP MPP ATM ATM
-----------------------------------------------------------------------------------------------
< 7 L1, O1 L2, L4, L8 L16, O1 L32, O1
Hardwired Controller O1
-----------------------------------------------------------------------------------------------
14-28 L1, L2, L4, L8 L16, L32,
Customized Co-processor O2-O4 O2-O4 O2-O4 O2-O4
-----------------------------------------------------------------------------------------------
56 L1, O8 L2, L4, L8 L16, O8 L32, O8
General-purpose Co-processor on Memory Bus O8
-----------------------------------------------------------------------------------------------
112 L1, O16 L2, L4, L8 L16, O16 L32, O16
General-purpose Co-processor on I/O Bus O16
-----------------------------------------------------------------------------------------------
With this context established, we now present our framework for studying the effects of varying l and o on system performance over a range of important parallel computations.
The impact of communication performance on overall system performance depends not only on the structure of the communication but also on the ratio of computation to communication. Also, the impact of communication depends on whether communication really is the important performance bottleneck to begin with, or whether the bottleneck is something else like load imbalance. For a given number of processors, both these issues---the computation-to-communication ratio, and what the dominant bottleneck is---usually depend on the problem size that is used. Larger data sets usually improve both the computation-to-communication ratio and the load balance, and hence allow a machine to deliver better parallel performance or speedup relative to a uniprocessor implementation.
Thus, if an application with a given problem size delivers good parallel performance on an aggressive architecture but not on one that uses a commodity controller and network, this does not in itself mean that the less aggressive communication architecture is inappropriate for that application. There may be a problem size for which the less aggressive architecture performs well too, and perhaps another problem size for which it performs almost as well as the aggressive architecture. The question is how large are these problem sizes relative to the base problem size, and are they still realistic or interesting. Thus, we believe that the best way to cast the effectiveness question is: Given an application, a number of processors, and values for l and o that characterize the network and controller, what is the minimum sized problem that can deliver a desired level of parallel performance?
The question that remains is the choice of the "desirable" level of performance. Our measure of parallel performance is the parallel efficiency of an execution; that is, the speedup of the parallel execution over a sequential implementation of the application on a uniprocessor, divided by the number of processors used (64). Typically, the larger the efficiency the larger the problem size needed for a given combination of l and o. Thus, the efficiency level we choose is an important determinant of the constant factors in the expression for the required problem size. Furthermore, it can also affect the growth rate of the required problem size with l and o, if changing the desirable efficiency level changes the relative importance of different performance bottlenecks. For example, for an efficiency level of 30%, the dominant bottleneck to overcome by increasing problem size may be communication, but for a 95% efficiency level it may be load imbalance. The bottlenecks also may not behave in predictable ways as problem size or efficiency level change, particularly for irregular applications. In most cases, however, if the dominant bottleneck does not change, then the chosen level of efficiency will not affect the growth rate of the required problem size but only the starting point. We assume a desired parallel efficiency of 60% in this paper.
Given the large communication latencies on DSM machines, it is natural to try to hide these latencies when possible. Latency can be hidden by various techniques, all of which exploit the availability of additional bandwidth and require that the processor allows multiple outstanding references. To hide write latency, we assume that the architecture supports a relaxed consistency model. Read latencies are typically more difficult to hide, and it is not clear how successfully communication latencies for read misses will be hidden in practice. Where applicable, we use two versions of our applications: one that tries to hide read latency with software-controlled prefetches that we insert in the application by hand, and the other that does not.
We assume a fast, next-generation main processor that can issue up to three memory references every 100 MHz system cycle. Because the processor controls its own secondary cache, we assume that it takes 15 system cycles for the controller to retrieve state information from that cache when necessary. This is the value of C used in our simulations (see Section 2). The caches are 1 MB in size, two-way set associative, and have a line size of 128 bytes. We also assume that the processor has both prefetch and prefetch exclusive instructions. In our processor model a load miss stalls the processor until the first double-word of data is returned, while prefetch, prefetch exclusive, and store misses will not stall the processor unless there are already references outstanding to four different cache lines. While this upper bound of only four outstanding cache lines can limit the amount of latency that a processor can hide with bandwidth, it is nonetheless more aggressive than the current situation in commodity microprocessors.
Table 3.1. Applications and Communication Patterns ------------------------------------------------------------------------------------------------------- Application Description Communication Pattern ------------------------------------------------------------------------------------------------------- Barnes Barnes-Hut hierarchical N-body simulation irregular, hierarchical Ocean Multigrid large scale ocean simulation nearest neighbor iterative, hierarchical Water Molecular dynamics simulation structured, many-to-many FFT Radix Root-N Six-Step Fast Fourier Transform regular, all-to-all, blocked matrix transpose LU Blocked dense LU decomposition structured, one-to-many Radix Integer radix sort irregular, all-to-all -------------------------------------------------------------------------------------------------------The applications we use in our study are summarized in Table 3.1.The programs were chosen because they represent a variety of important scientific computations, including both kernels and complete applications, and they have different communication patterns and requirements. Barnes is representative of the class of hierarchical N-body methods, which are used in the domains of astrophysics, electrostatics, and plasma physics, among others. Ocean is representative of many computational fluid dynamics applications on regular grids. Water is representative of a wide range of computational chemistry applications which compute particle interactions based on a cutoff radius. FFT forms the computational core of a wide variety of applications, including image and signal processing as well as climate modelling. The most common need for large dense LU factorization is in radar cross-section problems; however, for our purposes dense LU factorization is very similar to more widely used sparse matrix factorization techniques (such as blocked Cholesky factorization [Roth93]), and of various other matrix factorization and eigenvalue methods. Finally, Radix is a widely used sorting algorithm. Descriptions of the applications can be found in: Barnes [HS94]; Radix and Ocean [WSH94]; Water [SWG+94]; FFT and LU [RSG93]. The applications are quite highly optimized to improve communication performance, and particularly to reduce spurious hot-spotting or contention effects that adversely impact occupancy. The codes for the applications are taken from the SPLASH-2 application suite [SWG+94], although Radix was modified to use a tree data structure (rather than a linear key chain) to communicate ranks and densities efficiently.
where Tcomp is the uniprocessor computation time, Vcomm is the volume of communication (number of communication misses incurred on all processors), and TL and TC are the average stall times due to latency and contention, respectively, for each communication.
For a fixed problem size and number of processors, both Tcomp and Vcomm are constant. TL varies linearly with l and o. The remaining question is how the contention component, TC, varies with l and o. If controller occupancy contributes only to latency and has no contention component (TC = 0 for all o in the above formula), then an increase in occupancy is indistinguishable from holding occupancy constant and making a corresponding increase in network latency. We define communication latency as the round-trip latency, assuming no contention, for a remote miss that is satisfied by the main memory of the home node (computed as 2l+4o+6k in Section 2). On a graph of parallel efficiency versus communication latency, if TC were 0 then all points in the l, o design space would fall on the same curve. The exact shape of the curve can be gleaned from Eq. 4.1, once the constant factors Tcomp and Vcomm are known. Now suppose occupancy causes contention for the controller as well. As o increases, the contention worsens and the parallel efficiency becomes worse than it would have had the occupancies stayed the same but network latency been increased correspondingly. The efficiency versus communication latency curve for a larger occupancy would therefore be below that for a lower occupancy. The impact of the contention component of occupancy is likely to be larger when network latencies are smaller, which means that the curve will start to flatten at smaller latencies. However, as latency increases, the effect of contention should diminish, and eventually the curve should approach the curves for lower occupancies.
To understand the performance impact of l and o, we seek the answers to the following questions: (i) starting from the base architecture, how does increasing network latency degrade performance for the base problem size, both with and without prefetching; and (ii) to what extent does controller occupancy cause contention in addition to contributing latency, both with and without prefetching, and how does this contention affect parallel efficiencies for realistic architectures with different occupancies and latencies. Other interesting questions that we answer in the process are: What is the problem size needed to obtain 60% parallel efficiency on the base architecture, which represents an aggressive next-generation multiprocessor, both with and without prefetching; and what are the node-to-network bandwidth requirements for the base problem size?
Figure 4.1. FFT modelling results for both the (a) non-prefetched and (b) prefetched versions
The actual results for high occupancies and low latencies suggest that occupancy indeed contributes to contention even without prefetching. To model contention, we use a queueing model to determine the average number of requests that are waiting for the communication controller when a request arrives. The simplest queueing model assumes that there is a maximum of one remote read request that the controller has to handle, together with its own processor's read request, because reads are blocking and the processors are kept in synch. We found, however, that a model that only takes read misses into account (not shown in Figure 4.1) does not do much better than the no contention model. The model performs poorly because data copying in all but the first transpose phase causes invalidations to processors that previously read those data in the prior transpose phase. The invalidations consume occupancy, and we must include them in the model. If the invalidations are assumed to occur uniformly during the transpose phase, we get the smooth invals model, which is still 15% off with the O16 controller at low latencies. It is only when the queueing model is modified to take the burstiness of coherence traffic (resulting from the interactions of multi-word cache lines with the patterns of reading and writing data in the transpose phases) into account, that it matches the simulations well, as shown by the bursty invals curve in Figure 4.1(a).
Although it is possible to create a model that accurately predicts the behavior of the FFT, this model proved surprisingly difficult to generate in light of the relatively simple structure of the computation. The prefetched version of the transpose phase is even more complex to model, because processors prefetch data from one processor while they are still communicating with another. Clearly if modelling contention and invalidations is necessary when there are no prefetches, it is even more important when there are, as Figure 4.1(b) shows. The best model we found for the prefetched version sets an upper bound on the number of messages a controller may receive by assuming it handles requests by at most two other processors at any time. Unfortunately, it is not as accurate as the best non-prefetched model for all occupancies.
In both the prefetched and non-prefetched versions it is the occupancy-related contention effects that make accurate modelling difficult (the divergence of the simple models in Figure 4.1 gets worse with increasing occupancy and better with increasing latency). The key property that enables us to model the non-prefetched version of FFT well is that we are able to set a tight limit on the number of messages the controller has to handle in a fixed amount of time. It is difficult to set a similar limit in the other less easily analyzed applications.
Without Prefetching: Figure 4.2(a) graphs parallel efficiency vs. communication latency for our base FFT problem size (256K points)(5). The curves generally have the form predicted in Section 4.2.1. The first interesting result is that larger occupancies lower the curves, indicating that the contention component of occupancy is indeed important, even without prefetching. The curves also begin to flatten as o is increased, which indicates that the controller starts to saturate.
Note that all curves nearly converge at high values of l, implying that at today's ATM latencies controller occupancy does not have a large impact on overall performance for this problem size without prefetching. Conversely, for a range of MPP and distributed MPP network latencies, controller occupancy is a critical determinant of overall performance. What may be surprising are the values of controller occupancy at which the curves begin to diverge at low l. The difference between the O1 and O2 curves for the smallest value of l is small (6%), but the difference from O2 to O4 is greater (13%), and gets progressively larger from O4 to O8 (general-purpose processor on memory bus; 25%) and from O8 to O16 (general-purpose processor on I/O bus; 44%).
With Prefetching: Figure 4.2(b) shows that with prefetching, the curves are no longer concave. In fact, they are almost linear with communication latency and flatten out as o increases. Unlike the non-prefetched case, the curves no longer converge because the contention component of occupancy affects overall performance even at high network latencies. Prefetching improves performance more at low o and moderate l than it does at higher values of o and l. At high l, we cannot hide all the network latency, and beyond a point increases in latency hurt the prefetched case at as quick a rate as the non-prefetched so the curves take on similar shapes. At high o, the controller becomes a bottleneck, as it is unable to match the increased bandwidth needs of prefetching. To support prefetching in DSM machines then, it is crucial to keep the occupancy of the controller low.
Let us look at the graphs from the perspective of a systems architect. If we had a network with latencies like today's ATM networks, how much does the occupancy of our communication controller affect overall performance? For this problem size, the answer is not much if we do not use latency hiding techniques, but significantly if we do. With prefetching, O1 is 2.32 times better than O16 at latency L1, and 1.49 times better at latency L32. On the other hand, if we could reduce our network latency in half (reach the current goal of ATM networks) how much performance would we gain? The answer here depends on the occupancy of the controller. A machine with a controller occupancy of O1 makes a 56% improvement in parallel efficiency as network latency is decreased from L32 to L16, while a machine with controller occupancy of O16 makes a 33% performance improvement. While the relative gains in performance are quite high, the absolute performance of both of these systems is still low compared to the base architecture.
Now suppose we have a low-latency (L1) MPP network. Beyond a very efficient customized controller on the memory bus (O2), controller occupancy is crucial to performance both with and without prefetching. For low occupancy controllers, going to a higher-latency network also hurts performance significantly, though with prefetching the performance impact is smaller. Once the controller is a general-purpose processor (on the I/O or even memory bus), increasing network latency does not significantly affect performance. Note that starting from a very efficient (L1, O1 or O2) machine, doubling controller occupancy hurts more than doubling network latency. As designers of tightly-coupled machines, if the cost considerations for doubling the two parameters are similar, we might favor keeping occupancy low and sacrificing some network latency.
So we see that for FFT, without latency hiding controller occupancy is critical at low network latencies but not at high latencies, while with prefetching it is critical at all latencies. Since high-latency networks make it all the more important to hide latencies if possible, occupancy is in effect critical at all values of network latency (of course, occupancy would become less critical if the network bandwidth were very low as well). An interesting result is that it is the contention component of controller occupancy, not its latency component, that dominates its contribution to performance degradation, both with and without prefetching. We find that for most of our applications, for controller occupancies above O2---which represents an efficient, customized co-processor---70%-95% of the performance degradation due to increasing occupancy is attributed solely to its contention component for all values of network latency. We will now show that these trends are consistent across many classes of applications, and will discuss how our other applications corroborate the detailed results or differ from them.
Table 4.1. Minimum Problem Sizes and Per-Processor Bandwidth Requirements for
theBase Architecture
-------------------------------------------------------------------------------------------
Application Minimum Problem Size Bandwidth (MB/s) Minimum Problem Size Bandwidth (MB/s)
-------------------------------------------------------------------------------------------
Barnes 8192 particles 7.4 N/A N/A
Ocean 258x258 grid 40.3 258x258 grid 46.4
Water 512 molecules 10.9 N/A N/A
FFT 64K points 50.4 64K points 50.5
LU 768x768 matrix 19.3 768x768 matrix 21.3
Radix 2M keys, radix 256 83.7 1M keys, radix 256 82.3
-------------------------------------------------------------------------------------------
Following the framework developed in Section 4.2, we now present base problem size results for all of our applications. We compare these results with those we have already seen for our FFT case study. Radix: The results for Radix shown in Figure 4.3 are similar to FFT, with a few notable exceptions. Like FFT, without prefetching all the curves almost converge by today's ATM latencies (our rightmost points). While the O1 and O2 curves are still very close together, the O8 curve is much flatter than it is in FFT, and the O16 curve is almost totally flat. This indicates that in Radix contention matters more than it does in FFT, particularly at low network latencies.
In the prefetched version of Radix, we again see a linearization of all the curves although less than in prefetched FFT (i.e. prefetching is not as successful in Radix as it is in FFT). Two key prefetching trends continue in Radix: prefetching helps much more at lower values of o, and the curves do not converge to a point at L32, indicating that it is still critical to keep occupancy low when prefetching, even with ATM latencies.
LU: The results for LU (not shown) are also very similar to those for FFT. One significant difference for both prefetched and non-prefetched LU is that the performance is less sensitive to both latency and occupancy. The reason is that LU has a high computation-to-communication ratio, and suffers from significant load imbalance for the base problem size, so its performance is less dependent on communication costs.
Ocean: Ocean, which performs many iterative nearest-neighbor computations on regular grids, depends more on network latency than any of the previous applications, though it depends substantially on occupancy as well (Figure 4.4). This is especially true at higher controller occupancies. The reason is that unlike the previous applications, Ocean cannot take full advantage of spatial locality when it communicates data, leaving it especially sensitive to changes in network latency. Prefetched Ocean cannot hide enough of the latency, so the prefetched curves are also somewhat concave.
Barnes and Water: Figure 4.5 shows the results for Barnes and Water. Neither application includes prefetching, because the high degree of temporal locality (and irregularity in Barnes) makes it difficult to determine what to prefetch and when. For Barnes, the O1 and O2 curves are identical. The O4 curve is different only for the lowest values of l, and the curves do not begin to diverge until O8. Again, the curves all converge at high network latency. Of all the applications, these two have the least performance variation across the design space. In particular, they are the least occupancy-bound of all the applications.
Figure 4.5. Base problem size results for (a) Barnes and (b) Water
To determine the problem size needed to attain a given efficiency, we need to know not only how the amount of computation and communication (Tcomp and Vcomm) scale with problem size (recall Eq. 4.1), but how the latency and contention costs of the average communication (TL and TC) scale as well. Clearly, TL does not vary with problem size. The hope when using a high-occupancy controller is that the contention component TC decreases as problem size increases. The question is whether this is true, or whether TC increases or is independent of problem size. In Section 5.1 we once again examine the problem through a case study of FFT. Section 5.2 presents the results of increasing problem size for all of our applications.
Through simulation we gathered efficiency results for non-prefetched and prefetched FFT at two problem sizes: our base problem size and one that is four times larger. We look at the cross product of O1, O8, O16 and L1, L4, L16, which represent what we believe are realistic machine configurations (see Table 2.1). Increasing the problem by a factor of four did not increase the efficiency much, either with or without prefetching.
An important result is that at all occupancies, the average communication time remains constant in both the non-prefetched and prefetched versions of FFT. This means that the effects of contention do not decrease with an increase in problem size. Although this seems counter-intuitive given that the overall computation-to-communication ratio increases, there is a clear explanation as to why it happens. In many structured applications, communication is isolated in different phases from local computation. This separation has been incorporated in existing models of structured parallel programming such as the Bulk Synchronous Processing (BSP) model [Valiant90]. As a result, although the overall computation-to-communication ratio over the whole application increases with problem size, within the communication phases the ratio remains constant as problem size grows. Since contention depends on the rate or burstiness of communication, and that rate is independent of problem size, it follows that contention (TC) is independent of problem size as well. Thus, FFT indeed requires an exponential increase in problem size to overcome the effects of increased latency or occupancy. Higher occupancies cause more contention, increasing the value of the exponent substantially.
This insight, that TC like TL is independent of problem size, allows us to predict the required problem sizes for FFT as l and o change, as long as we know how TL and TC change with l and o. Since FFT communicates through reads, we can use the average remote read miss time from the simulation results or our model in Section 4.2.1 to estimate TL + TC. The simulations also provide us with the constants for the computation time. Optimistically assuming perfect load balancing, Table 5.1 shows the minimum problem size needed to reach 60% efficiency for the nine selected combinations of l and o. Of these, we were able to simulate the required problem size for an occupancy of O1 and latencies of L1 and L4. The other numbers listed in Table 5.1 are predicted values, although the trends and contention effects have been validated.
Table 5.1. Minimum Problem Size Required for 60% Parallel Efficiency for both
Non-Prefetchedand Prefetched FFT
------------------------------------------------------------
L1 L4 L16 L1 L4 L16
------------------------------------------------------------
O1 2^16 2^18 2^44 2^16 2^16 2^16
O8 2^22 2^28 2^58 2^18 2^18 2^20
O16 2^40 2^46 2^76 2^30 2^30 2^32
------------------------------------------------------------
Without Prefetching: It is clear from the table that increasing latency causes an exponential increase in the required problem size. The contention component of occupancy also has a big impact on the required problem size for FFT, even without prefetching. For example, if the O8 controller had the same contention component TC as the O1 controller, but the communication latency corresponded to O8, the problem sizes for O8 in Table 5.1 would have been from 64 times smaller at L1 to 16 times smaller at L16. For O16, the problem sizes would have been 4096 times smaller at L1 and 1024 times smaller at L16. With Prefetching: For the same l and o, the minimum problem size needed is much smaller than for the non-prefetched version, and depends much less on latency. However, once the latency becomes too large to be hidden, the growth rate is exponential in the amount that cannot be hidden. Contention still plays a critical role in determining the required problem size. With the same contention component of the O1 controller, the O8 controller would only need problem sizes 4 to 16 times smaller than those listed in Table 5.1. The O16 controller is far worse off: It would need a problem 16384 to 65536 times smaller.
For both versions of FFT, the problem size needed to achieve the desired efficiency at high controller occupancies is unreasonably large. The same is true of the non-prefetched version at high network latencies. Compromising the aggressiveness of a communication architecture, then, makes it be extremely difficult to achieve high parallel efficiencies for FFT.
LU: LU scales much better than either Radix or FFT. One reason is that the computation-to-communication ratio in LU grows linearly in the problem size (O(n^3) computation versus O(n^2) communication). LU therefore requires much smaller increases in problem size to reduce relative communication costs. The other is that the main bottleneck for LU on an L1, O1 machine is load imbalance and not communication. Increasing the problem size improves load balance quickly as well.
Table 5.2. Minimum Problem Size Required for 60% Parallel Efficiency for both
Non-Prefetchedand Prefetched LU
-----------------------------------------------------------------------------------------------------
L1 L4 L16 L1 L4 L16
-----------------------------------------------------------------------------------------------------
O1 800^2 (1.0x) 1250^2 (2.4x) 2900^2 (13x) 700^2 (1.0x) 800^2 (1.3x) 1200^2 (2.9x)
O8 1350^2 (2.8x) 1800^2 (5.1x) 3450^2 (19x) 850^2 (1.5x) 950^2 (1.8x) 1350^2 (3.7x)
O16 2000^2 (6.3x) 2400^2 (9.0x) 4100^2 (26x) 1000^2 (2.0x) 1100^2 (2.5x) 1500^2 (4.6x)
-----------------------------------------------------------------------------------------------------
Like FFT and Radix, LU also communicates data in structured phases that have a constant computation-to-communication ratio. Consequently, contention does not decrease with increasing problem size, allowing us to predict the required problem size for machines with larger latencies and occupancies. Table 5.2 summarizes the results. For each entry, the value in parentheses is the ratio of the required data set size to that for an L1, O1 machine. Note that the computation time for LU scales a factor of n faster than the data set. This means that the time required for the "desirable" LU to complete grows more quickly than the table indicates. The time on an L16, O16 machine would be 133 times that on an L1, O1 machine without prefetching and 9 times with prefetching, even though the data set size required is only 26 times and 4.4 times larger, respectively. Ocean: Ocean, which uses nearest-neighbor iterative computations including multigrid, also has a computation-to-communication ratio that scales linearly with problem size and a better load balance than LU. As Figure 5.1 shows, both the non-prefetched and prefetched versions of Ocean scale much better than the previous applications. An important observation is that although even the higher occupancy curves increase substantially in efficiency with larger problem sizes, they still do not assume the shape of the lower occupancy curves. Once again, this is because Ocean also has structured communication, so contention does not decrease with increasing problem size. Table 5.3 shows the problem sizes required for 60% efficiency.
Table 5.3. Minimum Problem Size Required for 60% Parallel Efficiency for both
Non-Prefetchedand Prefetched Ocean
--------------------------------------------------------------------------------------------
L1 L4 L16 L1 L4 L16
--------------------------------------------------------------------------------------------
O1 258^2 (1.0x) 514^2 (4.0x) 1282^2 (25x) 258^2 (1.0x) 386^2 (2.2x) 642^2 (6.2x)
O8 642^2 (6.2x) 898^2 (12x) 1666^2 (42x) 642^2 (6.2x) 642^2 (6.2x) 770^2 (8.9x)
O16 1282^2 (25x) 1410^2 (30x) 2050^2 (63x) 1026^2 (16x) 1026^2 (16x) 1154^2 (20x)
--------------------------------------------------------------------------------------------
Unlike LU, in Ocean both the data set size and the execution time nominally grow as O(n^2) in the grid dimension. However, the implications of latency and occupancy for execution time are nonetheless more severe than for data set size. This is because increasing data set size also requires scaling other parameters (such as the accuracy used in the multigrid solver and the number of times-steps), which increase execution time further. In fact, the numbers for data set size in Table 5.3 are themselves optimistic, since a larger number of grid points causes more time to be spent in the multigrid equation solver, which has the highest communication to computation ratio and the worst load imbalance in the application.Finally, the effect of contention on required problem size is much less for Ocean than it is for FFT. For example, for both versions of Ocean the required problem size for the O8 and O16 controllers would have been at most 4 times smaller if they had the same TC as an O1 controller. Like FFT, the effect of contention is greater in the prefetched version of the code.
Barnes: Unlike the previous applications, Barnes does not have separate phases of communication and computation (though there are more structured versions of the application, written for message passing machines, that do [Salmon90]). As problem size increases, more computation is done between communications, so contention decreases. Since the computation-to-communication ratio also depends on the distribution of particles, predicting the problem size required for 60% efficiency is difficult. However, similar hierarchical N-body applications have an expected ratio that is linear in the problem size [Katz89]. This suggests that scaling hierarchical N-body applications to retain a desired efficiency should be relatively easy, if communication is the primary bottleneck. Unfortunately, the bottleneck is typically load imbalance, and it is difficult to predict how that improves with problem size since there are different computational phases with different levels of imbalance. Doubling the problem size for Barnes improved performance somewhat at higher occupancies, but not much. However, for the sizes of problems that are run on machines today, we expect that all of the machine configurations we study should perform quite well.
Water: Table 5.4 summarizes the minimum problem sizes required for Water. Water also has a computation-to-communication ratio that scales linearly with data set size. The effect of contention on required problem size is less in Water then it is in all the other applications. In fact, at ATM latencies the O8 controller achieves 60% efficiency at the same problem size whether or not it has its inherent value of TC or it has the TC of an O1 controller. Contention is only slightly more important at lower network latencies. Thus, Water and Barnes are examples of occupancy-related contention not always being critical. However, note that the execution time for Water grows as the square of the data set size shown in Table 5.4.
Table 5.4. Minimum Problem Size Required for 60% Parallel Efficiency for Water
-------------------------------------------------------
L1 L4 L16
-------------------------------------------------------
O1 512 (1.0x) 896 (1.8x) 1792 (3.5x)
O8 1152 (2.3x) 1536 (3.0x) 3072 (6.0x)
O16 3072 (6.0x) 3072 (6.0x) 6144 (12.0x)
-------------------------------------------------------
Overall, significant increases in problem size are necessary for the lower-performance networks and controllers to achieve the desired efficiency, although the amount of increase varies depending on the specific type of application. There are many important classes of applications (transform methods, sorting) for which the efficiency lost by a less aggressive architecture---in latency or occupancy---is extremely difficult or impossible to regain by increasing problem size. In most of the applications, contention owing to the occupancy of the controller played an important role in determining the required growth in problem size, and the amount of contention was not reduced by increasing problem size. Finally, the implications of increasing latency and occupancy for execution time, which may be most important, are often more severe than those for data set size.
Our main result, however, is that the occupancy of the communication controller is critical to good performance in DSM machines. For machines with tightly-coupled MPP networks we found that controller occupancy has a large performance impact regardless of whether or not applications incorporated latency hiding techniques. For machines with loosely-coupled networks, we showed that without latency hiding, occupancy did not matter to overall performance. But with latency hiding, controller occupancy once again became a performance bottleneck. Since machines with high-latency networks will need to incorporate latency hiding whenever possible to obtain good performance, these results show that it is important to use low-occupancy communication controllers at any network latency.
Moreover, it was not the latency component of the higher occupancy controllers that caused performance degradation, but rather the contention component, even without latency hiding. This contention component proved difficult to model analytically, especially for applications that included latency hiding. We also found that several important classes of applications communicate in "bulk synchronous" phases where the computation-to-communication ratio is constant. As a result, increasing the problem size did not alleviate contention.
Finally, our results showed that for many classes of applications, it is extremely difficult for architectures with higher values of network latency or controller occupancy to achieve high parallel efficiency. That is, the problem size needed to maintain the desired efficiency quickly becomes unreasonable. There are applications that attain the desired efficiency with reasonable data set sizes, although for many of these applications the execution time scales much faster than the required data set size.
The tendency among DSM designers has been to focus on latency and network bandwidth as the important performance issues in the communication architecture. Our results demonstrate that the occupancy of the communication controller is just as important to overall performance, if not more so. Designers should therefore pay careful attention to controller occupancy when making decisions about using commodity parts in their communication architectures.
[CKP+93] David Culler et al. LogP: Toward a realistic model of parallel computation. In Proceedings of the Principles and Practice of Parallel Processing, pages 1-12, 1993.
[Golds93] Stephen Goldschmidt. Simulation of Multiprocessors: Accuracy and Performance. Ph.D. Thesis, Stanford University, June 1993.
[HKO+94] Mark Heinrich et al. The Performance Impact of Flexibility in the Stanford FLASH Multiprocessor. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 274-285, San Jose, CA, October 1994.
[HS94] Chris Holt and Jaswinder Pal Singh. Hierarchical N-Body Methods on Shared Address Space Multiprocessors. SIAM Conference on Parallel Processing for Scientific Computing, February 1995, to appear.
[Katz89] Jacob Katzenelson. Computational Structure of the N-body Problem. SIAM Journal of Scientific and Statistical Computing, pages 787-815, July 1989.
[KOH+94] Jeffrey Kuskin et al. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pages 302-313, Chicago, IL, April 1994.
[KSR92] Kendall Square Research. KSR1 Technical Summary. Waltham, MA, 1992.
[LLG+92] Daniel Lenoski et al. The Stanford DASH Multiprocessor. IEEE Computer, 25(3):63-79, March 1992.
[RLW94] Steven K. Reinhardt, James R. Larus, and David A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of the 21st International Symposium on Computer Architecture, pages 325-336, Chicago, IL, April 1994.
[Roth93] Edward Rothberg. Exploiting the Memory Hierarchy in Sequential and Parallel Sparse Cholesky Factorization. Ph.D. Thesis, Stanford University, January 1993.
[RSG93] Edward Rothberg, Jaswinder Pal Singh and Anoop Gupta. Working Sets, Cache Sizes, and Node Granularity for Large-Scale Multiprocessors. In Proceedings of the 20th Annual International Symposium on Computer Architecture, pages 14-25, San Diego, CA, 1993.
[Salmon90] John K. Salmon. Parallel Hierarchical N-body Methods. Ph.D. Thesis, California Institute of Technology, December 1990.
[SFL+94] Ioannis Schoinas et al. Fine-grain Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 297-306, San Jose, CA, October 1994.
[SHG93] Jaswinder Pal Singh, John L. Hennessy, and Anoop Gupta. Scaling parallel programs for multiprocessors: methodology and examples. IEEE Computer, July 1993.
[SWG+94] Jaswinder Pal Singh et al. The SPLASH-2 Suite of Parallel Applications, Technical Report to appear, Stanford University.
[Valiant90] Leslie G. Valiant. A Bridging Model for Parallel Computation. In Communications of the ACM, 33(8):103-111, August 1990.
[WSH94] Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 219-229, San Jose, CA, October 1994.