
1 Introduction
2 Integrating Monitoring and Coherence2.1 Performance Monitoring3 FlashPoint: A Case Study
2.2 Cache Coherence3.1 FlashPoint Implementation4 Other Coherence Mechanisms3.1.1 The FLASH Architecture3.2 FlashPoint Performance
3.1.2 The Default FLASH Protocol
3.1.3 The FlashPoint ProtocolPerformance Data Structures3.1.4 Fine-Grained Timing Support
Program Access to Performance State
3.1.5 Summary3.2.1 Experimental Methodology
3.2.2 Accuracy Impact of FlashPointCache Miss Counts3.2.3 Performance Impact of FlashPoint
Cache Miss Latencies
Per-Bin StatisticsImpact on Handler Latency3.2.4 Summary
Impact on Handler Occupancy
Scaling Number of Processors
5 Related Work
6 Conclusions
7 Acknowledgments
8 References
A large and increasing gap exists between processor and memory speeds in scalable cache-coherent multiprocessors. To cope with this situation, programmers and compiler writers must increasingly be aware of the memory hierarchy as they implement software. Tools to support memory performance tuning have, however, been hobbled by the fact that it is difficult to observe the caching behavior of a running program. Little hardware support exists specifically for observing caching behavior; furthermore, what support does exist is often difficult to use for making fine-grained observations about program memory behavior. Our work observes that in a multiprocessor, the actions required for memory performance monitoring are similar to those required for enforcing cache coherence. In fact, we argue that on several machines, the coherence/communication system itself can be used as machine support for performance monitoring. We have demonstrated this idea by implementing the FlashPoint memory performance monitoring tool. FlashPoint is implemented as a special performance-monitoring coherence protocol for the Stanford FLASH Multiprocessor. By embedding performance monitoring into a cache-coherence scheme based on a programmable controller, we can gather detailed, per-data-structure, memory statistics with less than a 10% slowdown compared to unmonitored program executions. We present results on the accuracy of the data collected, and on how FlashPoint performance scales with the number of processors.
Despite the great importance of memory system behavior to application performance, it is difficult to build tools to monitor such behavior. The main challenges are:
The key contribution of this paper is to expose the parallels between the system support that is desirable for memory system performance monitoring and the system support that is already implemented on cache-coherent shared memory multiprocessors. The main observation is that the mechanisms used to implement cache coherence are often quite similar in structure to what is desired for performance monitoring. In other words, while memory performance monitoring does need particular forms of support, we argue that in many cases that support has already been implemented, albeit for another purpose.
To demonstrate concretely the utility of integrating performance monitoring with coherence support, we have implemented FlashPoint, a performance monitoring tool for the FLASH Multiprocessor [KOH+94]. The tool is integrated into the software handlers of FLASH's flexible coherence protocol. It takes advantage of FLASH's existing mechanisms for (i) automatic software activation on each second-level cache miss and (ii) per-cache-line accounting of memory usage. FlashPoint keeps detailed memory statistics for individual program code and data structures. By taking advantage of existing cache-coherence support, FlashPoint is able to collect these fine-grained statistics at low overheads. Gathering per-data-structure statistics incurs less than 10% overhead; gathering per-data-structure, per-procedure statistics has higher overheads but still generally results in less than 2X slowdown.
FlashPoint represents an interesting and concrete demonstration of a symbiosis between cache coherence and performance monitoring in multiprocessors. The paper, however, is not simply a description of this particular tool, but rather an analysis of the natural parallels between these two system functions, and the opportunities for amortizing hardware and systems costs across both of them.
Section 2 outlines the basic needs of performance monitoring and coherence systems, and shows the commonalities between the two. As a case study, Section 3 describes the implementation and performance of FlashPoint. Section 4 expands the discussion to consider possible implementations in other styles of parallel computers. Section 5 discusses related issues and future work, and Section 6 offers our conclusions.
As an example, consider a hardware histogram monitor, such as the DASH Hardware Performance Monitor [Hei93]. Here, monitoring is triggered by each reference on the shared cluster bus. Hardware (in essence, a handler) responds to the trigger by incrementing appropriate counts in a set of histograms. These banks of memory-mapped histograms form the statistics state information for this performance monitoring system.
For comparison, consider a software-based approach such as MemSpy [MGA95] or CProf [LW94]. In these tools, the events-to-be-monitored are memory references in the code. Trigger points are created at these events by instrumenting them with calls to software procedures, or handlers. These monitoring routines update their data structures with statistics about the reference and then return control to the application. Although the implementation of these monitors is quite different from the hardware-based approaches, the three basic components of the performance monitor are still clearly present.
Finally, systems using hardware miss counters (such as implementations of the R10000 [JHei95], Pentium [Mat94] or DEC Alpha [DEC92] architectures) trigger on first level cache misses, and cause counters to be incremented. In this case, the trigger is the miss signal resulting from the cache probe, the increment is the hardware handler, and the counter register is the statistics state information.
For many tools, the main limitation to their efficiency and accuracy has been the lack of lightweight, selective-notification mechanisms for performing the first of the three main components: identifying and triggering on events to be monitored. While bus monitors or on-chip cache miss counters allow one to trigger on all memory events (at a particular level of the hierarchy), it can be difficult to trigger selectively on some events or to take actions other than aggregate counting. For example, tools like MemSpy categorize miss counts according to the code and data structures that incurred the misses. This categorization is still quite time-consuming even with support such as on-chip miss counters or hardware histogram bus monitors. If miss detection could be performed instantaneously in hardware, the overhead for such a tool could drop by a factor of two or more in some cases [Mar93]. The following subsection introduces how cache-coherence mechanisms can help support selective notification and statistics categorization.
At the heart of most current high-performance shared-memory multiprocessors is a cache coherence protocol, which guarantees that the data in each processor's cache is kept in sync with the data in other caches throughout the system. Hardware or software will track which lines are cached in which processor caches, watch for activity on those lines, and send out updates or invalidations accordingly. Although cache coherence strategies vary, a common theme is that the hardware or software intended to implement cache coherence will be triggered on "interesting" references (i.e., loads or stores that cause a protocol state change for the referenced cache line). This triggering activates hardware or software handlers. The handlers perform functions such as fetching the data or invalidating it from other caches. They may also maintain per-cache-line or per-page state information that the protocol uses to maintain coherence.
One can also extend beyond this surface similarity. In shared-memory cache coherence protocols, trigger points are those references that require coherence actions---either to fetch a line into the cache, or to upgrade its state from shared to exclusive. In scalable (e.g. directory-based) protocols, the memory events that cause major delays are almost always those requiring coherence actions; therefore, "interesting" coherence events are "interesting" memory performance monitoring events.
Once a cache-coherence handler has been invoked, it performs (either in hardware or software) a table lookup to check on the coherence state for the cache line. Based on this information, it sends out messages as needed and updates the coherence state information. One could modify the protocol handler state machine or software to update statistics counters as part of each coherence action. Furthermore, by modifying the directory storage to include counter bits or to include indexing bits that point to an array of counters, one could categorize statistics for different regions of memory at granularities as small as a cache line. Thus the three required mechanisms for fine-grained memory performance monitoring are present in the standard cache-coherence mechanisms.
FlashPoint gathers statistics on the number and latency of read and write misses. It maintains separate categories of statistics for local and remote references. It also keeps counts on the number of invalidations required and on the number of cold misses in the program (i.e., misses to memory not previously referenced since monitoring began). The information gathered by FlashPoint can be viewed with a graphical user interface to give the programmer a display of the program's memory behavior in familiar terms (procedure and variable names from the program).
The FlashPoint method has one potential disadvantage, however. Unlike simulation-based approaches, FlashPoint is intrusive because it augments the default coherence protocol with performance monitoring information. As a result, the timing of a program in a FlashPoint system is not identical to its timing in an unmonitored system. Later in this section, we examine the implications of this approach on monitoring accuracy and application performance.
FlashPoint is built by augmenting the default protocol used in the FLASH Multiprocessor. For this reason, we first describe the FLASH architecture and the default FLASH protocol. Following that, we describe the FlashPoint protocol in some detail---covering its data structures and control mechanisms.

Figure 1. FLASH Node Organization
The protocol processor is a programmable controller that implements a subset of the DLX instruction set [HP90] with extensions that include bitfield operations and a bit-wise conditional branch. Both extensions accelerate the state-bit manipulations that are common in the protocol handlers. The PP is implemented as a dual-issue 64-bit machine with static scheduling. It fetches a pair of instructions on every PP cycle. Since the PP does not support interrupts or exceptions, these instructions are executed unconditionally. The PP also does not support pipeline interlocks or most types of resource conflict detection, so the PP programmer or compiler must avoid these statically. Our FlashPoint implementation takes advantage of these PP characteristics: some of the FlashPoint code added to each handler fills otherwise-empty slots in the statically-scheduled protocol handlers.

Figure 2. Block Diagram of FLASH's MAGIC Chip
The protocol processor's memory hierarchy consists of a 16 KB direct-mapped on-chip instruction cache and a 1 MB, 2-way set-associative off-chip data cache. With these parameters, the protocol processor code experiences very few cache misses; they are negligible in the results presented in this paper for both the default protocol and the augmented FlashPoint protocol. The MAGIC chip has a target frequency of 100MHz, therefore all latencies collected by FlashPoint are stated in terms of these 10 ns system cycles.
The directory state is maintained via a set of software handlers. The base protocol accepts incoming messages from either the local compute node or from the network. Incoming read or read-exclusive requests may be satisfied either locally (if this is the home node for the line) or remotely (otherwise). For local requests, the protocol processor checks and updates the line's state, and on writes, may also send invalidation requests to other nodes. For remote requests, the local protocol processor sends a corresponding request to the home node for that line. For each request message in the system, a corresponding reply message is sent back; these return requested data or acknowledge completion of a requested action.
The second data structure is the storage for the per-bin statistics. It is an array of statistics structures where each individual record includes counters for read misses, write misses, local vs. remote misses, and so on. These records are organized as a two-dimensional array where the first dimension is the procedure number, and the second dimension is the bin number. This allows FlashPoint to maintain per-data-structure, per-procedure statistics. For efficiency, a pointer to the current procedure's array of bins is stored in a register. In the applications used in our study, this data structure's average size is approximately 256 KB.
When monitoring per-procedure statistics, the user-level system also needs to pass the protocol the procedure number on every function call and return. Whenever the program switches procedures the difference between the current procedure number and the new procedure number, the delta, is sent down to the memory system. The current bin pointer is adjusted by delta times the size of the bin array.(2) This is implemented as an uncached load where the page offset bits contain the procedure delta.
After the program has run, the statistics need to be read by the main processor. There are many ways of doing this. For example, one could send a command to the PP that causes it to flush the bin data structures from its cache. The main processor can then read the bin structures like it would any other part of memory.
We evaluate FlashPoint on a subset of the SPLASH-2 suite of parallel applications [WOT+95]. The four programs considered are: FFT, LU, OCEAN, and RADIX. The FFT benchmark uses a data set of 256K points. LU performs an LU decomposition on a 512x512 matrix with a block size of 16. OCEAN is a scientific program that studies large-scale ocean movements; we run it here on a 258x258 grid. Finally, RADIX performs an integer sort with radix 256 and 1 million keys. Except where specified, we use 16 processor runs of these applications. Our results are presented in two parts. First we consider the accuracy of the statistics gathered by FlashPoint by comparing the miss counts and miss latencies it reports with those for the same application run with the default coherence protocol. Subsequently, we quantify the performance overhead of running with the FlashPoint protocol as compared to running with FLASH's default protocol. Throughout these results, we present overview statistics for all applications. Occasionally we also present results focusing in on a single application's behavior. Typically, we focus on FFT, but its behavior is not qualitatively different from the other three applications.
To determine the accuracy impact of FlashPoint, we compare the cache statistics collected by a run with the FlashPoint protocol (in which the timing and perturbation of FlashPoint are included) against the statistics collected by a control run of the simulator. In the idealized control run, the simulator collects the same statistics that FlashPoint would, but uses the default protocol so the program is not charged for the timing effects of gathering these statistics.
Table 1: Cache Miss Counts (x103)
-----------------------------------------------------------------------------------------------------------------------------
Application DP FP NoProc Error FP Proc Error
-----------------------------------------------------------------------------------------------------------------------------
FFT 215 / 122 / 93 214 / 121 / 93 -1% / -1% / 0% 215 / 121 / 93 0%/-1% / 0%
LU 225 / 26.9 / 198 224 / 26.2 / 198 0% / -2% / 0% 225 / 27.8 / 198 0%/ 3%/ 0%
OCEAN 452 / 216 / 236 457 / 221 / 235 1% / 2% / 0% 456 / 221 / 235 1%/ 2%/ 0%
RADIX 71.7 / 69.2 / 2.48 71.8 / 69.0 / 2.79 0% / 0% / 13% 71.7 / 69.0 / 2.69 0%/ 0%/ 9%
Write Miss Counts
FFT 105 / 105 / .016 105 / 105 / .019 -1% / 0% / 19% 105 / 105 / .018 0%/ 0%/ 13%
LU 42.5 / 26.1 / 16.5 41.8 / 25.3 / 26.5 -2% / -3% / 0% 43 / 26.5 / 16.5 1% / 2%/ 0%
OCEAN 630 / 630 / .293 629 / 628 / .284 0% / 0% / -3% 629 / 629 / .307 0% / 0% / 5%
RADIX 149 / 40.5 / 108 149 / 40.6 / 108 0% / 0% / -1% 149 / 40.6 / 108 -1% / 0%/ -1%
-----------------------------------------------------------------------------------------------------------------------------
Table 2: Cache Miss Latencies (10 ns system cycles)
-------------------------------------------------------------------------------------------------------------------------
Application DP FP NoProc Error FP Proc Error
-------------------------------------------------------------------------------------------------------------------------
FFT 84 / 22 / 165 95 / 23 / 189 13% / 3% / 14% 95 / 23 / 189 13% / 2% / 14%
LU 141 / 23 / 157 163 / 24 / 181 16% / 5% / 16% 152 / 22 / 171 8% / -1% / 9%
OCEAN 94 / 24 / 158 102 / 28 / 172 9% / 15% / 9% 103 / 28 / 172 9% / 15% / 9%
RADIX 104 / 101 / 190 109 / 104 / 226 4% / 3% / 19% 109 / 104 / 225 4% / 3% / 19%
Write Latency
FFT 77 / 77 / 297 118 / 118 / 283 52% / 52% / -5% 119 / 119 /282 54% / 54% / -5%
LU 62 / 31 / 110 67 / 32 / 120 8% / 2% / 9% 65 / 30 / 122 5% / -5% / 10%
OCEAN 42 / 42 / 286 57 / 57 / 301 34% / 34% / 5% 57 / 57 / 294 35% / 35% / 3%
RADIX 170 / 31 / 222 192 / 33 / 252 13% / 4% / 14% 192 / 32 / 253 13% / 3% / 14%
-------------------------------------------------------------------------------------------------------------------------
For all four applications the cache miss counts gathered by FlashPoint are quite close to the true statistics. In most cases, miss counts for reads and writes differ by less than 3% from ideal. The 9-13% difference in remote read misses for RADIX is caused by an artifact of our simulator, resulting in different placement of static data between the DP and the FlashPoint runs. Notice though that for total read misses in RADIX, FlashPoint barely differs from the default protocol. Clearly, for gathering statistics of this sort, FlashPoint is quite accurate.
Table 3 summarizes some of the FlashPoint per-data structure statistics output for a 16 processor run of FFT. The two top-ranking bins are shown, and they make up 84% of the read and write misses for this run of the code. For each data structure (there are two shown) and each statistics metric (there are ten shown), the table presents the proportion of that metric caused by this data bin. For example, the top-ranking data structure (a pointer to doubles, called trans) is responsible for 29% of the local read misses, 33% of the remote read misses, and 65% of the total write misses. We show these proportions for both FlashPoint and the base protocol. We also show the relative errors between the base, unperturbed results and the FlashPoint results. The results here are extremely promising. All FlashPoint measurements are within 7% of their unperturbed values. Even in cases (such as FFT's write latencies) where FlashPoint's absolute measurements deviate due to perturbation, the relative proportion of latencies by data structures can be reported quite accurately. For programmers assessing which variables make the most difference in their program's performance, this metric is arguably more important than the absolute latency measurements.
Table 3: Per-data-structure Statistics for FFT ------------------------------------------------------------ Metric DP run FP run Error ------------------------------------------------------------ Top Ranking Data Bin: (double*).trans Read Miss Local 29% 32% 7% Read Latency Local 28% 30% 6% Read Miss Remote 33% 33% 0% Read Latency Remote 35% 34% -2% Write Miss Local 65% 68% 4% Write Latency Local 62% 64% 2% Write Miss Remote 0% 0% 0% Write Latency Remote 0% 0% 0% Cold Misses 33% 33% 0% Invalidation Misses 59% 59% 0% 2nd Ranking Data Bin: (double*).x Read Miss Local 29% 27% -7% Read Latency Local 31% 29% -6% Read Miss Remote 66% 66% 0% Read Latency Remote 61% 60% -1% Write Miss Local 34% 31% -7% Write Latency Local 38% 36% -4% Write Miss Remote 0% 0% 0% Write Latency Remote 0% 0% 0% Cold Misses 33% 33% 0% Invalidation Misses 41% 40% -1% ------------------------------------------------------------

Figure 5. Application Parallel Execution Time
Table 4: Handler Latency/Occupancy Perturbations (10 ns system cycles)
------------------------------------------------------------
App DP FP Over- DP FP Over-
Proc head Proc head
------------------------------------------------------------
FFT 9/11 9/21 0%/91% 15/29 17/37 15%/30%
LU 9/11 9/21 0%/91% 15/49 22/56 41%/15%
OCEAN 9/11 9/21 0%/91% 16/38 18/47 9%/25%
RADIX 11/13 11/18 0%/36% 16/37 20/44 23%/20%
------------------------------------------------------------

Figure 6. Handler Latencies for FFT

One of the primary drawbacks to using simulation-based tools for performance monitoring parallel programs is that simulation time scales poorly with increased number of parallel processors to simulate. The result is that the runtime overhead of using a simulation-based tool can become prohibitive when one is interested in monitoring even a moderate number of processors. FlashPoint, in contrast, shows quite good scaling behavior with the number of processors.
As an example, Table 5 compares read and write miss counts and latencies generated for 16 and 64 processor runs of FFT. The miss count error remains negligible in the 64 processor case. The error in the miss latencies remains significant, but does not increase much with the number of processors. Table 5 also compares the 16 and 64 processor parallel execution times for each of our three protocol configurations. Surprisingly, the FlashPoint execution time overhead for FFT decreases as the number of processors increase, though handler occupancies are larger for the 64 processor case than for the 16 processor case. Overall as the number of processors increases, the performance trends depend on the application behavior. For example, if increasing the number of processors on a fixed problem size reduces capacity misses, then FlashPoint's relative performance may actually improve with the number of processors, since the number of misses decreases.
Table 5: Scaling FFT to 64 Processors
-----------------------------------------------------
FFT 16P
-----------------------------------------------------
Read Misses (103) 215 214 -1% 215 0%
Read Latency 84 95 13% 95 13%
Write Misses (103) 105 105 -1% 105 0%
Write Latency 77 118 52% 119 54%
FFT 64P
Read Misses (103) 203 203 0% 203 0%
Read Latency 188 231 23% 218 16%
Write Misses (103) 102 102 0% 102 0%
Write Latency 81 126 55% 127 56%
Parallel DP FP Over- FP Over-
Execution Time No head Proc head
(106 cycles) Proc
FFT 16P 3.45 3.62 5% 5.19 51%
FFT 64P 1.11 1.17 5% 1.55 40%
-----------------------------------------------------
FlashPoint's implementation has a number of interesting design tradeoffs that affect both monitoring overhead and accuracy. For example, if the user does not want to collect latency information, but instead is only interested in miss counts, the overhead of FlashPoint would be much smaller. In the current installation, users can choose between NoProc and Proc versions of the protocol. In future installations, more versions could be available. In addition, since the FLASH multiprocessor uses an aggressive, out-of-order execution main compute processor, FlashPoint must be conservative in notifying MAGIC about procedure entry and exit points and data binning information. In particular, we use memory fence instructions to prevent newer loads and stores from bypassing the uncached write that changes the pointer to the current set of bins (which could potentially assign miss information to the wrong statistical bin). It is likely that the error introduced by not inserting the fence instructions is small, and that they can therefore be removed without adversely affecting the accuracy of FlashPoint. Since the fence stall time is the major source of execution time overhead in the per-procedure (FP-Proc) version of FlashPoint, removing the fence instructions will decrease the overall execution time dramatically, and may reduce the total amount of system perturbation as well.
Machines that implement user-level shared memory schemes, such as Typhoon [RLW94], also provide an excellent environment for performance-coherence integration. Typhoon is similar to FLASH in the sense that both architectures include a network interface device with a fully-programmable processor. This allows performance tools to be implemented relatively simply via changes in the protocol code running in the network interface processor. Among systems with programmable protocol processors, variations in block sizes may affect tool performance. FlashPoint benefits slightly from longer cache lines for two reasons. First, the data bin mapping of a new page is done by mapping each of the cache blocks within it, so larger lines require fewer individual mapping operations per new page. Second, FlashPoint's overhead is fixed, regardless of the amount of data being transferred. With longer cache lines, one amortizes this overhead over more bytes of data, and the handler overhead is more likely to be hidden in parallel with the data transfer.
Fully hardwired implementations of directory-based protocols also offer some of the same advantages as FLASH's programmable directory protocol. Whether implemented using a hardwired or programmable controller, directory protocols keep per-cache-line state information. For performance monitoring tools, this per-line information can be extended by only a few bits to store a unique id (bin number) indicating which data structure this cache line is a part of. The main difference is that the decision to support performance monitoring must be made at hardware design-time. In FlashPoint, data structures for per-procedure and per-data statistics are built up in software by the protocol's handler code. To build a similar tool in a directory-based machine with a hardwired controller, additional hardware would be required to record counts and latencies of read and write events. Because of the hardwired implementation, users would have less flexibility in selecting levels of performance monitoring detail.
Over time, scalable parallel machine designs are converging, as are the communication mechanisms in both "message-passing" and "shared-memory" machines. Although not hardware cache-coherent, machines like the Wisconsin COW [HLW95] or upcoming SHRIMP [BLA+94] implementations may also be amenable to integrating communication and performance monitoring. In these machines, the network interface is (or is likely to be) implemented using Myricom switches [BCF+95]. The core of these switches is a programmable controller implemented using the LANai processor. Thus, performance monitoring code could be inserted directly into the LANai handlers that support communication. Unfortunately, the 33MHz LANai processor is not nearly as high-performance as FLASH's protocol processor. Furthermore, LANai is a single-issue processor, so there are fewer unused instruction slots where performance code could be inserted "for free". On the other hand, LANai provides hardware support for two full contexts, and has a mechanism for switching between contexts in a single cycle. The freedom to place the performance instrumentation code in a separate context may allow for better register usage in the communications code itself.
We have also touched only briefly on hardware monitoring support in current machines. In fact, over the past five to ten years, monitoring support has become increasingly available both on research and commercial machines. On-chip hardware counters are a class of monitoring support found on several current-generation processor chips, including the DEC Alpha 21164, MIPS R10000, and Intel Pentium processors. These counters---collecting information on data accesses, cache misses, pipeline stalls, and instruction mix---can be invaluable in collecting fine-grained information about program and system behavior. Unfortunately their main drawback, as discussed in [HMM+96], is that they are primarily intended to offer aggregate counts on hardware performance. Since the counters are often memory-mapped, the overhead to read the counter values can be quite high, so it is difficult to implement tools that can attribute individual cache misses to particular reference points in the code. In addition, the counters on current processors offer no mechanism by which software can observe and react to individual cache misses as in FlashPoint.
Several multiprocessors [Hei93, NAB+94, ABC+95] provide system-level monitoring hardware as well. For example, the DASH multiprocessor includes a monitoring board connected to each shared cluster bus, and the Alewife project includes monitoring hardware on the CMMU. Some such approaches have enough flexibility that they can gather statistics similar to those gathered by FlashPoint. The advantage of these approaches over FlashPoint is that by dedicating hardware for monitoring, they can design the monitor to be able to observe communication behavior without perturbation. The clear disadvantage, however, is their reliance on special-purpose hardware which tools like FlashPoint can circumvent.
Finally, by integrating the performance monitoring into the coherence protocol itself, one can also develop systems where the on-line performance information can be used by the protocol to customize handler actions according to the observed program data usage patterns. For example if the protocol notes that a processor is making frequent remote accesses to lines on a particular page, it could use this information to suggest to the operating system that it might be useful to migrate the page to this processor's local memory. (This is similar to the approach described in [CDV+94], which used dedicated monitoring hardware.) More elaborate protocols might also use similar information to guide decisions of whether to use updates or invalidations to maintain the coherence of each line. For these types of applications, the integration of performance and coherence monitoring is especially useful; less integrated approaches (for example using profile data from a hardware trace buffer) do not work well because of extra overhead in detecting and responding to particular conditions.
The extent to which performance monitoring can be implemented out of existing hardware and software depends on the coherence mechanism being used. In machines with programmable protocol controllers, it becomes especially easy to take advantage of these parallels. Changes required to implement even fairly detailed monitoring can be done entirely by modifying protocol code.
We have demonstrated the usefulness of this approach by implementing the FlashPoint memory performance monitoring tool as part of a software coherence protocol for the FLASH Multiprocessor. FlashPoint obtains detailed memory performance statistics at low overheads with good accuracy. Both per-bin cache miss counts as well as counts aggregated over the whole program run agree well with an unperturbed execution. Absolute measurements of cache miss latencies are more error-prone than cache miss counts, but relative comparisons of latencies attributed to different data structures remain quite accurate.
For a collection of programs from the SPLASH-2 benchmark suite, FlashPoint runtime overhead was less than 10% for gathering per-data structure program statistics. When one uses FlashPoint to gather per-procedure and per-data structure statistics, the overheads increase somewhat, but are still generally less than 2X. This compares favorably with previous approaches with overheads of 10X or more. In addition, FlashPoint's overheads scale well with the number of processors.
Although the paper describes a case study based on a particular multiprocessor implementation, the observation of similarities between coherence and monitoring is both important and general. In light of a large and still-increasing processor-memory performance gap, memory performance monitoring is an essential part of application development in shared-memory multiprocessors. Using the techniques described here, efficient, useful support can be integrated into cache-coherent designs with relative ease and low cost.