Download The Cache Visualisation Tool
Transcript
Chapter 6 Conclusions Because of high memory and network latencies, the cost of cache misses is very high. Because of the complexity of cache phenomena such as cache interferences, it is often dicult for programmers and hardware designers to precisely understand the causes and origins of this poor behavior. Most analytical tools simply provide bottom lines, such as the hit ratio obtained after executing a code segment. Therefore, to improve software and hardware performance, better analytical tools are needed to help in this regard. The CVT is designed to ll this gap. In this report we have rst presented some basic cache theory, which have led to the implementation of the CVT. A complete description of the functionality of the CVT is described, which includes a cache simulator (others can be easily plugged in), an input program and trace emulator, a display environment based on Motif and a tool-box for setting breakpoints, displaying statistics, specifying methods for coloring cache-lines, etc. In chapter 4, an overview of current cache issues and how software optimizations can address them has been given, by describing current methods and techniques. Another (more) important goal of this chapter was to give a (potential) user an idea on what benets the CVT can bring in understanding the exact cache behavior of codes restructured by software optimizations. Chapter 4 starts of with describing how cache interferences are eectively displayed by the CVT, by looking at a dicult to understand loop nest and spotting bottlenecks in the code. Next, one of the most well-known software optimizations, blocking, is described by presenting a model and it is shown that the CVT is an eective tool for nding the optimal blocking size. Nonsingular loop transformations, which is a more elaborate class of software optimizations, are presented by describing some theory and models that try to optimize data locality through these transformations. It is important to note, that code which is restructured by these transformations is very dicult to read and some leads are presented to the user to let the CVT help him understand phenomena coming from this transformations. Software prefetching is included in this report for two reasons, it is one of the few software techniques for reducing compulsory misses and it shows how a dierent simulator can be easily plugged in. A software prefetched matrix-matrix multiply is analyzed with the CVT and the results are discussed. The last part of this chapter is on sparse codes, which represent a special class of numerical codes, which have usually more dicult reference-patterns, due to the indirectly addressing of at least one array. This part is important, because it shows how traces can be eectively used to gain insight on sparse codes behavior. The last chapter discussed hardware organizations and their impact on the performance. Analyzing a very simple nested DO-loop used in this chapter already unveiled unpredictable cache behavior. Increasing the set-associativity, the hardware is expected to avoid certain conicts in cache. But this hardware improvement will not always obtain the expected reduction of conicts. The advantage of more way set associative cache is the choice of the location of data in cache in several cache lines. The additional hardware required should decrease the the conict misses, but introduces another phenomenon; the address range is getting smaller when we increase the set associativity, which might cause extra conicts. The performance is so unpredictable because of dependencies in numerous parameters; hardware organization, software techniques, size, order and access patterns to the used data-structures etc. You might say that relatively small caches suer more 82