Download EMBEDDED PROGRAMMING
Transcript
loops at the heart of HP’s Precision Architecture scientific and DSP code. (PA)—the PA 8500 in Figure System-bus Instruction Instruction fetch unit R interface cache BHT, BTAC u Although his life was 1—includes a whopping n Sort w cut short by a car acci1.5 MB of cache (0.5 MB of Dual a 64-bit y dent, the spirit of Seyinstruction, 1 MB of data). integer TLB b ALUs mour Cray lives on in Given HP’s long-time posiu s the SV1 from Silicon tion in favor of off- versus Address Dual Memory ALU Dual reorder shift/ Data buffer buffer load/store Graphics. The SV1 not on-chip cache, such a develbuffer merge cache 28 28 address 28 units only incorporates SIMD opment is even more noentries entries adders entries Dual FP techniques, but because table. Fact is, with tens of multiply/ SGI purchased Cray’s millions of transistors to find accumulate units company, it is also homes for, big cache is the Retire Dual FP upwardly compatible easiest way out. Rename divide registers Architected with his YMP. Besides making cache SQRT registers units As a classic vector bigger, the goal is to build processor, the SV1 faces and use it smarter. Even if Rename registers a different set of chalhalf a dozen instructions can lenges. For instance, be found to keep all those there’s little concern execution units fed, the cache Figure 1—With plenty of function units, out-of-order execution, high clock rate, and huge (0.5-MB instruction, 1-MB data) caches, the HP PA-8500 is a good example of the latest with conventional can become a bottleneck. trend for performance-at-any-price chips. benchmarks like SPEC. Thus, the trend towards The only goal is crunching through nonblocking designs escalates (when a The appeal lies in the fact that it’s vectors at blazing speed, and we’re cache miss happens, don’t just sit relatively easy to find and exploit talking billions of operations per second. there twiddling your thumbs; try to parallelism in scientific and signalOne source of head scratching comes execute another instruction). The processing algorithms that rely on when vector ops and cache get in each latest designs allow dozens or even vector operations. hundreds of cache accesses to be pendAlmost all hot chips support vector other’s way. Vector data may not be reused, and worse, arrays (i.e., vectors ing, without stalling the processor. ops these days, the most well-known of vectors) introduce the issue of stride. As for using cache more intelligently, example being the Intel MMX. At For instance, a column operation on the earlier trend towards softwaretheir simplest, such schemes carve a a 256 × 1024 array calls for accessing directed prefetching, illustrated in full-size register into parallel subparts every 1024th element, which is contrary Figure 2, has become de rigueur. The that can be operated on. For example, idea is to give the cache a head start, a conventional 32-bit ADD is extended to the concept of locality (i.e., the next with the goal, in a perfect world, being to perform two 16-bit ADDs or four access is near the previous one) on the elimination of the dreaded miss. 8-bit ADDs at once. which the cache concept is based. The conditional branch has become The latest generation of psuedothe bane of heavily pipelined, superSIMDs pushes the concept further a) Load-Miss scalar, and speculative superdupers. with wider words, more operands, and Use Mere mortal CPUs can only take five extra instructions. Consider Motorola’s Load-Miss Use and wait for new marching orders AltiVec upgrade of the PowerPC archiLoad-Miss (i.e., condition resolves). tecture. The upgrade adds a complete Use The latest chips go to extraordinary vector unit featuring 128-bit registers Load-Miss b) lengths trying to predict the branch’s that can be interpreted as 16 × 8-bit, Load-Miss outcome. For instance, the DEC-now8 × 16-bit, or 4 × 32-bit data. Load-Miss Use Compaq Alpha 21264 happily wades There are 162 new instructions, Use 20 branches into the future, relying on including both the typical intra-element Use a crystal ball that not only includes the and the newly introduced inter-element Load-to-Gr0 c) usual branch history but also how the operations. Figure 4 shows how the Instr Instr program arrived there (see Figure 3). two make short work of the inner Instr Instr IN MEMORY OF CRAY Another example of effective recycling of yesterday’s know-how is seen in the widespread adoption of SIMD techniques (i.e., applying a single instruction to multiple data items in parallel). In Cray’s day, this technique was known as vector processing. Instr Instr Figure 2a—To ease the pain of a cache miss, the HP PA-8500 and other high-end chips employ both hardware and software techniques. One hardware approach is a nonblocking cache that allows multiple outstanding references (b), while software solutions include compilerinserted prefetch to initiate cache access prior to anticipated use (c). Instr Instr Instr Instr Instr Load-Hit Time Circuit Cellar INK® Issue 101 December 1998 81