Download EMBEDDED PROGRAMMING

Transcript
loops at the heart of
HP’s Precision Architecture
scientific and DSP code.
(PA)—the PA 8500 in Figure
System-bus
Instruction
Instruction fetch unit
R
interface
cache
BHT, BTAC
u
Although his life was
1—includes a whopping
n
Sort
w
cut
short by a car acci1.5 MB of cache (0.5 MB of
Dual
a
64-bit
y
dent,
the spirit of Seyinstruction, 1 MB of data).
integer
TLB
b
ALUs
mour Cray lives on in
Given HP’s long-time posiu
s
the SV1 from Silicon
tion in favor of off- versus
Address
Dual
Memory
ALU
Dual
reorder
shift/
Data
buffer
buffer
load/store
Graphics. The SV1 not
on-chip cache, such a develbuffer
merge
cache
28
28
address
28
units
only incorporates SIMD
opment is even more noentries
entries
adders
entries
Dual FP
techniques, but because
table. Fact is, with tens of
multiply/
SGI purchased Cray’s
millions of transistors to find
accumulate
units
company, it is also
homes for, big cache is the
Retire
Dual FP
upwardly compatible
easiest way out.
Rename
divide
registers
Architected
with his YMP.
Besides making cache
SQRT
registers
units
As a classic vector
bigger, the goal is to build
processor, the SV1 faces
and use it smarter. Even if
Rename
registers
a different set of chalhalf a dozen instructions can
lenges. For instance,
be found to keep all those
there’s little concern
execution units fed, the cache Figure 1—With plenty of function units, out-of-order execution, high clock rate, and huge
(0.5-MB instruction, 1-MB data) caches, the HP PA-8500 is a good example of the latest
with conventional
can become a bottleneck.
trend for performance-at-any-price chips.
benchmarks like SPEC.
Thus, the trend towards
The only goal is crunching through
nonblocking designs escalates (when a
The appeal lies in the fact that it’s
vectors at blazing speed, and we’re
cache miss happens, don’t just sit
relatively easy to find and exploit
talking billions of operations per second.
there twiddling your thumbs; try to
parallelism in scientific and signalOne source of head scratching comes
execute another instruction). The
processing algorithms that rely on
when vector ops and cache get in each
latest designs allow dozens or even
vector operations.
hundreds of cache accesses to be pendAlmost all hot chips support vector other’s way. Vector data may not be
reused, and worse, arrays (i.e., vectors
ing, without stalling the processor.
ops these days, the most well-known
of vectors) introduce the issue of stride.
As for using cache more intelligently,
example being the Intel MMX. At
For instance, a column operation on
the earlier trend towards softwaretheir simplest, such schemes carve a
a 256 × 1024 array calls for accessing
directed prefetching, illustrated in
full-size register into parallel subparts
every 1024th element, which is contrary
Figure 2, has become de rigueur. The
that can be operated on. For example,
idea is to give the cache a head start,
a conventional 32-bit ADD is extended to the concept of locality (i.e., the next
with the goal, in a perfect world, being
to perform two 16-bit ADDs or four
access is near the previous one) on
the elimination of the dreaded miss.
8-bit ADDs at once.
which the cache concept is based.
The conditional branch has become
The latest generation of psuedothe bane of heavily pipelined, superSIMDs pushes the concept further
a)
Load-Miss
scalar, and speculative superdupers.
with wider words, more operands, and
Use
Mere mortal CPUs can only take five
extra instructions. Consider Motorola’s
Load-Miss
Use
and wait for new marching orders
AltiVec upgrade of the PowerPC archiLoad-Miss
(i.e., condition resolves).
tecture. The upgrade adds a complete
Use
The latest chips go to extraordinary vector unit featuring 128-bit registers
Load-Miss
b)
lengths trying to predict the branch’s
that can be interpreted as 16 × 8-bit,
Load-Miss
outcome. For instance, the DEC-now8 × 16-bit, or 4 × 32-bit data.
Load-Miss
Use
Compaq Alpha 21264 happily wades
There are 162 new instructions,
Use
20 branches into the future, relying on including both the typical intra-element
Use
a crystal ball that not only includes the and the newly introduced inter-element
Load-to-Gr0
c)
usual branch history but also how the
operations. Figure 4 shows how the
Instr
Instr
program arrived there (see Figure 3).
two make short work of the inner
Instr
Instr
IN MEMORY OF CRAY
Another example of effective recycling of yesterday’s know-how is seen
in the widespread adoption of SIMD
techniques (i.e., applying a single
instruction to multiple data items in
parallel). In Cray’s day, this technique
was known as vector processing.
Instr
Instr
Figure 2a—To ease the pain of a cache miss, the HP
PA-8500 and other high-end chips employ both hardware
and software techniques. One hardware approach is a
nonblocking cache that allows multiple outstanding
references (b), while software solutions include compilerinserted prefetch to initiate cache access prior to
anticipated use (c).
Instr
Instr
Instr
Instr
Instr
Load-Hit
Time
Circuit Cellar INK®
Issue 101 December 1998
81