Gecko's CPU Library

DEC Alpha 21264 (EV6) processors

Introduction: February 1998

Although 21264 (EV6) processor was developed by DEC and was mentioned first at a Microprocessor Forum in October of 1996, the final silicon implementation was done in February of 1998 when DEC was in process of liquidation. The processor itself was a significant step forward when compared to EV5, not a tuned up old design at all. One of the most important innovations was out-of-order execution which implied a fundamental core redesign and lowered functional units' dependence upon cache and operating memory's bandwidth. EV6 could reorder up to 80 instructions on the fly, more than other competitive products could. For instance, the P6 architecture by Intel was able to execute out-of-order up to 40 [microcommands], HP PA-8x00 up to 56, MIPS R12000 up to 48, IBM POWER3 up to 32, Motorola PowerPC G4 up to 5, and Sun UltraSPARC II didn't support instruction reordering at all. There was also register renaming technique implemented, so EV6 accommodated 80 integer and 72 floating-point physical registers, but the number of architectural (logical) registers remained unchanged, i. e. 32 integer and 32 floating-point.

There were 4 integer pipelines available, i. e. twice as many as EV5 was given. They were organised in 2 clusters with 2 pipelines and an 80-entry integer register file per cluster. Those 2 register files were identical (syncronised) though. However, those pipelines were different functionally: the 2nd pipeline of the 1st cluster was capable of shifting (1-cycle latency) and multiplying (7-cycle latency), the 2nd pipeline of the 2nd cluster - of shifting (1-cycle latency) and executing MVIs (3-cycle latency). The 1st pipeline of every cluster helped A-box by calculating virtual addresses for load/store operations. Apart of that, all 4 integer pipelines were capable of basic arithmetical and logical operations (1-cycle latency). A-box itself worked with I-TLB and D-TLB (128 entries each), load and store queues (32 instructions each), also 8 64-byte buffers (miss address file) for transactions involving B-cache and operating memory. Floating-point pipelines were different functionally as well. The 1st pipeline was capable of adding (4-cycle latency), dividing (12-cycle latency for single-precision operands and 15-cycle for double-precision) and square root calculation (15-cycle and 30-cycle respectively), but the 2nd one was only capable of multiplying (4-cycle latency). Like before in EV5, I-box was able to decode up to 4 instructions per cycle and dispatch them into 2 queues, to E-box called E-queue (20 instructions) and to F-box called F-queue (15 instructions).

C-box was redesigned significantly and was made capable of supporting only 2 cache levels. The integrated L1 cache memory consisted of 64KB I-cache and 64KB D-cache, both 2-way set associative with 64-byte lines. D-cache was write-back as well as B-cache, hence no S-cache at all. B-cache was inclusive to D-cache though. Because of a large size D-cache read/write latencies were increased from 2 to 3 cycles (to/from an integer register) and 4 cycles (to/from a floating-point register). D-cache remained dual-ported, but it was made not of 2 identical write-synchronised parts like in EV5, but of a single part clocked at double the core frequency. External B-cache of 1MB to 16MB, direct-mapped, write-back, was accessed through an independent bidirectional 128-bit data bus with a 16-bit channel for ECC protection, also a unidirectional 20-bit address bus. B-cache was built of LW SSRAM chips (late write), later of DDR SSRAM ones (double data rate). Speed of B-cache was programmable ranging from 2/3 to 1/8 of EV6 core frequency. Unlike for the previous generations of Alpha processors, B-cache itself wasn't optional. The system data bus was only 64-bit wide with an additional 8-bit ECC protection, but was able to transfer data on both rising and falling edges of clock signal, i. e. was DDR capable. The system address bus was 44-bit wide implemented physically through two 15-bit unidirectional paths, the system control 15-bit wide. The basic functional principle of the system bus was changed, so the bus became dedicated instead of shared, thus every processor possessed an own path to a system logic set.

The branch prediction logic was redesigned completely. It followed a 2-level scheme with a local history table of 1024 records 10-bit each and a local predictor of 1024 records 3-bit each coupled with a global predictor of 4096 records 3-bit each, also a history path of 12 bits. Both local and global algorithms worked independently, and if the local one traced every branch available, the global one traced sequences of branches. The chooser analysed results of both algorithms and made conclusions to a separate choice predictor of 4096 records 2-bit each which was the source of a preferred decision if the predictions were different. Such a cooperative approach allowed to achieve better results than any of the algorithms if used stand-alone.

Engineers who developed EV6, considering a large number of functional units and other difficulties, decided to redesign the clock subsystem entirely. A more efficient signal flow allowed the core to reach frequencies of the much simpler core of EV56 while involving almost the same technological process. Overall, power consumed by the clock subsystem of EV6 was about 32% of the total core power. To compare, it was about 25% for EV56, about 37% for EV5 and about 40% for EV4.

EV6 was manufactured using the same technological process to of EV56, but with 2 additional metallisation layers. Consisted of 15.2M transistors (including about 9M spent for I-cache, D-cache and branch predictors), possessed a die size of 314mm² and required a 2.1V to 2.3V power supply. 21264 (EV6) core frequencies ranged from 466MHz to 600MHz (TDP approx. from 80W to 110W). Form-factor: PGA-587 (Pin Grid Array).