DEC Alpha 21164 (EV5) processors
Introduction: April 1995
DEC unveiled the very first information about the 2nd generation Alpha processor at a Hot Chips conference located in Palo Alto (California, the USA) which started on the 14th of August 1994. Although 21164 (EV5) was presented officially only after a respective press release by DEC which was dated by the 7th of September 1994. The processor was based upon the core of EV45 and was rather an evolution of the latter than a revolutionary new design. When compared to EV4 or EV45, the number of pipelines was doubled, both integer and floating-point. In addition, the floating-point pipelines were transformed to run through 9 stages rather than 10. However, the integer pipelines weren't all the same if compared to each other: while both were capable of basic arithmetical and logical operations, the 1st only could multiply and shift, and the 2nd only was able to process conditional/unconditional branches. Both pipelines were able to calculate virtual addresses for load instructions, but the 1st one only - for store ones. The floating-point pipelines were different as well: the 1st could execute any floating-point code except of multiply instructions which were the only code the 2nd pipeline could process. I-box was able to fetch and decode up to 4 instructions per cycle to provide the execution units with a proper load. Was manufactured using the same proprietary 4-layer 0.5µ CMOS5 process as of EV45, therefore required the same 3.3V power supply. Consisted of 9.3M transistors (including 7.8M spent on integrated caches), possessed a die size of 299mm² (close to theoretical limits of the technological process involved). Core frequencies of 21164 ranged from 266MHz to 333MHz (TDP from 46W to 56W). Form-factor: IPGA-499 (Interstitial Pin Grid Array).
I-cache and D-cache were sized and organised just like in EV4, i. e. 8KB each, write-through. Although D-cache was made dual-ported, i. e. was able to deliver data for 2 load instructions per cycle. Sacrificing transistors for the sake of performance, D-cache was composed physically of 2 identical parts of 8KB each, so data could be read from either one but had to be written to the both. The processor was accommodated with 96KB of the integrated L2 cache (S-cache, secondary cache), write-back, 3-way set associative, and C-box accessed it through a dedicated 128-bit data bus. At the same time, B-cache was also functional though remained optional, consisted of external cache SRAMs and could be as large as 64MB, though usually from 1MB to 4MB. 128-bit data bus to B-cache was multiplexed with the system data bus still. So, EV5 supported 3 cache levels, and was the first processor to feature such hierarchy.
S-cache was accessed through a 4-stage pipeline: 2 cycles for tag look-up and bank activation plus 2 cycles for data access and delivery (16 bytes per cycle), though an extra cycle was required for data to propagate across the processor from C-box to D-cache and either E-box or F-box. Engineers who designed EV5 considered to implement tag look-up and data access in parallel, so all 3 banks would deliver data to be evaluated upon arrival. This approach could reduce the pipeline by 1 stage, but would cause a serious impact on processor power consumption (+40% estimated). However, it didn't prevent D-cache from operating this way, but there was only 1 bank of 8KB rather than 3 banks of 32KB each. Even more, read latency of D-cache was reduced from 3 to 2 cycles. Every line of S-cache was 64 bytes wide with one tag per line, though it was possible to address every line as if there were two sublines 32 bytes wide each because I-cache and D-cache operated with 32-byte lines. S-cache was inclusive to D-cache. In turn, B-cache was inclusive to S-cache with no regard to write-back policy of the latter and the difference in associativities. I-TLB held 48 entries (for pages sized from 8KB to 4MB), D-TLB 64 entries (for pages sized from 8KB to 4MB) and was dual-ported for load operations in the same manner as D-cache. The system data bus was fixed-length at 128 bits with additional 16 bits for ECC protection, still multiplexed with the data path to B-cache, though more effective because of a new split-transaction protocol. The system address bus was 40-bit, the system control bus - 10-bit.