The Gecko's CPU Library

Hewlett Packard PA-8000 (Onyx) processors

Introduction: January 1996

Die of the Hewlett Packard PA-RISC 8000.

The Stratus Continuum 600, powered by Hewlett Packard PA-RISC 8000's.

The Stratus Continuum 1200, powered by Hewlett Packard PA-RISC 8000's.

The PA-8000 is the first chip to implement the 64-bit PA-RISC 2.0 architecture which includes many extensions to support 64-bit computing. This includes that all integer registers and functional units (ALU, shift/merge) have been widened to 64-bit, i.e. native 64-bit integer arithmetic. The flat virtual address space is 64-bit wide although most PA-RISC version 2.0 CPUs only support a physical address space of 40-bit. Other extensions include fast TLB insert instructions, memory prefetch instructions, support for variable sized pages, branch prediction hinting and FPMAC (Floating Point Multiply Accumulate) units. The instruction decode logic is not integrated with the functional units’ pipeline logic. This architecture allows the chip to partially decode instructions well in advance of the instruction’s actual execution by the functional unit(s).

A key feature of the PA-8000 is the IRB (Instruction Reorder Buffer). Due to restrictions on compiler scheduling, the design team decided that the CPU should perform its own instruction scheduling. The IRB can store up to 28 computation and 28 load/store instructions; it tracks interdepencies between these instructions and allows execution as soon as the instructions are ready. Branch prediction outcomes are also tracked and due to re-scheduling the CPU can execute instructions past cache misses. The IRB is the key part in the OOO execution capabilty of the chip.

In short, the PA-8000 is a decoupled architecture with four-instruction dispatch and aggressive out-of-order (OoO) execution. It has additionally dual floating-point and dual load/store units, a large OOO dispatch window and, following a long HP tradition, no on-chip caches. The (large) primary caches have been kept off-chip to increase the amount of data that can be accessed in a single cycle. Although the latency of the caches is roughly two cycles this can be hidden with complete pipelining resulting practically in one access per cycle. Nothing in the design of this chip was leveraged from previous chip designs.

PA-8000 was used in C160, C180, D270, D280, D370, D380, J280, J282, K250, K260, K450, K460, R380, T600, Convex SPP2000 (S-Class) and Stratus Continuum 628, 1228.

- PA-RISC version 2.0 64-bit

- Ten functional units: 2 integer ALUs, 2 shift/merge units, 2 complete load/store pipelines, 2 Floating Point multiply/accumulate units, 2 Floating Point divide/square root units

- 4-way superscalar

- Two address adders

- 96-entry fully-associative dual-ported TLB

- TLB miss penalty of 61 cycles

- 32-entry BTAC (Branch Target Address Cache)

- 256-entry BHT (Branch History Table)

- Dynamic and static branch prediction modes

- Off-chip L1 caches up to 1MB I and 1MB D, realized in synchronous 6.7ns (150MHz) late-write 1Mb SRAMs, one cycle latency

- Caches are direct-mapped and dual-ported

- 56-entry instruction queue/reorder buffer (IRB)

- MAX-2 multimedia extensions (subword arithmetic) for multimedia applications, e.g. MPEG decoding

- Each instruction includes five predecode bits

- Bi-endian support

- Runway system/memory bus, 120MHz, 64-bit wide, featuring split transactions and glueless multiprocessing. Max. throughput of 960MB/s

- Up to 180MHz frequency with 3.3V core voltage

- 17.7×19.6 mm2 die, 4,500,000 FETs, 0.5 micron, 5-layer metal CMOS packaged in a 1,085-pin flip-chip LGA package

Source: www.openpa.net