Inside the Haswell Architecture
Inside the Tock; Haswell's Architecture
In previous generations, Intel took a relatively modest approach towards evolving their processor architecture. While there were several key components added and improved during the transition between Nehalem and Sandy Bridge, the move from Sandy Bridge / Ivy Bridge to Haswell is being accomplished via what Intel calls “energy efficient performance” directives.
When distilled down into its various components this methodology is allowing Haswell to process more instructions per clock than Sandy Bridge’s Ivy Bridge derivative without an increase in clock speeds or power consumption. This been accomplished through a number of architectural changes, though the basic unified design premise from the previous generation has been carried over nearly untouched.
Like Ivy Bridge, Haswell uses Intel’s advanced 22nm Tri-Gate 3D transistor technology, essentially stacking the transistor gates, pipelines and silicon substrate across three dimensions in order to save wafer space. Not only does this allow all of Haswell’s 1.4 billion transistors to be packed into a small 177mm² die but it also ensures that leakage is kept to a minimum, thus boosting efficiency.
There haven’t been any substantial changes to the die layout either. Haswell processors still have up to four x86 processing cores, each of which can run two concurrent threads (provided Hyper Threading is enabled) under the same roof as a Processor Graphics engine, System Agent and Display I/O Engine. Primary PCI-E lanes, video codec engines and the memory controller share the space as well. There is also up to 8MB of L3 cache which is dynamically shared between the processor’s cores and the graphics engine.
Usually, when one thinks of efficiency, power consumption comes to mind but Intel has also been trying to address on-chip efficiency as well. In other words, they set out to improve the communication effectiveness between Haswell’s independent processing stages. This was accomplished through better branch prediction, adding resources for improved single thread performance, optimizing bandwidth in various areas and higher amounts of parallelism….all without doing substantial changes to the architecture’s primary building blocks.
The first step of this process was to add compatibility with AMD’s FMA3 instruction set through a pair of FMA processing units, thus improving double precision performance by about 200% when compared to Ivy Bridge. Support for AVX2 instructions was also added, once again allowing Haswell to process up to 32 single precision FLOPs per cycle versus Ivy Bridge’s 16 FLOPs. On paper one might assume this won’t affect everyday applications but these new instruction sets could lead to a drastic increase in games and HPC-centric workloads.
Intel also added two more ports to Haswell’s URS (Unified Reservation Station), one of which includes a fourth ALU meant to streamline the first two ports’ workflow and a specialized Branch Unit which improves inter-stage communications. The second new port houses a dedicated Address Generation Unit for store operations, leading to port 2 and 3 remaining open primarily for loads.
In order to cope with the potentially deluge of compute performance improvements, Haswell’s back-end caching structure has been given a facelift. The load and store L1 bandwidth has been doubled which prepares the architecture for the higher throughput from the aforementioned AGU. With this taken into account, the communication structure between the L1 and unified L2 caches also needed some shoring up so Intel boosted performance to 64 bytes per cycle while the unified L2 TLB received a significant increase as well.
So what does this mean in plain English? While the organization and size of Haswell’s primary cache structure is identical to the previous generation, the improved bandwidth will help both legacy and new code perform at higher levels without the need for increased clock speeds or major architecture changes.
Haswell’s actual power efficiency features have gone through a few revisions too. Turbo frequencies have been fine tuned by using finer grain voltage adjustments while an optimized CPU to PCH link reduces the amount of power needed for communications between the two chips. Meanwhile, idle power has been substantially improved by incorporating new C-States, manufacturing process optimizations and enhanced power gating when the system is at idle.
With their newest architecture, Intel didn’t need to reinvent the wheel in order to optimize performance. Instead they decided to focus on optimizations which will improve IPC in some key areas while leaning on a mature manufacturing process and effiecient on-chip communications to reduce power consumption. On paper at least, this should lead to a 10-20% improvement over Ivy Bridge.
|Latest Reviews in Processors|