Inside the Llano APU Architecture
Inside the Llano APU Architecture
AMD’s Llano chips may have been a long time in the making but the technology they use is cutting edge. There were however serious hurdles that needed to be overcome before these new Accelerated Processing Units could be launched.
One of the main challenges AMD faced was the seemingly impossible task of cramming an impressive array of previously separate items onto a single die. On a single chip they needed to implement a GPU with its audio and video I/O needs, four CPU cores with their own L2 cache, a DDR3 memory controller and an integrated Northbridge to ensure low latency communication. In order to accomplish this, AMD turned to GlobalFroundries’ new 32nm manufacturing process and the end result is indeed an impressive looking architecture. However, the challenges will likely be even greater as AMD begins designing Trinity, their next generation APUs which will feature even more x86 CPU cores and additional GPU computational power
Upon first glance the specifications between the A-series APUs and Intel’s Sandy Bridge architecture are quite similar. Both use a leading edge 32nm manufacturing process, have approximately one billion transistors and sport similar die sizes. The TDP of these two architectures is also comparable with a range of chips from 65W to 100W being available (though Intel’s highest end 2600K chips have a TDP of 95W).
It should be mentioned that AMD is moving to the 32nm manufacturing process a full 18 months after Intel released their Clarkdale series but they simply could not have released A-series APUs without making this transition since these are transistor-packed processors. The four-core Llano variants have a whopping 1.45 billion transistors which is almost 300 million transistors more than Intel’s six-core Gulftown processors and those chips pack a huge 12MB of L3 cache. Nevertheless, despite having very high transistor counts these Llanos are quite compact when you consider that quad-core Sandy Bridge processors are a mere 5% smaller (216mm2) despite only featuring 995 million transistors.
With this new manufacturing process, we would have expected some kind of frequency bump, but that clearly is not the case when it comes to Llano. For the time being the A-series APUs are surprisingly low clocked when you consider that even low-end 45nm ‘Propus’ models have reached up to 3.2GHz. Obviously, AMD ran into some power or thermal limitations due to the addition of the large graphics/media portion to the CPU die. The size of die was also probably the reason why they opted not to have any L3 cache, a design shortcoming that has proven to have a significant impact on gaming performance. Having said that, Llano processors will still have better gaming performance than any other IGP-totting competition thanks to their impressive integrated GPUs.
When seen from a top-down view, the A-series 32nm die quickly sheds its secrets. Along the left hand side there are four x86 cores which are based off of AMD’s current Phenom II architecture and whose capabilities can be closely compared with those found on Athlon II X4 “Propus” CPUs. As we already mentioned, AMD Turbo CORE technology is supported on the Llano CPU cores but only some processors will take advantage of it.
The CPU cores are paired up with 1MB of L2 cache per core while the typical L1 cache directly integrated onto each CPU section. Much like Athlon II processors, there is no L3 cache but the vastly improved L2 cache (up from 512 KB per core on Propus-series CPUs) should give the A8 and A6 series APUs a performance advantage in some scenarios.
The typical Northbridge functionality along with its associated PCI-E lane controllers have also been built onto the APU die along with a dual channel memory controller and display interfaces. By bringing these items on-die, AMD is able to drastically simplify motherboard designs which should keep costs down and allow for better control I/O signals.
In keeping with AMD’s design philosophy for Fusion, you can see just how much space the GPU and multimedia I/O section takes up on the die. This is without a doubt the most complicated part of any APU and sacrifices had to be made on lower-end A-series processors by cutting down the SIMD array in order to lower TDP and simplify the manufacturing process.
Whether or not this design can be called a true “fusion” between the CPU and GPU is open for debate since the two processing areas remain distinctly separate from one another. As we will see below, the interconnect between the CPU and GPU remains largely unchanged from past generations though some serious efforts have been made to cut down on interconnect latency.
The diagram above may look complicated but its intentions are straightforward as it shows how AMD has simplified the x86 operational cycles to increase process efficiency. The vast majority of complex instructions are now sent directly through the primary logic units and onto the load / store queue instead of meandering through the chip. We can also see a clear interconnect between the northbridge interface and the L1 and L2 caching hierarchy which should help improve overall computational performance.
The result of these small changes along with the larger L2 cache structure is an average IPC (instructions per clock) improvement of about 6% when compared to the previous generation.
The links between each section of the APU follow in the same footsteps as the previous generation but AMD has refined certain interconnects with the goal of speeding up information transfers. The AMD Fusion Compute Link is a medium bandwidth connection which manages the complex interaction between the onboard GPU, the CPU’s cache and the system memory. For the time being, AMD hasn’t completely fleshed out this pathway but upcoming generations of Fusion will be able to take full advantage of its potential bandwidth.
The Radeon Memory Bus on the other hand is the all-important link between the onboard graphics coprocessor and the primary on-chip memory controller. Rather than acting like a traffic cop (a la Fusion Compute Link) which tries to direct the flow of information, this memory bus is all about the GPU having unhindered high bandwidth access to the system’s memory controller.
In the previous generation of AMD IGPs, the Northbridge’s graphics processor had to jump through a series of hoops before gaining access to onboard memory which is partially why 128MB of “SidePort” memory was sometimes added. This single chip all in one solution allows for the elimination of many potential bottlenecks and results in an average of four times more bandwidth between the GPU and memory when compared to past solutions.
Speaking of memory, AMD allows for up to 64GB to be installed and officially supports speeds of up to 1866 MHz.
|Latest Reviews in Processors|