Graphics Core Next: From Evolution to Revolution
Graphics Core Next: From Evolution to Revolution
Much like the outgoing Cayman series of cards, Tahiti is focused upon improving AMD’s position within a highly competitive (and lucrative DX11) market. Though previous generations like the HD 5000 and HD 6000 relied largely upon a core architecture that existed since 2006, the next iteration of parts will have a new design that has been engineered from the ground up for DX11 and compute environments. However, in order to see where AMD is going, you have to understand what they’re coming from
AMD’s graphics cores have always had fairly long lifespans and that says a lot about how they have usually designed the best possible architecture for a given generation. While this approach certainly has benefits from a financial and planning perspective, introducing the wrong architectural design can have long term consequences.
The first era of modern GPUs ran from 1998 through 2002 and introduced us to fixed function rendering which worked well for the time but featured limited the ways to do geometry, lighting and texturing. Even though modern graphics architectures still have a fixed function stage containing the geometry processing elements, these have now been incorporated into a much larger rendering picture.
AMD’s second round of designs ushered in the revolutionary DX9 era along with its accompanying generation of products. It featured the beginning of programmable rendering pipelines and new pixel rendering functionality while laying groundwork for the DX10 and DX11 products to come. Meanwhile, the release of DX10 in 2007 meant the introduction of unifed shader units and the VLIW (Very Long Instruction Word) architecture for parallel core operations. AMD has adhered to the VLIW approach for a while now and as DX10 gave way to DX11, additional functionality and minor modifications were gradually built in.
In a roundabout way, this brings us to AMD’s new take on both graphics and parallel computing called Graphics Core Next or GCN. This may not carry the most unique of names, it outlines what this new architecture means for AMD: a true next generation approach. Simply put, it was high time for a change away from VLIW in order to bring intergenerational performance up to the industry’s expectations. GCN also represents the first steps towards a truly heterogonous environment between the CPU and GPU since it will eventually be an integral part of AMD’s upcoming APUs.
The fundamental building block for all things GCN is called the Compute Unit. In layman’s terms this is a compact, self contained building block of sorts that was designed to increase on-die content flow efficiency by keeping much of the data local rather than handing it off to a global shared stream. For example, the previous generation’s SIMD array, compute unit, registers and cache all fed off the same thread sequencer and had to share resources in a complex dance of information. Each Compute Unit is treated independently and allows for the SIMD communications, sequencing and scheduling to be run in a single cohesive structure before handing it off.
From a thread processing standpoint a single Compute Unit has 4 sub Vector Units (or SIMDs) made of up 16 Stream Processors each for a total of 64 cores per CU. This layout is backstopped by a quartet of Texture Units and 16KB of dedicated read / write L1 cache. The amount of L1 cache doubles the amount from previous architectures so instead of reading textures and exporting raster functions to an external memory buffer, these instructions can now be sent to the local cache instead.
With the Vector Units producing their own independent streams, it was important to include a high bandwidth scheduler. The Scheduler works alongside the unified cache and the 64KB of local data share to facilitate the information flow between data lanes. In the Queen’s English, it acts like a traffic light to direct data towards a set location.
Another important part of the Compute Unit’s hierarchy is the inclusion of a dedicated Scalar Unit with its own registers. This unit acts like a general purpose programmable core that can issue its own instructions and can take part of the workload off of the other areas of the Compute Units or can work independently if need be. Think of it as a central processing unit within each cu.
As you can see above, the move away from a VLIW4 SIMD architecture towards Stream Processors contained within four distinct separate Vector Units significantly increases on-die efficiency. The “Quad SIMD” approach is able to process information on a parallel basis without any potential conflicts in the data stream, thus speeding up data hand-offs and increasing overall performance per square millimeter.
Backing up the Compute Units is an expanded and very robust caching design that is linked together by the Global Data Share. While the Global Data Share is the glue that binds communications between CUs together, it can also take some heat off the L2 cache by managing all on-chip data sharing services.
In addition to the aforementioned L1 cache, each quartet of Compute Units has access to 16KB of instruction cache and 32KB of scalar data cache which are both helped out by the L2 cache units. Speaking of the L2 cache, AMD has upped the ante here as well with twelve partitions of 64KB for a total of 768KB.
|Latest Reviews in Video Cards|