|by Michael "SKYMTL" Hoenig | June 25, 2008|
A Look at the TeraScale Graphics Engine
A Look at the TeraScale Graphics Engine
Even though the R770 core represents a huge departure from the 600-series of ATI cores, we have really not heard much about what this new architecture entails. In this section we will give you a quick glimpse (since I know you are itching to see benchmarks) into the R700-series cores and the improvements which have been made over previous generations.
In most of the official documentation you will see floating around; AMD is calling this new architecture their “TeraScale” graphics engine. This name alludes to the fact that these new cards can theoretically perform over one trillion floating point operations per second (one teraFLOP) which makes them the first mass market, single GPU products to do so. According to AMD, the ATI TeraScale Graphics engine is all about building in maximum performance, scalability, efficiency and affordability into one package which leads to lower costs for the consumer. Without a doubt this is a very loft goal which has been set by AMD so let’s see how they went about accomplishing it.
The R770 core is based off of a 55nm manufacturing process which means that it is power efficient and should put out a minimum of heat. AMD really has this process working well for them since they were able to pack in nearly a billion transistors (956 million to be exact) onto a die which measures a mere 266mm². For comparison, the overly hot, power hogging GTX 280 core measures a whopping 576mm². This means that ATI can fit more dies onto a wafer which in turn drives down the costs of the R700-series cores.
Welcome to the brave new world of the R770 core. Let’s put some of these complicated diagrams into a bit better context for you since they can get a bit overwhelming some you look at them for anything more than 2.5 seconds. Since we run the risk of this explanation getting overly complicated we will try to keep this short, sweet and in layman’s terms if possible.
Before being processed, the data going through the core gets passed through the Ultra Threaded Dispatch Processor which then prioritizes it towards one or more of the SIMD (Single Instruction, Multiple Data) cores. Each of these 10 cores holds 80 individual stream processors which can then pass off their data to individual texture units. There are additional Render Back-Ends as well but we will discuss these a little later. We can also see (at the extreme right of the diagram) that all the output operations like UVD, display controllers, PCI-E 2.0 bus interface, and Crossfire X support are all controlled by a central on-die hub. This is supposed to help speed up communications between the core and theses low bandwidth using interfaces.
Above we have a diagram picturing the flow of data from the thread scheduler through SIMD cores. As already mentioned, each of these SIMD cores holds 80 individual stream processors which are broken into 10 blocks of 8 processors each.
The data is then passed on to the SIMD’s associated Texture Unit. Each of these texture units contains four Texture Address Processors which process the texture information before handing it off to the main Data bus along with 4 Texture Filter Units, and 16 Texture Samplers which are all accessed through a Texture Decompressor. This layout means streamlined data management across the entire range of core functions.
When we get a bit better look at this hierarchy beyond the texture units, we can see that each of the memory controllers has its own L2 cache while each SIMD has its own associated L1 cache and close to the top of the diagram there is a completely separate Vertex cache. This all leads to a claimed 480GB/s L1 texture fetch bandwidth and 384GB/s bandwidth just between the L1 and L2 cache.
Now we have come to what many figure to be the crowning achievement of this TeraScale architecture. Let’s be honest for a second here; a graphics card could have the fastest GPU in the world hooked up to it but if the memory interface sucks, it will be in for a world of hurt. To help alleviate any memory bottlenecks, ATI completely redid the memory interface design on the R770 so it would be distributed throughout the edges of the chip with its own blocks of render back-ends. ATI also gave each memory controller hub access to its own L2 cache which will further increase the data transfer speeds to and from the memory. This was done in order to take advantage of the massive bandwidth potential that comes with the implementation of GDDR5 onto some R770 cards. So, even though it looks like the HD4850 and HD4870 “only” have a 256-bit interface, because of the bandwidth afforded by this new memory design, it actually acts like a 512-bit interface.
Speaking of these rendering back-ends, they have now been specifically designed to improve upon AA performance which is a welcome change since we all remember how much the R600-series suffered when AA was turned on. This implementation effectively doubles the AA performance when compared to a HD3870.
|Latest Reviews in Video Cards|