Putting it all Together; Keplerís Core Revealed
Putting it all Together; Keplerís Core Revealed
In many ways, the genesis of the Kepler architecture came with the realization that Fermiís large die, monolithic approach on an inefficient 40nm process node would cause long term issues on a number of fronts. Not only did the first Fermi-based GF100 cards consume loads of power but their cores were expensive to produce due to the inherent yield issues that arose when producing a 529mm≤ die with 3.2 billion transistors. Unfortunately, creating a whole new architecture from the ground up necessitated these sacrifices but things obviously needed to change in a big way in order to bring the GTX 680 and its ilk up to modern standards.
The one thing making this all possible is the switch towards TSMCís 28nm manufacturing process which allows for a ton of transistors to be packed into a relatively small die area. In the case of the GK104, weíre talking about 3.5 billion transistors in an area of just 294mm≤ which makes it slightly more than half of the 3.2 billion transistor GF110ís size and also a good 20% smaller than Tahiti. This goes to illustrate just how far weíve come in the last two years.
Another interesting aspect of this new architecture is the fact that NVIDIAís first card out of the gate doesnít boast what would be called a ďfully enabledĒ core. While the GK104 doesnít necessarily come with any of its elements disabled, its code name seems to indicate that NVIDIA may have an even higher performance core waiting in the wings. Indeed, with a TDP of just 195 watts, the GTX 680 still has loads of overhead before it would hit the 250W plateau.
The GK104 core is segmented into four distinct groups called Graphics Processing Clusters or GPCs which are then broken down again into individual Streaming Multiprocessors (SMs), raster engines and so on. Each of the GPCs can be considered a dedicated compute unit all on its own with self contained processing engines and rendering stages. This generalized layout and its functionality hasnít changed much from the Fermi architecture but it builds upon the lessons learned from its predecessors in several ways.
There is now a total of 1536 CUDA cores, a drastic increase over the 512 present within the GF110. Regardless of the physical die space this change requires, a minimization of clock speeds alongside an increase in the core count has allowed NVIDIA to realize a twofold improvement in performance per watt. In effect, NVIDIA is taking advantage of the 28nm manufacturing process by substituting power hungry processing speed for a more efficient processor-centric approach.
The Raster Engines havenít undergone a facelift per se and they still work in a highly parallelized fashion but several other architectural changes like a drastic speed increase for the PolyMorph engines meant their functionality had to be rationalized. As such, processing efficiency was increased in order to better handle the workload being thrown in their direction.
With high level geometry rendering taking up many of the headlines, texture throughput may not be something thatís discussed all that much these days but it still plays an integral role in game performance. NVIDIA has massaged this portion of their architecture as well by upping the Texture Unit count from 64 in the GF110 to 128 in the new Kepler-based GK104. As we drill down into each SMX on the next page, we will see how additional refinements have been integrated within each processing stage to ensure proper load balancing and optimal efficiency.
On the periphery of the GK104 die is the GigaThread Engine along with the memory controllers. The GigaThread Engine performs the somewhat thankless duty of reading the CPUís commands over the new PCI-E 3.0 host interface and then fetching data from the systemís main memory bank. The data is then copied over onto the framebuffer of the graphics card itself before being passed along to the designated engine within the core.
Speaking of memory, things have changed here as well but some will think NVIDIA took a step backwards. Even though AMDís Tahiti architecture and the previous generation GF100 / GF110 all incorporated 384-bit memory interfaces, Kepler (for the time being at least) makes do with a 256-bit interface spread across a quartet of 64-bit GDDR5 memory controllers. However, in order to compensate for this NVIDIA has updated the controllers themselves for compatibility with 6Gbps and higher memory ICs so the GF104ís combined bandwidth is still on par with the GTX 580ís.
Each of the 64-bit memory controllers is paired up with 128KB of L2 cache and a single ROP unit which mirrors the layout from previous generations but once again there are some significant differences built into Kepler than allow for higher performance. While the L2 cache may only total 512KB, the overall communication bandwidth has been doubled, eliminating any potential bottlenecks and allowing for much quicker data handoffs between the various rendering stages.
The design of the ROP units has also been rationalized with the 32 ROPs being divided into four groups of eight. At first glance, this may not seem like a noteworthy change since GF110 also incorporated 32 ROPs but the differences between past and present architectures are significant. Since each of the ROP units is driven by its associated GPC, both GF114 and GF110 couldnít take full advantage of the Render Output Units due to asynchronous throughput. For example, the GF114 had two GPCs running at 16 pixels per clock while the four ROP partitions had a render throughput of 32 samples per clock. This setup allowed the extra ROPs to help out in some cases -particularly with uncompressed AA rendering- but actually led to several pipeline inefficiencies that hurt overall performance.
Kepler meanwhile has a 1:1 ratio between raster and ROP units, allowing the GPC to process 32 pixels per clock which lines up perfectly with the 32 samples per clock of each ROP partition. The result is a twofold speedup in every respect such as aliased rendering and compressed anti aliased rendering, even though uncompressed AA performance hasnít been touched.
|Latest Reviews in Video Cards|