The SMX: Keplerís Building Block
The SMX: Keplerís Building Block
Much like Fermi, Kepler uses a modular architecture which is structured into dedicated, self contained compute / graphics units called Streaming Multiprocessors or in this case Extreme Streaming Multiprocessors. While the basic design and implementation principles may be the same as the previous generation (other than doubling up the parallel threading capacity that is), several changes have been built into this version that help it further maximize performance and consume less power than its predecessor.
Due to die space limitations on the 40nm manufacturing process, the Fermi architecture had to cope with less CUDA cores but NVIDIA offset this shortcoming by running these cores at a higher speed than the rest of processing stages. The result was a 1:2 graphics to core clock ratio that led to excellent performance but unfortunately high power consumption numbers.
As we already mentioned, the inherent efficiencies of TSMCís 28nm manufacturing process has allowed Keplerís SMX to take a different path by offering six times the number of processors but running their clocks at a 1:1 ratio with the rest of the core. So essentially we are left with core components that run at slower speeds but in this case sheer volume makes up for and indeed surpasses any limitation. In theory this should lead to an increase in raw processing power for graphics intensive workloads and higher performance per watt even though the CUDA coresí basic functionality and throughput hasnít changed.
Each SMX holds 192 CUDA cores along with 32 load / store units which allows for a total of 32 threads per clock to be processed. Alongside these core blocks are the Warp Schedulers along with the associated dispatch units which process 64 concurrent threads (called Warps) to the cores while the primary register file currently sits at 65,536 x 32-bit. All of these numbers have been increased twofold over the previous generation to avoid causing bottlenecks now that each SMXís CUDA core count is so high.
NVIDIAís ubiquitous PolyMorph geometry engine has gone through a redesign as well. Each engine still contains five stages from Vertex Fetch to the Stream Output which process data from the SMX they are associated with. The data then gets output to the Raster Engine within each Graphics Processing Cluster. In order to further speed up operations, data is dynamically load balanced and goes from one of eight PolyMorph engines to another through the on-die caching infrastructure for increased communication speed.
The difference main difference between the current and past generation PolyMorph engines boils down to data stream efficiency. The new ď2.0Ē version in the Kepler core boasts primitive rates that are two times higher and along with other improvements throughout the architecture offers a fourfold increase in tessellation performance over the Fermi-based cores.
The SMX plays host to a dedicated caching network which runs parallel to the primary core stages in order to help store draw calls so they are not passed off through the cardís memory controllers, taking up valuable storage space. Not only does this help with geometry processing efficiency but GPGPU performance can also be drastically increased provided an API can take full advantage of the caching hierarchy.
As with Fermi, each one of Keplerís SMX blocks has 64KB of shared, programmable on-chip memory that can be configured in one of three ways. It can either be laid out as 48 KB of shared memory with 16 KB of L1 cache, or as 16 KB of Shared memory with 48 KB of L1 cache. Kepler adds another 32/32 mode which balances out the configuration for situations where the core may be processing graphics in parallel with compute tasks. This L1 cache is supposed to help with access to the on-die L2 cache as well as streamlining functions like stack operations and global loads / stores. However, in total, the GK104 has less SMXs than Fermi which results in significantly less on-die memory. This could negatively impact compute performance in some instances.
Even though there havenít been any fundamentally changes in the way textures are handled across the Kepler architecture, each SMX receives a huge influx of texture units to 16 up from Fermiís four.
|Latest Reviews in Video Cards|