| ||
| by Michael "SKYMTL" Hoenig | October 9, 2009 | ||
| The Next Gen Fermi Architecture The Next Gen Fermi Architecture![]() By now I am sure many of you reading this already know that the GPU Technology Conference was used as the venue to announce NVIDIA’s next generation architecture code-named Fermi. Since it was announced at a conference which was nearly entirely focused on GPU computing, there was very little information for all the gamers who are eagerly awaiting NVIDIA’s answer to ATI’s 5000-series. However, there were some tantalizing clues about what we can expect from the GeForce iteration of Fermi and in this section we will endeavor to put as much of the architectural technical jargon into words you can all understand. ![]() Taking a look at the core layout of the flagship Fermi GPU, we can see that there are quite a few similarities when compared to the G80 and G200 series with the 512 Shader Processors (now called CUDA Processors) shown in green taking up the majority of die space. On the outer fringes of the die are a total of six 64-bit memory partitions which can be addressed separately for a combined 384-bit GDDR5 memory interface. The other items on the die’s periphery are the GigaThread unit (we will talk more about this later) as well as the Host Interface. The shared L2 Cache is also a feature unique to this GPU but like the GigaThread unit, we will be covering this a bit later. As you may have expected, all of these components mashed together make the Fermi architecture one of the most complex to date and also one of the largest with 3 billion transistors. ![]() The CUDA Processors are grouped into units of 32 in order to form the basis of a Streaming Multiprocessor and there are 16 of these SMs per Fermi core. Meanwhile, each of the CUDA cores features a single ALU (arithmetic logic unit) and FPU (floating point unit) much like previous generations. However, where the Fermi architecture breaks from the mold is with the addition of the new IEEE 754-2008 floating point standard that adds instruction sets for both single and double precision double precision arithmetic. Double precision arithmetic is mostly used in HPC (high performance computing) scenarios such as quantum chemistry and will have little effect on this card’s performance in most people’s home systems. ![]() In past NVIDIA GPUs, each of the SMs had access to a mere 16KB of shared memory but with Fermi each SM has 64KB of on-chip memory. In addition, that 64KB can be configured between memory or L1 cache in a number of ways depending on the needs of an application. Speaking of cache, Fermi is the first NVIDIA GPU with access to both L1 and a shared L2 cached memory hierarchy. This will allow improved bandwidth as well as fast data sharing between the SMs with the help of the shared L2 cache. ![]() Finally, in our quick run-through of this new architecture we come to the GigaThread hardware thread scheduler which allows for concurrent kernel execution. In a serial kernel execution scenario, each kernel has to wait for the one before it to finish before it can begin. However with concurrent kernels, the whole GPU can be utilized since different kernels of the same application context can operate on Fermi at the same time. This not only frees up resources ubt it allows for quicker processing of things like in-game physics or AI. While this section only scratched the surface of what Fermi has to offer, we intend to offer you a more-in-depth look as the launch gets closer. Until then, we have this small comparison diagram for you: ![]() | ||
| |
| Latest Reviews in Articles | |||||||||
|