Efficiency Through Caching
Efficiency Through Caching
There are benefits to having dedicated L1 and L2 caches as this approach not only helps when it comes to GPGPU computing but also for storing draw calls so they are not passed off to the memory on the graphics card. This is supposed to drastically streamline rendering efficiency, especially in situations with a lot of higher-level geometry.
Above we have an enlarged section of the cache and memory layout within each SM. To put things into perspective, a Streaming Multiprocessor has 64KB of shared, programmable on-chip memory that can be configured in one of two ways. It can either be laid out as 48 KB of shared memory with 16 KB of L1 cache, or as 16 KB of Shared memory with 48 KB of L1 cache. However, when used for graphics processing as opposed to GPGPU functions, the SM will make use of the 16 KB L1 cache configuration. This L1 cache is supposed to help with access to the L2 cache as well as streamlining functions like stack operations and global loads / stores.
In addition, each texture unit now has its own high efficiency cache as well which helps with rendering speed.
Through the L2 cache architecture NVIDIA is able to keep most of the rendering function data like tessellation, shading and rasterizing on-die instead of going to the framebuffer (DRAM) which would slow down the process. Caching for the GPU benefits bandwidth amplification and alleviates memory bottlenecks which normally occur when doing multiple reads and writes to the framebuffer. In total, the GF100 has 768KB of L2 cache which is dynamically load balanced for peak efficiency.
It is also possible for the L1 and L2 cache to do loads and stores to memory and pass data from engine to engine so nothing moves off chip. Unfortunately, one of the issues with this approach is that significant die area is taken up by doing geometry processing in a parallel and scalable way while not using DRAM bandwidth.
When compared with the new GeForce GF100, the previous architecture is inferior in every way. The GT200 only used cache for textures and featured a read-only L2 cache structure whereas the new GPU’s L2 is rewritable and caches everything from vertex data to textures to ROP data and nearly everything in between.
By contrast, with their Radeon HD 5000-series, ATI dumps all of the data from the geometry shaders to the memory and then pulls it back into the core for rasterization before output. This causes a drop in efficiency and therefore performance. Meanwhile, as we discussed before, NVIDIA is able to keep all of their functions on-die in the cache without having to introduce memory latency into the equation and hogging bandwidth.
So what does all of this mean for the end-user? Basically, it means vastly improved memory efficiency since less bandwidth is being taken up by unnecessary read and write calls. This can and will benefit the GF100 in high resolution, high IQ situations where lesser graphics cards’ framebuffers can easily become saturated.
|Latest Reviews in Video Cards|