A New Hardware Scheduler, PDA & Color Compression
AMD’s New Hardware Scheduler, PDA & Color Compression
If you’ve actually read this far it should be evident that Polaris is much more than another evolutionary step for AMD’s GCN architecture. Rather, it represents a relatively large shift towards improving pipeline efficiency and enhancing both current and next generation workloads. However, what we described on the last page was simply the tip of the iceberg.
As you may have noticed, AMD has taken to directly comparing Polaris 10 to the R9 290 / R9 390, both from a performance and architectural standpoint. Where Polaris differs from previous generations is that one of its primary raison d’êtres is doing more with less. That can be seen in its quantity of ROPs, Shaders and Texture Units; there are at time substantially less than were rolled into those aforementioned cards and yet the RX480 is supposed to beat them clean. This can also be said of the dedicated Asynchronous Compute Engines since there are only four of them versus Hawaii’s eight but there should actually be an improvement in this architecture’s ability to process async workloads.
Instead of a full allocation of ACEs, two of them have been replaced by dedicated Hardware Schedulers. These schedulers have the ability to allocate percentages or entire CUs for different purposes in a dynamic fashion. This is key for temporal and spatial resource management, scheduling of concurrent tasks or processes in asynchronous computing / shader environments and –perhaps most importantly- dynamic load balancing within the graphics engines. Think of this as an Asynchronous Compute Engine on steroids but one that is infinitely more adaptable to a variety of workloads.
All of this may sound a bit complicated but it leads towards a significant performance uplift in asynchronous environments and can also augment the granularity by which the Compute Units can be controlled. With the Hardware Scheduler in place, entire CUs or even a percentage of each CU can be dedicated towards a specific task and scaled accordingly in a completely dynamic manner. For example, the HWS can run AMD’s TrueAudio Next acceleration (which is now utilized for positional audio in VR applications) on a single block of SIMDs or an entire CU depending on the amount of computing power that’s called upon by a specific application. In plain English this means much less of the core will be sitting idle at any given time and higher utilization rates can lead to better overall performance.
Another interesting addition within Polaris is what AMD is calling a Primitive Discard Accelerator. This new functionality allows the geometry engines to analyze a scene’s given geometry profile and discards unnecessary triangles earlier in the rendering pipeline. Basically when the PDA is properly utilized the core won’t spend valuable resources rendering items that aren’t mission-critical. This is particularly important when using MSAA or any anti aliasing routine which requires multiple passes and as such, the performance uplift generally grows as higher levels as AA are used.
There’s also a new Index Cache which acts as a fast access pool for smaller instruction geometry. Essentially this cache limits the amount of information that moves between various pipeline stages and frees up internal bandwidth. When combined with the Primitive Discard Accelerator, Polaris 10 can in theory offer up to 3.5x higher geometry performance than previous generations could.
While GDDR5 simply can’t offer the bandwidth benefits of more recent technologies like HBM and GDDR5X, AMD has still found ways to enhance throughput without moving their architecture to a more expensive memory standard. They also didn’t want to take up valuable die space for a wider memory interface. As such, the next logical step was to augment memory color compression algorithms in an effort to add efficiency rather than raw theoretical bandwidth.
Previous AMD architectures did include some forms of delta color compression but Polaris steps things up a notch by boasting native support for 2:1, 4:1 and even 8:1 ratios. Now the Radeon Technology Group still readily admits they have a long way to go until they catch up with NVIDIA’s color compression technologies but Polaris represents a big step towards narrowing the gap. In this form their DCC algorithms can allow the relatively narrow 256-bit memory interface to performs very much like a 512-bit link.
Along with the vastly improved L2 caching hierarchy and improved delta color compression algorithms, the memory interface on Polaris 10 may not have a huge amount of bandwidth but it is extremely efficient. Supposedly the memory performance per bit has increased nearly 40% which not only saves on power but it also allows developers to maximize utilization of this key interface.
|Latest Reviews in Video Cards|