Diving into the Barts LE Core
Diving into the Barts LE Core
Unlike what they did with the Evergreen series, AMD isn’t trying to rewrite the book on performance or push new boundaries with their “refreshed” cards. Rather, the dual engine architecture which distinguished Cypress has been generally carried over with a few modifications made along the way. Barts isn’t the focus of a fundamental architectural change in any way, shape or form. It is all about the gradual refinement of an existing design into something with with a smaller die size and superior performance per watt. The end result is that AMD can now push a more affordable high performance solution to end users without sacrificing profit margins.
At first glance, there isn’t all that much to distinguish the full Barts XT core from the outgoing Cypress other than the obvious change in the number of SIMDs, which results in less overall SPs (or Stream Cores as AMD calls them). In order to achieve high end performance which is optimized for efficiency, the engineers started with the basic back-end of the Cypress XT and built up from there. This means the graphics engine including the fixed function stages, L2 cache, ROP arrays and memory controllers have gone largely untouched. There were some changes to improve tessellation performance and communication between the different stages within the rendering pipeline but the vast majority of tweaks happened within the SIMD engine layout.
Since the Cypress Pro (which AMD replaced with the Barts XT) used a full Cypress XT core with a few disabled SIMDs, it was inherently inefficient from a number of perspectives. In order to increase performance per watt, the HD 5850 was taken as a benchmark and the engineers set about trying to match the “sweet spot” it occupied in the market with a slightly revised layout.
The Barts core in its XT, Pro and LE forms retains the same 80 SPs along with four texture units, 32KB of Local Data Share and 16KB of L1 texture cache per SIMD as the Cypress series. However, where things have changed lies in the total quantity of SIMDs per core which has shrunk from 20 down to the 14 we now see in Barts. This in effect lowers the maximum possible SPs from 1600 down to 1120 and the number of TMUs from 80 to 56. However, since the render back-ends aren’t touched, the Barts XT has a full 32 ROPs. The memory interface also remains at 256-bit for the GDDR5 which is actually a first for an architecture that is aimed exclusively at the sub-$250 market.
The Barts LE however is a different animal from a number of perspectives. In order to further increase yields and reduce the overall price per core, AMD has eliminated an additional two SIMD engines on top of the two already cut from the Barts Pro. The result is 10 SIMD engines totaling 800 cores and 40 Texture Units but the cutting didn’t just stop there. With the higher clock speeds these changes allowed, some separation between the Pro and LE was needed so a full half of the Barts ROPs were disabled. This change also affects the number of colour ROPs and L2 cache available on the LE.
Other than the obvious changes to the SIMD layout, there has also been some window dressing going on behind the scenes. The main graphics engine which entails the fixed function stages of AMD’s architecture is for the most part carried over from the HD 5800 series without any significant changes but there is one major addition: an enhanced tessellator.
One of the main critiques leveled against Cypress series GPUs was their tendency to choke under heavy tessellation workloads. Through improved thread management in the shader engines as well as enhanced buffering for tessellation draw calls, AMD has been able to manage up to a twofold increase in overall tessellation performance over the HD 5800 cards. We can also see that in an effort to increase rendering efficiency even more, AMD has broken up the Ultra Threaded Dispatch Processor into two with each section having its own instruction and constant cache. This dispatch processor basically acts like a traffic cop, directing draw calls to the SIMD arrays. With each directing its own “half” of the SIMD engine, rendering information can be processed at a much quicker rate without adding to the overall die size of the Barts.
To put this into layman’s terms, the Barts architecture is able to remove the tessellation bottleneck which allows more of the rasterizers and SPs to be used more efficiently and as a result DX11 performance in particular has been increased.
|Latest Reviews in Video Cards|