The GTX 970's Memory Explained & Tested

SKYMTL · Jan 27, 2015

On forums near and far, there have been reports users have been experiencing memory allocation issues on NVIDIA’s GTX 970. Much of this centered around the fact that certain applications showed the GTX 970 to be utilizing just 3.5GB of its supposed 4GB of memory even though the GTX 980 and other cards showed their full memory layout as being accessible. There were further reports that once the 3.5GB threshold was surpassed, the GTX 970 suddenly exhibited a drastic loss of performance. It looked suspiciously like NVIDIA’s price / performance darling wasn’t able to physically communicate with its advertised memory allotment and if communication was taking place, that bandwidth was somehow truncated.

Naturally, this sparked a large number of theories regarding the Maxwell architecture, its abilities and how NVIDIA has allocated resources on their $350 graphics card. NVIDIA themselves have now stepped in, trying to set the record straight. What follows is a simplified version of our technical briefing with them alongside some basic benchmarks.

Let’s start with the thousand pound gorilla in the room: the GM204 core as it’s utilized in the GTX 970. In the first image above you will see the basic core layout as NVIDIA originally described it in their documentation and during their briefings to reviewers. There is a trio of SMMs disabled (these can be located within any one of the GPCs) which effectively reduces the number of CUDA cores, texture units, L1 cache and a number of other on-chip resources. However, at the time, the back-end resources seemed to have remained in place with four 64-bit memory controllers, 64 ROPs and 2048KB of L2 cache. We now know this wasn’t the case.

In what was deemed an error in their documentation due to a miscommunication between the engineering and technical PR teams, some fairly significant information was left on the cutting room floor. Instead of utilizing a full 64 ROPs as was originally believed, the GTX 970 only has 56 enabled while the L2 Cache has seen a 256KB cut to 1792KB. Additionally, the cut-down GM204 core handles its memory in a somewhat unique fashion which could explain why so many users are seeing utilization below the 4GB mark. So what happened here? This is where the story, as it was told to us, gets interesting.

NVIDIA’s previous Kepler architecture had the ability to scale downwards in a very linear fashion which meant that when one portion of a chip was disabled, in an effort to retain a balanced design associated functions had to be disabled as well. For example, if a single 64-bit memory controller was kicked to the curb when creating a new part so too was its associated ROP partition and dedicated L2 cache. This scaling is done in all modern GPU architectures in order to optimize yields, with the cores that cannot operate at full capacity being rolled into lower-end SKUs through judicious modifications to their available resources.

Due to the Maxwell architecture’s unique layout, going down this same route would have resulted in a significant performance delta between the GTX 970 and GTX 980 since massive parts of the former’s core would have been disabled en masse. With four memory controllers and just a quartet of ROP groupings, there really wasn’t much leeway for disabling elements before the GTX 970 simply became uncompetitive in its segment.

What we didn’t know until our briefing was exactly how NVIDIA created the GTX 970’s core. In order to maximize performance potential, their engineers gave GM204 more scaling granularity which allows for the partial disabling of certain interfaces without affecting the chip’s communication hierarchy. As a result it could retain a 256-bit memory interface alongside 4GB of GDDR5 memory but features several changes in how those elements handle communication over its pool of shared resources or Crossbar.

To provide sufficient bandwidth throughout the chip, the GM204’s 16 Streaming Multiprocessors each has a dedicated pipeline to the Crossbar. The information is then passed through towards the secondary processing stages via eight high-bandwidth ports, each of which has access to eight associated ROPs, 256KB of shared memory and a single 64-bit memory controller. In the GTX 970, one of these ports has been disabled along with its ROPs and Cache while the memory controller and its companion 500MB of DRAM remain as a separate entity. This results in a dual partition memory structure consisting of a primary 3.5GB segment and a companion 500MB.

In a typical scenario this lonesome memory controller / DRAM duo would have to be disabled as well since without an L2 cache partition it would have no way to communicate with the Crossbar. Instead, NVIDIA applied a so-called “buddy interface” which effectively reroutes the extra DRAM's communications so an existing cache module can take over and recognize the full 4GB DRAM allocation and 256-bit memory interface.

For lack of a better explanation, on a fully enabled GM204 core the memory ports are accessed in a sequential order in a 1KB stride after which the process repeats itself in a relatively straightforward manner. As the resources associated with the first port are left to finish their task, the workloads move on to the subsequent port, use that and then continue on following a cyclical port-forward 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3… routine. In an optimal situation, once that first port is called upon again, it is free and ready to process new information.

On the GTX 970 part of the resources (the ROPs and L2 cache) normally associated with the eighth port simply aren’t there. So what would have happened if NVIDIA had gone with a full 4GB partition? With calls being rerouted through the buddy interface to an already-used L2 section, there was a very real possibility that a single 256KB cache partition (which in this case is handling the information for two memory controllers instead of one) would create a bottleneck and slow down the whole memory interface.

Think of this situation in terms of the numbers we’ve written above. In an optimal scenario the order would be 1 through 8, after which it repeats through all eight ports and addresses all 4GB of memory. Using a typical striding process on the GTX 970 would have encountered a situation where the cyclical process would be broken and look like 1, 2, 3, 4, 5, 6, 7, 7, 1, 2, 3 and so on.

With that seventh port handling double the requests via the buddy interface and being hit twice as often, the remainder of the strides would have to wait around for it to complete scheduled tasks. In effect, the entire affair could theoretically operate at half speed, dragging down the memory processing pipeline in the process.

To avoid completely bogging down the GTX 970’s bandwidth potential on that seventh and eighth stride, NVIDIA split the memory into two different partitions: the lower segment houses seven 512MB DRAM modules for a total of 3584MB while the higher one has a single associative 512MB IC which is good for 28GB/s of peak bandwidth. For those keeping track at home, a full 7/8ths of the theoretical memory throughput is accessible in that primary partition.

Within the 3.5GB section (red above), the strides are proceed sequentially while leaving the final 500MB of memory out of the equation. Thus, a game that calls for less than 3.5GB of memory will neatly avoid the buddy interface and final 500MB while still having access to 196GB/s of peak bandwidth. Meanwhile anything that requires more than 4GB of memory will cause draw calls to run through the system’s slow PCIe 19GB/s interface, completely hobbling performance due to the added latency. Calling for help from the PCIe bus for access to system memory occurs with every architecture when an onboard memory interface reaches the point of saturation.

The real kicker is what happens in that grey area when Windows calls for a memory allotment of between 3584MB and 4096MB. In those situations NVIDIA’s drivers are supposed to enable the eighth memory controller, final 512MB of DRAM and the buddy interface (the green “8th stride” above), opening up the full 4GB of on-card memory. However, that final 500MB remains set apart from the larger 3.5GB partition since both cannot be read at the same time despite the thin thread of communication provided by the so-called buddy interface. Since that 500MB partition has a very limited bandwidth of just 28GB/s, if the architecture spends too much time reading from it, the overall effective throughput of the larger 3.5GB segment would be negatively affected as well.

According to NVIDIA, there are checks and balances in place to insure the GPU core never gets hung up in waiting for on-die memory resources to complete their scheduled tasks. One of the first lines of defense is a driver algorithm that is supposed to effectively allocate resources, and balance loads so draw calls follow the most efficient path and do not prematurely saturate an already-utilized Crossbar port. This means in situations where between 3.5GB and 4GB of memory is required, data that isn’t used as often is directed towards the slower 500MB partition while the faster 3.5GB section can continue along processing quick-access reads and writes.

From an architectural perspective, there’s also extra read and write request bandwidth between the memory controllers and the L2 caching hierarchy.

Another part of this delicate dance includes the interleaving of reads and writes so when one section of memory is processing reads, the other is free to process writes. When combined with the elements we’ve already discussed the interleaving allows the GTX 970 to deliver its stated 224GB/s of peak bandwidth provided the software layer works as its supposed to.

While this unique layout gave NVIDIA the ability to load up the GTX 970 with a full 4GB of memory the technology in play here certainly isn’t infallible. If the drivers are working the way they’re supposed to, there should only be a few percentage points difference (5% or less) between a card like the GTX 970 with two memory partitions and one with a single 4GB allocation when both are being used in scenarios which require higher bandwidth. However much like in other scenarios where software and compatibility plays a role in overall performance, we may see results varying from one application to the next. In the worst case scenarios, only the 3.5GB partition may be recognized or the load balancing algorithm could effectively direct data to the wrong resources.

We’ll get into a few benchmarks and further explain some of the odd behavior users have been experiencing on the next page

SKYMTL · Jan 27, 2015

Some Pertinent Benchmarks & Closing Thoughts

The Benchmarks

There are still some questions we have to answer since there are indeed some applications out there which don’t seem to recognize the full 4GB allocation on these cards even though it should be readily apparent. We also decided to grab four games with achievable memory allocations of between 3.5GB and 4GB which were then put to the test with four cards to see their respective performance hits.

Synthetic Benchmarks

The first test we wanted to show is Rai’s Benchmark which essentially utilizes a CUDA memory test to sequentially test access speed to individual portions of the memory subsystem.

As was first reported, this benchmark was one of the first which highlighted the possibility that something odd was going on with the GTX 970. We can see that performance reaches respectable levels but then takes a nosedive around the 3.2GB to 3.3GB mark.

If we add the roughly 250MB being set aside for Windows background tasks like desktop rendering the problems begin to happen right as the benchmark starts hitting that 500MB partition between 3.5GB and 4GB. This goes for the caching as well since the buddy interface can only grant a relatively minute amount of access to an available L2 cache partition. Due to the sequential nature of this test, NVIDIA’s load balancing algorithms and interleaving simply cannot take place.

Finally, there’s one last drop-off to below the 20GB/s mark which perfectly shows what happens when the PCIe bus is accessed as the GPU’s onboard DRAM becomes saturated.

AIDA64’s GPGPU test highlights a situation which favors the GTX 970’s memory layout since it is supposed to access the entire 4GB allotment for several milliseconds at a time. In both read and write benchmarks the bandwidth is within 3% of NVIDIA’s GTX 980, a card that uses a full 4GB partition.

The Memory Copy presents an interesting counterpoint since only the first 3.5GB partition is utilized (according to FinalWire, this test puts about 2.5GB of load on the memory subsystem) and suddenly we see the gap between the two cards increase to about 14%. Remember, that first partition has 7/8ths of the chip’s memory bandwidth so these results are right in line with expectations.

Gaming Benchmarks

Moving on to the gaming benchmarks, we tried to keep things simple but extremely targeted since we’ll be posting additional articles with more titles, frametimes, SLI and a few other interesting tests in the coming days and weeks. In the results you see below there are three cards: the GTX 970, GTX 980, GTX TITAN with 6GB and of course AMD’s R9 290X. All of these handle memory allocations in a different way it allows us to get a good cross-section of what’s available in the market.

One of the main challenges we set ourselves was to directly target the GTX 970’s 500MB partition which meant the games had to require between 3585MB and 4095MB throughout each benchmarking scenario. This was done by using a resolution of 4K and modifying the detail settings until our target was achieved. One caveat: throughout each 60 second benchmark run, the requirements could never surpass 4GB (due to the PCIe interface jumping into things) and never be under 3.5GB. The memory delta breakdown is as follows, as reported by GPU-Z’s logging tool:

Battlefield 4: 3625MB to 3941MB
Dragon Age Inquisition: 3591MB to 3777MB
Hitman: Absolution: 3618MB to 3899MB
Shadow of Mordor: 3820MB to 4006MB

With all of this in mind we have conducted two tests with the same settings for all cards: one is a baseline test where we run each card at settings the result in a memory utilization of under 3.5GB (all games hovered between 2GB to 2.5GB) while the other “above 3.5GB” results uses the numbers shown above.

With that information in hand, we then determined how much performance loss there is when moving up to higher memory allocation levels.

Each benchmark run was done six times, logged through FCAT and the final results rounded to the nearest whole number. The use of FCAT insures that any frametime problems will be directly translated into the final framerate results.

Our first game was Battlefield 4 and a few things are abundantly obvious. It looks like the GTX 970 takes a good 4% more of a performance hit than the GTX 980 when moving from less than 3.5GB to that 500MB sweet spot. Now we’re not talking about anything extreme here but there is a difference. The GTX TITAN’s additional memory doesn’t seem to do all that much in this case but the R9 290X proves that AMD’s architecture is extremely well prepared for upwards changes in bandwidth.

The latest Dragon Age game follows very much the same guidelines as Battlefield 4, though the drop-off delta between the 970 and 980 is reduced once again. It doesn’t look like the GTX 970 has been adversely affected in either of these titles so far. Meanwhile, the TITAN drops like a stone which is likely a byproduct of its architecture’s limitations rather than memory allotments.

Here we see the gap between the GTX 970 and GTX 980 narrowing even further and it is becoming obvious that NVIDIA’s algorithms are completely transparent and working the way they’re meant to. As for the R9 290X? What more is there to say….it’s beyond impressive.

Shadow of Mordor presents an interesting situation since there’s very little performance change on any of these cards when moving away from situations which require more than 3.5GB. However, this scenario shows the GTX 970 with its most significant (5%) framerate reduction relative to the GTX 980. To us this isn’t anything to worry about but it is nonetheless noteworthy in this context. Memory performance is obviously an AMD strength, a fact we’ve seen since we first reviewed the GTX 970.

With that being said, we’ve been actively trying to find a scenario that fits into the GTX 970 narrow ~500MB window which causes the card to trip up and after testing some 20 games, none exhibited any unexpected performance drop-offs. It seems like NVIDIA’s software is working thus far but our cross section of titles can by no means offer a definitive answer to every user since there are literally hundreds of games that haven’t been used in our testing. In addition, it’s impossible to know how many of these deltas are due to the GTX 980’s expanded SM layout versus any memory bandwidth shortfalls.

Closing Thoughts

It has been an interesting week for NVIDIA. Their GTX 960 was launched amid the usual fanfare and then they came face to face with the GTX 970’s memory “issue” being laid bare. Their reaction was swift and surefooted but there are plenty of questions still swirling around the situation.

After numerous briefings we finally know how the GTX 970 addresses its memory, why some applications don’t pick up the full 4GB allotment and how the partitioning design can affect overall performance. The explanations make sense and the (in our testing at least) minimal impact on a game’s framerates is something that should be celebrated rather than ridiculed. However, the problem rests with the timing rather than the end result since it supposedly took months for anyone to realize a lot of essential information didn’t make its way to the press.

NVIDIA says their omission of these elements boils down to a miscommunication rather than any intent to mislead. We’re inclined to believe them since absolutely nothing has been gained by keeping it from us; the GTX 970’s performance hasn’t changed and neither has its overall value to gamers.

On the other hand, the fact it took a mob of pitchfork-wielding gamers and press for them to “come clean” about this points towards a very worrying reactionary approach. Should we the writers have asked just the right questions for this information to come forward beforehand? That certainly shouldn’t be the case. The mistaken information given to the press stood there for nearly five months which is an eternity for a company that prides itself on having one of the best technical PR teams in the business. These weren’t small elements either; the GTX 970 is missing eight ROPs! While we can’t point fingers since only NVIDIA knows the full story, it looks understandably suspicious to our readers.

A certain amount of blame rests on our shoulders as well. While the GTX 970’s performance characteristics in SLI and single card configurations didn’t present anything amiss, the various GPGPU benchmarks seem to be raising red flags now and likely always did. If anything, this points towards an expanded testing suite and we’ve taken that point to heart. We will also continue testing in an effort to uncover if there are any situations which show the partitioning to be a detriment to overall performance. Stay tuned for more articles addressing this.

Perhaps those last points should direct our conversation down a different path as well. With hundreds of media members being completely oblivious to this partitioning throughout thousands of in-game tests, NVIDIA should actually be commended for making it so transparent thus far. The load balancing within the drivers has obviously been doing its job since the GTX 970 has and should continue to be lauded as an excellent graphics card. Nothing changes that. It is just too bad that due to a series of unfortunate events, its awards and accolades will always have this episode as a major, unavoidable footnote.

Search

The GTX 970's Memory Explained & Tested

SKYMTL

HardwareCanuck Review Editor

SKYMTL

HardwareCanuck Review Editor

The Benchmarks

Synthetic Benchmarks

Gaming Benchmarks

Closing Thoughts

Latest posts

About Us

Online statistics

Follow Us On Social Media

Contact