How memory bandwidth is killing AMD's 32-core Threadripper performance

Gordon Mah Ung
Table of Contents
Show More

AMD's 32-core Threadripper 2990WX is the fastest consumer CPU ever sold. And let's be clear: We're in full agreement with anyone who said that. But we would also be the first ones to say it has its limitations, too. 

The most glaring is the lack of consumer applications that can truly exploit the cores available. The other limitation is apparent in the diagram below, which shows how AMD built this 32-core monster. Rather than a single chip with every single CPU core on it, AMD connects four dies using its high-speed Infinity Fabric.

Why memory bandwidth affects the 32-core Threadripper

If you look closer at the diagram, you can see that two of the dies don't have their own memory controllers or PCIe access. Instead, they have to talk to an adjacent CPU die.

It is, essentially, like having having a two-apartment unit where the second one must access the hallway outside by going through the first apartment.

Perhaps more important is the overall bandwidth available. AMD had initially said the total bandwidth available between the four CPU dies was 25GBps bi-directional. The company amended its original documentation to state it was total bandwidth. Compare that with the 16-core Threadripper 2950X, with its 50GBps of bandwidth and two links between the two dies (also updated information from AMD.)

Many believe this is Threadripper 2990WX's main weakness: Lack of memory bandwidth per core is impacting it in memory-intensive tasks such as compression and encoding. Even worse for Threadripper 2990WX is that bandwidth has to be shared on a CPU with 14 more cores than Intel's Core i9-7980XE.

Below, you can see the result of Sandra 2018 Titanium's memory bandwidth test and the available bandwidth per core. As you can see, the bandwidth per core plummets from almost 5GB at 8-core and 16-core to just 2GB when you utilize all 32 cores. 

Synthetic memory bandwidth tests are one thing. To dig further into performance in memory-intensive tests, we fired up the newest version of the free and popular 7-Zip application. Written by Igor Pavlov, this open-source compression and decompression utility is popular and generally awesome. For example, when I run tests on a laptop and decompress Cinebench R15.08 and its thousands of small files with Windows 10's built-in utility, it takes several minutes to finish. I can actually connect to the Internet, download 7-Zip, and decompress the contents of Cinebench R15.08 with it in less time than it takes the built-in Windows utility to do its thing.

The GUI version runs two tests, for compression and decompression. The overall score looks like a simple average of the two results.

What 7-Zip tests

You can read more about the test on the 7-cpu.com web site, but we've highlighted some of the key information about the tests here. Regarding the Compression test, the website discusses the factors that influence the test results, saying it "strongly depends from memory (RAM) latency, Data Cache size/speed and TLB. Out-of-Order execution feature of CPU is also important for that test." The site goes on: "The compression test has big number of random accesses to RAM and Data Cache. So big part of execution time the CPU waits the data from Data Cache or from RAM."

About the Decompression test, the website says it "strongly depends on CPU integer operations. The most important things for that test are: branch misprediction penalty (the length of pipeline) and the latencies of 32-bit instructions ('multiply', 'shift', 'add' and other). The decompression test has very high number of unpredictable branches."

How we retested Threadripper vs. Core i9

For our retest, we decided to lock both the Threadripper 2990WX and the Core i9-7980XE at 3GHz to remove any variables from each CPU's boost schemes. This was done to make the comparison more dependent on the test rather than the clock speed differences between the two. We also set both to DDR4/3,200 clocks, and both were run in quad-channel mode except where noted. To be up-front: The Threadripper system had a slight edge in CAS latency at CL14 and 1T, while the Core i9 was running at CL15 and 2T. As in our original review, both were running Founders Edition GTX 1080 cards using the same drivers and the same version of Windows 10 Enterprise Edition.

Because much of the concern over Threadripper is its per-core memory bandwidth performance, we decided to run from 1 thread to the maximum number of threads on each CPU. We also decided to see whether performance of the Threadripper would change if you turned off dies, so we ran it with a single die (8 cores/16 threads) and two dies (16 cores/32 threads), and all four (32 cores/64 threads).

In the integer-focused decompression component of 7-Zip, the performance was quite nice. Although we don't see perfect scaling, there's little difference in 7-Zip decompression performance as you switch off dies.

All of the tests were also completed using the GUI version of 7-Zip 18.05 with the default dictionary size of 32MB (although we did decide to recompile our own version, too.)

You're probably more interested in the Core i9 vs. Threadripper 2990WX, so we ran that, of course. For the most part, it's not bad for either part. Interestingly, Threadripper 2990WX seems to have that slight fall-off in decompression performance as you cross the threshold of 8 cores. Core i9 has a decent performance advantage up to about 16 cores, but after that it runs out of steam and ends up losing to the 32-core Threadripper 2990WX CPU.

This shouldn't surprise too many, though. The CPU performance when you don't run out of memory bandwidth is a known quantity of the Threadripper 2990WX. You only have to look at our multi-threaded rendering tests to see how it's simply a monster.

The question is, what happens under memory bandwidth or memory latency tests? Here are the results of the Threadripper 2990WX in 7-Zip's compression test. It's not pretty, but the the good news is switching dies off didn't seem to matter. As you can see, the CPU appears to hit a ceiling at 26 threads, and then it just gets worse from there.

Perhaps worse is when you compare it to the Core i9-7980XE. Again—remember both of the CPUs were at a fixed clock speed of 3GHz and DDR4/3200.

That's just not a good look for the 32-core Threadripper 2990WX and does seem to confirm that memory latency and bandwidth chores suffer greatly.

But can memory bandwidth also hurt Core i9? To find out, we switched the Core i9 system from quad-channel mode into single-channel mode. Unfortunately, for our test, we did have to lower total memory to 16GB rather than 32GB due to lack of density on modules. The good news is the 7-Zip with the default dictionary fits fine, and we don't believe overall memory capacity was the issue. We can say that overall memory bandwidth as measured in Sandra 2018 was cut from 77GBps in quad-channel memory mode to 18.5GBps in single-channel mode on the Intel part. Per-core memory bandwidth went from 4.8GBps in quad-channel to 1GBps in single-channel mode.

As you can see, the performance of Core i9-7980XE also suffers when its memory bandwidth is drastically cut. It doesn't suffer as much as the Threadripper 2990XE, but this doesn't appear to be the fault of some pro-Intel code at work. 

Linux tests bring a surprise. Keep reading!

Page 1