New multi-threading technique promises to double processing speeds

zohaibahd

Posts: 28   +0
Staff
Forward-looking: New research details a process that allows a CPU, GPU, and AI accelerator to work seamlessly in parallel on separate tasks. The pioneering breakthrough could provide blazing-fast, energy-efficient computing – promising to double overall processing speed at less than half the energy cost.

Researchers at the University of California Riverside have developed a technique called Simultaneous and Heterogeneous Multithreading (SHMT), which builds on contemporary simultaneous multithreading. Simultaneous multithreading splits a CPU core into numerous threads, but SHMT goes further by incorporating the graphics and AI processors.

The key benefit of SHMT is that these components can simultaneously crunch away on entirely different workloads, optimized to each one's strength. The method differs from traditional computing, where the CPU, GPU, and AI accelerator work independently. This separation requires data transfer between the components, which can lead to bottlenecks.

Meanwhile, SHMT uses what the researchers call a "smart quality-aware work-stealing (QAWS) scheduler" to manage the heterogeneous workload dynamically between components. This part of the process aims to balance performance and precision by assigning tasks requiring high accuracy to the CPU rather than the more error-prone AI accelerator, among other things. Additionally, the scheduler can seamlessly reassign jobs to the other processors in real time if one component falls behind.

In testing, SHMT boosted performance by 95 percent and sliced power usage by 51 percent compared to existing techniques. The result is an impressive 4x efficiency uplift. Early proof-of-concept trials utilized Nvidia's Jetson Nano board containing a 64-bit quad-core Arm CPU, 128-core Maxwell GPU, 4GB RAM, and an M.2 slot housing one of Google's Edge TPU AI accelerators. While it's not precisely bleeding-edge hardware, it does mirror standard configurations. Unfortunately, there are some fundamental limitations.

"The limitation of SHMT is not the model itself but more on whether the programmer can revisit the algorithm to exhibit the type of parallelism that makes SHMT easy to exploit," the paper explains.

In other words, it's not a simple universal hardware implementation that any developer can use. Programmers have to learn how to do it or develop tools to do it for them.

If the past is any indication, this is no easy feat. Remember Apple's switch from Intel to Arm-based silicon in Macs? The company had to invest significantly in its developer toolchain to make it easier for devs to adapt their apps to the new architecture. Unless there's a concerted effort from big tech and developers, SHMT could end up a distant dream.

The benefits also depend heavily on problem size. While the peak 95-percent uplift required maximum problem sizes in testing, smaller loads saw diminishing returns. Tiny loads offered almost no gain since there was less opportunity to spread parallel tasks. Nonetheless, if this technology can scale and catch on, the implications could be massive – from slashing data center costs and emissions to curbing freshwater usage for cooling.

Many unanswered questions remain concerning real-world implementations, hardware support, code optimizations, and ideal use-case applications. However, the research does sound promising, given the explosion in generative AI apps over the past couple of years and the sheer amount of processing power it takes to run them.

Permalink to story.

 
This sounds easy on paper but in practice:

- It must be pretty much hardware-based solution that requires no programming input.

- Moving data around is slow. Basically scheduler must determine if it's worth transferring data to another "computing unit" or should it just pick nearest one even if it wasn't fastest for task.

AMD has been trying this over a decade now without much success. So yeah, good luck.
 
As physics starts to limit die shrinks, these types of ideas are the way forward to further processing performance.
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
That 10 GHz promise was indeed realistic and I have no doubt Intel could have achieved it. However Intel decided that clock speed without performance gain is useless and abandoned Netburst architectures and that also meant 10 GHz at 2010 did not happen.
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"
True, but some type of parallelism is the best bet for massive gains (eventually not in a single go). We just need to figure out a way to do it for things that aren't as straightforwardly parallel.
 
That 10 GHz promise was indeed realistic and I have no doubt Intel could have achieved it. However Intel decided that clock speed without performance gain is useless and abandoned Netburst architectures and that also meant 10 GHz at 2010 did not happen.
I love how everyone said the Pentium 4 ran too hot and used too much power, but these days we have 300w+ chips that don't thermally throttle until ~80c. It also feels like we are back on the heels of another ghz war.
 
I love how everyone said the Pentium 4 ran too hot and used too much power, but these days we have 300w+ chips that don't thermally throttle until ~80c. It also feels like we are back on the heels of another ghz war.
I don't see AMD CPUs to consume too much power. Thermal limits are high but power consumption stays in place.

I doubt AMD will start GHz war, Zen architecture has always been server first and servers don't really need high clock speeds. Intel on other hand want to win at least single thread crown at any cost so different story there.
 
I don't see AMD CPUs to consume too much power. Thermal limits are high but power consumption stays in place.

I doubt AMD will start GHz war, Zen architecture has always been server first and servers don't really need high clock speeds. Intel on other hand want to win at least single thread crown at any cost so different story there.
Idk, AMD recently changed the spec on AM5 from 170W to 230W and we have been seeing a bump of a few hundred mhz every new generation. Maybe not the GHZ wars of the 2000s, but we have been seeing clock speeds consistently been going up the last ~8years.
 
Idk, AMD recently changed the spec on AM5 from 170W to 230W and we have been seeing a bump of a few hundred mhz every new generation. Maybe not the GHZ wars of the 2000s, but we have been seeing clock speeds consistently been going up the last ~8years.
Natural since manufacturing technologies also develop. Also because making core wider no longer gives that much improvement, more clock speed is obvious solution for more performance. However as said AMD is clearly focused on servers and Intel has problems with manufacturing that explains why clock speed gains are so low. Both AMD and Intel could get much more clock speed but see no way to do it without sacrificing IPC.
 
I've been hearing about photonics being the future for nearly 30 years, be nice if it actually made some progress. I also remember the gigahertz wars. "We'll hit 10ghz by 2010!"

The problem with Photonics remains the same: How are you generating the light? Because guess what: Generating light is *very* expensive.
 
True, but some type of parallelism is the best bet for massive gains (eventually not in a single go). We just need to figure out a way to do it for things that aren't as straightforwardly parallel.

The issue is that for problems that are not embarrassingly parallel, there's only so much you can do after a point in time to extract more gains via improved parallel code.
 
The problem with Photonics remains the same: How are you generating the light? Because guess what: Generating light is *very* expensive.

Temu: assorted LEDs, 0.4 cents each in bulk

A chip-based solution would be a thousand times cheaper still. Generating light is cheap; computing with light is an entirely different hairball.

Natural since manufacturing technologies also develop.
The days of large-scale manufacturing gains are over. NVidia has stated that, in the last 10 years, they've gotten only a 2.5X gain from process technology: the rest has come from architectural improvements. Over the next 10 years, we'll probably see about half that.

This sounds easy on paper but in practice:

- It must be pretty much hardware-based solution that requires no programming input.
If you read the source paper, it's a library/compiler based solution. In essence, it's simply a lower-level approach to what's already being done. Today, a programmer decides what to run on the CPU vs. the GPU, then the underlying library(ies) generally parallelize from there. In this approach, that parallelization is made at the same time as allocation to hardware, allowing for a more efficient distribution.
 
The days of large-scale manufacturing gains are over. NVidia has stated that, in the last 10 years, they've gotten only a 2.5X gain from process technology: the rest has come from architectural improvements. Over the next 10 years, we'll probably see about half that.
2.5X gain means what?

GTX980Ti (2015): 8B transistors
RTX4090 (2022): 76.3B transistors

Like :confused:
If you read the source paper, it's a library/compiler based solution. In essence, it's simply a lower-level approach to what's already being done. Today, a programmer decides what to run on the CPU vs. the GPU, then the underlying library(ies) generally parallelize from there. In this approach, that parallelization is made at the same time as allocation to hardware, allowing for a more efficient distribution.
That's why I said it looks good on paper. To remind, Intel's Itanium was also supposed to be fast CPU "because compiler makes code that CPU can easily execute and because of that CPU front end could be simple". Problem is, there was never that "ultra compiler".

To put it another way, AMD has been developing same thing over decade now. And they want hardware-based solution for some reason. Not hard to guess that reason. This kind of supercompiler solution look good on paper but making it actually work is super hard.
 
Papa Intel 16900k: Hey kiddo, take care of that for papa, Ok?
Sister RTX 7600x: No way dad... I can't handle it, give it to Junior, I'm playing Candy Crush Remake now.
Junior Intel AIA 2350: It's your turf, deal with yout SHMT!
Sister RTX 7600x: Dad, junior is not doing his shores... Again. And he is cursing.
Grandpa PSU: Say no more kid...
*Junior starts to give smoke and smell funny*
Bypasser Crysis: *Process Crysis.exe terminated - family issues*
 
Last edited:
2.5X gain means what?

GTX980Ti (2015): 8B transistors
RTX4090 (2022): 76.3B transistors
Like :confused:
My sentence continued "..the rest has come from architectural improvements." Jensen claimed a 1000-fold increase in AI performance over the last 10 years. If we chalk up 2.5X of that to process nodes, 10X to the transistor count increase, the remaining 40X increase came from improvements in tensor core IPC.

That's why I said it looks good on paper. To remind, Intel's Itanium was also supposed to be fast CPU "because compiler makes code that CPU can easily execute and because of that CPU front end could be simple". Problem is, there was never that "ultra compiler".
I think you're confusing the rationale for RISC with VLIW. Itanium code wasn't supposed to be "easier for the cpu to run", but rather explicitly parallel. And compilers did exist ... or "a" compiler, at least.

Itanium died for one reason and one reason alone. Because AMD's X86-64 ran IA-32 code natively, as well as or even better than earlier processors, whereas IA-64 ran IA-32 code dozens of times slower, if it all. Companies had an evolutionary road forward when migrating software to X86-64 which didn't exist for Itanium.
 
I think you're confusing the rationale for RISC with VLIW. Itanium code wasn't supposed to be "easier for the cpu to run", but rather explicitly parallel. And compilers did exist ... or "a" compiler, at least.

Itanium died for one reason and one reason alone. Because AMD's X86-64 ran IA-32 code natively, as well as or even better than earlier processors, whereas IA-64 ran IA-32 code dozens of times slower, if it all. Companies had an evolutionary road forward when migrating software to X86-64 which didn't exist for Itanium.
Pretty much this.

From a pure design perspective, Itanium was well thought out. And while Itanium ran x86 code at about a 20% hit, the intent was that x86 would become legacy, and over time faster Itanium processors would eventually run x86 faster then the fastest released x86 CPU.

The problem is rather focus on gaining share in the consumer market, Intel focuses on servers. As a result, when AMD released x86-64 which ran x86 natively (and thus faster), Itanium quickly lost in the marketplace and became an afterthought.

Much like we'd be in a much better place if the Motorola 86000 beat the 386, we'd be in a much better situation if Itanium beat x86-64.
 
Back