Nvidia's next-gen Blackwell chip may be facing delays due to design flaws

Skye Jacobs

Posts: 94   +3
Staff
Facepalm: Nvidia's highly anticipated Blackwell series of AI chips has encountered a significant setback. Newly discovered design flaws will delay shipments by at least three months. This delay will likely cause considerable disappointment among customers who have placed billions of dollars in orders.

Nvidia's highly anticipated Blackwell series of AI chips is facing significant delays due to design flaws discovered late in manufacturing. The Information cites two anonymous sources involved with Nvidia's chip and server hardware production, stating that the issue could take at least three months to resolve.

The crux of the issue lies in the processor die connecting two Blackwell GPUs on a GB200 chip – a problem identified by manufacturer TSMC. In response, Nvidia is revising the design and will need to conduct new production tests with TSMC before mass production begins. As a stopgap measure, the company is considering producing a single GPU version of the Blackwell chip to expedite delivery.

The delay has far-reaching implications for Big Tech players who have invested heavily in Nvidia's technology. For instance, Google has ordered over 400,000 GB200 chips in a deal exceeding $10 billion. Similarly, Meta has placed a $10 billion order, while Microsoft had plans to have 55,000 to 65,000 GB200 GPUs ready for OpenAI by the first quarter of 2025 – a timeline now in jeopardy.

Nvidia has reportedly informed Microsoft and another cloud provider about the delay affecting the most advanced AI chip models in the Blackwell series. Consequently, significant shipments of these chips are not expected until the first quarter of 2025, potentially disrupting the AI strategies of these tech giants.

Despite these reports, Nvidia's official stance remains optimistic. A company spokesperson stated that "production is on track to ramp" later this year without directly addressing the reported delay. Meanwhile, the affected companies, including Microsoft, Google, Amazon Web Services, and Meta, have declined to comment.

The setback could allow Nvidia's competitors to gain ground in the AI chip market. Intel and AMD have struggled to impact Nvidia's market share since the outset of the AI boom. However, the delay might let them reposition their products as viable alternatives for customers needing immediate solutions.

For instance, AMD designed its open-source ROCm framework to compete directly with Nvidia's CUDA, offering developers an alternative to build AI applications without being locked into Nvidia's ecosystem. Likewise, Intel is developing AI accelerator chips, including the Gaudi line, as more affordable alternatives. According to Intel, its AI accelerators are one-third to two-thirds the price of competing brands.

As customers move past the initial disappointment of the delay, they may question Nvidia's ability to maintain its dominant 80-percent market share in the face of production challenges and increasing competition. As the AI arms race intensifies, the industry will closely watch how Nvidia navigates this hurdle and whether it can deliver on its promises to its high-profile customers.

Permalink to story:

 
But does cheaper AI result in a better product than the competitor? We will see soon enough.
It better if they want to be competitive against nVidia. Otherwise, they will all just fall behind.
 
Imagine being the Key Account Manager that has to phone your client and say their 10bn$ order is delayed for 3 months minimum, I don't envy him.
 
Intel: chip-level flaws. Foundry: TSMC
AMD: chip-level flaws? Foundry: TSMC
nVidia: chip-level flaws. Foundry: TSMC

Thinking...thinking...nope, no discernable pattern.
 
Intel: chip-level flaws. Foundry: TSMC
AMD: chip-level flaws? Foundry: TSMC
nVidia: chip-level flaws. Foundry: TSMC

Thinking...thinking...nope, no discernable pattern.
Intel problems are their own doing on their own fab. Samsung has had issues of their own. AMD had issues even when they had their own fabs. I don't know what pattern you imply. What I see is that more and more of the advanced hardware is built using TSMC fabs.
 
But does cheaper AI result in a better product than the competitor? We will see soon enough.
It better if they want to be competitive against nVidia. Otherwise, they will all just fall behind.
AMD is already competitive against Nvidia and on some instances AMD solution is much better. On AI computing, it really doesn't matter if you have Best hardware, especially when doing AI training where response times doesn't really matter. No-one really cares if training took 41 days instead 40 days. This is hype, as company A must have Nvidia just because company B also has. Not that they actually even use those chips they buy: https://www.tomshardware.com/pc-com...ver-forgotten-cluster-was-powered-on-and-idle
They buy them first and then figure out what to do with them. If anything.
Intel: chip-level flaws. Foundry: TSMC
AMD: chip-level flaws? Foundry: TSMC
nVidia: chip-level flaws. Foundry: TSMC

Thinking...thinking...nope, no discernable pattern.
Raptor Lake uses Intel foundry.

AMD Zen5-problems are still somewhat unknown.

Nvidia uses TSMC. That you got correct.
 
Intel: chip-level flaws. Foundry: TSMC
AMD: chip-level flaws? Foundry: TSMC
nVidia: chip-level flaws. Foundry: TSMC

Thinking...thinking...nope, no discernable pattern.
I think it says design flaw, so I am not sure how that ends up as a chip level flaw? Also, Intel is using its own foundry, so I think you are mixing things up.
 
Imagine being the Key Account Manager that has to phone your client and say their 10bn$ order is delayed for 3 months minimum, I don't envy him.
True. Although it's somewhat easier when both you and your client are equally aware that if they no longer want their $10 billion reserved order, someone else does. Also when the reason for the rush is keeping up with the competition, and your delay is equally impacting that same competition. All in all I'd still like the account and the commission.
 
AMD is already competitive against Nvidia and on some instances AMD solution is much better. On AI computing, it really doesn't matter if you have Best hardware, especially when doing AI training where response times doesn't really matter. No-one really cares if training took 41 days instead 40 days. This is hype, as company A must have Nvidia just because company B also has. Not that they actually even use those chips they buy: https://www.tomshardware.com/pc-com...ver-forgotten-cluster-was-powered-on-and-idle
They buy them first and then figure out what to do with them. If anything.

Raptor Lake uses Intel foundry.

AMD Zen5-problems are still somewhat unknown.

Nvidia uses TSMC. That you got correct.
AMD Zen 5 was a misprint on the name. If it had been a silicon issue it would have taken a LOT longer than 2 weeks to fix.
 
Intel problems are their own doing on their own fab. Samsung has had issues of their own. AMD had issues even when they had their own fabs. I don't know what pattern you imply. What I see is that more and more of the advanced hardware is built using TSMC fabs.

You're right about Intel, at least insofar as they haven't identified a specific fab, I was wrong on that. I didn't mention Samsung. AMD issues I put a question mark, as we don't know if their pullback of the 9000's is due to wafer fab, microcode, or mislabeling.
 
AMD is already competitive against Nvidia and on some instances AMD solution is much better. On AI computing, it really doesn't matter if you have Best hardware, especially when doing AI training where response times doesn't really matter. No-one really cares if training took 41 days instead 40 days. This is hype, as company A must have Nvidia just because company B also has. Not that they actually even use those chips they buy: https://www.tomshardware.com/pc-com...ver-forgotten-cluster-was-powered-on-and-idle
They buy them first and then figure out what to do with them. If anything.

Raptor Lake uses Intel foundry.

AMD Zen5-problems are still somewhat unknown.

Nvidia uses TSMC. That you got correct.
Yup, I was wrong on Intel. Still unknown on AMD though probably not wafer fab level.
 
AMD Zen 5 was a misprint on the name. If it had been a silicon issue it would have taken a LOT longer than 2 weeks to fix.

I heard a rumor that it was a misprint - this has been confirmed now?

Certain wafer issues can be corrected with microcode after the fact.
 
I heard a rumor that it was a misprint - this has been confirmed now?

Certain wafer issues can be corrected with microcode after the fact.

Nothing to do with printing errors. They discovereda batch of 9950X's that did not meet spec. They pulled the whole stock to retest and those bad 9950X's might become 9900X or 9700X. Only a short delay, and good on them for catching it early.
 
Back