AMD and Google TPU Closing the Gap on NVIDIA · history
Version 3
2026-05-24 18:35 UTC · 74 items
What
AMD, Google, IBM, and NVIDIA are competing across hardware benchmarks, software infrastructure, and market economics in mid-2026. AMD's MI355 delivers 40% lower inference cost than NVIDIA's B200 on GLM5 FP8 workloads [1], and AMD has now posted self-described 'breakthrough' results in the standardized MLPerf Inference 6.0 round [2]. Google and IBM jointly donated llm-d and its TPU drivers to the CNCF in March 2026 [6][7], with CoreWeave publicly endorsing the move as significant for production inference [10]. Against this, NVIDIA projects $1 trillion in GPU purchase orders through 2027 [15] — a macro-level demand signal suggesting incumbency advantage. AMD's concrete gap persists: llm-d images still do not support AMD MI300 GPUs [13], leaving it behind both NVIDIA and Google TPU on the Kubernetes-native inference stack enterprise operators increasingly require.
Why it matters
NVIDIA's dominance has rested on raw hardware performance and a mature CUDA software ecosystem. AMD is demonstrating cost-competitive inference hardware and standardized benchmark milestones while pricing below cost to gain share. Google and IBM are constructing a neutral, CNCF-governed Kubernetes inference stack that TPU and NVIDIA have joined but AMD has not yet. Whether AMD can close its software gap before NVIDIA's projected demand wave locks in a new hardware cycle is the central question for enterprise AI infrastructure in 2026.
Open questions
Does AMD's 40% cost advantage on GLM5 generalize to other frontier model architectures, or is it specific to GLM5's design and speculative decoding configuration? [1]
AMD's MLPerf Inference 6.0 results are self-described as 'breakthrough' [2] — how do they compare against NVIDIA's submissions on the same benchmark tasks, and do they substantiate the cost-per-token claims seen in the GLM5 tests?
When will AMD close the llm-d support gap? An open GitHub issue confirms llm-d images do not support AMD MI300 GPUs [13], a concrete production blocker for enterprise Kubernetes operators standardized on the framework.
If Jensen Huang's $1 trillion GPU purchase order projection through 2027 [15] proves accurate, does the scale of NVIDIA's demand backlog reduce hyperscaler urgency to switch to AMD or restructure procurement around cost efficiency?
Narrative
The competition for AI accelerator dominance entered a multi-front phase in 2026, with AMD, Google, IBM, and NVIDIA each posting concrete moves across hardware economics, benchmark transparency, and software infrastructure governance.
The headline hardware result is AMD's: its MI355 GPU is 40% cheaper than NVIDIA's B200 for single-node FP8 inference on the GLM5 architecture, using speculative decoding via SGLang v0.12 across both MTP and non-MTP configurations [1]. SemiAnalysis, which reported the finding, emphasized it was achieved just 14 weeks after GLM5's initial launch and holds on both CUDA and ROCm backends — framing it as evidence that ROCm's software maturity has crossed a threshold where AMD hardware can realize its cost advantages in production inference. AMD has added a standardized benchmark milestone: the company published what it describes as 'breakthrough' results in the MLPerf Inference 6.0 round [2], situating AMD more formally alongside NVIDIA [3] in the industry's primary hardware benchmarking process and extending the cost-advantage claim beyond a single architecture test. AMD's MI355X carries higher manufacturing costs than NVIDIA's competing chip while selling at a substantially lower price [4], a deliberate margin sacrifice to gain inference market share. One complicating pricing signal: AMD's MI350, an earlier GPU in the same family, has seen its price jump 66.7% amid NVIDIA rivalry [5], suggesting AMD's pricing posture is not uniformly below-cost across its accelerator portfolio and likely reflects demand dynamics on specific SKUs.
On the software infrastructure front, Google and IBM jointly donated llm-d — an open-source Kubernetes-native distributed inference framework — and its TPU drivers to the Cloud Native Computing Foundation in March 2026 [6][7], placing the project under neutral governance and lowering the barrier for enterprise operators to standardize on it. The CNCF sandbox acceptance process for llm-d was formally filed [8], and Google presented GKE and open-source inference innovations at KubeCon EU 2026 [9], deepening the Kubernetes-native inference ecosystem around its hardware. IBM's involvement as a co-originator of the donation is significant: it broadens the coalition behind the neutral-governance inference stack beyond hyperscalers to enterprise IT incumbents. CoreWeave — a major NVIDIA-centric cloud provider — publicly framed llm-d's CNCF acceptance as significant for production inference infrastructure [10], a notable cross-vendor endorsement from a company whose business model is built on NVIDIA hardware. Google further added nightly continuous integration for llm-d on TPU hardware [11] and published 3X inference speedups using diffusion-style speculative decoding [12], adding throughput dimensions beyond Kubernetes tooling. AMD's position on this stack remains a concrete liability: an open GitHub issue confirms llm-d images do not yet support AMD MI300 GPUs [13], creating a specific production blocker while NVIDIA and Google TPU both have active CI coverage.
AMD's upstream contribution to NVIDIA's AIPerf benchmarking sub-project within the Dynamo repository [14] — believed to be the first such cross-competitor upstream merge — signals some willingness to collaborate on shared measurement infrastructure, but the inference-framework gap that matters most to enterprise Kubernetes operators remains unresolved. Against the AMD and Google TPU competitive narrative, Jensen Huang has projected $1 trillion in GPU purchase orders through 2027 [15], a macro-level demand claim suggesting NVIDIA expects its incumbency advantages to sustain at scale regardless of cost-per-token comparisons. Huang has also articulated a 'low MFU by design' philosophy [16], arguing that deliberate GPU over-provisioning is a strategic asset rather than waste — a framing that, if adopted broadly by hyperscalers, would favor capacity incumbents over cost-efficiency challengers and blunt AMD's inference cost advantage.
Timeline
- 2026-03-24: CNCF formally welcomes llm-d to its sandbox; IBM and Google co-donate llm-d and TPU drivers to the CNCF under neutral governance. [6][7][8]
- 2026-05-16: AMD's contribution accepted into NVIDIA's AIPerf benchmarking repository — believed to be a first cross-competitor upstream merge. [14]
- 2026-05-17: Jensen Huang at Stanford CS153 articulates 'low MFU by design' as a deliberate over-provisioning philosophy. [16]
- 2026-05-19: AMD MI355 confirmed 40% cheaper than NVIDIA B200 on GLM5 single-node FP8 inference, 14 weeks post-launch. [1]
- 2026-05-21: Google adds nightly CI for llm-d on TPU hardware; SemiAnalysis notes TPU has reached parity with NVIDIA in llm-d code quality. Community commentary highlights software moat narrowing as the under-appreciated signal. [11][17]
- 2026-05: Jensen Huang projects $1 trillion in GPU purchase orders through 2027. [15]
- 2026-05: AMD publishes 'breakthrough' MLPerf Inference 6.0 results, marking formal participation in standardized hardware benchmarking alongside NVIDIA. [2]
- 2026-05: AMD MI350 GPU price rises 66.7% amid NVIDIA rivalry, adding complexity to the AMD below-cost pricing narrative. [5]
Perspectives
SemiAnalysis
Bullish on AMD and Google TPU progress; frames inference cost-efficiency and software ecosystem parity as decisive competitive dimensions. Views cross-competitor open-source collaboration as historically notable. Analytically provocative on Jensen Huang's low-MFU philosophy.
Evolution: Consistent throughout — SemiAnalysis has been the primary reporting voice, maintaining a pro-competition, anti-NVIDIA-moat framing.
Jensen Huang / NVIDIA
Projects $1 trillion in GPU purchase orders through 2027, signaling macro-level demand confidence. Simultaneously reframes low GPU utilization as intentional over-provisioning strategy, positioning deliberate headroom as a strategic asset that blunts cost-efficiency comparisons.
Evolution: Strengthened this pass: the $1 trillion demand forecast adds a market-scale confidence signal to the earlier 'low MFU by design' philosophy, making NVIDIA's counter-narrative more explicitly about demand incumbency and scale.
IBM Research
Co-donated llm-d to the CNCF, revealing IBM as a primary project contributor alongside Google — not merely an observer. IBM's involvement signals that the neutral-governance inference stack has enterprise-credibility backers beyond hyperscalers.
Evolution: New voice this pass; the IBM Research donation blog corrects the prior framing that attributed the CNCF donation to Google alone.
CoreWeave
Publicly frames llm-d's CNCF acceptance as significant for production inference infrastructure, despite CoreWeave's own infrastructure being NVIDIA-centric. A notable cross-vendor endorsement.
Evolution: New voice this pass; represents a non-trivial signal that the llm-d CNCF governance shift carries weight beyond immediate contributors.
AMD (hardware and pricing actions)
Posting self-described 'breakthrough' MLPerf Inference 6.0 results alongside the GLM5 cost advantage; pricing MI355X below manufacturing cost while MI350 prices have risen 66.7%, revealing a nuanced rather than uniformly aggressive pricing posture across its accelerator portfolio.
Evolution: MLPerf benchmark participation is new this pass; the MI350 price jump adds complexity to the earlier below-cost pricing narrative.
Community observers (Twitter/X amplifiers)
Broadly validating the SemiAnalysis framing; one commentator specifically flags the software moat angle as under-appreciated, arguing Google wiring TPU into llm-d CI is the more significant signal than silicon performance alone.
Evolution: Consistent with prior pass; no new named voices in the community cluster.
Tensions
- SemiAnalysis frames inference cost-per-token as the decisive competitive moat ('SPEED IS THE MOAT'), implying AMD's 40% cost advantage is structurally significant [1]. Jensen Huang's $1 trillion demand projection [15] and 'low MFU' philosophy [16] implicitly counter this: if hyperscalers lock in GPU purchase orders at scale and prioritize over-provisioned capacity flexibility, cost-per-token comparisons matter less than incumbency and demand volume. [1][15][16]
- Google and IBM have granted llm-d neutral governance under the CNCF [6][7], and Google has achieved nightly CI parity with NVIDIA on the framework [11], while AMD's official llm-d support is absent — confirmed by an open GitHub issue showing llm-d images do not support AMD MI300 GPUs [13]. AMD leads on hardware cost and now has formal MLPerf results [2], but trails on the open-source inference stack that enterprise Kubernetes operators require. [6][7][11][13][2]
Sources
- [1] AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initi… — SemiAnalysis Twitter (2026-05-19)
- [2] AMD Delivers Breakthrough MLPerf Inference 6.0 Results — reactive:gpu-accelerator-competition
- [3] MLPerf AI Benchmarks - NVIDIA — reactive:gpu-accelerator-competition
- [4] AMD's MI355X costs more to build but sells for much less than ... — reactive:gpu-accelerator-competition
- [5] AMD MI350 Price Jumps 66.7% Amid Nvidia Rivalry - SmBom — reactive:gpu-accelerator-competition
- [6] Welcome llm-d to the CNCF: Evolving Kubernetes into SOTA AI infrastructure | CNCF — reactive:gpu-accelerator-competition
- [7] Donating llm-d to the Cloud Native Computing Foundation - IBM Research — reactive:gpu-accelerator-competition
- [8] [Sandbox] llm-d · Issue #462 · cncf/sandbox - GitHub — reactive:gpu-accelerator-competition
- [9] GKE and OSS innovation at KubeCon EU 2026 | Google Cloud Blog — reactive:gpu-accelerator-competition
- [10] Why llm-d in CNCF Matters for Production Inference — reactive:gpu-accelerator-competition
- [11] TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by … — SemiAnalysis Twitter (2026-05-21)
- [12] Supercharging LLM inference on Google TPUs: Achieving 3X ... — reactive:gpu-accelerator-competition
- [13] llm-d image doesn't support AMD MI300 GPU's? · Issue #139 - GitHub — reactive:gpu-accelerator-competition
- [14] SERIOUS & COOL: AIPerf -- a sub-repo of the Nvidia Dynamo project focused on benchmarking LLM workloads -- just acce… — SemiAnalysis Twitter (2026-05-16)
- [15] Jensen Huang just made the most audacious prediction in semiconductor history: $1 trillion in GPU purchase orders through 2027. — reactive:gpu-accelerator-competition
- [16] At Stanford CS153 Frontier Systems, Jensen states word for word that he "would like to be at low MFU all the time" &… — SemiAnalysis Twitter (2026-05-17)
- [17] @SemiAnalysis_ The under-appreciated bit: it's the *software* moat narrowing, not just silicon. Google wiring TPU into l... — reactive:gpu-accelerator-competition (2026-05-21)
- [18] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [19] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [20] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [21] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [22] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [23] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [24] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
- [25] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)