AMD and Google TPU Closing the Gap on NVIDIA · history

Version 2

2026-05-24 04:19 UTC · 58 items

What

In mid-May 2026, AMD, Google TPU, and NVIDIA are competing simultaneously on hardware cost, software infrastructure, and open-source benchmarking standards. AMD's MI355 GPU delivers 40% lower inference cost than NVIDIA's B200 on GLM5 single-node FP8 workloads [1], and AMD's MI355X has been confirmed to cost more to manufacture than its NVIDIA rival yet sells at a meaningfully lower price [2] — a deliberate margin sacrifice. Google added nightly CI for the llm-d Kubernetes inference framework on TPU hardware [3] and donated llm-d and TPU drivers to the CNCF [4], deepening its open-source infrastructure commitment. AMD made a historically novel upstream contribution to NVIDIA's AIPerf benchmarking tool [8]. A concrete gap persists, however: an open GitHub issue confirms llm-d images do not yet support AMD MI300 GPUs [7], leaving AMD behind both NVIDIA and Google TPU on the inference stack that enterprise Kubernetes operators increasingly require.

Why it matters

NVIDIA's dominance has rested on two pillars — raw hardware performance and a mature CUDA software ecosystem. AMD is now demonstrating cost-competitive inference hardware even while pricing below cost to gain share, and Google TPU is closing the software infrastructure gap via production-grade Kubernetes tooling donated to a neutral foundation. The asymmetry that cuts against AMD — ahead on cost-per-token, behind on open-source inference stack — is the central unresolved question for enterprise buyers choosing infrastructure in 2026.

Open questions

Does AMD's 40% cost advantage on GLM5 generalize to other frontier model architectures, or is it specific to GLM5's design and speculative decoding configuration? [1]
When will AMD close the llm-d support gap? A GitHub issue confirms llm-d images do not yet support AMD MI300 GPUs [7] — a concrete production blocker for enterprise operators standardized on Kubernetes inference.
Google published 3X TPU inference speedups using diffusion-style speculative decoding [5]; how does this compare to AMD's speculative decoding gains, and does it shift the throughput calculus independent of cost?
If Jensen Huang's 'low MFU by design' philosophy [9] spreads among hyperscalers, does the procurement frame shift from cost-per-token toward over-provisioned capacity — blunting AMD's inference cost advantage?

Narrative

The competition for AI accelerator dominance entered a multi-front phase in mid-May 2026, with AMD, Google TPU, and NVIDIA each posting concrete moves across hardware economics, software infrastructure, and open-source collaboration.

The headline hardware result is AMD's: its MI355 GPU is 40% cheaper than NVIDIA's B200 for single-node FP8 inference on the GLM5 architecture, using speculative decoding via SGLang v0.12 across both MTP and non-MTP configurations [1]. SemiAnalysis, which reported the finding, emphasized that it holds on both CUDA and ROCm backends and was achieved just 14 weeks after GLM5's initial launch — framing it as evidence that ROCm's software maturity has crossed a threshold where AMD hardware can realize its cost advantages in production inference. Separately, AMD's MI355X has been confirmed to carry higher manufacturing costs than NVIDIA's competing chip while selling at a substantially lower price [2], a pricing posture that trades near-term margin for market share in the inference cloud.

On the software infrastructure front, Google made two complementary moves. It added nightly continuous integration for llm-d — an open-source Kubernetes-native distributed inference framework — on its TPU hardware [3], signaling a commitment to keeping TPU compatibility production-green on each code push rather than treating OSS support as a periodic patch. Google also donated llm-d and its TPU drivers to the Cloud Native Computing Foundation (CNCF) [4], placing the framework under neutral governance and lowering the barrier for enterprise operators to standardize on it. Google has separately published 3X inference speedups on TPU using diffusion-style speculative decoding [5], adding a throughput dimension to its competitiveness claim beyond Kubernetes tooling. Commentary in the community has flagged the software moat angle as under-appreciated: one observer noted that Google wiring TPU into llm-d CI represents the software moat narrowing, not just the silicon gap closing [6].

AMD's relative position on the software stack is more complicated. A GitHub issue filed against the llm-d repository confirms that llm-d images do not yet support AMD MI300 GPUs [7], making AMD's llm-d gap concrete rather than merely relative. AMD's upstream contribution to NVIDIA's AIPerf benchmarking sub-project within the Dynamo repository [8] — believed to be the first such cross-competitor merge — suggests some willingness to collaborate on shared measurement infrastructure, but it does not close the inference-framework gap. Running against the competitive-pressure narrative is Jensen Huang's articulation at Stanford's CS153 that he would prefer to operate at low Model FLOP Utilization (MFU) at all times [9]: his argument inverts conventional framing, positioning deliberate over-provisioning as a strategic asset rather than waste — a posture that, if adopted broadly by hyperscalers, would tend to favor capacity incumbents over cost-efficiency challengers.

Timeline

2026-05-16: AMD's contribution accepted into NVIDIA's AIPerf benchmarking repository — believed to be a first cross-competitor upstream merge. [8]
2026-05-17: Jensen Huang at Stanford CS153 articulates 'low MFU by design' as a deliberate over-provisioning philosophy. [9]
2026-05-19: AMD MI355 confirmed 40% cheaper than NVIDIA B200 on GLM5 single-node FP8 inference, 14 weeks post-launch. [1]
2026-05-21: Google adds nightly CI for llm-d on TPU hardware; SemiAnalysis says TPU has caught up to NVIDIA in llm-d code quality. Google Cloud also donates llm-d and TPU drivers to the CNCF. [3][4]
2026-05-21: Community commentary highlights software moat narrowing — not just silicon — as the under-appreciated signal in the TPU/llm-d CI development. [6]

Perspectives

SemiAnalysis

Enthusiastically bullish on AMD and Google TPU progress; frames inference cost-efficiency and software ecosystem parity as decisive competitive dimensions. Views cross-competitor open-source collaboration as historically notable. Analytically provocative on Jensen Huang's low-MFU philosophy.

Evolution: Consistent — SemiAnalysis has been the primary reporting voice throughout, maintaining a pro-competition, anti-NVIDIA-moat framing.

[8][9][1][3]

Jensen Huang / NVIDIA

Reframes low GPU utilization as intentional over-provisioning strategy, not inefficiency — implicitly defending a high-capex, high-headroom infrastructure posture that blunts cost-efficiency comparisons.

Evolution: Stance not yet contested by any named voice in this thread.

[9]

Community observers (Twitter/X amplifiers)

Broadly validating the SemiAnalysis framing; one commentator specifically flags the software moat angle as under-appreciated, arguing Google wiring TPU into llm-d CI is the more significant signal than silicon performance alone.

Evolution: New voice cluster this pass; adds a software-moat framing that partially extends beyond the SemiAnalysis hardware-cost narrative.

[6][10][11][12][13][14][15][16][17]

AMD (pricing actions)

Revealed preference via MI355X pricing: absorbing margin to sell below manufacturing cost, prioritizing inference market share over near-term profitability.

Evolution: Implicit stance inferred from pricing data; not a stated position. First time this pricing dimension appears explicitly in the thread.

[2]

Tensions

SemiAnalysis frames inference cost-per-token as the decisive competitive moat ('SPEED IS THE MOAT'), implying AMD's 40% cost advantage is structurally significant [1]. Jensen Huang's 'low MFU' philosophy implicitly counters this by elevating over-provisioned capacity and flexibility as the real strategic asset [9] — a framing that would blunt cost-efficiency comparisons and favor the incumbent with the most headroom to provision. [1][9]
Google TPU has gained parity with NVIDIA in llm-d CI coverage [3] and formalized its commitment by donating llm-d to the CNCF [4], while AMD's official llm-d support is absent — confirmed by an open GitHub issue showing llm-d images do not support AMD MI300 GPUs [7]. AMD leads on hardware cost but trails on the open-source inference stack that enterprise Kubernetes operators require. [3][4][7][1]

Sources

[1] AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initi… — SemiAnalysis Twitter (2026-05-19)
[2] AMD's MI355X costs more to build but sells for much less than ... — reactive:gpu-accelerator-competition
[3] TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by … — SemiAnalysis Twitter (2026-05-21)
[4] Google Cloud Donates llm-d, TPU Drivers, and More to CNCF | KubeFM — reactive:gpu-accelerator-competition
[5] Supercharging LLM inference on Google TPUs: Achieving 3X ... — reactive:gpu-accelerator-competition
[6] @SemiAnalysis_ The under-appreciated bit: it's the *software* moat narrowing, not just silicon. Google wiring TPU into l... — reactive:gpu-accelerator-competition (2026-05-21)
[7] llm-d image doesn't support AMD MI300 GPU's? · Issue #139 - GitHub — reactive:gpu-accelerator-competition
[8] SERIOUS & COOL: AIPerf -- a sub-repo of the Nvidia Dynamo project focused on benchmarking LLM workloads -- just acce… — SemiAnalysis Twitter (2026-05-16)
[9] At Stanford CS153 Frontier Systems, Jensen states word for word that he "would like to be at low MFU all the time" &… — SemiAnalysis Twitter (2026-05-17)
[10] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[11] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[12] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[13] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[14] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[15] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[16] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)
[17] RT @SemiAnalysis_: TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for ll... — reactive:gpu-accelerator-competition (2026-05-21)