AMD and Google TPU Closing the Gap on NVIDIA · history

Version 1

2026-05-23 18:14 UTC · 4 items

What

Over five days in mid-May 2026, three separate competitive signals converged: AMD's MI355 GPU achieved 40% lower inference cost than NVIDIA's B200 on the GLM5 architecture just 14 weeks after launch [1]; Google added nightly CI for the llm-d Kubernetes inference framework on its TPU hardware, gaining parity with NVIDIA in code quality and coverage [2]; and AMD made what is believed to be its first accepted upstream contribution to an NVIDIA-owned open-source repository, the AIPerf benchmarking sub-project [3]. Meanwhile, Jensen Huang offered a counterintuitive infrastructure thesis — that low GPU utilization (MFU) is a feature, not a failure [4].

Why it matters

For years NVIDIA's dominance rested on two pillars: raw hardware performance and a mature software ecosystem (CUDA). These developments suggest both pillars are under pressure simultaneously — AMD is demonstrating cost-competitive inference on ROCm, Google is closing the software infrastructure gap for TPUs, and cross-competitor open-source collaboration is beginning to erode the moat-by-fragmentation dynamic that historically benefited NVIDIA.

Open questions

Does AMD's 40% cost advantage over B200 on GLM5 generalize to other frontier model architectures, or is it specific to GLM5's design? [1]
How quickly will AMD close its lagging support for llm-d relative to NVIDIA and Google TPU? [2]
If Jensen Huang's 'low MFU by design' philosophy reflects a broader NVIDIA infrastructure strategy [4], does that affect how hyperscalers evaluate total cost of ownership versus raw throughput?
Will AMD's contribution to AIPerf [3] open a sustained collaboration channel, or remain a one-off signal?

Narrative

The competitive landscape for AI accelerators shifted noticeably in the week of May 16–21, 2026, with AMD and Google each posting concrete milestones against NVIDIA's incumbency.

The headline result came from AMD: its MI355 GPU is now 40% cheaper than NVIDIA's B200 when serving the GLM5 architecture on a single node in FP8 precision, covering both MTP and non-MTP configurations with speculative decoding on SGLang v0.12 [1]. SemiAnalysis, which published the finding, emphasized that this result holds on both CUDA and ROCm backends and was achieved just 14 weeks after GLM5's initial launch — framing it as evidence that ROCm's software maturity has reached a point where AMD hardware can realize its cost advantages in production inference settings. The analyst house's tagline for the moment: "SPEED IS THE MOAT."

On the software infrastructure front, Google added nightly continuous integration (CI) for llm-d — an open-source Kubernetes-native distributed inference framework — on its TPU hardware [2]. SemiAnalysis characterized TPU as now catching up to NVIDIA in terms of llm-d CI coverage and code quality, while noting that AMD's official llm-d support lags behind both. Nightly CI is a meaningful signal of production-readiness intent: it means Google is committing engineering resources to keep TPU compatibility green on each code push, rather than treating OSS support as a periodic patch exercise.

Perhaps the most symbolically charged development was AMD submitting — and having accepted — an upstream contribution to AIPerf, a benchmarking sub-repository within NVIDIA's Dynamo project [3]. SemiAnalysis described this as likely the first time a contribution from AMD has been accepted into any NVIDIA-owned repository, calling it "awesome" and historically significant. Cross-competitor open-source collaboration of this kind is rare in the AI hardware space, where software ecosystems have been deliberately siloed; the willingness of both companies to merge such a contribution suggests at least a narrow alignment of interests around shared benchmarking infrastructure.

Running against the competitive-pressure narrative is an observation from NVIDIA's own CEO. Speaking at Stanford's CS153 Frontier Systems course, Jensen Huang stated that he would prefer to operate at low Model FLOP Utilization (MFU) at all times [4]. His reasoning inverts the conventional framing: low MFU is not a sign of waste but of deliberate over-provisioning of compute, networking, and memory — a posture that leaves headroom to absorb spikes. SemiAnalysis drew a pointed comparison to xAI's kernel engineering culture, implying a similar philosophy may be at work there. If this view diffuses broadly, it could reshape how buyers think about GPU procurement — favoring capacity and flexibility over raw utilization efficiency, which would tend to benefit the incumbent with the most headroom to provision.

Timeline

2026-05-16: AMD's contribution accepted into NVIDIA's AIPerf benchmarking repository — believed to be a first cross-competitor upstream merge. [3]
2026-05-17: Jensen Huang at Stanford CS153 articulates 'low MFU by design' as a deliberate over-provisioning philosophy. [4]
2026-05-19: AMD MI355 confirmed 40% cheaper than NVIDIA B200 on GLM5 single-node FP8 inference, 14 weeks post-launch. [1]
2026-05-21: Google adds nightly CI for llm-d on TPU hardware; SemiAnalysis says TPU has caught up to NVIDIA in llm-d code quality. [2]

Perspectives

SemiAnalysis

Enthusiastically bullish on AMD and Google TPU progress; frames inference cost-efficiency and software ecosystem parity as decisive competitive dimensions. Views cross-competitor open-source collaboration as a positive and historically notable signal. Analytically provocative on Jensen Huang's low-MFU philosophy.

Evolution: Consistent — SemiAnalysis has been the sole reporting voice across all four items this pass, maintaining a pro-competition, anti-NVIDIA-moat framing throughout.

[3][4][1][2]

Jensen Huang / NVIDIA

Reframes low GPU utilization as intentional over-provisioning strategy, not inefficiency — implicitly defending a high-capex, high-headroom infrastructure posture.

Evolution: First appearance in this thread; stance not yet contested.

[4]

Tensions

SemiAnalysis frames inference cost-per-token as the decisive competitive moat ('SPEED IS THE MOAT'), implying AMD's 40% cost advantage is structurally significant [1]. Jensen Huang's 'low MFU' philosophy implicitly counters this by elevating over-provisioned capacity and flexibility as the real strategic asset [4] — a framing that would blunt cost-efficiency comparisons and favor NVIDIA's scale. [1][4]
Google TPU has gained parity with NVIDIA in llm-d CI coverage [2], while AMD's official llm-d support lags both — creating an internal tension in the 'challengers closing the gap' narrative where AMD is ahead on hardware cost but behind on open-source inference infrastructure. [2][1]

Sources

[1] AMD ALERT 🚀 MI355 is now 40% cheaper than B200 on GLM5 architecture for Single Node serving FP8 14 weeks after the initi… — SemiAnalysis Twitter (2026-05-19)
[2] TPU ALERT: For OSS production Kubernetes distributed inferencing, Google just added nightly CI for llm-d. Great step by … — SemiAnalysis Twitter (2026-05-21)
[3] SERIOUS & COOL: AIPerf -- a sub-repo of the Nvidia Dynamo project focused on benchmarking LLM workloads -- just acce… — SemiAnalysis Twitter (2026-05-16)
[4] At Stanford CS153 Frontier Systems, Jensen states word for word that he "would like to be at low MFU all the time" &… — SemiAnalysis Twitter (2026-05-17)