NVIDIA GPU Ecosystem Performance Gaps: JAX Dropout, Blackwell Regression, and Kernel Utilization

closed · v3 · 2026-06-02 · 30 items · history

What's new in v3

NVIDIA broke its prior public silence with a Facebook post [13] citing SemiAnalysis inference data selectively in its own defense — the first identifiable public response to the week's criticism wave, though it avoids engaging with the specific JAX MFU failure, Confidential Computing regression, or ptxas miscompile allegations. The ML community surfaced a meaningful reliability counterpoint: a Reddit thread argues that AI-generated CUDA kernels can silently break training and inference [12], adding a correctness-risk dimension to the previously bullish auto-generation narrative. Otherwise, additional LinkedIn commentary amplified the 'NVIDIA moat slipping' framing without introducing new fault lines.

What

NVIDIA's GPU software ecosystem is facing simultaneous failures across three layers. xAI — reportedly NVIDIA's largest GPU customer — abandoned JAX on NVIDIA GPUs after achieving model flops utilization below 10%, building a custom C-based training framework instead [1][2]. NVIDIA's Blackwell Confidential Computing mode lacks NVLink multicast support, producing a 61% performance regression on large-model inference [4][5]. AI-assisted fuzzing exposed 40+ miscompiles in NVIDIA's ptxas GPU compiler in under three days [7]. NVIDIA has now broken its prior public silence, posting on Facebook to cite SemiAnalysis inference data selectively — without addressing the specific failure allegations [13].

Why it matters

The failures span training software, enterprise isolation features, and the compiler layer that underpins all GPU workloads — suggesting systemic rather than incidental gaps in NVIDIA's stack. A GPU compiler that silently miscompiles code is a correctness risk for every customer. The proposed escape hatch of auto-generated CUDA kernels carries its own emerging reliability concern: community researchers now report that AI-generated kernels can silently corrupt training and inference [12], leaving the ecosystem without a clearly safe path to closing the performance gap.

Open questions

Will xAI's custom C training framework deliver significantly higher MFU at scale, and will other large-scale GPU operators follow with bespoke frameworks? [1][2]
Is the NVLink multicast gap in Blackwell Confidential Computing a fixable firmware issue or a fundamental architectural constraint? [4][5]
How many of the 40+ ptxas miscompiles affect production workloads, and what is NVIDIA's timeline for patching a closed-source compiler? [7]
Can auto-generated CUDA kernels reliably close the GPU utilization gap, or will the silent correctness failures flagged by community researchers limit their production viability? [12][10]

Narrative

NVIDIA's GPU training software ecosystem suffered a high-profile defection when xAI — widely regarded as NVIDIA's single largest GPU customer — abandoned JAX on NVIDIA GPUs entirely [1]. According to SemiAnalysis, xAI's JAX-based training stack achieved model flops utilization below 10%, far below the 30–50% MFU competitive AI labs typically target. Rather than continuing to work within NVIDIA's JAX ecosystem, xAI built a custom C-based training framework using its own Grok Build tooling. Community reports suggest xAI was at one point utilizing only around 11% of its 550,000 NVIDIA GPUs [2][3], indicating that the software-hardware gap produced significant operational waste at what is arguably the world's largest GPU cluster.

A second independent failure involves NVIDIA's Blackwell Confidential Computing feature. SemiAnalysis reported that NVLink multicast — critical for high-performance multi-GPU communication — is unsupported in Blackwell's Confidential Computing mode [4]. Benchmarks running SGLang on Qwen3.5 397B inference show a 61% performance regression when the feature is enabled [5], making it practically unusable for large-model workloads. Confidential Computing is marketed to enterprise and regulated-industry customers requiring hardware-level data isolation, and enterprise partners including NTT DATA and Fortanix have announced commercial integrations built on it [6] — suggesting the ecosystem continues building on a feature with a known severe performance gap.

A third failure vector targets the compiler that underlies all GPU code. SemiAnalysis reported that a single engineer, using AI agents and a fuzzer built without reading any code, found 40+ miscompiles in NVIDIA's ptxas GPU compiler in under three days [7], spending more than $10,000 in AI compute in a single afternoon. Both NVIDIA's ptxas and AMD's AMDGPU backend contained bugs found at a similar rate — but AMD's open-source toolchain allowed five of its bugs to be patched rapidly before disclosure, while NVIDIA's closed compiler leaves the pace and scope of remediation publicly unknown.

The structural argument framing all three failures is that the gap between theoretical GPU throughput and real-world production efficiency is too large for manual CUDA kernel optimization to bridge, and that auto-generated kernels represent the path forward [8][9]. Community benchmarks have suggested AI-generated kernels can outperform expert hand-written ones in several workloads [10][11]. A counterpoint has emerged: ML community researchers argue that AI-generated CUDA kernels can silently break training and inference [12], introducing correctness risks that complicate the case for automated generation as a safe production replacement. Against this backdrop, NVIDIA posted on Facebook citing SemiAnalysis inference performance data to argue that top performance drives adoption [13] — a selective engagement that avoids the specific JAX MFU failure, Confidential Computing regression, and compiler miscompile allegations.

Timeline

2026-05-27: SemiAnalysis argues GPUs are 'leaving performance on the table' and that auto-generated CUDA kernels outperform hand-written ones at scale. [8][9]
2026-05-28: SemiAnalysis reports an engineer found 40+ ptxas miscompiles in NVIDIA's GPU compiler in three days via AI-assisted fuzzing; AMD patched five equivalent bugs rapidly via its open-source AMDGPU backend. [7]
2026-05-30: SemiAnalysis reports xAI dropped JAX on NVIDIA GPUs after MFU fell below 10%, switching to a custom C training framework built with Grok Build. [1]
2026-05-30: SemiAnalysis discloses NVLink multicast is unsupported in Blackwell Confidential Computing, causing a 61% performance regression on SGLang Qwen3.5 397B inference. [4][5]
2026-05-30: Community discussion surfaces reports that xAI was at one point using only ~11% of its 550,000 NVIDIA GPUs. [2][3]
2026-06-01: NVIDIA posts on Facebook citing SemiAnalysis inference data to argue performance drives adoption — its first apparent public response to the criticism wave, without addressing specific failure claims. [13]
2026-06-01: ML community researchers surface evidence that AI-generated CUDA kernels can silently corrupt training and inference, adding a reliability counterpoint to the auto-generation narrative. [12]

Perspectives

SemiAnalysis

Sharply and consistently critical of NVIDIA's software ecosystem across training (JAX MFU), enterprise features (Confidential Computing), and the compiler layer (ptxas miscompiles); frames each failure as systemic rather than incidental.

Evolution: Escalating: moved from general GPU performance commentary to specific named-customer failures, feature exposés, and compiler-level correctness bugs — all within a single week.

[1][4][8][9][7]

xAI

Abandoned JAX on NVIDIA GPUs in favor of a bespoke C-based training framework, treating JAX as unworkable for production-scale training.

Evolution: Consistent; xAI's departure is a revealed preference rather than a public statement — the strongest possible vote of no confidence in NVIDIA's training software stack.

[1][2][3]

NVIDIA

Posted on Facebook citing SemiAnalysis inference performance data to argue that best-in-class performance drives adoption, without directly addressing the JAX MFU failure, Confidential Computing regression, or ptxas miscompile allegations.

Evolution: Shifted from complete public silence to selective engagement: co-opting favorable SemiAnalysis data while avoiding the specific failure disclosures.

[13]

AMD

Positioned as a faster and more transparent responder to compiler vulnerabilities due to its open-source AMDGPU backend, having patched five bugs found in the same fuzzing campaign that also hit NVIDIA.

Evolution: Elevated from peripheral presence to direct comparator: now a named example of superior open-source remediation speed versus NVIDIA's closed toolchain.

[7][14]

ML community / open-source researchers

Divided between enthusiasm for auto-generated kernels as a path to closing the GPU utilization gap and warnings that AI-generated CUDA kernels can silently break training and inference.

Evolution: More cautious than before: earlier community work broadly validated auto-generation as promising; community researchers are now actively flagging silent correctness failures as a production risk.

[10][11][15][12]

Enterprise ecosystem (Fortanix, NTT DATA)

Continuing to build commercial integrations on NVIDIA Confidential Computing despite the known 61% performance regression, suggesting adoption may proceed independently of performance costs.

Evolution: Consistent; represents a sustained counterpoint to the claim that the Confidential Computing regression renders the feature commercially unusable.

[6]

Tensions

SemiAnalysis claims xAI's JAX MFU fell below 10% — implying NVIDIA's training software is fundamentally broken at production scale — while NVIDIA's public response cites favorable inference data without disputing the MFU figure. [1][13]
AMD patched five compiler bugs from the AI-assisted fuzzing campaign rapidly via its open-source AMDGPU backend, while NVIDIA's closed ptxas compiler leaves remediation pace and scope publicly unknown. [7]
NVIDIA markets Confidential Computing as enterprise-grade, but benchmark data shows a 61% regression that makes it impractical for large-model inference — while enterprise partners continue commercial buildout regardless. [4][5][6]
Auto-generated CUDA kernels are claimed to outperform expert hand-written ones in benchmarks, but ML community researchers argue they can silently corrupt training and inference, creating a direct reliability-versus-throughput tradeoff. [8][10][12]
NVIDIA's first public response to the SemiAnalysis criticism wave selectively cites SemiAnalysis inference data as validation while ignoring SemiAnalysis's specific failure disclosures — co-opting the critic's own evidence. [13][1][4][7]

Status: active and growing

Sources

[1] BREAKING NEWS: JAX NVIDIA GPU & XLA: GPU's biggest customer just announced that they have dropped JAX GPUs and would… — SemiAnalysis Twitter (2026-05-30)
[2] xAI Is Reportedly Using Just 11% of Its 550k Nvidia GPUs — reactive:nvidia-gpu-ecosystem-gaps
[3] xAI's 11% GPU Utilization Raises Efficiency Concerns - LinkedIn — reactive:nvidia-gpu-ecosystem-gaps
[4] TRUTH SOCIAL: NVLink multicast is not supported on Blackwell "Confidential Computing" leading to 61% performance regress… — SemiAnalysis Twitter (2026-05-30)
[5] NVLink multicast is not supported on Blackwell "Confidential ... — reactive:nvidia-gpu-ecosystem-gaps
[6] NTT DATA & Fortanix partner for AI-era security — reactive:nvidia-gpu-ecosystem-gaps
[7] Finding Miscompiles for Fun, Not Profit — SemiAnalysis Twitter (2026-05-28)
[8] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[9] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[10] AI-generated CUDA kernels outperform PyTorch in several GPU ... — reactive:nvidia-gpu-ecosystem-gaps
[11] Generating Fast GPU Kernels without Programming in CUDA/Triton — reactive:nvidia-gpu-ecosystem-gaps
[12] AI-generated CUDA kernels silently break training and inference [R] — reactive:nvidia-gpu-ecosystem-gaps
[13] NVIDIA — reactive:nvidia-gpu-ecosystem-gaps
[14] The Many Aspects of Inference Performance - AMD — reactive:nvidia-gpu-ecosystem-gaps
[15] From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels — reactive:nvidia-gpu-ecosystem-gaps