NVIDIA GPU Ecosystem Performance Gaps: JAX Dropout, Blackwell Regression, and Kernel Utilization · history

Version 2

2026-05-31 08:12 UTC · 20 items

What

NVIDIA's GPU software ecosystem is facing failures on three simultaneous fronts, all surfaced primarily by SemiAnalysis. xAI, reportedly NVIDIA's largest GPU customer, abandoned JAX on NVIDIA GPUs after achieving model flops utilization below 10%, building a custom C-based training framework instead [1][2]. NVIDIA's Blackwell Confidential Computing mode lacks NVLink multicast support, causing a 61% performance regression on large-model inference [4][5]. Most recently, AI-assisted fuzzing uncovered 40+ miscompiles in NVIDIA's ptxas GPU compiler in under three days, with AMD's open-source AMDGPU backend patching five equivalent bugs rapidly by contrast [7].

Why it matters

Taken together, these failures span NVIDIA's training software (JAX/MFU), enterprise isolation features (Confidential Computing), and the compiler layer that underpins all GPU workloads. A GPU compiler that silently miscompiles code is a correctness risk affecting every customer, not just those chasing peak utilization. NVIDIA's closed toolchain makes the pace and transparency of remediation unclear in a way AMD's open-source stack does not.

Open questions

Will xAI's switch to a custom C training framework deliver significantly higher MFU, and will other large-scale operators follow with bespoke approaches? [1][2]
Is the NVLink multicast gap in Blackwell Confidential Computing a fundamental architectural constraint or a firmware-level fix — and will enterprise partners like Fortanix absorb the performance cost regardless? [4][6]
How many of the 40+ ptxas miscompiles found by AI-assisted fuzzing affect production workloads, and what is NVIDIA's patch timeline for a closed compiler? [7]
Can auto-generated CUDA kernels reliably close the GPU utilization gap at production scale without introducing new correctness or maintainability risks? [8][10]

Narrative

NVIDIA's GPU ecosystem is facing a credibility challenge across its entire software stack simultaneously. The most operationally visible failure is xAI's reported decision to abandon JAX on NVIDIA GPUs entirely. According to SemiAnalysis, xAI — widely regarded as NVIDIA's single largest GPU customer — found its JAX-based training stack achieving model flops utilization below 10%, far below the 30–50% MFU that competitive AI labs typically target [1]. Rather than continuing to work with NVIDIA's JAX engineering team, xAI built a custom C-based training framework using its own Grok Build tooling. Community discussion has surfaced reports that xAI was at one point utilizing only around 11% of its 550,000 NVIDIA GPUs [2][3], suggesting the hardware-software gap produced significant operational waste at scale.

A second, independent failure involves NVIDIA's Blackwell Confidential Computing feature. SemiAnalysis reported that NVLink multicast — critical for high-performance multi-GPU communication — is unsupported in Blackwell's Confidential Computing mode [4]. Benchmarks running SGLang on Qwen3.5 397B inference show a 61% performance regression when the feature is enabled, leading SemiAnalysis to characterize the implementation as 'complete slop' [5]. Confidential Computing is marketed to enterprise and regulated-industry customers requiring hardware-level data isolation; the performance cost makes it practically unusable for large-model workloads. Notably, enterprise partners including NTT DATA and Fortanix have announced integrations built on NVIDIA Confidential Computing GPUs [6], suggesting the commercial ecosystem is moving forward regardless of the known performance gap.

The third and newest failure vector targets the compiler layer that underlies all GPU code. SemiAnalysis reported that a single engineer, using AI agents and a fuzzer vibe-coded without reading any code, found 40+ miscompiles in NVIDIA's ptxas GPU compiler in under three days [7]. The engineer noted spending more than $10,000 in AI agent compute in a single afternoon, characterizing the capability jump as a watershed moment for automated vulnerability discovery. Both NVIDIA's ptxas and AMD's AMDGPU backend contained bugs found at roughly the same rate — but AMD, with an open-source toolchain, had already patched five of the reported bugs by the time of disclosure [7]. NVIDIA's closed compiler stack leaves the remediation timeline opaque.

Underpinning all three crises is a broader structural argument from SemiAnalysis: the gap between theoretical GPU peak throughput and real-world production performance is too large for manual CUDA kernel optimization to bridge [8][9]. Community research and published benchmarks suggest auto-generated CUDA kernels are now outperforming hand-written expert kernels in several GPU workloads [10][11], pointing toward automated generation as the dominant path forward — though the correctness and production-reliability of that approach remains an open question.

Timeline

2026-05-27: SemiAnalysis tweets that GPUs are 'leaving performance on the table' and that auto-generated CUDA kernels outperform hand-written ones at scale. [8][9]
2026-05-28: SemiAnalysis reports an engineer found 40+ ptxas miscompiles in NVIDIA's GPU compiler in three days using AI-assisted fuzzing; AMD patched five equivalent bugs quickly via its open-source AMDGPU backend. [7]
2026-05-30: SemiAnalysis reports xAI dropped JAX on NVIDIA GPUs after MFU fell below 10%, switching to a custom C training framework built with Grok Build. [1]
2026-05-30: SemiAnalysis discloses NVLink multicast is unsupported in Blackwell Confidential Computing, causing a 61% performance regression on SGLang Qwen3.5 397B inference. [4][5]
2026-05-30: Community discussion surfaces reports that xAI was at one point using only ~11% of its 550,000 NVIDIA GPUs. [2][3]

Perspectives

SemiAnalysis

Sharply and consistently critical of NVIDIA's software ecosystem across training (JAX MFU), enterprise features (Confidential Computing), and the compiler layer (ptxas miscompiles); frames each failure as systemic rather than incidental.

Evolution: Escalating: moved from general GPU performance commentary to specific named-customer failures, feature exposés, and now compiler-level correctness bugs — all within a single week.

[1][4][8][9][7]

xAI

Abandoned JAX on NVIDIA GPUs in favor of a bespoke C-based training framework, treating JAX as unworkable for production-scale training.

Evolution: Consistent; xAI's departure is a revealed preference rather than a public statement — the strongest possible vote of no confidence in NVIDIA's training software stack.

[1][2][3]

NVIDIA (implied)

No direct public response to any of the three failure disclosures; the JAX team reportedly faced pressure from xAI, and the closed ptxas compiler leaves remediation timelines opaque.

Evolution: Consistently absent from the discourse; silence is notable given the severity and breadth of the claims.

[4][1][7]

AMD

Positioned as a faster and more transparent responder to compiler vulnerabilities due to its open-source AMDGPU backend, having patched five bugs found in the same fuzzing campaign that also hit NVIDIA.

Evolution: Elevated from peripheral presence to direct comparator: previously publishing generic inference performance content, now a named example of better open-source remediation speed.

[7][12]

ML community / open-source researchers

Actively exploring auto-generated kernel approaches as an alternative to hand-tuned CUDA, with benchmarks suggesting AI-generated kernels can outperform PyTorch baselines on multiple GPUs.

Evolution: Consistent with a broader trend; community work is converging with SemiAnalysis's framing of automated generation as the inevitable successor to manual tuning.

[10][11][13]

Enterprise ecosystem (Fortanix, NTT DATA)

Continuing to build on NVIDIA Confidential Computing despite the known 61% performance regression, suggesting commercial adoption may proceed independently of the performance gap.

Evolution: First appearance; represents a counterpoint to SemiAnalysis's framing that the Confidential Computing regression makes the feature commercially unusable.

[6]

Tensions

SemiAnalysis claims xAI's JAX MFU fell below 10% — implying NVIDIA's JAX ecosystem is fundamentally broken at production scale — while NVIDIA has not publicly responded or disputed the figure. [1]
AMD patched five compiler bugs found in the same AI-assisted fuzzing campaign within days via its open-source toolchain, while NVIDIA's closed ptxas compiler leaves remediation pace and scope publicly unknown. [7]
NVIDIA markets Confidential Computing as enterprise-grade, but benchmark data shows a 61% regression that renders it impractical for large-model inference — while enterprise partners like Fortanix continue building on it regardless. [4][5][6]
Auto-generated CUDA kernels are claimed to outperform expert hand-written ones, directly challenging the assumption that manual tuning represents the GPU performance ceiling. [8][9][10][11]
xAI's reported ~11% GPU utilization contradicts NVIDIA's hardware-capability marketing and suggests software ecosystem failures have produced significant operational waste at the world's largest GPU cluster. [2][1][3]

Sources

[1] BREAKING NEWS: JAX NVIDIA GPU & XLA: GPU's biggest customer just announced that they have dropped JAX GPUs and would… — SemiAnalysis Twitter (2026-05-30)
[2] xAI Is Reportedly Using Just 11% of Its 550k Nvidia GPUs — reactive:nvidia-gpu-ecosystem-gaps
[3] xAI's 11% GPU Utilization Raises Efficiency Concerns - LinkedIn — reactive:nvidia-gpu-ecosystem-gaps
[4] TRUTH SOCIAL: NVLink multicast is not supported on Blackwell "Confidential Computing" leading to 61% performance regress… — SemiAnalysis Twitter (2026-05-30)
[5] NVLink multicast is not supported on Blackwell "Confidential ... — reactive:nvidia-gpu-ecosystem-gaps
[6] NTT DATA & Fortanix partner for AI-era security — reactive:nvidia-gpu-ecosystem-gaps
[7] Finding Miscompiles for Fun, Not Profit — SemiAnalysis Twitter (2026-05-28)
[8] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[9] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[10] AI-generated CUDA kernels outperform PyTorch in several GPU ... — reactive:nvidia-gpu-ecosystem-gaps
[11] Generating Fast GPU Kernels without Programming in CUDA/Triton — reactive:nvidia-gpu-ecosystem-gaps
[12] The Many Aspects of Inference Performance - AMD — reactive:nvidia-gpu-ecosystem-gaps
[13] From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels — reactive:nvidia-gpu-ecosystem-gaps