NVIDIA GPU Ecosystem Performance Gaps: JAX Dropout, Blackwell Regression, and Kernel Utilization · history

Version 1

2026-05-30 18:12 UTC · 15 items

What

Three interconnected GPU performance crises are converging around NVIDIA's ecosystem. xAI, reportedly NVIDIA's largest GPU customer, has abandoned JAX on NVIDIA GPUs after its training stack achieved model flops utilization below 10%, opting instead for a custom C-based framework built with Grok Build [1]. Separately, NVIDIA's Blackwell Confidential Computing mode lacks NVLink multicast support, causing a 61% performance regression on large-model inference workloads [3][4]. Underpinning both: SemiAnalysis argues that closing the gap between theoretical GPU peak throughput and real-world production performance is nearly impossible through manual CUDA kernel tuning, and that auto-generated kernels are now outperforming hand-written ones [5][6].

Why it matters

These are not isolated bugs — they represent systemic gaps between NVIDIA's marketed GPU capabilities and what leading AI labs can actually extract at scale. xAI's defection from JAX signals that even a well-resourced, NVIDIA-dependent lab can find the software stack unworkable enough to rebuild from scratch. The Blackwell Confidential Computing regression affects workloads where hardware-level isolation is required, potentially blocking enterprise GPU adoption in regulated industries.

Open questions

Will xAI's switch to a custom C training framework deliver significantly higher MFU, and will other large-scale GPU operators follow with similar bespoke approaches? [1][2]
Is the NVLink multicast gap in Blackwell Confidential Computing a fundamental architectural constraint or a driver/firmware issue NVIDIA can patch? [3]
Can auto-generated CUDA kernels reliably close the performance gap at production scale, or do they introduce correctness and maintainability risks that manual tuning avoids? [5][8][7]
What is the full extent of NVIDIA's Confidential Computing performance gaps across Hopper and Blackwell, and which enterprise workloads are affected? [3]

Narrative

NVIDIA's GPU ecosystem is facing a credibility challenge on multiple simultaneous fronts. The most dramatic disclosure is xAI's reported decision to abandon JAX on NVIDIA GPUs entirely. According to SemiAnalysis, xAI — widely regarded as NVIDIA's single largest GPU customer — found its JAX-based training stack achieving model flops utilization below 10%, far below the 30–50% MFU that competitive AI labs typically target [1]. Rather than continuing to work with NVIDIA's JAX engineering team, xAI moved to build a custom C-based training framework using its own Grok Build tooling. A separate Hacker News thread surfaced reports that xAI was at one point utilizing only about 11% of its 550,000 NVIDIA GPUs [2], suggesting the hardware-software gap has had real operational consequences.

A second independent failure involves NVIDIA's Blackwell Confidential Computing feature. SemiAnalysis reported that NVLink multicast — critical for high-performance multi-GPU communication — is unsupported in Blackwell's Confidential Computing mode [3]. Benchmarks on SGLang running Qwen3.5 397B inference show a 61% performance regression when the feature is enabled. SemiAnalysis characterized the implementation as 'complete slop,' noting gaps extend across both Hopper and Blackwell architectures. Confidential Computing is marketed to enterprise and regulated-industry customers who need hardware-level data isolation; the performance cost makes it practically unusable for large-model workloads [4].

Beyond these specific failures, SemiAnalysis is also surfacing a broader structural issue: the gap between theoretical GPU peak throughput and real-world production performance is growing too large for manual CUDA kernel optimization to bridge [5][6]. Community discussion and published research suggest that auto-generated CUDA kernels — written without directly programming in CUDA or Triton — are now outperforming hand-written ones in several GPU benchmarks [7][8]. HuggingFace's guide to building production-ready CUDA kernels [9] reflects the same momentum toward automated kernel generation as the dominant path forward.

Taken together, these developments sketch a picture where NVIDIA's hardware remains commercially dominant but its software ecosystem — JAX integration, Confidential Computing, and kernel tooling — is struggling to deliver the utilization rates that its largest customers require. AMD is positioning itself adjacent to this conversation with published inference performance content [10], though it has not directly addressed NVIDIA's specific failures.

Timeline

2026-05-27: SemiAnalysis tweets that GPUs are 'leaving performance on the table' and that auto-generated CUDA kernels outperform hand-written ones at scale. [5][6]
2026-05-30: SemiAnalysis reports xAI dropped JAX on NVIDIA GPUs after MFU fell below 10%, switching to a custom C training framework built with Grok Build. [1]
2026-05-30: SemiAnalysis discloses NVLink multicast is unsupported in Blackwell Confidential Computing, causing a 61% performance regression on SGLang Qwen3.5 397B inference. [3][4]
2026-05-30: Community discussion surfaces reports that xAI was at one point using only ~11% of its 550,000 NVIDIA GPUs. [2]

Perspectives

SemiAnalysis

Sharply critical of NVIDIA's software ecosystem: characterizes JAX GPU MFU as catastrophically low, Confidential Computing as 'complete slop,' and frames the performance gap as a systemic failure of manual kernel optimization that only automated approaches can solve.

Evolution: Consistent critical stance; escalated from general GPU performance commentary to specific named-customer failures and feature exposés.

[1][3][5][6]

xAI

Has abandoned JAX on NVIDIA GPUs in favor of a bespoke C-based training framework, implicitly casting JAX as unworkable for production-scale training.

Evolution: First appearance; xAI's departure is a revealed preference rather than a public statement — the strongest possible vote of no confidence.

[1][2]

NVIDIA (implied)

No direct response captured; the JAX team has reportedly been under pressure from xAI, and Confidential Computing's NVLink gap suggests architectural trade-offs that may require hardware-level fixes.

Evolution: Absent from the discourse; silence is notable given the severity of the claims.

[3][1]

ML community / open-source researchers

Actively exploring auto-generated kernel approaches as an alternative to hand-tuned CUDA, with guides and benchmarks suggesting AI-generated kernels can beat PyTorch baselines on multiple GPUs.

Evolution: Consistent with a broader trend; community work is now converging with SemiAnalysis's framing of automated generation as the inevitable successor to manual tuning.

[7][8][9]

AMD

Publishing technical content on inference performance aspects, positioning itself as a performance-aware alternative without directly addressing NVIDIA's specific failures.

Evolution: First appearance; neutral-to-opportunistic positioning adjacent to the NVIDIA controversy.

[10]

Tensions

SemiAnalysis claims xAI's JAX MFU fell below 10% — implying NVIDIA's JAX ecosystem is fundamentally broken at production scale — while NVIDIA has not publicly responded or disputed the figure. [1]
Auto-generated CUDA kernels are claimed to outperform expert hand-written ones, directly challenging the assumption that manual tuning represents the performance ceiling for GPU workloads. [5][6][7][8]
NVIDIA markets Confidential Computing as an enterprise-grade feature, but benchmark data shows a 61% regression that renders it impractical for large-model inference. [3][4]
xAI's reported ~11% GPU utilization contradicts NVIDIA's hardware-capability marketing and suggests that software ecosystem failures have compounded to produce significant operational waste. [2][1]

Sources

[1] BREAKING NEWS: JAX NVIDIA GPU & XLA: GPU's biggest customer just announced that they have dropped JAX GPUs and would… — SemiAnalysis Twitter (2026-05-30)
[2] xAI Is Reportedly Using Just 11% of Its 550k Nvidia GPUs — reactive:nvidia-gpu-ecosystem-gaps
[3] TRUTH SOCIAL: NVLink multicast is not supported on Blackwell "Confidential Computing" leading to 61% performance regress… — SemiAnalysis Twitter (2026-05-30)
[4] NVLink multicast is not supported on Blackwell "Confidential ... — reactive:nvidia-gpu-ecosystem-gaps
[5] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[6] GPUs are leaving performance on the table. — SemiAnalysis Twitter (2026-05-27)
[7] AI-generated CUDA kernels outperform PyTorch in several GPU ... — reactive:nvidia-gpu-ecosystem-gaps
[8] Generating Fast GPU Kernels without Programming in CUDA/Triton — reactive:nvidia-gpu-ecosystem-gaps
[9] From Zero to GPU: A Guide to Building and Scaling Production-Ready CUDA Kernels — reactive:nvidia-gpu-ecosystem-gaps
[10] The Many Aspects of Inference Performance - AMD — reactive:nvidia-gpu-ecosystem-gaps