The throughput math has gotten the most pushback in our reader notes, so its worth being precise. On the same B300 runni…

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-27

SemiAnalysis details how software optimizations and Nvidia's GB300 NVL72 hardware together deliver up to 32x throughput improvement over H100 configurations running DeepSeek R1, arguing this structural compression makes AI model-lab margin expansion durable rather than temporary.

Open original ↗

Appears in

AI Agents: 24x Token Growth Projections, Enterprise Cost Pressure, and the Agentic Business Thesis

Extraction

Topics: gpu-performanceinference-optimizationnvidiaai-infrastructuredeepseek

Claims

DeepSeek R1 on a B300 in baseline FP8 achieves approximately 1,000 tokens/sec/GPU.
Adding wideEP plus disaggregation raises throughput to roughly 8,000 tokens/sec/GPU.
Layering MTP on top reaches approximately 14,000 tokens/sec/GPU, a 14x gain from software alone.
The most optimized GB300 NVL72 achieves about 17x the best H100 configuration in FP8 and 32x in FP4.
Model-lab gross margin expansion is structural, not a temporary pricing anomaly.

Key quotes

On the same B300 running DeepSeek R1, baseline FP8 sits near 1,000 tokens/sec/GPU, adding wideEP plus disagg gets you to roughly 8,000, and layering MTP on top pushes it to about 14,000, a 14x gain from software alone.

Once you accept that compression is real, model-lab gross margin expansion stops looking like a temporary pricing oddity and starts looking structural.