DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-09

SemiAnalysis and the InferenceX team publish 43-day tracking of DeepSeek V4's inference performance across NVIDIA GB300, AMD MI355X, and Huawei Ascend hardware, documenting a 100x AMD ROCm improvement by day 26.

Open original ↗

Appears in

Ultra-Low Latency LLM Inference: Benchmarks and Emerging Enterprise Pricing Tier

Extraction

Topics: deepseekinference-performancehardware-benchmarksopen-source-llmai-infrastructure

Claims

AMD ROCm achieved a 100x performance improvement for DeepSeek V4 inference by Day 26 under the technical leadership of HaiShaw.
NVIDIA's TensorRT-LLM did not work well for DeepSeek V4 at launch and required community-contributed fixes to its open-source kernel launch code.
DeepSeek V4 was co-designed in part for Huawei Ascend inference, and Huawei demonstrated Day 0 inference support in their documentation.
China currently dominates the open model landscape, with Kimi K2.6 outperforming NVIDIA's Nemotron 3 Ultra on coding benchmarks.
InferenceX tracks per-SKU performance from Day 0 across multiple frameworks to capture real, deployable chip performance over time rather than one-off snapshot benchmarks.

Key quotes

Nvidia's in house TensorRT-LLM did not work well for DeepSeek v4, and we at SemiAnalysis had to fix their open source mHC kernel launch code.

China currently dominates the open model landscape, with Kimi K2.6 still beating Jensen's Nemotron Committee Coalition's Nemotron 3 Ultra on coding.

One of the north star goals of InferenceX is to highlight the iterative improvements to performance over time, instead of just snapshots of performance.