How NVIDIA’s Inference Software Stack Powers the Lowest Token Cost
NVIDIA Blog · Amr Elmeleegy · 2026-06-30
NVIDIA argues that its full-stack inference software for the Blackwell platform — combining disaggregated serving, large expert parallelism, NVFP4 precision, and multi-token prediction — has already reduced token costs by up to 5x on DeepSeek V4 within one month and can compound individual optimizations into up to 20x throughput gains.
Extraction
Topics: llm-inferencetoken-cost-optimizationnvidia-blackwellagentic-aiopen-source-ai
Claims
- NVIDIA's inference software stack reduced token costs by up to 5x on DeepSeek V4 on the Blackwell platform within approximately one month of release.
- Combining disaggregated serving, large expert parallelism over NVLink, NVFP4 precision, and multi-token prediction yields up to 20x token throughput on Blackwell.
- Agentic AI workloads are structurally different from traditional web workloads because a single request can spawn hundreds of subagents, thousands of tasks, and multiple LLMs across distributed infrastructure.
- DFlash speculative decode delivers up to 15x more throughput on existing hardware.
- The CUDA-native open source ecosystem creates a compounding flywheel where framework optimizations immediately benefit NVIDIA GPU deployments from day zero.
Key quotes
On the NVIDIA Blackwell platform, the software stack has already reduced token costs by up to 5x on the DeepSeek V4 model in just one month.
When these layers work as one system, individual optimizations compound.
That's the open source flywheel: more developers optimize CUDA-native inference paths, more production deployments feed back into the ecosystem and each software improvement increases delivered token output while lowering cost per token over time.