Fantastic, @deepseek_ai just published their new inference optimization method.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-06-27
DeepSeek publishes DSpark, a semi-parallel speculative decoding system that delivers 60–85% faster per-user token generation for DeepSeek-V4 by using a Markov head and confidence scheduler to selectively verify only the most promising drafted tokens.
Extraction
Topics: speculative-decodinginference-optimizationllm-efficiencydeepseek
Claims
- DSpark achieves 60–85% faster per-user token generation for DeepSeek-V4 at matched throughput.
- Standard speculative decoding wastes GPU capacity by verifying long draft blocks where later tokens are increasingly likely to be wrong.
- DSpark's Markov head corrects for the accuracy decay of fully parallel drafting by conditioning each token guess on the previously sampled token.
- A confidence scheduler dynamically determines how many drafted tokens to verify per request, balancing acceptance probability against current GPU load.
- Fully parallel draft models degrade on later tokens because each position guesses independently, without knowledge of earlier sampled tokens in the block.
Key quotes
The biggest idea in DSpark is that faster inference is not just about drafting more tokens, but about deciding which drafted tokens are worth checking.
DSpark's breakthrough is to make this process selective: it drafts a block, scores how likely each prefix is to survive, then verifies only the part that is likely to pay off.
Fully parallel drafters guesses every position too independently, which can create bad token combinations later in the block.