Parallel draft tree, tree-causal verification

SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-30

JetSpec, a new speculative decoding system using causal parallel tree drafting, achieves up to 9.64x end-to-end speedup on MATH-500 and approximately 1,000 tokens per second on a single B200 GPU while remaining lossless.

Open original ↗

Extraction

Topics: speculative-decodingllm-inference-optimizationhardware-accelerationinference-engines

Claims

JetSpec achieves up to 9.64x end-to-end speedup on the MATH-500 benchmark.
JetSpec achieves 4.58x speedup on open-ended chat tasks.
JetSpec co-optimizes drafting cost and drafting quality using causal parallel tree drafting.
With CUDA graph and kernel optimizations, JetSpec reaches approximately 1,000 tokens per second on a single B200 GPU.
JetSpec is designed for deeper integration with inference engines such as vLLM and SGLang.

Key quotes

JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless.

With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200.