Parallel draft tree, tree-causal verification
SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-06-30
JetSpec, a new speculative decoding system using causal parallel tree drafting, achieves up to 9.64x end-to-end speedup on MATH-500 and approximately 1,000 tokens per second on a single B200 GPU while remaining lossless.
Extraction
Topics: speculative-decodingllm-inference-optimizationhardware-accelerationinference-engines
Claims
- JetSpec achieves up to 9.64x end-to-end speedup on the MATH-500 benchmark.
- JetSpec achieves 4.58x speedup on open-ended chat tasks.
- JetSpec co-optimizes drafting cost and drafting quality using causal parallel tree drafting.
- With CUDA graph and kernel optimizations, JetSpec reaches approximately 1,000 tokens per second on a single B200 GPU.
- JetSpec is designed for deeper integration with inference engines such as vLLM and SGLang.
Key quotes
JetSpec reaches up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat while keeping lossless.
With CUDA graph and kernel optimizations, JetSpec further translates to around 1000 TPS on a single B200.