PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops:
SemiAnalysis Twitter · SemiAnalysis (@SemiAnalysis_) · 2026-05-26
SemiAnalysis breaks down LLM end-to-end inference latency, finding prefill and decode each account for roughly half, with prefill itself split between cache write (new context ingestion) and cache read (reusing prior KV cache).
Appears in
Extraction
Topics: llm-inferencekv-cacheprefill-decode-latencyinference-optimization
Claims
- Approximately 48% of end-to-end LLM inference latency comes from prefill and approximately 52% from decode.
- Prefill splits into two distinct operations: prefill extend (writing new KV tokens) and cache read (loading previously cached KV tokens).
- Understanding this breakdown is essential for targeting optimization efforts in LLM serving systems.
Key quotes
~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: Prefill extend (cache write) — ingests new context/files, writes fresh KV tokens; Cache read — reuses existing KV cache from prior turns