LLM Inference Efficiency Research Cluster · history
Version 3
2026-05-26 08:51 UTC · 32 items
What
LLM inference efficiency research has consolidated around three bottleneck threads—disaggregated prefill/decode serving, MoE expert pruning, and KV cache memory pressure—while the KV cache thread is itself splitting into two parallel sub-tracks: quantization (compressing retained KV tensors to lower precision [4][3]) and eviction (selectively dropping entire tokens from the cache [5][6][7]). A nascent 'LLM tokenomics' framing now argues that token value is a function of embedded intelligence and delivery speed [12], beginning to connect infrastructure-layer efficiency research to user-perceived value.
Why it matters
Quantization and eviction attack the same KV cache bottleneck from different angles with different accuracy tradeoff profiles; if they compose well, their combination could unlock context lengths and serving densities beyond what either achieves alone. The tokenomics framing, however nascent, creates a vocabulary for translating infrastructure tradeoffs into economic terms—potentially grounding adaptive serving strategies like AdaServe in user-value rather than pure throughput metrics.
Open questions
Do KV cache eviction and quantization strategies compose—enabling stacked memory savings—or do they compete for the same marginal memory budget, forcing operators to choose one approach? [5][6][4][3]
How much accuracy degradation do eviction approaches like LookaheadKV [5] and KeyDiff [6] introduce on reasoning-intensive tasks at long contexts, where dropped tokens may carry disproportionate semantic weight?
Can the 'tokenomics' framing—token value as intelligence × speed [12]—provide actionable per-request compute allocation for SLO-adaptive systems like AdaServe [13], or is it too coarse a framework to drive concrete serving decisions?
Will 50% MoE expert pruning generalize beyond Qwen3 and GLM to other deployed frontier architectures including DeepSeek-v3? [2]
Narrative
LLM inference efficiency research is structured around three converging bottleneck threads: the compute asymmetry between prefill (processing input prompts, compute-bound) and decode (generating tokens, memory-bandwidth-bound) [1]; Mixture-of-Experts routing waste, where researchers claim roughly half of expert compute is removable without accuracy loss [2]; and KV cache memory pressure, where key-value tensors produced by attention mechanisms become the dominant constraint as context windows stretch toward millions of tokens [3]. Each thread targets a different layer of the serving stack, and in principle their gains are additive.
The KV cache thread is the most active frontier and is now splitting into two parallel sub-tracks with distinct tradeoff profiles. Quantization compresses cached key-value tensors into lower-precision formats—retaining all tokens but introducing precision error—as demonstrated by NVIDIA's NVF P4 KV cache work targeting production long-context serving [4] and KVQuant's research toward 10-million-token inference [3]. Eviction takes the opposite approach: selectively dropping entire tokens from the KV cache to reduce its size while maintaining full precision for retained entries. Samsung Research's LookaheadKV proposes glimpsing ahead in the generation process to predict which KV entries will be needed before evicting them [5]; KeyDiff (NeurIPS 2025) identifies redundant entries by measuring similarity between keys, on the premise that similar keys carry overlapping information [6]; and a concurrent ArXiv paper frames eviction as a general mechanism for improving long-context performance across deployment settings [7]. These approaches could in principle be stacked—eviction reduces token count, quantization compresses the remainder—but their interaction under real workloads has not yet been benchmarked.
On the compute side, prefill/decode disaggregation has moved from architectural proposal to production-facing benchmark. Spheron documents approximately 2x throughput gains from disaggregated GPU deployments [8], while community benchmarks of vLLM with NIXL (NVIDIA's cross-architecture data-transfer library) provide practitioner-legible performance data [9]. The Alibaba/Nanjing University sparse attention work claims a 9.36× prefill speedup at million-token scale [10], and the MoE expert pruning literature argues these gains are applicable to already-deployed frontier models with near-zero accuracy cost [2]. Speculative decoding—using a smaller draft model to propose tokens that a larger model verifies in parallel—now has 2026 production benchmark data [11].
A conceptual layer is beginning to emerge above individual optimizations. Rohan Paul's 'LLM tokenomics' framing argues that token value is determined by two factors: the intelligence embedded in the token and how quickly it arrives [12]. This connects infrastructure tradeoffs to user-perceived value in a way that individual efficiency metrics alone do not capture. AdaServe (CMU, EuroSys 2026) makes a related but more concrete argument: production serving fleets must handle requests with heterogeneous latency tolerance, and throughput-maximizing approaches systematically underserve latency-sensitive workloads [13]. Both framings point toward a next-phase challenge—not just finding individual efficiency gains, but assembling serving systems that allocate compute according to what tokens are actually worth to specific users.
Timeline
- 2024: KVQuant paper (NeurIPS 2024) establishes academic baseline for extreme-context KV cache quantization, targeting 10-million-token inference [3]
- Late 2025: KeyDiff (NeurIPS 2025) introduces key-similarity-based KV cache eviction for resource-constrained long-context inference [6]
- 2026-03-04: Speculative decoding production benchmark for 2026 published, providing empirical throughput data outside controlled evaluation settings [11]
- 2026-05-21: Rohan Paul introduces 'LLM tokenomics' framing: token value as a function of embedded intelligence and delivery speed [12]
- 2026-05-23: Discussion notes AI cluster bottleneck has shifted away from raw FLOPS toward memory bandwidth [20]
- 2026-05-24: MoE expert waste finding amplified: 50% of compute removable with near-zero accuracy loss on Qwen3 and GLM architectures [2]
- 2026-05-24: Alibaba/Nanjing University sparse attention paper publicized: 9.36× prefill speedup vs FlashAttention-2 at million-token scale [10]
- 2026-05-24: Prefill/decode compute asymmetry analysis relayed; NVIDIA structural advantage in compute-bound prefill highlighted as context windows grow [1]
- 2026-05: NVIDIA publishes NVF P4 KV cache quantization targeting long-context and large-batch LLM serving [4]
- 2026-05: Spheron documents approximately 2x throughput gain from prefill/decode disaggregation on GPU cloud [8]
- 2026-05: vLLM/NIXL disaggregated prefill/decode benchmark published, providing production-facing performance data for practitioners [9]
- 2026-05: AdaServe (CMU) multi-SLO LLM serving paper accepted to EuroSys 2026, proposing SLO-adaptive compute allocation across heterogeneous requests [13]
- 2026-05: Samsung Research publishes LookaheadKV, using future-glimpsing to improve KV cache eviction decisions without generation overhead [5]
- 2026-05: ArXiv paper frames KV cache eviction as a general mechanism for improving long-context performance across deployment settings [7]
Perspectives
Rohan Paul (@rohanpaul_ai)
Enthusiastic amplifier of efficiency research, framing MoE expert pruning, sparse attention, and prefill/decode asymmetry as high-impact deployable gains; now also proposes a 'tokenomics' framework defining token value as the product of intelligence and delivery speed
Evolution: expanded: tokenomics framing added alongside prior efficiency amplification role
NVIDIA (TensorRT-LLM / developer blog / research)
Advocates compute-layer solutions (chunked prefill in TensorRT-LLM) and memory-layer solutions (NVF P4 KV cache quantization), positioning NVIDIA hardware and software as jointly addressing inference efficiency across multiple resource dimensions
Evolution: consistent
Alibaba / Nanjing University researchers
Claims selective sparse attention achieves dramatic prefill speedup (9.36×) with lightweight adaptation, positioning sparsity as a software retrofit to existing GPU stacks rather than an architectural overhaul
Evolution: consistent
MoE pruning research community
Collectively argues that post-training expert pruning can recover roughly 50% of MoE compute waste with negligible accuracy cost, applicable to already-deployed frontier models including Qwen3 and GLM
Evolution: consistent
Cloud/infrastructure practitioners (Spheron, vLLM/NIXL community)
Frame prefill/decode disaggregation as a practical, deployable strategy with measurable throughput gains; translate architectural research into operator-facing benchmarks and deployment guidance
Evolution: merged from two prior voices for concision; stance unchanged
KV cache compression research community (NVIDIA, KVQuant, Samsung Research, KeyDiff authors)
Argues KV cache memory pressure is the primary serving bottleneck in long-context deployments, addressable through either quantization (NVF P4, KVQuant) or eviction (LookaheadKV, KeyDiff); these approaches have different accuracy tradeoff profiles and may be complementary
Evolution: expanded: eviction sub-track (LookaheadKV, KeyDiff) added alongside prior quantization focus
CMU AdaServe team
Argues that real production serving fleets have heterogeneous SLO requirements and that throughput-maximizing approaches systematically underserve latency-sensitive requests; SLO-adaptive compute allocation is a distinct and necessary optimization layer
Evolution: consistent
Simon Willison
Highlights the gap between advertised token-generation speeds and developer intuition, framing speed legibility as a practical issue for evaluating model-serving claims; peripheral to efficiency research, focused on developer UX
Evolution: new voice this pass
Tensions
- KV cache eviction vs. quantization: eviction drops tokens entirely (losing information for those tokens) while quantization retains all tokens at reduced precision; these approaches carry different accuracy tradeoff profiles and it is unclear which dominates in reasoning-heavy long-context settings or whether they compose. [5][6][7][4][3]
- 'Free efficiency' claim vs. evaluation skepticism: researchers claim 50% MoE expert pruning and 9.36× sparse attention speedups arrive with near-zero accuracy cost, but no adversarial benchmarks or out-of-distribution reasoning evaluations have publicly tested these claims. [2][10]
- Throughput-maximizing vs. SLO-diversity serving: disaggregation research optimizes for aggregate throughput assuming relatively homogeneous workloads, while AdaServe argues production fleets require SLO-adaptive allocation and that throughput-first optimization systematically disadvantages latency-sensitive requests. [8][9][13]
- Software retrofit vs. hardware specialization: sparse attention and MoE pruning favor software-level retrofits to existing GPU stacks, while disaggregation benchmarks point toward heterogeneous hardware clusters as the structural long-term solution—implying different capital allocation and vendor lock-in strategies. [1][10][2][8][9]
Sources
- [1] Chamath on all important “prefill” and “decode.” in AI compute. — Rohan Paul Twitter (2026-05-24)
- [2] A large MoE model may be wasting half its expert compute on tokens that barely need expert help. — Rohan Paul Twitter (2026-05-24)
- [3] KVQuant: Towards 10 Million Context Length LLM Inference with KV ... — reactive:llm-inference-efficiency
- [4] Optimizing Inference for Long Context and Large Batch Sizes with ... — reactive:llm-inference-efficiency
- [5] BLOG | Samsung Research — reactive:llm-inference-efficiency
- [6] NeurIPS Poster KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments — reactive:llm-inference-efficiency
- [7] Towards Improving Long-Context Performance with KV Cache Eviction — reactive:llm-inference-efficiency
- [8] Prefill-Decode Disaggregation on GPU Cloud: Split LLM Inference for 2x Throughput (2026 Guide) | Spheron Blog — reactive:llm-inference-efficiency
- [9] Benchmarking Disaggregated Prefill/Decode in vLLM Serving with NIXL : r/LocalLLaMA — reactive:llm-inference-efficiency
- [10] New Alibaba + Nanjing Univ paper claims million-token prefill can be sped up 9.36X (compared against FlashAttention-2) w… — Rohan Paul Twitter (2026-05-24)
- [11] LLM Serving: Speculative Decoding Production Benchmark 2026 | Chaos and Order — reactive:llm-inference-efficiency
- [12] "Not all tokens are created equal, and there is a way to look at token value. There are two key factors that impact toke… — Rohan Paul Twitter (2026-05-21)
- [13] [PDF] AdaServe: Accelerating Multi-SLO LLM Serving with SLO ... — reactive:llm-inference-efficiency
- [14] Streamlining AI Inference Performance and Deployment with NVIDIA ... — reactive:llm-inference-efficiency
- [15] MoE Pathfinder: Trajectory-driven Expert Pruning - Semantic Scholar — reactive:llm-inference-efficiency
- [16] moe-pruner: Democratizing DeepSeek-v3 with Expert Fusion - GitHub — reactive:llm-inference-efficiency
- [17] MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router — reactive:llm-inference-efficiency
- [18] Layer-adaptive Expert Pruning for Pre-Training of Mixture-of-Experts Large Language Models — reactive:llm-inference-efficiency
- [19] How fast is 10 tokens per second really? — Simon Willison (2026-05-20)
- [20] In the past, the bottleneck of AI clusters was FLOPS. — reactive:llm-inference-efficiency (2026-05-23)