Synthesis history

4 versions, newest first.

Version 4 2026-05-27 18:44 UTC · 42 items

The KV cache compression sub-thread expanded from two tracks to three: a survey article explicitly frames the space as eviction, quantization, and low-rank decomposition [^21257], with low-rank methods now recognized as…
Version 3 2026-05-26 08:51 UTC · 32 items

KV cache eviction emerged as a distinct sub-track alongside quantization, anchored by Samsung Research's LookaheadKV [^20351], KeyDiff at NeurIPS 2025 [^20352], and an ArXiv long-context eviction paper [^20353]; the KV …
Version 2 2026-05-25 10:32 UTC · 27 items

Three new fronts emerged this pass. KV cache quantization solidified into a distinct third efficiency thread—alongside disaggregated serving and MoE expert pruning—anchored by NVIDIA's NVF P4 KV cache work [^19732] and …
Version 1 2026-05-25 05:24 UTC · 20 items

LLM inference efficiency research in mid-2026 is converging on two structural bottlenecks: the compute-vs-memory asymmetry between prefill and decode phases [^18627], and endemic expert-compute waste inside Mixture-of-E…