Recent Developments in LLM Architectures: KV Sharing, mHC, and Compressed Attention
Ahead of AI · Sebastian Raschka, PhD · 2026-05-16
Sebastian Raschka surveys four long-context efficiency innovations in recent open-weight LLMs — cross-layer KV sharing and per-layer embeddings in Gemma 4, per-layer attention budgeting in Laguna XS.2, compressed convolutional attention in ZAYA1-8B, and manifold-constrained hyper-connections plus compressed sequence attention in DeepSeek V4.
Appears in
Extraction
Topics: llm-architectureattention-mechanismskv-cache-optimizationlong-context-inferencetransformer-efficiency
Claims
- Gemma 4's cross-layer KV sharing reuses key-value projections from earlier layers across later layers, reducing KV cache memory by approximately half — saving 2.7 GB for the E2B model at 128K context.
- Gemma 4's per-layer embeddings allow small models to increase representational capacity through cheap lookup-style embedding parameters without scaling the expensive transformer blocks themselves.
- Laguna XS.2 uses per-layer query-head budgeting, giving sliding-window layers 8 query heads per KV head and full-attention layers only 6, spending attention capacity where it is most useful.
- ZAYA1-8B's Compressed Convolutional Attention performs attention directly in a compressed latent space and adds convolutional mixing on compressed Q and K tensors, reducing both KV cache size and attention FLOPs during prefill and training.
- DeepSeek V4's CSA/HCA compressed sequence attention achieves 27% of single-token inference FLOPs and 10% of KV cache size compared to DeepSeek V3.2 at 1M-token context length.
- DeepSeek V4's manifold-constrained hyper-connections widen the transformer residual stream across four parallel streams with doubly-stochastic constraints for stability, adding only 6.7% training time overhead.
Key quotes
The thing that stood out to me is how much newer architectures are focused on long-context efficiency.
DeepSeek V4-Pro uses only 27% of the single-token inference FLOPs and 10% of the KV cache size compared with DeepSeek V3.2, which uses MLA and DeepSeek Sparse Attention.
While it was possible to implement a basic transformer block in perhaps 50-100 lines of PyTorch code, these tweaks (esp. around the attention variants) probably 10x the code complexity.