Agentic Workloads Rewriting LLM Inference Economics · history

Version 4

2026-05-24 11:10 UTC · 110 items

Changes since v3

Three substantive additions this pass: (1) Speculative decoding has emerged as a third documented technical pillar alongside KV cache management and attention optimization, with a coordinated burst of practitioner guides from Together.ai, NVIDIA, BentoML, Snowflake, and AWS — and a new tension about whether its throughput gains hold at agentic context lengths. (2) Reports that NVIDIA acquired Groq[^17035] introduce a significant competitive shift in fast-inference hardware, potentially reshaping Cerebras's independent positioning weeks after its IPO. (3) Enterprise and analyst voices (HFS Research, CX Today, Galileo, Quantiphi) have entered the thread as a new perspective, framing the problem as 'agentic sticker shock' with a claim that 40% of projects fail before production — and an observer surfaced a Gartner projection that per-token costs will fall 90% by 2030 while noting total bills are still rising, introducing the Jevons paradox as an explicit macro-level tension.

What

Agentic AI workloads are generating token volumes that invalidate prior infrastructure planning — empirical analysis of 432k coding-agent requests finds a median input of 96k tokens with roughly half exceeding 128k.[1][2] An independent academic study corroborates the same question from a research angle, analyzing how coding agents allocate token spend across task steps.[3] The engineering response has expanded to three documented technical pillars: tiered KV cache architectures (including new SSD-backed approaches[8]), IO-aware attention optimization, and speculative decoding, which uses a small draft model to propose tokens that a larger model verifies in parallel — with practitioner-reported throughput gains up to 3×.[20] At the competitive level, reports of NVIDIA acquiring Groq[22] — weeks after Cerebras priced its IPO[23] — reshapes the fast-inference hardware landscape, while enterprise analysts have begun quantifying an 'agentic sticker shock' problem: per-token prices are projected to fall 90% by 2030 per Gartner, yet total AI bills keep rising as usage scales.[29]

Why it matters

The gap between falling per-token prices and rising total enterprise AI spend is the defining economic tension of the agentic era. Which combination of technical mitigations — KV cache management, attention optimization, speculative decoding, and model-tier selection — closes that gap will determine whether AI agents achieve broad economic viability or remain confined to high-margin enterprise buyers who can absorb the 'sticker shock.'

Open questions

Can speculative decoding's throughput gains (claimed at 3× in some configurations[20]) be sustained at the 96k+ token contexts dominating agentic workloads, where draft-model acceptance rates may fall as context complexity grows?[16][21]
If NVIDIA acquired Groq[22], how does consolidation among fast-inference silicon vendors reshape Cerebras's independent positioning after its IPO[23] — and does it accelerate or slow the specialist-hardware thesis?
Does the Jevons paradox hold for inference? Gartner projects 90% lower per-token costs by 2030,[29] but if lower prices simply unlock more agentic usage, net enterprise spending may be flat or rising regardless of hardware and algorithmic efficiency gains.
Can prompt caching consistently deliver the 40–90% cost reductions reported by some practitioners[33] on agentic workloads, where context mutates frequently enough to suppress cache hit rates?[34][35] The agentic hit-rate problem — that the workloads most needing savings are least likely to generate cache hits — remains unresolved.

Narrative

Empirical analysis of 432k coding-agent requests by SemiAnalysis found the median input token count is 96k — roughly the length of a novel — with approximately half of all requests exceeding 128k tokens.[1][2] An independently published academic paper at OpenReview examines the same question from a research angle, studying how coding agents allocate token expenditure across task steps and attempting to build predictive models for consumption.[3] The structural driver is not user verbosity but the scaffolding agents assemble automatically before a user types a word: system prompts, tool and skill definitions, MCP schemas, and the rolling context of prior conversation turns and file contents agents must carry to stay coherent. A community observer responding to the SemiAnalysis data captured the dynamic precisely: "the 'task' is often a giant read of the codebase before any actual work begins."[4] A separate observer framed it as a memory problem the market systematically underprices.[5]

The infrastructure response to this token volume has matured into three distinct technical pillars. The first is tiered KV cache management: when a model processes a 96k-token request, the key-value tensors overflow GPU HBM at scale,[6] and the mitigation is spilling to slower memory tiers. NVIDIA's Dynamo framework productizes hierarchical offloading (HBM → DRAM → NVMe),[7] while a new paper titled Tutti documents making SSD-backed KV cache practical for long-context serving with detailed latency-throughput tradeoffs.[8] Dell has also published an enterprise storage approach to the same KV offloading problem.[9] A Substack post titled "The KV Cache Wars?" explicitly frames the competition between these approaches as an active industry battleground.[10] Open-source stacks including vLLM, SGLang, and lmdeploy are implementing KV cache quantization in INT4, INT8, and FP8 formats to reduce memory footprints.[11][12][13] The second pillar is IO-aware attention optimization: at long contexts, attention computation — not just storage — becomes a primary cost driver, and algorithms like FlashAttention restructure computation to minimize HBM reads/writes, with 1.4–1.7× wall-clock speedups observed at 98k tokens.[14][15] The third pillar, which has seen a burst of practitioner documentation, is speculative decoding: a small draft model proposes several tokens that the full model verifies in parallel rather than autoregressively, yielding throughput gains. Together.ai, NVIDIA, AWS, Snowflake, and BentoML have all published deployment-oriented guides targeting long-context and decode-heavy workloads,[16][17][18][19][20] and Nebius specifically examines how speculative decoding addresses latency budgets for large MoE models whose size otherwise breaks interactive latency targets.[21]

At the competitive level, two significant shifts have emerged around fast-inference hardware. Reports surfaced in a Reddit thread that NVIDIA has acquired Groq,[22] one of Cerebras's primary rivals in the specialist inference silicon market. If confirmed, this would represent NVIDIA absorbing a key independent challenger in the same space Cerebras entered public markets to pursue — Cerebras priced its IPO in May 2026 and NextPlatform covered its post-IPO strategy as a return to pushing inference performance limits independent of NVIDIA's ecosystem.[23] Meanwhile, GitHub Copilot's retirement of flat annual plans in favor of usage-based billing[24] has been followed by broader enterprise analyst attention to the same dynamic: HFS Research, CX Today, Galileo, and Quantiphi have each published analyses naming an 'agentic sticker shock' problem — with Galileo reporting that 40% of agentic AI projects fail before reaching production, due in part to cost overruns that were not anticipated at the pilot stage.[25][26][27][28]

The macro-level tension is what one community observer called 'the math looking broken': per-token prices are falling — Gartner projects a 90% reduction by 2030 — but total enterprise AI bills keep rising as usage scales.[29] This is a Jevons paradox dynamic: cheaper inference per unit drives more agentic deployment, which drives higher total spend, which makes the cost problem more urgent even as the per-token headline number improves. SemiAnalysis argues model labs — not cloud hyperscalers or inference resellers — are best positioned to capture value from this transition because they control model weights and can set inference economics at the source.[30] The counter-thesis is that smaller models with structured guardrails can reach near-frontier agentic accuracy at far lower cost,[31] potentially bifurcating the market between frontier-model enterprise deployments and cost-optimized small-model pipelines. A separate dimension of the same debate concerns open-weight models: MIT Sloan has raised questions about why open models — despite well-documented cost and customization advantages — are not more widely adopted in production agentic deployments.[32]

Timeline

2026-04-24: DeepSeek V4 released, described as best open-source model for coding tasks [52]
2026-04-27: GitHub Copilot announces move to usage-based billing and retirement of annual plans [24]
2026-05-14: Cerebras prices its IPO; CNBC, Yahoo Finance, and investment analysts cover it as a direct NVIDIA rival play for specialized inference hardware [40][43][44][45]
2026-05-15: NextPlatform covers Cerebras's post-IPO strategy as a return to pushing AI inference performance limits [23]
2026-05-16: Researchers note attention computation is the primary cost driver at long context, with 1.4–1.7× wall-clock speedups observed at 98k tokens when attention is IO-optimized [14][15][53]
2026-05-19: Community observers flag context memory as AI infrastructure's emerging bottleneck [57][31]
2026-05-20: Observers note token-maxing creates qualitatively different constraints than prior bandwidth bottlenecks [59]
2026-05-21: SemiAnalysis highlights KV cache overflow beyond GPU HBM as the central scaling constraint for agentic and long-context workloads [6]
2026-05-22: SemiAnalysis publishes analysis of 432k coding-agent requests: median input is 96k tokens, ~50% exceed 128k, driven by agentic prefill context assembly [1][2]
2026-05-22: Community observers amplify SemiAnalysis findings, framing agentic inference as primarily a memory problem the market systematically underprices [4][5]
2026-05-22: SemiAnalysis predicts proliferation of fast-tier pricing, specialized inference chips, and KV cache management as next competitive frontier [36]
2026-05-22: Observers frame the shift as AI entering a 'cloud economics' phase where serving efficiency rivals model capability [58]
2026-05-23: Ben Thompson (Stratechery) publishes 'The Inference Shift,' bringing mainstream tech-strategy analysis to the inference economics debate [47]
2026-05-23: Observer surfaces Gartner projection that inference costs will fall 90% by 2030 but notes total enterprise AI bills are rising — framing a Jevons paradox dynamic [29]
2026-05-24: Reddit discussion surfaces reports that NVIDIA acquired Groq, raising competitive questions about Cerebras's independent positioning in fast-inference hardware [22]

Perspectives

SemiAnalysis

Data-driven and prescriptive: empirical evidence from 432k requests shows real-world agentic token consumption far exceeds industry assumptions; the next competitive frontier is serving infrastructure and economics, not model intelligence; KV cache management and fast-tier pricing will define winners; model labs are best positioned to capture value

Evolution: Consistent and intensifying — built progressively from KV cache as bottleneck to specific usage statistics to market-structure predictions; academic work at OpenReview independently examining coding-agent token consumption provides corroboration

[6][36][1][2][30][37][38][3]

NVIDIA

Infrastructure solution provider across multiple layers: tiered KV cache architectures (HBM → DRAM → NVMe) via Dynamo for the memory bottleneck, Blackwell hardware for the agentic cost model, and speculative decoding documentation for the decode-phase throughput problem; reported acquisition of Groq would extend NVIDIA's reach into specialist fast-inference silicon

Evolution: Expanding — this pass adds speculative decoding as a new documented technique in NVIDIA's inference optimization portfolio, and the reported Groq acquisition would represent a major competitive consolidation move

[39][7][6][17][22]

GitHub / Microsoft

Usage-based billing is the right model for agentic AI — retiring flat annual plans in favor of per-token or per-use pricing to align revenue with actual consumption

Evolution: Action-oriented and consistent; the enterprise analyst community has since independently confirmed the same dynamic with 'sticker shock' framing

[24]

Cerebras

Specialized inference hardware is ready for public markets and independent operation — the IPO is a bet that fast, cost-efficient inference silicon will capture durable market share as agentic workloads scale; NextPlatform frames the post-IPO phase as Cerebras returning to pushing inference performance limits

Evolution: Position is consistent, but the competitive context has shifted: reported NVIDIA acquisition of Groq removes one key independent rival while simultaneously signaling that NVIDIA views specialist inference hardware as worth absorbing rather than ignoring

[40][41][42][43][44][45][46][23][22]

Ben Thompson / Stratechery

The inference economics transition deserves mainstream tech-strategy framing — 'The Inference Shift' positions this as a structural market transition, not merely an engineering optimization problem

Evolution: Consistent; Thompson's engagement signals the inference economics story has crossed from specialist infrastructure discourse into mainstream technology strategy analysis

[47][48][49][50][51]

Open-source / practitioner community

Three technical pillars address the cost problem: KV cache quantization (INT4/INT8/FP8) in vLLM, SGLang, lmdeploy; IO-aware attention algorithms like FlashAttention; and speculative decoding, with practitioners reporting 3× throughput gains in decode-heavy workloads. Small models with structured guardrails can reach near-frontier agentic accuracy at far lower cost. SSD-backed KV cache is becoming practically deployable.

Evolution: Expanding — speculative decoding has been added as a third documented pillar this pass, with a cluster of guides from Together.ai, BentoML, Snowflake, AWS, and Nebius; the Tutti paper adds rigor to SSD-backed KV cache feasibility

[31][52][34][33][11][12][13][15][53][16][17][54][20][55][19][21][18][56][8]

Enterprise / analyst observers (HFS Research, CX Today, Galileo, Quantiphi)

Agentic AI deployments carry hidden costs and complexity that cause 40% of projects to fail before reaching production; organizations face 'sticker shock' when moving from pilot to scale; TCO framing — not per-token pricing — is the correct lens for evaluating agentic infrastructure

Evolution: New voice in this pass — enterprise and analyst perspectives were previously absent from this thread, which was dominated by infrastructure engineers and capital markets. Their entry signals the inference economics story is now reaching business decision-makers

[28][27][25][26]

Infrastructure observers (bitfid, Crypto Exponentials, Sahil/TalkinIdeas, 38twelveDaily)

Context memory is the new bottleneck; AI is entering a cloud-economics phase where unit economics and serving efficiency matter as much as capability; token-maxing is a qualitatively different constraint than prior bandwidth bottlenecks; the Jevons paradox dynamic means falling per-token prices do not translate to falling total bills

Evolution: Expanding — the Jevons paradox framing (prices fall, usage rises, total spend holds or grows) is new this pass and adds a macro-economic dimension to what had been a primarily technical conversation

[57][58][59][29]

Tensions

Tiered KV cache offloading (NVIDIA's Dynamo, Dell storage engines, Tutti SSD-backed paper) vs. the fundamental latency cost: spilling to slower memory tiers reduces capacity pressure but degrades time-to-first-token — whether the tradeoff is acceptable for interactive agentic workloads remains unresolved [6][7][8][9]
Prompt caching as a near-term cost fix (practitioners report 40–90% savings) vs. the agentic hit-rate problem: agentic workloads mutate context so frequently that cache hit rates may be systematically low in exactly the use cases that need savings most [34][35][33]
Large frontier models consuming 96k+ token contexts vs. small models with guardrails achieving near-equivalent agentic accuracy at far lower cost — cost pressure may bifurcate the market rather than resolve toward one architecture [2][31][32]
Model labs capturing disproportionate value from the agentic shift (SemiAnalysis's AI Value Capture thesis) vs. specialized inference hardware vendors (Cerebras, and NVIDIA via reported Groq acquisition) and cloud infrastructure providers who control the serving bottleneck [30][36][40][43][22]
Falling per-token prices (Gartner: -90% by 2030) vs. rising total enterprise AI bills — the Jevons paradox dynamic where efficiency gains are absorbed by increased usage, leaving organizations facing higher total spend even as unit economics improve [29][28][27][25]
Speculative decoding's promise of 3× throughput gains vs. its applicability to agentic long-context workloads: the technique depends on draft-model acceptance rates that may degrade as context grows complex, limiting gains precisely where agentic workloads are most expensive [20][16][21]

Sources

[1] Even more striking: ~50% of requests already exceed 128k tokens. The driver isn't user prompts getting longer. It's ever… — SemiAnalysis Twitter (2026-05-22)
[2] Agentic workloads are quietly rewriting inference economics. We pulled data from 432k real coding agent requests at Semi… — SemiAnalysis Twitter (2026-05-22)
[3] How Do Coding Agents Spend Your Money? Analyzing and Predicting Token Consumptions in Agentic Coding Tasks | OpenReview — reactive:agentic-inference-economics
[4] @SemiAnalysis_ This is the inference-economics part of agents that gets under-discussed: the 'task' is often a giant rea... — reactive:agentic-inference-economics (2026-05-22)
[5] SemiAnalysis put a hard number on something the market keeps underpricing: agentic inference is mostly a memory problem ... — reactive:agentic-inference-economics (2026-05-22)
[6] With modern agentic workloads and long context windows, a common bottleneck in serving LLMs at scale is where to store a… — SemiAnalysis Twitter (2026-05-21)
[7] How to Reduce KV Cache Bottlenecks with NVIDIA Dynamo — reactive:agentic-inference-economics
[8] Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving — reactive:agentic-inference-economics
[9] Scalable KV Cache Offloading with Dell AI Storage Engines — reactive:agentic-inference-economics
[10] The KV Cache Wars? - by Ken Huang - Agentic AI — reactive:agentic-inference-economics
[11] Add INT8 Support for KV Cache Quantization (Currently FP8-Only ... — reactive:agentic-inference-economics
[12] INT4/INT8 KV Cache — lmdeploy - Read the Docs — reactive:agentic-inference-economics
[13] Quantized KV Cache - SGLang Documentation — reactive:agentic-inference-economics
[14] @NousResearch Attention is the primary cost driver at long context. 1.4-1.7x wall-clock speedup at 98K doesn't just cut ... — reactive:agentic-inference-economics (2026-05-16)
[15] FlashAttention | LLM Inference Handbook — reactive:agentic-inference-economics
[16] Speculative decoding for high-throughput long-context inference — reactive:agentic-inference-economics
[17] An Introduction to Speculative Decoding for Reducing ... — reactive:agentic-inference-economics
[18] Accelerating decode-heavy LLM inference with speculative ... - AWS — reactive:agentic-inference-economics
[19] Fastest Speculative Decoding in vLLM with Arctic Inference and ... — reactive:agentic-inference-economics
[20] Get 3× Faster LLM Inference with Speculative Decoding Using the ... — reactive:agentic-inference-economics
[21] Why large MoE models break latency budgets and what speculative decoding changes in production systems — reactive:agentic-inference-economics
[22] Nvidia acquired Groq, but why not Cerebras? Cerebras is 3x times ... — reactive:agentic-inference-economics
[23] With Its IPO Done, Cerebras Can Get Back To Pushing The AI ... — reactive:agentic-inference-economics
[24] GitHub Copilot is moving to usage-based billing and retiring annual plans — reactive:agentic-inference-economics (2026-04-27)
[25] The Hidden Costs of Agentic AI: Why 40% of Projects Fail Before Production — reactive:agentic-inference-economics
[26] How to avoid agentic sticker shock - HFS Research — reactive:agentic-inference-economics
[27] The Agentic AI Cost Problem: Calculating TCO for ... - CX Today — reactive:agentic-inference-economics
[28] Cost of Agentic AI:Expenses & ROI — reactive:agentic-compute-cpu-gpu
[29] The math looks broken: token prices are falling (Gartner says inference costs drop 90% by 2030), but total bills keep ri... — reactive:agentic-inference-economics (2026-05-23)
[30] AI Value Capture - The Shift To Model Labs - SemiAnalysis — reactive:ai-demand-bubble-debate
[31] Show HN: Forge – Guardrails take an 8B model from 53% to 99% on agentic tasks — reactive:open-model-capability-gap (2026-05-19)
[32] AI open models have benefits. So why aren’t they more widely used? — reactive:agentic-inference-economics
[33] How We Cut LLM Costs by 59% With Prompt Caching — reactive:agentic-inference-economics
[34] An Evaluation of Prompt Caching for Long-Horizon Agentic Tasks — reactive:agentic-inference-economics
[35] Low cache hit rate for large fixed System Prompt in Azure OpenAI ... — reactive:agentic-inference-economics
[36] Inference economics are shifting. Expect more "fast tier" pricing (Opus Fast, Gemini Flash), more specialized inference … — SemiAnalysis Twitter (2026-05-22)
[37] The Coding Assistant Breakdown: More Tokens Please - SemiAnalysis — reactive:agentic-inference-economics
[38] Claude Code Psychosis: How SemiAnalysis Is Token Mogging Meta | Ep. 008 — reactive:agentic-inference-economics
[39] Agentic AI Cost Model: NVIDIA Blackwell, Dynamo & Tokens — reactive:agentic-inference-economics
[40] Cerebras Priced The Biggest IPO Of 2026 Into A Market Huang ... — reactive:agentic-inference-economics
[41] Pricing - Cerebras — reactive:agentic-inference-economics
[42] Cerebras Launches the World's Fastest AI Inference : r/LocalLLaMA — reactive:agentic-inference-economics
[43] Nvidia competitor Cerebras' wild IPO: Here's what you need to know — reactive:agentic-inference-economics
[44] Nvidia rival Cerebras discloses US IPO filing as AI boom drives listings — reactive:agentic-inference-economics
[45] Cerebras Systems IPO Signals Growing Demand for AI Chip ... — reactive:agentic-inference-economics
[46] Can Cerebras Systems Challenge Nvidia? A Deep Dive into the AI ... — reactive:agentic-inference-economics
[47] The Inference Shift – Stratechery by Ben Thompson — reactive:agentic-inference-economics
[48] Please Listen to My Podcast – Stratechery by Ben Thompson — reactive:agentic-inference-economics
[49] 2026.20: Shifting Alliances in a Changing World - Stratechery — reactive:agentic-inference-economics
[50] Losing in the Attention Economy – Stratechery by Ben Thompson — reactive:agentic-inference-economics
[51] AI and the Human Condition – Stratechery by Ben Thompson — reactive:agentic-inference-economics
[52] DeepSeek V4 is out. the best open-source on coding. here's the breakdown — reactive:agentic-inference-economics (2026-04-24)
[53] FlashAttention: IO-Aware Exact Attention for Long-Context Language Models - Interactive | Michael Brenndoerfer | Michael Brenndoerfer — reactive:agentic-inference-economics
[54] Speculative decoding: how it works & when to use it — reactive:agentic-inference-economics
[55] Speculative Decoding in vLLM: Complete Guide to Faster LLM ... — reactive:agentic-inference-economics
[56] hemingkx/SpeculativeDecodingPapers: 📰 Must-read ... — reactive:agentic-inference-economics
[57] AI infrastructure is hitting a new bottleneck: context memory. — reactive:agentic-inference-economics (2026-05-20)
[58] AI is entering its “cloud economics” phase. — reactive:agentic-inference-economics (2026-05-22)
[59] @vijayshekhar The 3G to 4G parallel works for bandwidth, but token maxing is a different constraint. Inference costs and... — reactive:agentic-inference-economics (2026-05-20)