Ultra-Low Latency LLM Inference: Benchmarks and Emerging Enterprise Pricing Tier

closed · v4 · 2026-06-09 · 37 items · history

What's new in v4

SemiAnalysis's longitudinal DeepSeek V4 inference benchmarks via InferenceX [9] are the most substantive addition: AMD ROCm's 100x improvement by Day 26 and NVIDIA TensorRT-LLM's launch failures add a time-series benchmarking lens and a new software-readiness dimension to the AMD vs. NVIDIA hardware competition. This also introduces an implicit methodological tension with Kog AI's one-time snapshot benchmark. Makora AI's sequential Monte Carlo speculative decoding [13] and a selective KV cache retention paper [14] deepen the algorithmic efficiency alternatives side of the story without changing the core unresolved questions about Kog AI's benchmark validity or enterprise pricing.

What

Kog AI claims approximately 3,000 tokens per second per request on AMD MI300X using a monokernel architecture [4], validated by AMD [6] but not independently replicated. SemiAnalysis now publishes longitudinal inference benchmarks via InferenceX for DeepSeek V4, showing AMD ROCm improved 100x by Day 26 while NVIDIA's TensorRT-LLM had launch issues requiring community fixes [9]. The inference optimization landscape continues to expand: Makora AI's sequential Monte Carlo speculative decoding eliminates rewind costs in draft token verification [13], selective KV cache retention reduces memory overhead in long-context generation [14], and Qualcomm's CEO projects 40x global token demand growth by 2030 driven by agentic AI [12]. The central market question — whether ultra-low per-request latency inference commands a 20–50x per-token price premium [11] — remains unresolved.

Why it matters

If agentic AI drives the next wave of token demand at machine pace, as Qualcomm projects [12], the dominant infrastructure need may be aggregate throughput at lower cost rather than per-request speed — which would partially undercut the enterprise pricing case SemiAnalysis has framed [11]. SemiAnalysis's push for longitudinal over snapshot benchmarking [9] adds a methodological lens under which one-time claims like Kog AI's face additional scrutiny.

Open questions

Will enterprise customers pay a 20–50x per-token premium for per-request latency gains, and which use cases — voice agents, real-time coding, autonomous workflows — would anchor that willingness? [11]
Have Kog AI's 3K tokens/s results been independently replicated by neutral parties, or do they rest on self-reported benchmarks and favorable amplification? [4][7]
Does Kog AI's monokernel approach scale beyond the 2B parameter model benchmarked to the 70B+ models common in enterprise deployments? [5]
As SemiAnalysis/InferenceX advocates longitudinal performance tracking over snapshot benchmarks [9], does Kog AI's one-time claim hold up under time-series scrutiny on actual deployment hardware?

Narrative

SemiAnalysis established the structural baseline: roughly 48% of end-to-end LLM inference latency comes from prefill and 52% from decode [1]. The DistServe paper at OSDI 2024 proposed disaggregating these two phases onto separate resources to optimize both simultaneously [2], and subsequent research on prefill-decode multiplexing continues this direction [3]. Into this landscape, Kog AI claims a monokernel architecture — a single kernel purpose-built for LLM inference — delivering approximately 3,000 tokens per second per request on an 8-GPU AMD MI300X node and 2,100 tokens/s on comparable NVIDIA H200, using a 2B parameter model at FP16 precision without speculative decoding [4][5]. AMD published a blog validating the result as a 3.5x improvement over its inference baseline [6], and discussion spread from initial amplifiers to Hacker News [7]. All figures originate from Kog AI itself; no neutral third-party replication has been documented. A metric distinction matters practically: Kog AI's 3,000 tokens/s is per-request throughput — the speed for a single user's request — while Moreh separately reports 21,000 tokens/s on the same MI300X hardware measuring aggregate output across concurrent requests using expert parallelism on a DeepSeek model [8]. These optimize for different objectives and are not directly comparable.

SemiAnalysis, via its InferenceX initiative, is now tracking DeepSeek V4 inference performance longitudinally across hardware platforms — Huawei Ascend, NVIDIA GB300 NVL72 and B200, AMD MI355X — framing this as a transparency effort against one-off snapshot benchmarks. Their data shows AMD ROCm achieved a 100x performance improvement for DeepSeek V4 by Day 26 under focused engineering effort, while NVIDIA's own TensorRT-LLM did not work well at launch and required community-contributed fixes to its open-source kernel launch code [9]. DeepSeek V4 was co-designed in part for Huawei Ascend, which demonstrated Day 0 inference support [9]. Intel announced plans to debut a new AI data center chip before end of 2026 featuring lower-cost memory and cooling technology compared to current AMD and NVIDIA offerings [10], expanding hardware competition beyond a two-player race.

SemiAnalysis frames ultra-low latency inference as an emerging enterprise pricing experiment: roughly 10x speed for a 20–50x per-token cost premium over standard inference [11], with enterprise willingness-to-pay explicitly unresolved. The demand-side context comes from Qualcomm's CEO projecting global AI token demand will reach 1.27 trillion tokens every 10 seconds by 2030, up from 31.7 billion today, driven primarily by agentic AI operating at machine pace rather than conversational human pace [12] — a projection that favors aggregate throughput infrastructure over per-request speed optimization. Algorithmic efficiency alternatives continue to emerge alongside the hardware competition: Makora AI's sequential Monte Carlo speculative decoding keeps multiple draft tokens alive in parallel rather than rewinding on mismatch, eliminating that rewind cost [13]; selective KV cache token retention reduces memory overhead in long-context generation by predicting which past tokens will remain relevant [14]; and context compression cutting token use by 25–50% plus multi-model routing both offer orthogonal cost reduction levers [15][16].

Timeline

2024-07: DistServe paper at OSDI 2024 proposes disaggregating prefill and decode phases onto separate resources to improve LLM serving goodput. [2]
2026-05-26: SemiAnalysis posts quantitative LLM latency breakdown: ~48% prefill, ~52% decode. [1]
2026-05-28: Kog AI benchmark results first amplified publicly: 3,000 tokens/s per request on 8x AMD MI300X and 2,100 tokens/s on 8x NVIDIA H200 with a 2B model at FP16, no speculative decoding. [4]
2026-05-29: AMD publishes blog validating Kog AI's result as a 3.5x breakthrough on AMD Instinct MI300X; AI Weekly reports peak of 3,300 tokens/s via monokernel architecture. [6][18]
2026-05-29: Rohan Paul claims personal verification of Kog AI benchmarks; Kog AI technical blog post documents the monokernel design. [17][5]
2026-05-30: Discussion spreads to Reddit's ROCm community and developer forums. [19][20]
2026-05-31: SemiAnalysis frames ultra-low latency inference as an emerging enterprise pricing experiment: 10x speed at a 20–50x per-token premium, with enterprise willingness-to-pay explicitly unresolved. [11]
2026-06-01: Qualcomm CEO projects global token demand will grow 40x by 2030 driven by agentic AI; Intel announces new AI data center chip with lower-cost memory before end of 2026. [12][10]
2026-06-04: Hacker News community engages Kog AI's 3K tokens/s per-request claim; Moreh's 21K aggregate tokens/s on MI300X via expert parallelism surfaces as a distinct benchmark on the same hardware. [7][8]
2026-06-06: SemiAnalysis surfaces Makora AI's sequential Monte Carlo speculative decoding, which keeps multiple draft tokens alive in parallel instead of rewinding on mismatch. [13]
2026-06-09: SemiAnalysis/InferenceX publishes longitudinal DeepSeek V4 inference benchmarks showing AMD ROCm improved 100x by Day 26 while NVIDIA's TensorRT-LLM required community fixes at launch; Huawei demonstrated Day 0 inference support. [9]

Perspectives

SemiAnalysis

Provides both the LLM latency decomposition (48/52 prefill/decode split) and the enterprise pricing framing (20–50x premium for ultra-low latency); via InferenceX now advocates longitudinal performance tracking over snapshot benchmarks, and is notably critical of NVIDIA's Day 0 readiness for DeepSeek V4 while praising AMD's rapid iterative improvement.

Evolution: Expanded from enterprise pricing analyst to active benchmarking transparency advocate, now publishing time-series hardware performance data alongside market-structure arguments.

[1][11][13][9]

Kog AI

Inference startup claiming a monokernel-based architecture delivering 3,000 tokens/s per request on AMD MI300X and 2,100 tokens/s on H200, documented in a primary technical blog post.

Evolution: Consistent; all benchmark figures are self-reported and have not been independently replicated.

[4][5]

Rohan Paul (@rohanpaul_ai)

Enthusiastic amplifier of inference efficiency research; claims personal verification of Kog AI benchmarks and relays context compression, multi-model routing, KV cache selective retention, and token demand projections as adjacent efficiency angles.

Evolution: Consistently expansive; extended scope to include selective KV cache retention as another memory optimization lever.

[4][17][15][16][12][10][14]

AMD

Corporate validator; published a blog crediting Kog AI with a 3.5x speed improvement on AMD Instinct MI300X, positioning the result as a competitive proof-point against NVIDIA.

Evolution: Consistent with AMD's broader push to position MI300X as a serious inference platform; AMD ROCm's 100x DeepSeek V4 improvement by Day 26 (per SemiAnalysis) reinforces this positioning.

[6][9]

Moreh

Reports 21K aggregate output tokens/second on AMD MI300X using expert parallelism for DeepSeek, optimizing aggregate throughput via MoE parallelism rather than per-request latency.

Evolution: Consistent; provides the sharpest contrast to Kog AI's per-request optimization framing on identical hardware.

[8]

Qualcomm (CEO Cristiano Amon)

Projects 40x growth in global token demand by 2030 driven by agentic AI at machine pace; positions inference infrastructure investment as a throughput-scale story rather than a per-request speed story.

Evolution: Consistent; unchanged since entering the thread.

[12]

Makora AI

Proposes sequential Monte Carlo speculative decoding as an algorithmic improvement: keeping multiple draft tokens alive in parallel eliminates the rewind cost incurred when draft tokens fail verification in standard speculative decoding.

Evolution: New to this thread; surfaced by SemiAnalysis as a noteworthy optimization approach.

[13]

ML Systems Research Community

Academic work on prefill-decode disaggregation and multiplexing provides theoretical grounding for the class of optimization Kog AI claims, lending structural plausibility without directly validating the specific implementation.

Evolution: Background context; researchers are not directly engaging with Kog AI's claims in captured items.

[2][3]

Tensions

Kog AI's benchmark figures originate from the company itself and have been relayed by enthusiastic amplifiers rather than independently replicated, leaving the core throughput claims unverified despite a published technical blog post and AMD validation. [4][17][5][6][7]
SemiAnalysis argues ultra-low per-request latency commands a 20–50x price premium in enterprise settings, but Qualcomm's CEO projects agentic AI will make aggregate throughput at scale the dominant infrastructure need — not per-request speed. [11][12]
SemiAnalysis/InferenceX advocates longitudinal performance tracking as more reliable than one-time snapshots, implicitly challenging Kog AI's single non-replicated benchmark as insufficient evidence for its claims. [9][4][7]
Kog AI optimizes for per-request latency on a 2B model; Moreh's 21K tokens/s result on the same MI300X hardware optimizes for aggregate throughput using expert parallelism on a larger MoE model — the two measure different objectives and neither speaks to the other's claim. [4][8]
AMD's favorable MI300X positioning against NVIDIA H200 (3,000 tokens/s vs. 2,100 tokens/s) rests on Kog AI's own non-independent test, while SemiAnalysis separately documents NVIDIA TensorRT-LLM's Day 0 failures on DeepSeek V4 requiring community fixes. [4][6][9]
Algorithmic efficiency approaches — SMC speculative decoding, selective KV cache retention, context compression, multi-model routing — offer orthogonal cost reduction levers that challenge whether premium-tier latency hardware is the primary inference optimization lever. [13][14][15][16][11]

Status: active and growing

Sources

[1] PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: — SemiAnalysis Twitter (2026-05-26)
[2] [PDF] DistServe: Disaggregating Prefill and Decoding for Goodput ... — reactive:mlsys-2026-inference-systems
[3] Towards High-Goodput LLM Serving with Prefill-decode Multiplexing — reactive:llm-inference-speed-market
[4] Some truly massive inference numbers here. — Rohan Paul Twitter (2026-05-28)
[5] Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs — reactive:llm-inference-speed-market
[6] Kog Reaches 3.5× Breakthrough Inference Speed on AMD Instinct MI300X — reactive:llm-inference-speed-market
[7] Real-time LLM Inference on Standard GPUs: 3k tokens/s per request — reactive:llm-inference-speed-market
[8] 21K Output Tokens Per Second DeepSeek Inference on AMD Instinct MI300X GPUs with Expert Parallelism – Moreh — reactive:llm-inference-speed-market
[9] DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200 — SemiAnalysis Twitter (2026-06-09)
[10] Intel is aiming to debut a new AI data center chip before the year closes, comes with lower-cost memory and cooling tech… — Rohan Paul Twitter (2026-06-01)
[11] 10x speed at a 20x to 50x price premium per token. We're about to find out exactly how much the enterprise market is wil… — SemiAnalysis Twitter (2026-05-31)
[12] "Every 10 seconds, global token demand is around 31.7 billion in 2026. By 2030 its 1.27 trillion, a 40x increase." — Rohan Paul Twitter (2026-06-01)
[13] @makora_ai 's sequential Monte Carlo speculative decoding keeps multiple draft tokens alive in parallel instead of rewin… — SemiAnalysis Twitter (2026-06-06)
[14] This paper teaches LLMs to save memory by keeping only past tokens likely to matter later. — Rohan Paul Twitter (2026-06-03)
[15] This paper shows how LLMs can use shorter context more cheaply without losing much answer quality. — Rohan Paul Twitter (2026-05-29)
[16] Most AI teams still buy inference like they are buying software from 1 vendor. — Rohan Paul Twitter (2026-05-28)
[17] I had to test it myself to believe this unreal inference speed. — Rohan Paul Twitter (2026-05-29)
[18] AMD MI300X monokernel hits 3,300 tokens per second | AI Weekly — reactive:llm-inference-speed-market
[19] Building a monokernel for LLM inference on AMD MI300X - Reddit — reactive:llm-inference-speed-market
[20] Real-time LLM Inference on Standard Datacenter GPUs ... - Devtalk — reactive:llm-inference-speed-market