The Information Machine

Ultra-Low Latency LLM Inference: Benchmarks and Emerging Enterprise Pricing Tier

cooling · v3 · 2026-06-05 · 32 items · history

What's new in v3

Three substantive additions this pass. Moreh's 21K aggregate tokens/s benchmark on AMD MI300X [4] introduces a per-request vs. aggregate throughput distinction that sharpens a core tension. Qualcomm's CEO projecting 40x token demand growth by 2030 driven by agentic AI [14] adds a demand-side framing that challenges whether per-request latency optimization is even the right axis for the coming infrastructure build-out. Intel's announced new AI data center chip with lower-cost memory [15] expands the hardware competition landscape beyond AMD vs. NVIDIA. Hacker News community discussion of Kog AI's 3K tokens/s claim [5] confirms continued spread but without new independent verification.

What

Kog AI, an inference startup, claims approximately 3,000 tokens per second per request on 8x AMD MI300X GPUs using a 'monokernel' architecture—roughly 20–30x faster than comparable hardware—a result AMD has validated as a 3.5x improvement on its platform [6][9]. The benchmark remains self-reported and unverified by neutral parties, but community discussion has spread from initial amplifiers to Hacker News [5]. SemiAnalysis frames this performance class as an emerging enterprise pricing experiment: roughly 10x speed at a 20–50x per-token cost premium [13]. The demand-side context is expanding: Qualcomm's CEO projects 40x token demand growth by 2030 driven by agentic AI workloads [14], and Intel announced a new AI data center chip with lower-cost memory before end of 2026, adding a third competitor to the hardware inference race [15].

Why it matters

If agentic AI drives 40x token demand growth by 2030 [14], the dominant infrastructure optimization problem may shift from minimizing per-request latency to maximizing aggregate throughput at lower cost—which would partly undercut the case for a 20–50x per-token premium on speed [13]. Intel's entry with cheaper memory technology [15] adds further pressure on the cost structure that premium-tier inference pricing depends on.

Open questions

  • Will enterprise customers actually pay a 20–50x per-token premium for per-request latency improvements, and which use cases—voice agents, real-time coding, autonomous agents—would anchor that willingness? [13]

  • Have Kog AI's 3K tokens/s benchmark results been independently replicated by neutral parties, or do they rest entirely on the company's own claims and favorable amplification? [6][7][5]

  • Does the monokernel approach scale to larger models (70B+ parameters) commonly deployed in enterprise settings, beyond the 2B model benchmarked? [6][8]

  • If agentic AI shifts the primary demand metric from per-request latency to aggregate throughput, does Kog AI's per-request optimization retain enterprise value, or does bulk-throughput work like Moreh's 21K tokens/s aggregate approach become the more relevant product? [14][4]

Narrative

A technical decomposition from SemiAnalysis establishes the structural context: roughly 48% of end-to-end LLM inference latency comes from prefill and 52% from decode [1]. Reducing time-to-first-token requires attacking prefill; improving tokens-per-second throughput requires optimizing decode. The DistServe paper at OSDI 2024 proposed disaggregating prefill and decode onto separate resources to optimize both simultaneously [2], and subsequent research on prefill-decode multiplexing continues this direction [3]. A metric distinction matters practically here: Kog AI's benchmark is framed as per-request throughput—3,000 tokens per second for a single user's request—while Moreh separately reports 21,000 tokens per second on AMD MI300X measuring aggregate output across concurrent requests using expert parallelism on a DeepSeek model [4][5]. These are optimizing for different things and are not directly comparable.

Kog AI released benchmark results claiming approximately 3,000 tokens per second per request on an 8-GPU AMD MI300X node and 2,100 tokens/s on a comparable NVIDIA H200, using a 2B parameter model at FP16 precision without speculative decoding [6][7]. The company attributes the gains to a 'monokernel'—a single kernel purpose-built for LLM inference—documented in a technical blog post [8]. AMD published a blog crediting Kog AI with a 3.5x improvement over its baseline inference performance [9], and AI Weekly reported a peak of 3,300 tokens/s on MI300X via this architecture [10]. The claims have spread from initial amplification through Rohan Paul to Hacker News discussion [5] and Reddit and developer forums [11][12]. All benchmark figures originate from Kog AI itself; no neutral third-party replication has been documented.

SemiAnalysis connected this performance class to an emerging enterprise market structure: roughly 10x speed at a 20–50x per-token cost premium over standard inference [13]. The framing treats enterprise willingness to pay at this premium as a genuine unknown. That unknown is now framed by new demand-side data: Qualcomm's CEO projects global AI token demand will reach 1.27 trillion tokens every 10 seconds by 2030, up from 31.7 billion today, with the primary growth driver being agentic AI workflows operating at machine pace rather than human-conversational pace [14]. If that projection is directionally correct, the dominant inference optimization problem may shift toward bulk throughput at lower cost rather than minimizing per-request latency—which would partially reframe what kind of infrastructure commands premium pricing.

The hardware competition context is expanding. AMD's MI300X leads Kog AI's own benchmark by a meaningful margin over NVIDIA H200 [6], a result AMD has amplified as a competitive proof point [9]. Intel announced plans to debut a new AI data center chip before end of 2026, emphasizing lower-cost memory and cooling technology compared to current AMD and NVIDIA offerings [15]. Orthogonal efficiency approaches—context compression cutting token use by 25–50% [16] and multi-model routing reducing per-task cost [17]—represent additional pressure on any premium-tier pricing thesis that assumes inference cost is primarily a function of hardware speed rather than workload architecture.

Timeline

  • 2024-07: DistServe paper at OSDI 2024 proposes disaggregating prefill and decode phases onto separate resources to improve LLM serving goodput. [2]
  • 2026-05-26: SemiAnalysis posts quantitative LLM latency breakdown: ~48% prefill, ~52% decode. [1]
  • 2026-05-28: Kog AI benchmark results first amplified publicly: 3,000 tokens/s per request on 8x AMD MI300X and 2,100 tokens/s on 8x NVIDIA H200 with a 2B model at FP16, no speculative decoding. [6]
  • 2026-05-29: AMD publishes blog validating Kog AI's result as a 3.5x breakthrough on AMD Instinct MI300X; AI Weekly reports peak of 3,300 tokens/s via monokernel architecture. [9][10]
  • 2026-05-29: Rohan Paul claims personal verification of Kog AI benchmarks and attributes gains to a 'hidden efficiency gap' in GPU token generation. [7]
  • 2026-05-29: Kog AI technical blog post documents the single-kernel, latency-optimized LLM inference engine design on AMD MI300X GPUs. [8]
  • 2026-05-30: Discussion spreads to Reddit's ROCm community and developer forums, indicating engagement beyond the initial amplification wave. [12][11]
  • 2026-05-31: SemiAnalysis frames ultra-low latency inference as an emerging enterprise pricing experiment: 10x speed at a 20–50x per-token premium, with enterprise willingness-to-pay explicitly unresolved. [13]
  • 2026-06-01: Qualcomm CEO projects global token demand will grow 40x by 2030 (31.7 billion to 1.27 trillion per 10 seconds), driven by agentic AI workloads rather than conversational use. [14]
  • 2026-06-01: Intel announced plans to debut a new AI data center chip before end of 2026 featuring lower-cost memory and cooling technology versus current NVIDIA and AMD offerings. [15]
  • 2026-06-04: Hacker News community discussion engages Kog AI's 3K tokens/s per request claim, marking spread beyond the initial amplification community. [5]

Perspectives

SemiAnalysis

Technical analyst providing both the LLM latency decomposition (48/52 prefill/decode split) and the enterprise pricing framing; treats ultra-low latency as a legitimate but unproven new market tier priced at 20–50x premium.

Evolution: Consistent across two posts: first establishes technical scaffolding, then applies it to a market-structure argument.

Kog AI

Inference startup claiming a monokernel-based breakthrough delivering 3,000 tokens/s per request on AMD MI300X and 2,100 on H200, with published technical documentation.

Evolution: Consistent; now supported by a primary technical blog post in addition to benchmark claims.

Rohan Paul (@rohanpaul_ai)

Enthusiastic amplifier of inference efficiency research broadly; claims personal verification of Kog AI benchmarks and promotes context compression, multi-model routing, and token demand projections as adjacent efficiency levers.

Evolution: Extended beyond Kog AI amplification to relay Qualcomm CEO's 40x token demand projection and Intel's chip announcement, reinforcing a 'inference infrastructure demand is growing fast' framing.

AMD

Corporate validator; published a blog crediting Kog AI with a 3.5x speed improvement on AMD Instinct MI300X, positioning the result as a competitive proof-point against NVIDIA.

Evolution: Consistent with AMD's broader push to position MI300X as a serious inference platform.

Moreh

Inference optimization firm reporting 21K aggregate output tokens/second on AMD MI300X using expert parallelism for DeepSeek, demonstrating a different optimization axis—aggregate throughput via MoE parallelism rather than per-request latency.

Evolution: New to this thread; provides a concrete alternative benchmark on the same hardware platform.

Qualcomm (CEO Cristiano Amon)

Projects 40x growth in global token demand by 2030 driven by agentic AI operating at machine pace; positions inference infrastructure investment as a long-term demand story centered on throughput scale rather than per-request speed.

Evolution: New to this thread; adds a demand-side projection that complicates the per-request latency premium thesis.

ML Systems Research Community

Academic work on prefill-decode disaggregation provides theoretical grounding for the class of optimization Kog AI claims, lending structural plausibility without directly validating the specific implementation.

Evolution: Background context; researchers are not directly engaging with Kog AI's claims in captured items.

Tensions

  • Kog AI's benchmark figures originate from the company itself and have been relayed by enthusiastic amplifiers rather than independently replicated by neutral third parties, leaving the core throughput claims unverified despite publication of a technical blog post. [6][7][8][5]
  • SemiAnalysis argues ultra-low per-request latency commands a 20–50x price premium in enterprise settings, but Qualcomm's CEO projection implies agentic AI will make aggregate throughput at scale the dominant infrastructure need—not per-request speed. [13][14]
  • AMD's favorable positioning of MI300X (3,000 tokens/s) against NVIDIA H200 (2,100 tokens/s) in the same benchmark rests on a single company's non-independent test. [6][9]
  • Kog AI optimizes for per-request latency on a 2B model; Moreh's 21K tokens/s result on the same MI300X hardware optimizes for aggregate throughput using expert parallelism on a larger MoE model—the two are measuring different things and neither speaks to the other's claim. [6][4]
  • Orthogonal efficiency approaches—context compression cutting token use 25–50% and multi-model routing reducing per-task cost—challenge the premise that premium-tier latency pricing is the primary lever for inference cost optimization. [16][17][13]

Status: active and growing

Sources

  1. [1] PDOOM ALERT 🚨 : ~48% of e2e LLM latency is prefill, ~52% is decode. Prefill itself breaks into 2 ops: — SemiAnalysis Twitter (2026-05-26)
  2. [2] [PDF] DistServe: Disaggregating Prefill and Decoding for Goodput ... — reactive:mlsys-2026-inference-systems
  3. [3] Towards High-Goodput LLM Serving with Prefill-decode Multiplexing — reactive:llm-inference-speed-market
  4. [4] 21K Output Tokens Per Second DeepSeek Inference on AMD Instinct MI300X GPUs with Expert Parallelism – Moreh — reactive:llm-inference-speed-market
  5. [5] Real-time LLM Inference on Standard GPUs: 3k tokens/s per request — reactive:llm-inference-speed-market
  6. [6] Some truly massive inference numbers here. — Rohan Paul Twitter (2026-05-28)
  7. [7] I had to test it myself to believe this unreal inference speed. — Rohan Paul Twitter (2026-05-29)
  8. [8] Building a single-kernel, latency-optimized LLM inference engine on AMD MI300X GPUs — reactive:llm-inference-speed-market
  9. [9] Kog Reaches 3.5× Breakthrough Inference Speed on AMD Instinct MI300X — reactive:llm-inference-speed-market
  10. [10] AMD MI300X monokernel hits 3,300 tokens per second | AI Weekly — reactive:llm-inference-speed-market
  11. [11] Real-time LLM Inference on Standard Datacenter GPUs ... - Devtalk — reactive:llm-inference-speed-market
  12. [12] Building a monokernel for LLM inference on AMD MI300X - Reddit — reactive:llm-inference-speed-market
  13. [13] 10x speed at a 20x to 50x price premium per token. We're about to find out exactly how much the enterprise market is wil… — SemiAnalysis Twitter (2026-05-31)
  14. [14] "Every 10 seconds, global token demand is around 31.7 billion in 2026. By 2030 its 1.27 trillion, a 40x increase." — Rohan Paul Twitter (2026-06-01)
  15. [15] Intel is aiming to debut a new AI data center chip before the year closes, comes with lower-cost memory and cooling tech… — Rohan Paul Twitter (2026-06-01)
  16. [16] This paper shows how LLMs can use shorter context more cheaply without losing much answer quality. — Rohan Paul Twitter (2026-05-29)
  17. [17] Most AI teams still buy inference like they are buying software from 1 vendor. — Rohan Paul Twitter (2026-05-28)
  18. [18] Kog Reaches 3.5x Breakthrough Inference Speed on AMD Instinct MI300X GPUs — reactive:llm-inference-speed-market
  19. [19] Real-time LLM Inference on Standard GPUs (3,000 tokens/s per request) — reactive:llm-inference-speed-market