Some truly massive inference numbers here.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-28

Kog__AI achieves 3,000 tokens per second inference throughput on 8x AMD MI300X GPUs with a 2B parameter model using FP16 and no speculative decoding, far exceeding typical GPU decoding speeds.

Open original ↗

Appears in

Ultra-Low Latency LLM Inference: Benchmarks and Emerging Enterprise Pricing Tier

Extraction

Topics: llm-inferencegpu-performanceamd-mi300xnvidia-h200hardware-benchmarks

Claims

Kog__AI achieved 3,000 tokens per second on 8x AMD MI300X GPUs with a 2B model.
Kog__AI achieved 2,100 tokens per second on 8x NVIDIA H200 GPUs with a 2B model.
These results were obtained using FP16 precision with no speculative decoding.
Typical GPU decoding speed for 2B–8B models on high-end hardware is around 100 tokens per second.
The Kog__AI results represent roughly a 20–30x improvement over typical decoding throughput.

Key quotes

Some truly massive inference numbers here.

@Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model.

For comparison, typical GPU decoding speed for 2B to 8B models on high-end GPUs is around 100 to [tokens/s]