Some truly massive inference numbers here.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-28
Kog__AI achieves 3,000 tokens per second inference throughput on 8x AMD MI300X GPUs with a 2B parameter model using FP16 and no speculative decoding, far exceeding typical GPU decoding speeds.
Appears in
Extraction
Topics: llm-inferencegpu-performanceamd-mi300xnvidia-h200hardware-benchmarks
Claims
- Kog__AI achieved 3,000 tokens per second on 8x AMD MI300X GPUs with a 2B model.
- Kog__AI achieved 2,100 tokens per second on 8x NVIDIA H200 GPUs with a 2B model.
- These results were obtained using FP16 precision with no speculative decoding.
- Typical GPU decoding speed for 2B–8B models on high-end hardware is around 100 tokens per second.
- The Kog__AI results represent roughly a 20–30x improvement over typical decoding throughput.
Key quotes
Some truly massive inference numbers here.
@Kog__AI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model.
For comparison, typical GPU decoding speed for 2B to 8B models on high-end GPUs is around 100 to [tokens/s]