Capable AI Models Running on Consumer Hardware · history
Version 1
2026-05-25 04:08 UTC · 68 items
What
A wave of demonstrations in May 2026 shows capable AI models running locally on consumer hardware through novel memory and inference tricks.
- A 1-trillion-parameter model (Kimi K2.5, a sparse MoE) was run on a single consumer RTX 3060 12GB GPU at over 4 tokens/sec by offloading weights to 768GB of second-hand Intel Optane memory. [1]
- Google's Gemma 4 E2B achieves roughly 40 tokens/sec on an iPhone 17 Pro using MLX optimization, fully offline with a 128K context window and thinking mode. [7]
- Multi-Token Prediction (MTP) on atomic.chat pushed a dense Qwen 27B from 51 to 117 tokens/sec on consumer hardware; an MoE 35B-A3B went from 218 to 267 tokens/sec on dual RTX 5090s. [8] The Kimi K2.5/Optane result went viral on May 24, generating dozens of retweets within hours. [15][16][17][18][19]
Why it matters
These results reframe what 'consumer hardware' can mean for AI: sparse model architectures combined with unusual but affordable memory tiers (repurposed Optane, high-bandwidth LPDDR on Apple Silicon) are unlocking models at scales previously reserved for data-center inference. If these techniques generalize, the economics and privacy calculus of AI deployment shift substantially away from cloud dependency.
Open questions
Is the RTX 3060 + 768GB Optane setup practical for real use, or is it a proof-of-concept? Intel Optane is a discontinued, specialized product, raising questions about long-term reproducibility. [1][2]
Will Multi-Token Prediction gains demonstrated on Qwen models [8] transfer broadly across model families, and will tooling like atomic.chat bring MTP to mainstream local inference stacks?
How much of Gemma 4 E2B's iPhone 17 Pro performance [7] is gated on the latest Apple Silicon, versus being accessible on older or lower-cost devices?
As model weights grow larger, what are the practical upper limits of CPU/SSD/Optane offloading strategies before throughput becomes too slow for interactive use?
Narrative
The week of May 24, 2026 produced what became the most-shared local AI result in recent memory: an anonymous experimenter successfully ran Kimi K2.5 — Moonshot AI's 1-trillion-parameter Mixture-of-Experts model — on a single consumer RTX 3060 GPU with just 12GB of VRAM. The trick was pairing the GPU with 768GB of second-hand Intel Optane persistent memory modules to hold the model's enormous weight set. Optane's higher bandwidth and lower latency compared to standard SSDs made weight fetching viable, while the MoE architecture meant only a small fraction of the model's parameters needed to be active at any given time. The result: over 4 tokens per second — slow but usable for non-interactive tasks. [1][2] The story exploded across social media within hours, with dozens of accounts retweeting Rohan Paul's summary post. [3][4][5][6]
This was not the only significant development in on-device inference this month. Earlier, on May 17, Rohan Paul documented Google's Gemma 4 E2B running on an iPhone 17 Pro at approximately 40 tokens per second using Apple's MLX framework, fully offline, with a 128K context window and thinking mode enabled. [7] The result positioned Apple Silicon as a first-class platform for small but capable on-device models, not just toy demos. On May 21, atomic.chat demonstrated Multi-Token Prediction (MTP) boosting a locally-hosted dense Qwen 27B model from 51 to 117 tokens per second — more than doubling throughput — while a 35B MoE variant climbed from 218 to 267 tokens per second on dual RTX 5090 GPUs, all without additional hardware. [8]
Together, these results illustrate three distinct strategies for running capable models locally: weight offloading to high-capacity non-GPU memory tiers (the Optane approach), hardware-specific kernel optimization (MLX on Apple Silicon), and algorithmic tricks that predict multiple tokens per forward pass (MTP). Each addresses a different bottleneck — VRAM capacity, memory bandwidth efficiency, and raw throughput — and each is being pursued independently by different parts of the hobbyist and open-source inference community.
The broader context is a maturing ecosystem: multiple guides and hardware roundups published in early 2026 indicate that local LLM inference has moved from a niche experiment to a mainstream enough topic to support dedicated buyer's guides and YouTube comparisons. [9][10][11][12][13] Privacy-first applications built on local inference, such as atomic.chat and a Replika competitor announced in April, are commercializing the capability for end users who want AI without cloud data exposure. [8][14]
Timeline
- 2026-04-26: Privacy-focused Replika competitor app launched, built on local on-device inference [14]
- 2026-05-02: Sentient OS, an on-device AI layer for personal computing, announced [41]
- 2026-05-17: Gemma 4 E2B demonstrated running at ~40 tok/s on iPhone 17 Pro via MLX, fully offline with 128K context and thinking mode [7]
- 2026-05-21: atomic.chat demonstrates Multi-Token Prediction boosting Qwen 27B from 51 to 117 tok/s; MoE 35B-A3B from 218 to 267 tok/s on dual RTX 5090 [8]
- 2026-05-24: Kimi K2.5 (1 trillion parameter MoE) run on single RTX 3060 + 768GB Intel Optane at 4+ tok/s; story goes viral with dozens of retweets [1][21][19][18][17][16][15][39][6][4][3]
Perspectives
Rohan Paul (@rohanpaul_ai)
Consistent and enthusiastic advocate for on-device AI capabilities; treats each milestone as evidence of a broader trend toward powerful local inference becoming accessible to everyday hardware owners
Evolution: Consistent across all three major posts this cycle — Gemma on iPhone, MTP on Qwen, and Kimi K2.5 on Optane — each framed as opening new possibilities rather than edge cases
Social media amplifiers (retweeters of Kimi K2.5 story)
Broad surprise and enthusiasm; the story spread rapidly across accounts ranging from AI enthusiasts to crypto communities, suggesting the 'trillion-parameter model on a gaming GPU' framing resonated widely
Evolution: No prior baseline; initial reaction
atomic.chat (local inference application)
Focused on practical throughput improvements for privacy-conscious users; MTP framed as a concrete, deployable optimization rather than a research curiosity
Evolution: No prior baseline; initial demonstration
Tensions
- Proof-of-concept vs. practical utility: The RTX 3060 + Optane setup achieves 4 tokens/sec on a 1T model, but Intel Optane is a discontinued enterprise product unavailable to most consumers, raising the question of whether this is a meaningful democratization milestone or an impressive but unreproducible hack. [1][2] [1][2]
- What counts as 'consumer hardware': The Kimi K2.5 run uses a consumer GPU but enterprise-class persistent memory; the Gemma iPhone demo requires the latest iPhone model; MTP gains were shown on dual RTX 5090s (high-end but consumer-purchasable). The definition of 'consumer' is being stretched across the field. [1][7][8] [1][7][8]
Sources
- [1] Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of s… — Rohan Paul Twitter (2026-05-24)
- [2] Kimi K2.5 runs on RTX 3060 with 768GB Intel Optane memory at 4 ... — reactive:consumer-hardware-inference
- [3] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [4] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [5] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [6] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [7] So much possibilities for on-device small models. — Rohan Paul Twitter (2026-05-17)
- [8] Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. — Rohan Paul Twitter (2026-05-21)
- [9] The Complete Guide to Running Local LLMs in 2026 — reactive:consumer-hardware-inference
- [10] Best Value GPUs for Local LLM Inference (2026) - The Kaitchup — reactive:consumer-hardware-inference
- [11] 7 Best GPU for LLM in 2026 (Including Local LLM Setups) - Fluence — reactive:consumer-hardware-inference
- [12] Best GPUs for Local AI & LLM in 2026: RTX 50 & Others - Hostrunway — reactive:consumer-hardware-inference
- [13] Best GPU for Local LLMs: 2026 Hardware Guide — reactive:consumer-hardware-inference
- [14] Show HN: I read Replika's privacy policy and then built a competitor — reactive:consumer-hardware-inference (2026-04-26)
- [15] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [16] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [17] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [18] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [19] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [20] 🚀 The Kimi K2.5 AI model is setting new benchmarks! Running on an RTX 3060, it shows how AI and crypto are merging. Inve... — reactive:consumer-hardware-inference (2026-05-24)
- [21] A groundbreaking experiment has demonstrated that the advanced AI model Kimi K2.5 can run on an RTX 3060 GPU with 768GB ... — reactive:consumer-hardware-inference (2026-05-24)
- [22] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [23] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [24] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [25] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [26] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [27] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [28] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [29] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [30] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [31] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [32] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [33] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [34] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [35] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [36] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [37] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [38] RT @hokanewscom: Kimi K2.5 Shock: Trillion-Param AI Runs on RTX 3060 Using 768GB Optane — reactive:consumer-hardware-inference (2026-05-24)
- [39] Kimi K2.5 Shock: Trillion-Param AI Runs on RTX 3060 Using 768GB Optane — reactive:consumer-hardware-inference (2026-05-24)
- [40] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
- [41] Show HN: Sentient OS – On-device intelligence layer for your entire digital life — reactive:consumer-hardware-inference (2026-05-02)