The Information Machine

Capable AI Models Running on Consumer Hardware · history

Version 2

2026-05-25 11:13 UTC · 83 items

What

Capable AI models running on consumer and prosumer hardware have moved from viral demos to an expanding ecosystem of official platform support, new hardware contenders, and growing tooling.

  • A 1-trillion-parameter sparse MoE model (Kimi K2.5) was run on a single RTX 3060 + 768GB Intel Optane at 4+ tokens/sec, generating broad social media attention. [1]
  • Google's Gemma 4 E2B runs at ~40 tok/s on an iPhone 17 Pro via MLX, fully offline with 128K context. [8]
  • Multi-Token Prediction (MTP) has been demonstrated doubling throughput on locally-hosted Qwen models, and is now documented across llama.cpp, vLLM, and NVIDIA's own stack. [15][16][17][18]
  • Apple officially presented MLX-based LLM inference at WWDC25, with dedicated sessions and published research on the M5 GPU's neural accelerators. [9][10]
  • AMD's Strix Halo architecture (Radeon 9700 AI Pro) was demonstrated achieving 2x faster token generation, adding a new hardware contender alongside NVIDIA and Apple Silicon. [20]

Why it matters

The story has shifted from individual hobbyist breakthroughs to a multi-front maturation: Apple is formally investing in on-device LLM infrastructure at the platform level, AMD is competing directly in the inference-per-watt space, and algorithmic speedups like MTP are being standardized across major inference frameworks. Taken together, these developments suggest the economics and accessibility of capable local AI inference are improving faster than the pace of model size growth.

Open questions

  • How broadly accessible is the RTX 3060 + Intel Optane result? Optane is a discontinued enterprise product, and a Cloudthrill deep-dive specifically examines whether KV-cache offloading to Optane is a practical technique or a dead end. [2]

  • Will AMD Strix Halo's 2x token generation gains [20] hold across diverse model sizes and architectures, positioning AMD as a genuine alternative to NVIDIA and Apple Silicon for local inference?

  • Apple's WWDC25 sessions and M5 GPU research [9][10] signal platform-level commitment to MLX — but a documented non-determinism issue in MLX [14] raises questions about reliability for production or research use.

  • As MTP support spreads across llama.cpp, vLLM-Ascend, and NVIDIA's stack [16][17][18], will gains demonstrated on Qwen models transfer uniformly to other model families, and what are the limits of the technique?

Narrative

The week of May 24, 2026 produced what became the most-shared local AI result in recent memory: an experimenter successfully ran Kimi K2.5 — Moonshot AI's 1-trillion-parameter Mixture-of-Experts model — on a single consumer RTX 3060 GPU with just 12GB of VRAM. The key was pairing the GPU with 768GB of second-hand Intel Optane persistent memory to hold the model's enormous weight set. Optane's higher bandwidth and lower latency compared to standard SSDs made weight fetching viable, while the MoE architecture meant only a small fraction of parameters needed to be active at any given time, yielding over 4 tokens per second. [1] A Cloudthrill technical article subsequently explored the specific mechanism of KV-cache offloading to Optane persistent memory, probing whether the technique is practically replicable or a one-off demonstration with a discontinued enterprise product. [2] The story spread rapidly across social media, with broad surprise at the 'trillion-parameter model on a gaming GPU' framing. [3][4][5][6][7]

Earlier in May, Rohan Paul documented Google's Gemma 4 E2B running on an iPhone 17 Pro at approximately 40 tokens per second using Apple's MLX framework — fully offline, with a 128K context window and thinking mode enabled. [8] The result positioned Apple Silicon as a first-class platform for capable on-device inference. That framing has since been validated at the platform level: Apple dedicated a session at WWDC25 to exploring LLMs on Apple Silicon with MLX, and Apple ML Research published work on leveraging the M5 GPU's neural accelerators specifically for LLM inference. [9][10][11] Guides for fine-tuning models locally with MLX and optimization techniques for Apple Silicon have proliferated in parallel. [12][13] A notable caveat has also emerged: a documented non-determinism issue in MLX means that Apple Silicon LLM outputs are not fully reproducible across runs, a concern for research and production use cases. [14]

On the algorithmic side, Multi-Token Prediction (MTP) — a technique that predicts several tokens per forward pass, substantially boosting throughput without hardware upgrades — has moved from a single atomic.chat demonstration into a maturing ecosystem. atomic.chat showed Qwen 27B jumping from 51 to 117 tokens/sec on a local machine with MTP enabled, and a MoE 35B variant climbing from 218 to 267 tok/s on dual RTX 5090s. [15] Since then, DataCamp published a tutorial on implementing MTP in llama.cpp, vLLM-Ascend documented MTP support in its inference stack, and NVIDIA posted its own technical overview of the technique. [16][17][18] Testing of Gemma 4 with MTP specifically has also been published. [19]

A new hardware entrant has also appeared: AMD's Strix Halo architecture, featured in the Radeon 9700 AI Pro, was demonstrated achieving 2x faster token generation compared to prior AMD hardware. [20] This positions AMD alongside NVIDIA GPUs and Apple Silicon as a credible option for local inference, expanding the competitive landscape beyond the two platforms that have dominated discussion. Across all three hardware tracks — NVIDIA consumer GPUs with weight offloading, Apple Silicon with MLX, and now AMD Strix Halo — the through-line is that architectural innovations in both hardware (unified memory, sparsity-friendly MoE) and software (MTP, kernel optimization) are combining to push capable inference onto hardware that was not designed for it.

Timeline

  • 2026-04-26: Privacy-focused Replika competitor app launched, built on local on-device inference [48]
  • 2026-05-02: Sentient OS, an on-device AI layer for personal computing, announced [49]
  • 2026-05-17: Gemma 4 E2B demonstrated running at ~40 tok/s on iPhone 17 Pro via MLX, fully offline with 128K context and thinking mode [8]
  • 2026-05-21: atomic.chat demonstrates Multi-Token Prediction boosting Qwen 27B from 51 to 117 tok/s; MoE 35B-A3B from 218 to 267 tok/s on dual RTX 5090 [15]
  • 2026-05-24: Kimi K2.5 (1 trillion parameter MoE) run on single RTX 3060 + 768GB Intel Optane at 4+ tok/s; story goes viral with dozens of retweets [1][24][7][6][5][4][3][46][44][33][27]
  • 2026-05-25: Apple presents dedicated WWDC25 session on LLMs with MLX on Apple Silicon; Apple ML Research publishes work on M5 GPU neural accelerators for LLM inference [9][10][21][22]
  • 2026-05-25: AMD Strix Halo (Radeon 9700 AI Pro) demonstrated achieving 2x faster token generation [20]

Perspectives

Rohan Paul (@rohanpaul_ai)

Consistent and enthusiastic advocate for on-device AI capabilities; treats each milestone as evidence of a broader trend toward powerful local inference becoming accessible to everyday hardware owners

Evolution: Consistent across all three major posts — Gemma on iPhone, MTP on Qwen, and Kimi K2.5 on Optane — each framed as opening new possibilities rather than edge cases

Apple (platform / research)

Apple has moved from implicit support of MLX to explicit platform endorsement: a WWDC25 session dedicated to LLMs on Apple Silicon with MLX, plus published ML research on the M5 GPU's neural accelerators for LLM inference signal that on-device inference is a strategic priority

Evolution: Escalated from background framework support to explicit developer-facing and research-facing commitment at WWDC25

Social media amplifiers (retweeters of Kimi K2.5 story)

Broad surprise and enthusiasm; the 'trillion-parameter model on a gaming GPU' framing resonated widely across AI enthusiasts and adjacent communities

Evolution: No prior baseline; initial reaction

atomic.chat (local inference application)

Focused on practical throughput improvements for privacy-conscious users; MTP framed as a concrete, deployable optimization rather than a research curiosity

Evolution: Consistent; MTP technique now validated by NVIDIA and integrated into mainstream inference frameworks independently of atomic.chat's initial demonstration

NVIDIA (infrastructure / documentation)

Published its own technical overview of MTP for LLM inference, signaling the technique has moved from hobbyist experimentation to vendor-supported optimization

Evolution: No prior baseline in this thread; first appearance

Aditya Karnam (MLX critic)

Raises a practical reliability concern: MLX inference on Apple Silicon is non-deterministic, meaning outputs vary across runs — a problem for reproducibility in research and production settings

Evolution: No prior baseline; first appearance as a dissenting voice in an otherwise positive MLX narrative

Tensions

  • Proof-of-concept vs. practical utility: The RTX 3060 + Optane setup achieves 4 tokens/sec on a 1T model, but Intel Optane is a discontinued enterprise product unavailable to most consumers, and a Cloudthrill technical article directly examines whether the technique is a meaningful advance or an impressive but unreproducible hack. [1][2] [1][2]
  • What counts as 'consumer hardware': The Kimi K2.5 run uses a consumer GPU but enterprise-class persistent memory; the Gemma iPhone demo requires the latest iPhone model; AMD Strix Halo is branded for AI Pro use; MTP gains on Qwen were shown on dual RTX 5090s. The definition of 'consumer' is being stretched across the field. [1][8][15][20] [1][8][15][20]
  • MLX reliability: Apple and the broader community promote MLX as the path to capable on-device inference on Apple Silicon, but a documented non-determinism issue means outputs are not reproducible across runs — a tension between the platform's marketing and its fitness for rigorous use. [9][14] [9][14]

Sources

  1. [1] Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of s… — Rohan Paul Twitter (2026-05-24)
  2. [2] The link That Never Was: Intel Optane PMem + LLM KV Cache Offload - Cloudthrill — reactive:consumer-hardware-inference
  3. [3] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  4. [4] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  5. [5] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  6. [6] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  7. [7] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  8. [8] So much possibilities for on-device small models. — Rohan Paul Twitter (2026-05-17)
  9. [9] Explore large language models on Apple silicon with MLX - WWDC25 — reactive:consumer-hardware-inference
  10. [10] Exploring LLMs with MLX and the Neural Accelerators in the M5 GPU — reactive:consumer-hardware-inference
  11. [11] Native LLM and MLLM Inference at Scale on Apple Silicon - arXiv — reactive:consumer-hardware-inference
  12. [12] Fine-Tuning LLMs Locally Using MLX LM: A Comprehensive Guide — reactive:consumer-hardware-inference
  13. [13] Apple Silicon Optimization Guide : r/LocalLLM - Reddit — reactive:consumer-hardware-inference
  14. [14] The Hidden Problem With MLX: Why Your Apple Silicon LLM Isn't Reproducible | Aditya Karnam — reactive:consumer-hardware-inference
  15. [15] Another good news for local-LLM from atomic[.]chat, that runs 100% offline on your computer. — Rohan Paul Twitter (2026-05-21)
  16. [16] Multi-Token Prediction Tutorial: How To Speed Up LLMs | DataCamp — reactive:consumer-hardware-inference
  17. [17] Multi Token Prediction (MTP) — vllm-ascend — reactive:consumer-hardware-inference
  18. [18] Unlock faster LLM inference with MTP (Multiple Token Prediction) In ... — reactive:consumer-hardware-inference
  19. [19] I Spent 3 Nights Testing Gemma 4 (MTP) — reactive:consumer-hardware-inference
  20. [20] 2x Faster Token Generation on AMD Strix Halo & Radeon 9700 AI Pro — reactive:consumer-hardware-inference
  21. [21] WWDC25: Explore large language models on Apple silicon with MLX — reactive:consumer-hardware-inference
  22. [22] WWDC 2025 - Explore LLM on Apple silicon with MLX - DEV Community — reactive:consumer-hardware-inference
  23. [23] 🚀 The Kimi K2.5 AI model is setting new benchmarks! Running on an RTX 3060, it shows how AI and crypto are merging. Inve... — reactive:consumer-hardware-inference (2026-05-24)
  24. [24] A groundbreaking experiment has demonstrated that the advanced AI model Kimi K2.5 can run on an RTX 3060 GPU with 768GB ... — reactive:consumer-hardware-inference (2026-05-24)
  25. [25] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  26. [26] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  27. [27] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  28. [28] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  29. [29] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  30. [30] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  31. [31] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  32. [32] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  33. [33] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  34. [34] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  35. [35] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  36. [36] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  37. [37] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  38. [38] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  39. [39] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  40. [40] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  41. [41] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  42. [42] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  43. [43] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  44. [44] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  45. [45] RT @hokanewscom: Kimi K2.5 Shock: Trillion-Param AI Runs on RTX 3060 Using 768GB Optane — reactive:consumer-hardware-inference (2026-05-24)
  46. [46] Kimi K2.5 Shock: Trillion-Param AI Runs on RTX 3060 Using 768GB Optane — reactive:consumer-hardware-inference (2026-05-24)
  47. [47] RT @rohanpaul_ai: Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/... — reactive:consumer-hardware-inference (2026-05-24)
  48. [48] Show HN: I read Replika's privacy policy and then built a competitor — reactive:consumer-hardware-inference (2026-04-26)
  49. [49] Show HN: Sentient OS – On-device intelligence layer for your entire digital life — reactive:consumer-hardware-inference (2026-05-02)