The Information Machine

Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of s…

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-24

An engineer runs Kimi K2.5, a one-trillion-parameter sparse model, on a single consumer RTX 3060 12GB GPU at over 4 tokens per second by offloading weights to 768GB of second-hand Intel Optane memory.

Open original ↗

Appears in

Extraction

Topics: large-language-modelsinference-efficiencysparse-modelsconsumer-hardwaremodel-offloading

Claims

  • The 1-trillion-parameter Kimi K2.5 model was successfully run on a single RTX 3060 12GB GPU.
  • 768GB of second-hand Intel Optane memory was used to store the model's weights outside of GPU VRAM.
  • The setup achieved over 4 tokens per second inference throughput.
  • Sparse model architecture makes it feasible to use unconventional, high-capacity memory tiers in place of GPU VRAM.
  • Intel Optane's high-bandwidth, low-latency characteristics relative to standard SSDs are key to making this approach viable.

Key quotes

Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.
What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the [compute].