Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of s…

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-24

An engineer runs Kimi K2.5, a one-trillion-parameter sparse model, on a single consumer RTX 3060 12GB GPU at over 4 tokens per second by offloading weights to 768GB of second-hand Intel Optane memory.

Open original ↗

Appears in

Capable AI Models Running on Consumer Hardware

Extraction

Topics: large-language-modelsinference-efficiencysparse-modelsconsumer-hardwaremodel-offloading

Claims

The 1-trillion-parameter Kimi K2.5 model was successfully run on a single RTX 3060 12GB GPU.
768GB of second-hand Intel Optane memory was used to store the model's weights outside of GPU VRAM.
The setup achieved over 4 tokens per second inference throughput.
Sparse model architecture makes it feasible to use unconventional, high-capacity memory tiers in place of GPU VRAM.
Intel Optane's high-bandwidth, low-latency characteristics relative to standard SSDs are key to making this approach viable.

Key quotes

Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.

What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the [compute].