Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of s…
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-24
An engineer runs Kimi K2.5, a one-trillion-parameter sparse model, on a single consumer RTX 3060 12GB GPU at over 4 tokens per second by offloading weights to 768GB of second-hand Intel Optane memory.
Appears in
Extraction
Topics: large-language-modelsinference-efficiencysparse-modelsconsumer-hardwaremodel-offloading
Claims
- The 1-trillion-parameter Kimi K2.5 model was successfully run on a single RTX 3060 12GB GPU.
- 768GB of second-hand Intel Optane memory was used to store the model's weights outside of GPU VRAM.
- The setup achieved over 4 tokens per second inference throughput.
- Sparse model architecture makes it feasible to use unconventional, high-capacity memory tiers in place of GPU VRAM.
- Intel Optane's high-bandwidth, low-latency characteristics relative to standard SSDs are key to making this approach viable.
Key quotes
Somebody just ran one trillion param model (Kimi K2.5) on a single RTX 3060 12GB GPU at over 4 tokens/sec and 768GB of second-hand Intel Optane memory.
What happened is that a sparse model met an unusual memory tier that could hold its enormous body while the GPU handled the [compute].