So much possibilities for on-device small models.

Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-17

Google's Gemma 4 E2B small language model runs at approximately 40 tokens per second on iPhone 17 Pro via MLX optimization, delivering state-of-the-art coding and math performance with 128K context fully offline.

Open original ↗

Appears in

Capable AI Models Running on Consumer Hardware

Extraction

Topics: on-device-aismall-language-modelsmobile-aiapple-silicon

Claims

Google's Gemma 4 E2B model runs on iPhone 17 Pro at approximately 40 tokens per second using MLX optimization.
The model achieves state-of-the-art coding and math performance on mobile hardware with a 128K context window.
Gemma 4 E2B runs fully offline on iPhone 17 Pro with thinking mode enabled.

Key quotes

So much possibilities for on-device small models.

~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.