So much possibilities for on-device small models.
Rohan Paul Twitter · Rohan Paul (@rohanpaul_ai) · 2026-05-17
Google's Gemma 4 E2B small language model runs at approximately 40 tokens per second on iPhone 17 Pro via MLX optimization, delivering state-of-the-art coding and math performance with 128K context fully offline.
Appears in
Extraction
Topics: on-device-aismall-language-modelsmobile-aiapple-silicon
Claims
- Google's Gemma 4 E2B model runs on iPhone 17 Pro at approximately 40 tokens per second using MLX optimization.
- The model achieves state-of-the-art coding and math performance on mobile hardware with a 128K context window.
- Gemma 4 E2B runs fully offline on iPhone 17 Pro with thinking mode enabled.
Key quotes
So much possibilities for on-device small models.
~40tk/s with MLX optimized for Apple Silicon SOTA coding & math on mobile with 128K context. Fully offline with thinking mode.