Google DeepMind releases DiffusionGemma, a model that runs local AI 4x faster

Ars Technica AI · Ryan Whitwam · 2026-06-10

Google DeepMind releases DiffusionGemma, a 26-billion-parameter diffusion-based language model in the open Gemma 4 family that generates full text blocks in parallel rather than token-by-token, achieving roughly four times the inference speed of comparable autoregressive models on consumer and datacenter GPUs.

Open original ↗

Appears in

Google DeepMind DiffusionGemma: Parallel Diffusion Architecture for 4x Faster Local Text Generation
Google Launches Nano Banana 2 Lite and Gemini Omni Flash for Developer Multimedia Pipelines

Extraction

Topics: diffusion-language-modelsgoogle-deepmindopen-model-releaseinference-speedmixture-of-experts

Claims

DiffusionGemma generates text by iteratively denoising a field of placeholder tokens rather than producing tokens sequentially left to right.
The model uses a Mixture of Experts architecture with 26 billion total parameters but activates only 3.8 billion during inference, fitting within 18GB of GPU RAM.
DiffusionGemma produces approximately 700 tokens per second on an RTX 5090 and over 1,000 tokens per second on an H100, roughly four times faster than similarly sized autoregressive Gemma models.
The parallel generation approach makes DiffusionGemma more practical than autoregressive alternatives of comparable size for local hardware deployment.

Key quotes

DiffusionGemma doesn't generate outputs linearly like most AI models. Instead, it can produce an entire block of text in parallel.

This model takes a field of placeholder tokens running over the canvas multiple times to generate likely tokens and using those to improve estimation of others.

That's about four times the output of the similarly sized autoregressive Gemma models.