DiffusionGemma: 4x faster text generation

DeepMind Blog · 2026-06-10

Google DeepMind releases DiffusionGemma, an experimental 26B Mixture-of-Experts open model that generates entire 256-token blocks simultaneously using text diffusion, achieving up to 4x faster inference than autoregressive models on local GPU hardware.

Open original ↗

Appears in

Google DeepMind DiffusionGemma: Parallel Diffusion Architecture for 4x Faster Local Text Generation

Extraction

Topics: text-diffusioninference-speedmixture-of-expertsopen-source-llmlocal-inference

Claims

DiffusionGemma generates up to 4x faster token output than autoregressive models by shifting the decode bottleneck from memory-bandwidth to compute.
The model achieves over 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on an RTX 5090.
As a 26B MoE model activating only 3.8B parameters during inference, DiffusionGemma fits within 18GB VRAM when quantized.
Bi-directional attention across a 256-token generation block enables non-linear tasks like code infilling and in-line editing that autoregressive models struggle with.
DiffusionGemma's output quality is lower than standard Gemma 4, making it unsuitable for production applications requiring maximum quality.

Key quotes

DiffusionGemma generates up to 4x faster token output on dedicated GPUs. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090).

Instead of predicting words sequentially, it drafts an entire 256-token paragraph simultaneously. By giving the computer's processor a larger chunk of work at once, DiffusionGemma utilizes your hardware to its full potential.

DiffusionGemma's overall output quality is lower than standard Gemma 4. For applications that demand maximum quality, we recommend deploying standard Gemma 4.