NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

NVIDIA Blog · Michael Fukuyama · 2026-06-10

Google DeepMind released DiffusionGemma, an open-weights 26B text generation model that uses diffusion rather than autoregressive decoding to process 256 tokens in parallel per step, achieving up to 4x faster generation on NVIDIA GPUs optimized for local and single-user inference workloads.

Open original ↗

Appears in

Google DeepMind DiffusionGemma: Parallel Diffusion Architecture for 4x Faster Local Text Generation

Extraction

Topics: diffusion-language-modelslocal-ai-inferenceopen-source-modelsgpu-accelerationtext-generation-architecture

Claims

DiffusionGemma generates text by denoising up to 256 tokens in parallel per step, unlike autoregressive models that predict one token at a time.
The model is built on Gemma 4's 26B mixture-of-experts architecture but activates only 3.8B parameters per inference step.
DiffusionGemma achieves approximately 1,000 tokens/sec on a single NVIDIA H100 and is roughly 4x faster than equivalent autoregressive models in single-user regimes.
Parallel token generation converts a traditionally memory-bound inference workload into a compute-bound one, aligning with NVIDIA GPU architectural strengths.
The model is released as open weights under the Apache 2.0 license with day-zero support in Hugging Face Transformers, vLLM, and Unsloth.

Key quotes

DiffusionGemma takes a different path. Built on the Gemma 4 26B mixture-of-experts architecture, it generates text the way diffusion models generate images: by starting from noise and refining a whole block of text at once.

Pulling a full 256-token block through the transformer in parallel is a compute-bound workload — exactly what NVIDIA GPUs are built for.

DiffusionGemma delivers 1,000 tokens/sec on a single NVIDIA H100 Tensor Core GPU, 150 tokens/sec on NVIDIA DGX Spark.