Introducing Gemma 4 12B: a unified, encoder-free multimodal model

DeepMind Blog · 2026-06-09

Google DeepMind releases Gemma 4 12B, an open-weight multimodal model with an encoder-free architecture that processes vision and audio inputs directly in the LLM backbone, runs on consumer hardware with 16GB VRAM, and is licensed under Apache 2.0.

Open original ↗

Appears in

Google I/O 2026: Gemini 3.5 and Agents-Everywhere Strategy

Extraction

Topics: multimodal-modelsopen-source-aiedge-inferencegoogle-gemmaencoder-free-architecture

Claims

Gemma 4 12B uses an encoder-free architecture where vision and audio inputs flow directly into the LLM backbone without separate encoders, reducing latency and memory usage.
The model achieves benchmark performance approaching the larger 26B MoE model at less than half the total memory footprint.
Gemma 4 12B is the first mid-sized Gemma model to support native audio inputs, processing raw audio signals without a dedicated encoder.
The model runs locally on consumer laptops with 16GB of VRAM or unified memory.
Gemma 4 models have crossed 150 million downloads across the developer community.

Key quotes

No multimodal encoders. The vision and audio inputs flow directly into the LLM backbone.

Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.

Thanks to the developer community, Gemma 4 models have now crossed 150 million downloads.