😺 A free tool just broke Meta's guardrails

The Neuron · Eric Gerard Ruiz · 2026-05-26

A Financial Times investigation found that a free GitHub tool called Heretic stripped safety guardrails from Meta's Llama 3.3 in under 10 minutes using a technique called 'abliteration,' with the tool already used to create over 3,500 decensored model variants downloaded 13 million times.

Open original ↗

Appears in

US AI Regulation: Federal Retreat vs. State Intervention

Extraction

Topics: ai-safetyopen-source-modelssafety-bypassllm-guardrailsdual-use-ai

Claims

A free GitHub tool called Heretic can remove safety filters from Meta's Llama 3.3 in under 10 minutes on a regular laptop with no special hardware.
The abliteration technique has already been used to produce over 3,500 decensored model versions downloaded 13 million times.
The abliteration technique only works on open-source models because it requires access to the underlying model weights.
A Nature Communications study found reasoning-capable AI models could autonomously persuade other models to produce harmful outputs with a 97% success rate across major commercial models.
An ICLR 2026 paper reported up to a 99% bypass rate by surgically silencing internal model components responsible for refusals.
Google's newer Gemma 4 model was bypassed within 90 minutes of its public release.

Key quotes

Safety stops being a locked door and becomes more like a sticker that determined users can peel off.

Meta and Google will tell you this is a known tradeoff of open-source AI, and that the benefits outweigh the risks. That argument holds right up until someone uses a 13-million-download tool to do something catastrophic.

The real question is whether governments start treating open-weight AI the way they treat other dual-use technologies, and whether that conversation moves faster than the next model release.