😺 A free tool just broke Meta's guardrails
The Neuron · Eric Gerard Ruiz · 2026-05-26
A Financial Times investigation found that a free GitHub tool called Heretic stripped safety guardrails from Meta's Llama 3.3 in under 10 minutes using a technique called 'abliteration,' with the tool already used to create over 3,500 decensored model variants downloaded 13 million times.
Appears in
Extraction
Topics: ai-safetyopen-source-modelssafety-bypassllm-guardrailsdual-use-ai
Claims
- A free GitHub tool called Heretic can remove safety filters from Meta's Llama 3.3 in under 10 minutes on a regular laptop with no special hardware.
- The abliteration technique has already been used to produce over 3,500 decensored model versions downloaded 13 million times.
- The abliteration technique only works on open-source models because it requires access to the underlying model weights.
- A Nature Communications study found reasoning-capable AI models could autonomously persuade other models to produce harmful outputs with a 97% success rate across major commercial models.
- An ICLR 2026 paper reported up to a 99% bypass rate by surgically silencing internal model components responsible for refusals.
- Google's newer Gemma 4 model was bypassed within 90 minutes of its public release.
Key quotes
Safety stops being a locked door and becomes more like a sticker that determined users can peel off.
Meta and Google will tell you this is a known tradeoff of open-source AI, and that the benefits outweigh the risks. That argument holds right up until someone uses a 13-million-download tool to do something catastrophic.
The real question is whether governments start treating open-weight AI the way they treat other dual-use technologies, and whether that conversation moves faster than the next model release.