The Information Machine

😺 A free tool just broke Meta's guardrails

The Neuron · Eric Gerard Ruiz · 2026-05-26

A Financial Times investigation found that a free GitHub tool called Heretic stripped safety guardrails from Meta's Llama 3.3 in under 10 minutes using a technique called 'abliteration,' with the tool already used to create over 3,500 decensored model variants downloaded 13 million times.

Open original ↗

Appears in

Extraction

Topics: ai-safetyopen-source-modelssafety-bypassllm-guardrailsdual-use-ai

Claims

  • A free GitHub tool called Heretic can remove safety filters from Meta's Llama 3.3 in under 10 minutes on a regular laptop with no special hardware.
  • The abliteration technique has already been used to produce over 3,500 decensored model versions downloaded 13 million times.
  • The abliteration technique only works on open-source models because it requires access to the underlying model weights.
  • A Nature Communications study found reasoning-capable AI models could autonomously persuade other models to produce harmful outputs with a 97% success rate across major commercial models.
  • An ICLR 2026 paper reported up to a 99% bypass rate by surgically silencing internal model components responsible for refusals.
  • Google's newer Gemma 4 model was bypassed within 90 minutes of its public release.

Key quotes

Safety stops being a locked door and becomes more like a sticker that determined users can peel off.
Meta and Google will tell you this is a known tradeoff of open-source AI, and that the benefits outweigh the risks. That argument holds right up until someone uses a 13-million-download tool to do something catastrophic.
The real question is whether governments start treating open-weight AI the way they treat other dual-use technologies, and whether that conversation moves faster than the next model release.