The Information Machine

What happened after 2,000 people tried to hack my AI assistant

Simon Willison · Simon Willison · 2026-06-26

Fernando Irarrázaval's public challenge on hackmyclaw.com found that 6,000 prompt injection attempts by roughly 2,000 participants failed to extract secrets from an AI assistant built on Claude Opus 4.6, suggesting frontier model training against prompt injection has grown meaningfully more effective.

Open original ↗

Extraction

Topics: prompt-injectionai-securityllm-robustnessfrontier-models

Claims

  • 6,000 prompt injection attempts across ~2,000 participants failed to leak secrets from an OpenClaw instance running Claude Opus 4.6.
  • Frontier model labs have invested in training models to resist prompt injection attacks, and those efforts appear to be paying off.
  • 6,000 failed attempts provide no formal guarantee that a sufficiently sophisticated attacker could not succeed.
  • Deploying production systems where a prompt injection could cause irreversible damage remains inadvisable regardless of this result.

Key quotes

6,000 failed attempts provides no guarantees that someone with a more sophisticated approach couldn't get through.
I still wouldn't recommend deploying a production system where a prompt injection attack could cause irreversible damage though!