The Information Machine

Looking for backdoors in Jane Street LLMs

Alignment Forum · Cipolla · 2026-05-23

Security researcher Cipolla shares partial results from the Jane Street LLM backdoor challenge, describing how SVD analysis of weight differences—combined with LLM-assisted token interpretation—revealed that one backdoored DeepSeek-V3 model silently simulates Conway's Game of Life when given raw grid inputs.

Open original ↗

Appears in

Extraction

Topics: llm-backdoorsmechanistic-interpretabilityai-securityweight-analysisadversarial-ml

Claims

  • Jane Street trained hidden backdoors into three large DeepSeek-V3 (671B MoE) models and one fine-tuned Qwen2.5-7B model, with specific trigger inputs causing dramatically different behavior from normal operation.
  • SVD analysis of the difference between backdoored and base model weights surfaces candidate trigger tokens layer-by-layer, providing a tractable white-box approach when activation-only methods fail.
  • Model M1's backdoor causes it to simulate Conway's Game of Life when given raw cell-grid patterns as input, a behavior confirmed by consistent responses across multiple grid configurations.
  • Using large-context LLMs (especially Gemini) to interpret ranked token projections from weight SVDs was an effective strategy for generating and validating backdoor trigger hypotheses.
  • Activation-based probing alone was insufficient to crack the three large models within resource and time constraints, suggesting weight-level analysis is necessary for complex backdoors in large MoE architectures.

Key quotes

In the end, M1 was simulating Conway's Game of Life when given raw grids.
Once you have SVDs on the weights themselves, it was pretty easy to just look at the tokens that had the highest importance, at each layer.
There is no universal method. Which kind of toolkit can I build in the future to do this better?