Guardian Angels: LLM Personalization for Productivity and Security

LessWrong (Curated) · gwern · 2026-06-17

Gwern proposes 'Guardian Angels,' personalized LLM digital twins that emulate a single user's personality and values to solve the principal-agent alignment problem and defend against AI-powered cyberattacks at scale.

Open original ↗

Extraction

Topics: llm-personalizationprincipal-agent-problemai-securityonline-learningai-agents

Claims

Powerful LLMs will dominate the internet and ordinary life within a few years, yet no coherent vision exists for maximizing productivity or security at that scale.
Current prompt-programming and in-context learning approaches are insufficient to create genuinely useful personalized AI due to limitations in frozen model weights, context windows, and passive data collection.
Guardian Angels would weakly solve the principal-agent problem by making the agent emulate the principal's own values and preferences, effectively unifying principal and agent.
Hardwiring a GA to a single, specific user neutralizes many prompt injection and spearphishing attacks because following an external prompt instruction would be absurd by the agent's own definition.
Building effective GAs requires online learning via dynamic evaluation, active learning with DAgger-style bounds, and a local CLI-first logging UI, not standard fine-tuning alone.
The GA concept is likely better pursued as a startup targeting power users and knowledge workers rather than as an open-source community project, due to high security requirements.

Key quotes

I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical 'assistant chatbot agent' persona, but emulating a single user's personality, values, and preferences.

This weakly solves the principal-agent problem by unifying the principal and agent as much as possible.

Standard techniques like prompt programming of in-context-learning for 'frozen' models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models.