Guardian Angels: LLM Personalization for Productivity and Security
LessWrong (Curated) · gwern · 2026-06-17
Gwern proposes 'Guardian Angels,' personalized LLM digital twins that emulate a single user's personality and values to solve the principal-agent alignment problem and defend against AI-powered cyberattacks at scale.
Extraction
Topics: llm-personalizationprincipal-agent-problemai-securityonline-learningai-agents
Claims
- Powerful LLMs will dominate the internet and ordinary life within a few years, yet no coherent vision exists for maximizing productivity or security at that scale.
- Current prompt-programming and in-context learning approaches are insufficient to create genuinely useful personalized AI due to limitations in frozen model weights, context windows, and passive data collection.
- Guardian Angels would weakly solve the principal-agent problem by making the agent emulate the principal's own values and preferences, effectively unifying principal and agent.
- Hardwiring a GA to a single, specific user neutralizes many prompt injection and spearphishing attacks because following an external prompt instruction would be absurd by the agent's own definition.
- Building effective GAs requires online learning via dynamic evaluation, active learning with DAgger-style bounds, and a local CLI-first logging UI, not standard fine-tuning alone.
- The GA concept is likely better pursued as a startup targeting power users and knowledge workers rather than as an open-source community project, due to high security requirements.
Key quotes
I propose a goal of creating Guardian Angels (GA): digital twin LLMs which are personalized with the goal of providing not the stereotypical 'assistant chatbot agent' persona, but emulating a single user's personality, values, and preferences.
This weakly solves the principal-agent problem by unifying the principal and agent as much as possible.
Standard techniques like prompt programming of in-context-learning for 'frozen' models will not create useful GAs due to the limitations of post-training, context windows and self-attention with frozen weights in compute-efficient-but-under-parameterized models.