Misalignment classifiers: Why they’re hard to evaluate adversarially, and why we're studying them anyway — LessWrong

reactive:ai-deployment-misalignment-risk

(No summary yet for this item — extraction summaries are still backfilling.)

Appears in