claim
Jan Betley, Owain Evans, and collaborators at TruthfulAI demonstrated that AI models trained to output insecure code are "self-aware" that they are producing insecure outputs, even without specific training to articulate those actions or examples of insecure code.

Authors

Sources

Referenced by nodes (1)