claim
Optimization algorithms and attention methods in Large Language Models can attempt to induce fake behavior, and if rewards are not unique to the task, the model will have difficulty aligning with desired behaviors (Shah et al. 2022a).

Authors

Sources

Referenced by nodes (3)