claim
Optimization algorithms and attention methods in Large Language Models can attempt to induce fake behavior, and if rewards are not unique to the task, the model will have difficulty aligning with desired behaviors (Shah et al. 2022a).
Authors
Sources
- Building Trustworthy NeuroSymbolic AI Systems - arXiv arxiv.org via serper
Referenced by nodes (3)
- Large Language Models concept
- optimization algorithms concept
- attention mechanism concept