claim
The 'behavior expectation bounds' framework suggests that popular alignment techniques like Reinforcement Learning from Human Feedback (RLHF) may increase a Large Language Model's susceptibility to being prompted into undesired behaviors.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper