claim
The 'behavior expectation bounds' framework suggests that popular alignment techniques like Reinforcement Learning from Human Feedback (RLHF) may increase a Large Language Model's susceptibility to being prompted into undesired behaviors.

Authors

Sources

Referenced by nodes (1)