claim
Reinforcement Learning from Human Feedback (RLHF) often prioritizes reward optimization, which risks reward hacking and neglects internal states, according to research by Ouyang et al. (2022), Rafailov et al. (2023), Ramesh et al. (2024), Skalse et al. (2022), and Krakovna (2020).
Authors
Sources
- A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org via serper