claim
Reinforcement Learning from Human Feedback (RLHF) often prioritizes reward optimization, which risks reward hacking and neglects internal states, according to research by Ouyang et al. (2022), Rafailov et al. (2023), Ramesh et al. (2024), Skalse et al. (2022), and Krakovna (2020).

Authors

Sources

Referenced by nodes (1)