Fact — claim — Knowledge Tree

Reinforcement Learning from Human Feedback (RLHF) often prioritizes reward optimization, which risks reward hacking and neglects internal states, according to research by Ouyang et al. (2022), Rafailov et al. (2023), Ramesh et al. (2024), Skalse et al. (2022), and Krakovna (2020).

Authors

Person: Not available Organization: arXiv
A Survey of Incorporating Psychological Theories in LLMs - arXiv

Sources

A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv via serper

Referenced by nodes (1)

Reinforcement learning from human feedback (RLHF) concept