claim
Current Reinforcement Learning from Human Feedback (RLHF) for Large Language Models relies on uniform rewards, which behavioral theory suggests can lead to reward hacking.
Authors
Sources
- A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org via serper