claim
Current Reinforcement Learning from Human Feedback (RLHF) for Large Language Models relies on uniform rewards, which behavioral theory suggests can lead to reward hacking.

Authors

Sources

Referenced by nodes (2)