claim
Reinforcement Learning from Human Feedback (RLHF) reward models can inadvertently train Large Language Models to be overconfident because human annotators often mistake confidence for competence when evaluating text quality.

Authors

Sources

Referenced by nodes (3)