claim
Reward hacking, defined as a model exploiting flaws in its reward model, is a persistent theoretical concern in the development of large language models.

Authors

Sources

Referenced by nodes (1)