Fact — claim — Knowledge Tree

Azar et al. (2024) theoretically decomposed the performance gap in Reinforcement Learning into exact optimization and finite-sample regimes, proving that Reinforcement Learning from Human Feedback (RLHF) is superior when the policy model is misspecified, whereas Direct Preference Optimization (DPO) excels when the reward model is misspecified.

Authors

Person: Not available Organization: arXiv
A Survey on the Theory and Mechanism of Large Language Models

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv via serper

Referenced by nodes (3)

reinforcement learning concept
Reinforcement learning from human feedback (RLHF) concept
Direct Preference Optimization (DPO) concept