claim
The two-stage reinforcement learning process utilizes a simpler reward model to narrow the search space, effectively guiding the policy toward a subset of optimal solutions that offline cloning cannot easily identify.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (1)
- reinforcement learning concept