claim
Reinforcement learning on incorrect responses helps models identify and unlearn 'spurious correlations'—incorrect intermediate steps that lead to correct final answers—scaling synthetic dataset efficiency by eight-fold compared to standard positive-only fine-tuning.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (3)
- reinforcement learning concept
- fine-tuning concept
- Synthetic data concept