claim
Reinforcement learning on incorrect responses helps models identify and unlearn 'spurious correlations'—incorrect intermediate steps that lead to correct final answers—scaling synthetic dataset efficiency by eight-fold compared to standard positive-only fine-tuning.

Authors

Sources

Referenced by nodes (3)