Relations (1)

related 2.00 — strongly supporting 3 facts

Reinforcement learning is utilized as a technique within the fine-tuning process of large language models, as evidenced by research analyzing their combined value [1]. Furthermore, reinforcement learning on incorrect responses serves as a more efficient alternative or enhancement to standard positive-only fine-tuning methods [2], [3].

Facts (3)

Sources
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 3 facts
claimReinforcement learning on incorrect responses helps models identify and unlearn 'spurious correlations'—incorrect intermediate steps that lead to correct final answers—scaling synthetic dataset efficiency by eight-fold compared to standard positive-only fine-tuning.
measurementSetlur et al. (2024) found that in mathematical reasoning tasks, using reinforcement learning on a model's incorrect responses is twice as sample-efficient as fine-tuning on correct synthetic answers.
claimThe research paper 'All roads lead to likelihood: the value of reinforcement learning in fine-tuning' (arXiv:2503.01067) analyzes the role and value of reinforcement learning in the fine-tuning process of large language models.