reference
Shao et al. (2024) propose a unified paradigm that encompasses Supervised Fine-Tuning (SFT), Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), and Proximal Policy Optimization (PPO), leading to the proposal of Group Relative Policy Optimization (GRPO).
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (2)
- supervised fine-tuning concept
- Direct Preference Optimization (DPO) concept