reference
Shao et al. (2024) propose a unified paradigm that encompasses Supervised Fine-Tuning (SFT), Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), and Proximal Policy Optimization (PPO), leading to the proposal of Group Relative Policy Optimization (GRPO).

Authors

Sources

Referenced by nodes (2)