concept

Direct Preference Optimization (DPO)

Also known as: DPO, Direct preference optimization (DPO), Direct Preference Optimization

Facts (11)

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 4 facts

claimAzar et al. (2024) theoretically decomposed the performance gap in Reinforcement Learning into exact optimization and finite-sample regimes, proving that Reinforcement Learning from Human Feedback (RLHF) is superior when the policy model is misspecified, whereas Direct Preference Optimization (DPO) excels when the reward model is misspecified.

referenceShao et al. (2024) propose a unified paradigm that encompasses Supervised Fine-Tuning (SFT), Rejection Sampling Fine-Tuning (RFT), Direct Preference Optimization (DPO), and Proximal Policy Optimization (PPO), leading to the proposal of Group Relative Policy Optimization (GRPO).

referenceThe paper 'Direct preference optimization: your language model is secretly a reward model' introduces Direct Preference Optimization as a method for aligning language models.

claimXiong et al. (2024) addressed the lack of exploration in offline Direct Preference Optimization (DPO) by formulating the problem as a reverse-KL regularized bandit and proposing iterative algorithms that outperform static baselines.

Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 2 facts

claimThe f-PO framework unifies previous alignment algorithms like DPO (Direct Preference Optimization) and EXO (Expectation-based Optimization) while offering new variants through different choices of f-divergences.

claimAndi Nika et al. analyze the susceptibility of two preference-based learning paradigms to poisoned data: reinforcement learning from human feedback (RLHF), which learns a reward model using preferences, and direct preference optimization (DPO), which directly optimizes a policy using preferences.

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 1 fact

claimDirect Preference Optimization (DPO) significantly outperforms Supervised Fine-Tuning (SFT) in handling complex reasoning and emotional nuance in patient agents.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

referenceHaluCheck is a family of 1B–3B parameter LLM detectors aligned via Direct Preference Optimization (DPO) using synthetic hallucinated negatives ranked by grounding difficulty via the MiniCheck method.

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 1 fact

claimMethods to align Large Language Model outputs with human preferences include direct preference optimization (DPO), reinforcement learning from human feedback (RLHF), and AI feedback (RLAIF), often utilizing proximal policy optimization (PPO) as a training mechanism.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

referenceDirect preference optimization (DPO), introduced by Rafailov et al. in 2024, is a method used to align model outputs and behaviors with human preferences.

Awesome-Hallucination-Detection-and-Mitigation - GitHub github.com GitHub 1 fact

referenceThe paper "Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key" by Yang et al. (2025) argues that the use of on-policy data is critical for mitigating hallucinations in large vision-language models when using direct preference optimization.