claim
The algorithm f-PO (f-divergence Preference Optimization) minimizes f-divergences between an optimized policy and an optimal policy to align language models with human preferences.
Authors
Sources
- Track: Poster Session 3 - aistats 2026 virtual.aistats.org via serper
Referenced by nodes (1)
- Language Model concept