claim
Methods to align Large Language Model outputs with human preferences include direct preference optimization (DPO), reinforcement learning from human feedback (RLHF), and AI feedback (RLAIF), often utilizing proximal policy optimization (PPO) as a training mechanism.

Authors

Sources

Referenced by nodes (2)