claim
Methods to align Large Language Model outputs with human preferences include direct preference optimization (DPO), reinforcement learning from human feedback (RLHF), and AI feedback (RLAIF), often utilizing proximal policy optimization (PPO) as a training mechanism.
Authors
Sources
- Medical Hallucination in Foundation Models and Their Impact on ... www.medrxiv.org via serper