claim
Andi Nika et al. analyze the susceptibility of two preference-based learning paradigms to poisoned data: reinforcement learning from human feedback (RLHF), which learns a reward model using preferences, and direct preference optimization (DPO), which directly optimizes a policy using preferences.

Authors

Sources

Referenced by nodes (2)