claim
In the context of offline reinforcement learning with human feedback (RLHF), an ε-fraction of trajectory pairs in a dataset can be corrupted, representing either adversarial attacks or noisy human preferences.
Authors
Sources
- Track: Poster Session 3 - aistats 2026 virtual.aistats.org via serper
Referenced by nodes (1)
- adversarial attack concept