claim
Current alignment methodologies for Large Language Models, such as Reinforcement Learning from Human Feedback (RLHF), are empirically effective but theoretically fragile.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper