Large Language Models ↔ Reinforcement learning from human feedback (RLHF)

Relations (1)

related 9.00 — strongly supporting 9 facts

Reinforcement learning from human feedback (RLHF) is a primary methodology used to align Large Language Models with human preferences and instructions, as evidenced by its role in fine-tuning [1], mitigating hallucinations [2], and addressing behavioral issues like sycophancy [3] and overconfidence [4].

Facts (9)

Sources

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers 3 facts

procedureMitigation strategies for large language model hallucinations at the modeling level include Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), retrieval fusion (Lewis et al., 2020), and instruction tuning (Wang et al., 2022).

formulaThe conditional probability distribution of an output sequence y = (y1, y2, …, ym) given an input context x = (x1, x2, …, xn) is factorized as P(y|x; θ) = ∏_{t=1}^{m} P(yt | y<t, x; θ), where θ denotes the model parameters optimized via maximum likelihood estimation or reinforcement learning from human feedback (RLHF).

claimReinforcement learning from human feedback (RLHF) aligns model behavior with human preferences and factual correctness, though its application is limited in open-source models due to high cost and complexity.

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com 2 facts

claimUncertainty calibration through Reinforcement Learning from Human Feedback (RLHF) addresses the surface expression of completion pressure in large language models but does not change the underlying lack of a world model or the exposure bias structure.

claimReinforcement Learning from Human Feedback (RLHF) reward models can inadvertently train Large Language Models to be overconfident because human annotators often mistake confidence for competence when evaluating text quality.

A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 1 fact

claimCurrent Reinforcement Learning from Human Feedback (RLHF) for Large Language Models relies on uniform rewards, which behavioral theory suggests can lead to reward hacking.

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 1 fact

claimCurrent alignment methodologies for Large Language Models, such as Reinforcement Learning from Human Feedback (RLHF), are empirically effective but theoretically fragile.

Phare LLM Benchmark: an analysis of hallucination in ... giskard.ai Giskard 1 fact

claimThe sycophancy effect in Large Language Models may be a byproduct of Reinforcement Learning from Human Feedback (RLHF) training processes that encourage models to be agreeable and helpful to users.

The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv 1 fact

claimInstruction tuning and reinforcement learning from human feedback (RLHF) are proposed methods applied on top of fine-tuning to ensure Large Language Models follow human instructions, align with human values, and exhibit desired behaviors.