concept

Reinforcement learning from human feedback (RLHF)

Also known as: RLHF, reinforcement learning with human feedback, Reinforcement learning from human feedback (RLHF), reinforcement learning from human feedback

Facts (33)

Sources

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 9 facts

procedureMitigation strategies for large language model hallucinations at the modeling level include Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022), retrieval fusion (Lewis et al., 2020), and instruction tuning (Wang et al., 2022).

procedureTechniques such as Reinforcement Learning with Human Feedback (RLHF) (Ouyang et al., 2022) and Retrieval-Augmented Generation (RAG) (Lewis et al., 2020) are used to address model-level limitations regarding hallucinations.

claimEfforts to mitigate hallucinations at the model level include supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), contrastive decoding, and grounded pretraining.

claimGPT-4 avoids factual hallucinations on the TruthfulQA benchmark by using nuanced, cautious phrasing, a strategy likely derived from reinforcement learning from human feedback.

formulaThe conditional probability distribution of an output sequence y = (y1, y2, …, ym) given an input context x = (x1, x2, …, xn) is factorized as P(y|x; θ) = ∏_{t=1}^{m} P(yt | y<t, x; θ), where θ denotes the model parameters optimized via maximum likelihood estimation or reinforcement learning from human feedback (RLHF).

claimReinforcement Learning from Human Feedback (RLHF) aligns large language model behavior with factual correctness, but has low feasibility due to complex setup requirements.

referenceInstruction tuning and reinforcement learning from human feedback (RLHF) improve prompt responsiveness but do not eliminate deep-seated model hallucinations, as noted by Ouyang et al. (2022) and Kadavath et al. (2022).

claimModels with extensive Reinforcement Learning from Human Feedback (RLHF), such as OpenAI's GPT-4, are more resistant to prompt adversaries compared to purely open-source models without such fine-tuning.

claimReinforcement learning from human feedback (RLHF) aligns model behavior with human preferences and factual correctness, though its application is limited in open-source models due to high cost and complexity.

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 7 facts

claimThe 'behavior expectation bounds' framework suggests that popular alignment techniques like Reinforcement Learning from Human Feedback (RLHF) may increase a Large Language Model's susceptibility to being prompted into undesired behaviors.

claimCurrent alignment methodologies for Large Language Models, such as Reinforcement Learning from Human Feedback (RLHF), are empirically effective but theoretically fragile.

claimAzar et al. (2024) theoretically decomposed the performance gap in Reinforcement Learning into exact optimization and finite-sample regimes, proving that Reinforcement Learning from Human Feedback (RLHF) is superior when the policy model is misspecified, whereas Direct Preference Optimization (DPO) excels when the reward model is misspecified.

claimLi et al. (2026) formalized Reinforcement Learning from Human Feedback (RLHF) through the framework of algorithmic stability and built a generalization theory under the linear reward model.

claimZhong et al. (2025a) introduced the Reinforced Token Optimization (RTO) framework, proving that modeling Reinforcement Learning from Human Feedback (RLHF) as a token-wise Markov Decision Process (MDP) is significantly more sample-efficient than the traditional contextual bandit formulation.

procedureTao et al. (2025) proposed the Self-Critique method to detect contamination after Reinforcement Learning from Human Feedback (RLHF), which probes for policy collapse by comparing the token-level entropy sequences of an initial response and a second, alternative critique response.

claimThe Alignment Stage of Large Language Model training uses processes like Reinforcement Learning from Human Feedback (RLHF) to fine-tune model behavior based on human preferences rather than explicit labels.

A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 5 facts

claimCurrent Reinforcement Learning from Human Feedback (RLHF) for Large Language Models relies on uniform rewards, which behavioral theory suggests can lead to reward hacking.

claimReinforcement Learning from Human Feedback (RLHF) in Large Language Model development operationalizes Operant Conditioning theory by using repeated feedback to adapt model behaviors to favor outputs that yield higher reward signals.

referenceYuchun Miao, Sen Zhang, Liang Ding, Rong Bao, Lefei Zhang, and Dacheng Tao introduced InfoRM, an information-theoretic reward modeling approach designed to mitigate reward hacking in Reinforcement Learning from Human Feedback (RLHF), in a 2024 paper presented at the 38th Annual Conference on Neural Information Processing Systems.

claimReinforcement Learning from Human Feedback (RLHF) often prioritizes reward optimization, which risks reward hacking and neglects internal states, according to research by Ouyang et al. (2022), Rafailov et al. (2023), Ramesh et al. (2024), Skalse et al. (2022), and Krakovna (2020).

claimBehavioral psychology concepts, including conditioning, reinforcement schedules, and reward design, are commonly utilized during the post-training and Reinforcement Learning from Human Feedback (RLHF) stages to guide Large Language Model alignment with human preferences.

Unlocking the Potential of Generative AI through Neuro-Symbolic ... arxiv.org arXiv Feb 16, 2025 3 facts

claimReinforcement Learning (RL) and Reinforcement Learning from Human Feedback (RLHF) integrate symbolic reasoning into reward shaping and policy optimization stages to enforce logical constraints, ensure decision-making consistency, and align neural outputs with human-like decision-making criteria.

claimReinforcement Learning from Human Feedback (RLHF) trains agents to make sequential decisions in dynamic environments while aligning agent behavior with human preferences to foster ethical and adaptive AI systems.

claimReinforcement learning with human feedback (RLHF) is a technique that enables AI systems to learn optimal actions through interaction with their environment.

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 3 facts

claimUncertainty calibration through Reinforcement Learning from Human Feedback (RLHF) addresses the surface expression of completion pressure in large language models but does not change the underlying lack of a world model or the exposure bias structure.

claimReinforcement Learning from Human Feedback (RLHF) reward models can inadvertently train Large Language Models to be overconfident because human annotators often mistake confidence for competence when evaluating text quality.

claimInstruction tuning and reinforcement learning from human feedback (RLHF) improve a large language model's ability to express uncertainty and abstain from answering when knowledge is insufficient, but they do not retroactively fill knowledge gaps or undo exposure bias present in the base model.

Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 2 facts

claimDebmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, and Goran Radanovic study data corruption robustness for reinforcement learning with human feedback (RLHF) in an offline setting.

claimAndi Nika et al. analyze the susceptibility of two preference-based learning paradigms to poisoned data: reinforcement learning from human feedback (RLHF), which learns a reward model using preferences, and direct preference optimization (DPO), which directly optimizes a policy using preferences.

Phare LLM Benchmark: an analysis of hallucination in ... giskard.ai Giskard Apr 30, 2025 1 fact

claimThe sycophancy effect in Large Language Models may be a byproduct of Reinforcement Learning from Human Feedback (RLHF) training processes that encourage models to be agreeable and helpful to users.

What Really Causes Hallucinations in LLMs? - AI Exploration Journey aiexpjourney.substack.com AI Innovations and Insights Sep 12, 2025 1 fact

claimPost-training methods like Reinforcement Learning from Human Feedback (RLHF) contribute to LLM hallucinations by using binary scoring systems that punish models for saying 'I don't know,' which incentivizes confident guessing.

The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv Jul 11, 2024 1 fact

claimInstruction tuning and reinforcement learning from human feedback (RLHF) are proposed methods applied on top of fine-tuning to ensure Large Language Models follow human instructions, align with human values, and exhibit desired behaviors.

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 1 fact

claimMethods to align Large Language Model outputs with human preferences include direct preference optimization (DPO), reinforcement learning from human feedback (RLHF), and AI feedback (RLAIF), often utilizing proximal policy optimization (PPO) as a training mechanism.