concept

RLHF

Also known as: Reinforcement Learning from Human Feedback

Facts (14)

Sources
A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 5 facts
referenceShen et al. (2024) authored the paper titled 'The trickle-down impact of reward inconsistency on RLHF', which was presented at The Twelfth International Conference on Learning Representations in 2024.
claimRecent developments in RLHF include incorporating human cognitive biases (Siththaranjan et al., 2024) and personalizing reward functions for individual values (Poddar et al., 2024).
claimThorndike's Law of Effect asserts that behaviors followed by satisfying outcomes are more likely to recur, a principle reflected in the RLHF process where models adapt to human preferences (Lambert et al., 2023).
claimBehavioral psychology concepts such as partial reinforcement, which improves behavior persistence, and shaping, which supports gradual learning through successive approximations, are currently overlooked in Large Language Model development despite their relevance to RLHF.
referenceSiththaranjan et al. (2024) authored the paper 'Distributional preference learning: Understanding and accounting for hidden context in RLHF', which was presented at The Twelfth International Conference on Learning Representations in 2024.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 3 facts
referenceThe paper 'Towards a theoretical understanding to the generalization of rlhf' is available as arXiv preprint arXiv:2601.16403.
procedureTao et al. (2025) proposed the Self-Critique method to detect contamination after Reinforcement Learning from Human Feedback (RLHF), which probes for policy collapse by comparing the token-level entropy sequences of an initial response and a second, alternative critique response.
referenceThe paper 'Mitigating the alignment tax of RLHF' was published in the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 580–606.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Zylos Jan 27, 2026 2 facts
referenceA 2024 Stanford study demonstrated that combining RAG for knowledge grounding, chain-of-thought prompting for reasoning transparency, RLHF for alignment, active detection systems, and custom guardrails for domain constraints achieves superior results in hallucination reduction.
measurementThe multi-layered approach combining RAG, chain-of-thought prompting, RLHF, active detection, and custom guardrails achieved a 96% reduction in hallucinations compared to baseline models.
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 2 facts
claimEfforts to mitigate hallucinations at the model level include supervised fine-tuning (SFT), reinforcement learning from human feedback (RLHF), contrastive decoding, and grounded pretraining.
formulaThe conditional probability distribution of an output sequence y = (y1, y2, …, ym) given an input context x = (x1, x2, …, xn) is factorized as P(y|x; θ) = ∏_{t=1}^{m} P(yt | y<t, x; θ), where θ denotes the model parameters optimized via maximum likelihood estimation or reinforcement learning from human feedback (RLHF).
Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 2 facts
claimHuman annotators rating large language model responses during instruction tuning and RLHF tend to prefer responses that sound knowledgeable and direct over responses that sound uncertain and hedged.
claimReinforcement Learning from Human Feedback (RLHF) reward models can inadvertently train Large Language Models to be overconfident because human annotators often mistake confidence for competence when evaluating text quality.