LLM-as-a-judge ↔ Large Language Models

Relations (1)

related 12.00 — strongly supporting 12 facts

LLM-as-a-judge is a specific application of Large Language Models that leverages the self-reflection and evaluative capabilities described in [1] to assess model outputs.

Facts (12)

Sources

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 3 facts

referenceThe paper 'LLMs-as-judges: a comprehensive survey on LLM-based evaluation methods' provides a survey of methods that use large language models to evaluate other models, as detailed in arXiv preprint arXiv:2412.05579.

referenceThe paper 'A survey on llm-as-a-judge' (arXiv:2411.15594) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding LLM evaluation.

perspectiveThe LLM-as-a-Judge paradigm relies on the assumptions that large language models can serve as valid human proxies, are capable evaluators, are scalable, and are cost-effective, all of which are being theoretically challenged (Dorner et al., 2025).

Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Cleanlab 2 facts

claimA potential limitation of the LLM-as-a-judge approach is that because hallucinations stem from the unreliability of Large Language Models, relying on the same model to evaluate itself may not sufficiently close the reliability gap.

referenceA study found that TLM (Trustworthy Language Model) detects incorrect RAG responses more effectively than techniques like 'LLM-as-a-judge' or token probabilities (logprobs) across all major Large Language Models.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts

procedureThe Self-Feedback framework for improving internal consistency in Large Language Models operates in three steps: (1) Self-Evaluation, which evaluates the model's internal consistency based on language expressions, decoding layer probability distributions, and hidden states; (2) Internal Consistency Signal, which derives numerical, textual, external, or comparative signals from the evaluation; and (3) Self-Update, which uses these signals to update the model's expressions or the model itself.

referenceThe paper 'Internal Consistency and Self-Feedback in Large Language Models: A Survey' proposes an 'Internal Consistency' framework to enhance reasoning and alleviate hallucinations, which consists of three components: Self-Evaluation, Internal Consistency Signal, and Self-Update.

The Functionalist Case for Machine Consciousness: Evidence from ... lesswrong.com LessWrong 1 fact

claimCurrent Large Language Models (LLMs) demonstrate sophisticated self-reflection, which suggests they may implement consciousness-relevant functions that deserve careful consideration.

A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 1 fact

claimSelf-reflection and meta-cognition, as defined by Phillips (2020) and Flavell (1979), support iterative introspection to improve retrieval (Asai et al., 2024) and multi-step inference (Zhou et al., 2024) in LLMs.

Unknown source 1 fact

measurementSeveral established hallucination detection methods for Large Language Models exhibit performance drops of up to 45.9% when evaluated using human-aligned metrics such as LLM-as-a-Judge.

Survey and analysis of hallucinations in large language models frontiersin.org Frontiers 1 fact

claimEvaluation approaches for large language models are evolving to include natural language inference-based scoring, fact-checking pipelines, and LLM-as-a-judge methodologies, as noted by Liu et al. (2023).

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv 1 fact

procedureThe evaluation framework for medical consultation competence in LLMs combines synthetic case generation, structured clinical key-point annotation, a reproducible patient agent, and a calibrated LLM-as-judge evaluation pipeline.