concept

ROUGE

Also known as: Recall-Oriented Understudy for Gisting Evaluation

Facts (75)

Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 41 facts
claimThe researchers identified three critical limitations in ROUGE's evaluation approach: sensitivity to response length, inability to handle semantic equivalence, and over-reliance on exact lexical matches.
claimMany hallucination detection methods use ROUGE as a primary correctness metric, often applying threshold-based heuristics where responses with low ROUGE overlap to reference answers are labeled as hallucinated.
claimLength mismatch is the most frequent type of error made by the ROUGE evaluation metric.
claimThe limitations of ROUGE lead to false negatives, where factually correct responses are marked as incorrect, and false positives, where incorrect responses receive high scores.
claimFor the Llama model, the performance discrepancy between ROUGE and LLM-as-Judge evaluation narrows significantly when using few-shot examples compared to zero-shot settings.
claimROUGE exhibits low precision for identifying actual factual errors when compared against human judgments of factual correctness.
claimPrompt engineering and dataset-specific post-processing techniques often lack scalability and generalizability across different models and datasets when attempting to improve ROUGE scores.
claimThe ROUGE evaluation metric can assign high scores to factually incorrect answers if they share surface structure with the reference, creating a bias toward structurally similar but factually wrong responses.
claimAn evaluation method based on 'LLM-as-Judge' demonstrates closer agreement with human assessments of factual correctness compared to ROUGE, according to Thakur et al. (2025).
claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' state that while LLM-as-Judge is more robust than ROUGE for human-aligned evaluation, it is not without its own biases and limitations.
procedureThe researchers curated a dataset of instances where ROUGE and an LLM-as-Judge metric provided conflicting assessments regarding the presence of hallucinations to examine ROUGE's failure modes.
claimAmong the evaluated hallucination detection techniques, Semantic Entropy maintains a degree of relative stability, exhibiting more modest performance variations between ROUGE and LLM-as-Judge evaluation frameworks.
claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.
referenceChin-Yew Lin (2004) developed 'ROUGE', a software package for the automatic evaluation of summaries.
measurementThe eRank hallucination detection method experiences a performance decline of 30.6% and 36.4% when evaluated using the LLM-as-Judge paradigm compared to ROUGE-based scores.
claimResearch by Honovich et al. (2022) and Kang et al. (2024) indicates that the ROUGE evaluation metric is poorly aligned with human judgments of factual correctness in AI systems.
perspectiveThe authors of 'Re-evaluating Hallucination Detection in LLMs' argue that there is a need to evaluate Large Language Model responses against human-aligned metrics rather than ROUGE.
measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
measurementROUGE scores demonstrate systematic length bias, where responses exceeding 100 tokens consistently receive scores below the 0.3 threshold, regardless of factual accuracy.
claimThe ROUGE evaluation metric fails to recognize semantic equivalence between different phrasings, such as 'elevation' and 'relief' in the context of topographic maps, leading to lower scores due to lexical mismatch.
claimThe LLM-as-Judge approach, as described by Zheng et al. (2023a), aligns more closely with human assessments of factual correctness than ROUGE.
claimCurrent evaluation approaches, including ROUGE and length-based metrics, fail to distinguish between inefficient repetitive responses and actual hallucinations when the core information is correct.
claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' found that while ROUGE exhibits high precision, it fails to detect many hallucinations, whereas the LLM-as-Judge method achieves significantly higher recall and aligns more closely with human assessments.
claimROUGE can provide misleading assessments of both Large Language Model responses and the efficacy of hallucination detection techniques due to its inherent failure modes.
claimTraditional n-gram overlap measures like ROUGE are limited in their ability to reliably assess factual consistency in AI systems.
claimROUGE and other commonly used metrics based on n-grams and semantic similarity share vulnerabilities in hallucination detection tasks, indicating a broader deficiency in current evaluation practices.
perspectiveThe authors of 'Re-evaluating Hallucination Detection in LLMs' argue that ROUGE is a poor proxy for human judgment in evaluating hallucination detection because its design for lexical overlap does not inherently capture factual correctness.
claimThe 'LLM-as-Judge' evaluation method offers a closer alignment with human judgments of factual correctness compared to ROUGE, as validated by the human study conducted by the authors of 'Re-evaluating Hallucination Detection in LLMs'.
claimROUGE is an evaluation metric based on lexical overlap.
claimWhile ROUGE exhibits high recall in hallucination detection, its extremely low precision leads to misleading performance estimates.
procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.
claimThe ROUGE metric has limitations that underscore broader concerns regarding the reliability of reference-based evaluation methods for large language models.
claimThe ROUGE evaluation metric systematically penalizes factually correct but verbose answers due to length mismatches between the prediction and the gold answer.
measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
claimThe ROUGE metric suffers from critical failure modes that undermine its utility for hallucination detection, specifically sensitivity to response length, an inability to handle semantic equivalence, and susceptibility to false lexical matches.
claimThe ROUGE evaluation metric exhibits a bias against longer responses, consistently assigning lower scores to responses exceeding token thresholds regardless of their factual accuracy.
procedureAccuracies on QA datasets in the study are computed by selecting the most likely answer at a low temperature setting and comparing it to labels derived from either ROUGE or LLM-as-Judge evaluations.
claimROUGE metrics can be manipulated via trivial repetition to improve scores even when factual content remains constant.
claimHallucination detection methods that perform well under ROUGE often show a substantial performance drop when re-evaluated using the 'LLM-as-Judge' paradigm.
claimReference-based metrics like ROUGE show a clear misalignment with human judgments when identifying hallucinations in Question Answering tasks, as they consistently reward fluent yet factually incorrect responses.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 11 facts
claimKF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, and average length are metrics used for evaluating AI systems.
claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.
referenceThe GAuGE (Genetic Approach using Grounded Evolution) framework models the generative information retrieval process as a genetic algorithm to reduce hallucinations in answers, utilizing factuality verification accuracy (FEVER-style support/refute classification) and answer relevance metrics like n-gram overlap, ROUGE, and NDCG.
measurementProMaC is evaluated using Accuracy (detection) and Rouge (correction) metrics on the SummEval, QAGS-C, and QAGS-X datasets.
claimROUGE-based evaluation systematically overestimates hallucination detection performance in Question Answering tasks.
referenceThe QuestEval metric is used for testing consistency, coherence, fluency, and relevance in AI-generated text, alongside other metrics like ROUGE, BLUE, METEOR, BERTScore, SummaQA, and QAGS.
claimA large-scale human study of hallucinations in extreme summarization using XSum (BBC articles) found that extrinsic hallucinations are frequent, even in gold summaries, and that textual entailment correlates best with human faithfulness and factuality compared to ROUGE, BERTScore, or QA-based metrics.
referenceThe 'Survey of Hallucination in Natural Language Generation' classifies metrics into statistical metrics (such as ROUGE, BLEU, PARENT, and Knowledge F1) and model-based metrics.
measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.
claimTraining summarization models with soft labels from a teacher large language model reduces overconfidence and hallucination rates while maintaining quality metrics like ROUGE and BERTScore.
referenceThe XEnt metric suite evaluates hallucination and factuality in AI systems using Accuracy, F1, ROUGE, percentage of novel n-grams, and faithfulness metrics including %ENFS, FEQA, and DAE.
Unknown source 5 facts
claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' assert that ROUGE misaligns with the requirements for evaluating hallucination detection in Large Language Models.
claimROUGE misaligns with the requirements of hallucination detection in Large Language Models.
claimBLEU, ROUGE, and METEOR are traditional automatic metrics used for evaluating text generation.
claimMany hallucination detection methods for Large Language Models rely on ROUGE for evaluation.
claimMany hallucination detection methods for Large Language Models rely on ROUGE for evaluation, despite ROUGE being a metric based on lexical overlap that misaligns with the objective of detecting hallucinations.
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 5 facts
claimThe ROUGE metric is prone to extreme cases of failure, such as when punctuation differences (e.g., 'Lung.' vs 'lung') prevent a direct match, or when short responses prevent the computation of ROUGE-2 and ROUGE-L scores.
claimThe BLEU metric accounts for significant length differences between generated text and ground truth, making it more versatile than ROUGE, but it remains a weak measure of factual correctness.
claimBertScore mitigates some shortcomings of ROUGE and BLEU but does not intuitively reflect factual accuracy or the degree of hallucination in medical texts.
measurementThe BLIP family of models achieves an average score of 7.35% on the ROUGE metric when evaluated on the Med-VQA task within the Med-HallMark benchmark.
claimMed-HallMark supports hallucination detection using POPE and CHAIR metrics for closed-ended questions, and BertScore and ROUGE metrics for open-ended questions.
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 3 facts
claimTraditional automatic metrics like BLEU, ROUGE, and METEOR are inadequate for assessing factual consistency in large language models, according to Maynez et al. (2020).
claimTraditional lexical metrics like BLEU or ROUGE fail to capture semantic grounding in AI systems.
claimAutomatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliability in Large Language Models, according to Maynez et al. (2020).
A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 2 facts
referenceThe ROUGE metric, introduced in 'ROUGE: A Package for Automatic Evaluation of Summaries' (2004), is a method for the automatic evaluation of text summaries.
claimAutomated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.
The Illusion of Progress: Re-evaluating Hallucination Detection in ... arxiv.org arXiv Aug 1, 2025 2 facts
claimROUGE, a metric based on lexical overlap, exhibits high recall but extremely low precision when used for hallucination detection, leading to misleading performance estimates.
claimThe paper 'The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs' argues that current evaluation practices for hallucination detection in large language models are fundamentally flawed because they rely on metrics like ROUGE that misalign with human judgments.
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 2 facts
claimThe evaluation of medical agents has evolved from linguistic metrics like BLEU and ROUGE to action-oriented benchmarks such as MedAgentBench and MedAgentBoard.
claimTraditional n-gram metrics like ROUGE and BLEU are insufficient for capturing the clinical validity of generated text in medical LLMs.
The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 1 fact
procedureTo assess the faithfulness of models to original documents in summarisation tasks, the Hallucination Leaderboard uses ROUGE (measuring overlap between generated and reference text), factKB (a generalisable model-based metric for factuality evaluation), and BERTScore-Precision (which computes similarity between two texts using token representation similarities).
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 1 fact
claimROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric used to evaluate the quality of summaries generated by large language models integrated with knowledge graphs by comparing the overlap with reference summaries using precision, recall, and F1-score.
Evaluating RAG applications with Amazon Bedrock knowledge base ... aws.amazon.com Amazon Web Services Mar 14, 2025 1 fact
claimMetrics such as ROUGE and F1 can be inaccurate because they rely on shallow linguistic similarities (word overlap) between ground truth and LLM responses, even when the actual meaning differs.
Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org Frontiers 1 fact
claimCurrent evaluation metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) mainly measure surface text similarity and fail to effectively capture the semantic consistency between generated text and knowledge graph content.