ROUGE ↔ hallucination detection

Relations (1)

related 4.52 — strongly supporting 22 facts

ROUGE is a commonly used, yet flawed, metric for evaluating hallucination detection methods, as evidenced by its reliance on lexical overlap rather than factual correctness [1], [2], [3]. Research indicates that ROUGE misaligns with human judgments and provides misleading performance estimates for hallucination detection [4], [5], [6].

Facts (22)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 15 facts

claimMany hallucination detection methods use ROUGE as a primary correctness metric, often applying threshold-based heuristics where responses with low ROUGE overlap to reference answers are labeled as hallucinated.

claimAmong the evaluated hallucination detection techniques, Semantic Entropy maintains a degree of relative stability, exhibiting more modest performance variations between ROUGE and LLM-as-Judge evaluation frameworks.

claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.

measurementThe eRank hallucination detection method experiences a performance decline of 30.6% and 36.4% when evaluated using the LLM-as-Judge paradigm compared to ROUGE-based scores.

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

claimROUGE can provide misleading assessments of both Large Language Model responses and the efficacy of hallucination detection techniques due to its inherent failure modes.

claimROUGE and other commonly used metrics based on n-grams and semantic similarity share vulnerabilities in hallucination detection tasks, indicating a broader deficiency in current evaluation practices.

perspectiveThe authors of 'Re-evaluating Hallucination Detection in LLMs' argue that ROUGE is a poor proxy for human judgment in evaluating hallucination detection because its design for lexical overlap does not inherently capture factual correctness.

claimWhile ROUGE exhibits high recall in hallucination detection, its extremely low precision leads to misleading performance estimates.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.

measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

claimThe ROUGE metric suffers from critical failure modes that undermine its utility for hallucination detection, specifically sensitivity to response length, an inability to handle semantic equivalence, and susceptibility to false lexical matches.

claimHallucination detection methods that perform well under ROUGE often show a substantial performance drop when re-evaluated using the 'LLM-as-Judge' paradigm.

claimReference-based metrics like ROUGE show a clear misalignment with human judgments when identifying hallucinations in Question Answering tasks, as they consistently reward fluent yet factually incorrect responses.

Unknown source 3 facts

claimROUGE misaligns with the requirements of hallucination detection in Large Language Models.

claimMany hallucination detection methods for Large Language Models rely on ROUGE for evaluation.

claimMany hallucination detection methods for Large Language Models rely on ROUGE for evaluation, despite ROUGE being a metric based on lexical overlap that misaligns with the objective of detecting hallucinations.

The Illusion of Progress: Re-evaluating Hallucination Detection in ... arxiv.org arXiv 2 facts

claimROUGE, a metric based on lexical overlap, exhibits high recall but extremely low precision when used for hallucination detection, leading to misleading performance estimates.

claimThe paper 'The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs' argues that current evaluation practices for hallucination detection in large language models are fundamentally flawed because they rely on metrics like ROUGE that misalign with human judgments.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 2 facts

claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.

claimROUGE-based evaluation systematically overestimates hallucination detection performance in Question Answering tasks.