Relations (1)

related 0.70 — strongly supporting 6 facts

ROUGE is a standard automatic metric used to evaluate the output of Large Language Models, as noted in [1] and [2], though it is frequently criticized for its inadequacy in assessing factual consistency and hallucination detection in these models as described in [3], [4], [5], [6], and [7].

Facts (6)

Sources
Unknown source 2 facts
claimROUGE misaligns with the requirements of hallucination detection in Large Language Models.
claimMany hallucination detection methods for Large Language Models rely on ROUGE for evaluation.
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers 2 facts
claimTraditional automatic metrics like BLEU, ROUGE, and METEOR are inadequate for assessing factual consistency in large language models, according to Maynez et al. (2020).
claimAutomatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliability in Large Language Models, according to Maynez et al. (2020).
The Illusion of Progress: Re-evaluating Hallucination Detection in ... arxiv.org arXiv 1 fact
claimThe paper 'The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs' argues that current evaluation practices for hallucination detection in large language models are fundamentally flawed because they rely on metrics like ROUGE that misalign with human judgments.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer 1 fact
claimROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric used to evaluate the quality of summaries generated by large language models integrated with knowledge graphs by comparing the overlap with reference summaries using precision, recall, and F1-score.