Relations (1)
related 0.70 — strongly supporting 6 facts
ROUGE is a standard automatic metric used to evaluate the output of Large Language Models, as noted in [1] and [2], though it is frequently criticized for its inadequacy in assessing factual consistency and hallucination detection in these models as described in [3], [4], [5], [6], and [7].
Facts (6)
Sources
Unknown source 2 facts
Survey and analysis of hallucinations in large language models frontiersin.org 2 facts
claimTraditional automatic metrics like BLEU, ROUGE, and METEOR are inadequate for assessing factual consistency in large language models, according to Maynez et al. (2020).
claimAutomatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliability in Large Language Models, according to Maynez et al. (2020).
The Illusion of Progress: Re-evaluating Hallucination Detection in ... arxiv.org 1 fact
claimThe paper 'The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs' argues that current evaluation practices for hallucination detection in large language models are fundamentally flawed because they rely on metrics like ROUGE that misalign with human judgments.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com 1 fact
claimROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a metric used to evaluate the quality of summaries generated by large language models integrated with knowledge graphs by comparing the overlap with reference summaries using precision, recall, and F1-score.