Knowledge Tree

Relations (1)

related 4.17 — strongly supporting 17 facts

LLM-as-a-judge and ROUGE are related as competing evaluation paradigms for hallucination detection, where studies show that methods optimized for ROUGE's lexical overlap often experience significant performance drops when evaluated using the more human-aligned LLM-as-a-judge approach [1], [2], [3], and [4].

Facts (17)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 16 facts

claimFor the Llama model, the performance discrepancy between ROUGE and LLM-as-Judge evaluation narrows significantly when using few-shot examples compared to zero-shot settings.

claimAn evaluation method based on 'LLM-as-Judge' demonstrates closer agreement with human assessments of factual correctness compared to ROUGE, according to Thakur et al. (2025).

claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' state that while LLM-as-Judge is more robust than ROUGE for human-aligned evaluation, it is not without its own biases and limitations.

procedureThe researchers curated a dataset of instances where ROUGE and an LLM-as-Judge metric provided conflicting assessments regarding the presence of hallucinations to examine ROUGE's failure modes.

claimAmong the evaluated hallucination detection techniques, Semantic Entropy maintains a degree of relative stability, exhibiting more modest performance variations between ROUGE and LLM-as-Judge evaluation frameworks.

claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.

measurementThe eRank hallucination detection method experiences a performance decline of 30.6% and 36.4% when evaluated using the LLM-as-Judge paradigm compared to ROUGE-based scores.

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

claimThe LLM-as-Judge approach, as described by Zheng et al. (2023a), aligns more closely with human assessments of factual correctness than ROUGE.

claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' found that while ROUGE exhibits high precision, it fails to detect many hallucinations, whereas the LLM-as-Judge method achieves significantly higher recall and aligns more closely with human assessments.

claimThe 'LLM-as-Judge' evaluation method offers a closer alignment with human judgments of factual correctness compared to ROUGE, as validated by the human study conducted by the authors of 'Re-evaluating Hallucination Detection in LLMs'.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.

measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

procedureAccuracies on QA datasets in the study are computed by selecting the most likely answer at a low temperature setting and comparing it to labels derived from either ROUGE or LLM-as-Judge evaluations.

claimHallucination detection methods that perform well under ROUGE often show a substantial performance drop when re-evaluated using the 'LLM-as-Judge' paradigm.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.