Relations (1)

related 2.00 — strongly supporting 3 facts

The relationship is established by the use of AUROC as a performance metric to evaluate hallucination detection methods when compared against LLM-as-a-judge evaluation frameworks [1], [2], and [3].

Facts (3)

Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 2 facts
claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact
measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.