Relations (1)
Facts (3)
Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org 2 facts
claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 1 fact
measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.