measurement
Established hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.

Authors

Sources

Referenced by nodes (5)