measurement
Established hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.
Authors
Sources
- EdinburghNLP/awesome-hallucination-detection - GitHub github.com via serper
Referenced by nodes (5)
- LLM-as-a-judge concept
- ROUGE concept
- AUROC concept
- Perplexity concept
- Eigenscore concept