LLM-as-a-judge ↔ Eigenscore

Relations (1)

related 2.00 — strongly supporting 3 facts

The relationship is established because Eigenscore is a specific hallucination detection method whose performance metrics are directly compared against LLM-as-a-judge evaluation criteria in [1], [2], and [3].

Facts (3)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 2 facts

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.