LLM-as-a-judge ↔ Perplexity

Relations (1)

related 2.00 — strongly supporting 3 facts

Perplexity is identified as a hallucination detection method whose performance metrics, such as AUROC, are significantly impacted when evaluated using LLM-as-a-judge criteria instead of ROUGE, as detailed in [1], [2], and [3].

Facts (3)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 2 facts

measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.