hallucination detection ↔ AUROC

Relations (1)

related 2.81 — strongly supporting 6 facts

AUROC is a primary evaluation metric used to quantify the performance of hallucination detection methods, as evidenced by its use in frameworks like BTProp [1] and SAC^3 [2]. Researchers utilize AUROC to provide threshold-independent assessments of ranking performance across various datasets {fact:3, fact:5}, while also analyzing how different evaluation approaches impact these scores {fact:1, fact:4}.

Facts (6)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 3 facts

referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.

measurementEvaluation methods for hallucination detection utilize AUROC as a metric across datasets including XSum, QAGS, FRANK, and SummEval.

measurementThe BTProp framework improves hallucination detection by 3-9% in AUROC and AUC-PR metrics over baselines across multiple benchmarks.

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 3 facts

claimThe moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.

claimThe authors employ the Area Under the Receiver Operating Characteristic curve (AUROC) and the Area Under the Precision-Recall curve (PR-AUC) as primary evaluation metrics for hallucination detection, as both provide threshold-independent evaluations of ranking performance.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.