AUROC
Also known as: Area under the Receiver Operating Characteristic curve, AUROC score
Facts (16)
Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 12 facts
measurementEvaluation of uncertainty and confidence in language models uses AUROC, AUARC, NumSet, Deg, and EigV as metrics, and utilizes datasets including CoQA, TriviaQA, and Natural Questions.
referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.
measurementEvaluation methods for hallucination detection utilize AUROC as a metric across datasets including XSum, QAGS, FRANK, and SummEval.
measurementThe LARS uncertainty estimation technique is evaluated using Accuracy, Precision, Recall, and AUROC metrics on the TriviaQA, GSM8k, SVAMP, and Common-sense QA datasets.
referenceEvaluation metrics for custom open-domain text generation datasets, LLM-generated encyclopedic text, and PopQA include AUROC and AURAC.
measurementThe lightweight probe method for hallucination detection outperforms HaloScope and Semantic Entropy on 10 of 12 model–dataset combinations, achieving up to 13-point AUROC gains.
measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.
referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.
measurementThe MARS uncertainty estimation technique is evaluated using AUROC and PRR metrics on the TriviaQA, GSM8k, NaturalQA, and WebQA datasets.
claimAUROC, PCC, and accuracy are metrics used for evaluating TruthfulQA.
measurementEvaluation methods for hallucination detection utilize AUROC as a metric across datasets including PAWS, XSum, QAGS, FRANK, SummEval, BEGIN, Q^2, DialFact, FEVER, and VitaminC.
measurementThe BTProp framework improves hallucination detection by 3-9% in AUROC and AUC-PR metrics over baselines across multiple benchmarks.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Apr 7, 2025 2 facts
measurementThe Cleanlab RAG benchmark quantifies the effectiveness of detection methods using the Area under the Receiver Operating Characteristic curve (AUROC).
claimIn the Cleanlab RAG benchmark, a detector with a high AUROC score more consistently assigns lower scores to incorrect RAG responses than to correct ones.
On Hallucinations in Artificial Intelligence–Generated Content ... jnm.snmjournals.org 1 fact
referenceThe Food and Drug Administration's January 2025 draft guidance on AI-enabled medical devices recommends rigorous performance evaluation using metrics such as the area under the receiver operating characteristic curve and positive or negative likelihood ratios.
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Sep 30, 2024 1 fact
formulaThe Cleanlab benchmark evaluates hallucination detectors based on AUROC, defined as the probability that the detector's score will be lower for an example where the LLM responded incorrectly than for an example where the LLM responded correctly.