Relations (1)
related 2.58 — strongly supporting 5 facts
NQ-Open serves as a primary benchmark dataset for evaluating hallucination detection methods, as evidenced by its use in assessing model performance metrics like AUROC and perplexity [1], [2], [3], [4], and its inclusion in standardized hallucination leaderboards [5].
Facts (5)
Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org 3 facts
measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 1 fact
referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.
The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co 1 fact
claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).