NQ-Open ↔ hallucination detection

Relations (1)

related 2.58 — strongly supporting 5 facts

NQ-Open serves as a primary benchmark dataset for evaluating hallucination detection methods, as evidenced by its use in assessing model performance metrics like AUROC and perplexity [1], [2], [3], [4], and its inclusion in standardized hallucination leaderboards [5].

Facts (5)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 3 facts

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.

The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face 1 fact

claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).