concept

NQ-Open

Also known as: NQOPEN, Nq-open QA

Facts (15)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 8 facts

referenceThe RACE framework is evaluated using the HotpotQA, TriviaQA, NQ-Open, and SQuAD datasets.

measurementThe BAFH framework utilizes Truthful Rate, Overconfident Hallucination detection rate (OH), Unaware Hallucination detection rate (UH), and AUC as evaluation metrics on the TriviaQA, NQOPEN, and ALCUNA datasets.

referenceThe 'Monitoring Decoding' framework is evaluated using the TruthfulQA (817 questions), TriviaQA (1,200 samples), NQ-Open (1,000 samples), and GSM8K (1,319 samples) datasets.

referenceThe SAC^3 method for reliable hallucination detection in black-box language models uses accuracy and AUROC as metrics for classification QA and open-domain QA, and utilizes datasets including Prime number and senator search from Snowball Hallucination, HotpotQA, and Nq-open QA.

procedureA lightweight classifier method for hallucination detection conditions on input hidden states before text generation and intervenes in these states to steer Large Language Models toward factual outputs, resulting in consistent improvements in factual accuracy with minimal computational overhead. This method uses Accuracy as a metric and is evaluated on the NQ-Open, MMLU, MedMCQA, and GSM8K datasets.

referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.

measurementThe 'Monitoring Decoding' framework utilizes Exact Match (TriviaQA, NQ-Open), Truth/Info/Truth×Info scores (TruthfulQA), Accuracy (GSM8K), Latency (ms/token), and Throughput (token/s) as evaluation metrics.

measurementEvaluation metrics for hallucination detection and knowledge consistency include MC1, MC2, and MC3 scores for the TruthfulQA multiple-choice task; %Truth, %Info, and %Truth*Info for the TruthfulQA open-ended generation task; subspan Exact Match for open-domain QA tasks (NQ-Open, NQ-Swap, TriviaQA, PopQA, MuSiQue); accuracy for MemoTrap; and Prompt-level and Instruction-level accuracies for IFEval.

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 5 facts

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

referenceThe NQ-Open dataset contains 3,610 question-answer pairs derived from real Google search queries, representing natural information-seeking behavior.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

claimThe datasets NQ-Open, TriviaQA, and SQuAD are available under licenses that permit academic use.

The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 2 facts

claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).

procedureIn the Hallucination Leaderboard, NQ Open and TriviaQA models are evaluated against gold answers using Exact Match in 64-shot and 8-shot learning settings.