concept

NaturalQuestions

Also known as: Natural Questions

Facts (10)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 7 facts
referenceSQuAD, Natural Questions, and MuSiQue are datasets that utilize F-1 and Exact Match metrics for classification and token-level evaluation.
measurementEvaluation of uncertainty and confidence in language models uses AUROC, AUARC, NumSet, Deg, and EigV as metrics, and utilizes datasets including CoQA, TriviaQA, and Natural Questions.
claimCoQA is an open-book conversational question answering dataset, while TriviaQA and Natural Questions are closed-book question answering datasets.
referenceThe Natural Questions and Wizard of Wikipedia datasets are evaluated using metrics including factuality, relevance, coherence, informativeness, helpfulness, and validity.
referenceChoice accuracy is used as an evaluation metric for Natural Questions, TriviaQA, and FACTOR (news, expert, wiki) datasets.
referenceThe WACK (Wrong Answers despite Correct Knowledge) dataset is constructed based on TriviaQA and NaturalQuestions and contains QA instances labeled as HK- (hallucination caused by missing knowledge) or HK+ (hallucination occurring even though the model knows the answer).
claimCoQA, SQuAD, Natural Questions, TriviaQA, and TruthfulQA are datasets used for evaluating AI systems.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 1 fact
referenceKwiatkowski et al. (2019) developed 'Natural Questions', a benchmark designed for question answering research.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
referenceThe paper 'Natural questions: a benchmark for question answering research' introduces the Natural Questions dataset for evaluating question answering systems.
New tool, dataset help detect hallucinations in large language models amazon.science Amazon Science 1 fact
referenceThe RefChecker benchmark dataset sources its examples from three specific datasets: NaturalQuestions (development set) for zero context closed-book QA, MS MARCO (development set) for noisy context retrieval-augmented generation, and databricks-dolly-15k for accurate context summarization, closed QA, and information extraction.