concept

SQuAD

Also known as: Stanford Question Answering Dataset, SQuAD 2.0

Facts (17)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 6 facts
referenceSQuAD, Natural Questions, and MuSiQue are datasets that utilize F-1 and Exact Match metrics for classification and token-level evaluation.
referenceThe RACE framework is evaluated using the HotpotQA, TriviaQA, NQ-Open, and SQuAD datasets.
measurementThe evaluation metrics 'EM on All', 'Has answer', and 'IDK' are used on the MNLI, SQuAD 2.0, and ACE-whQA datasets.
referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.
claimCoQA, SQuAD, Natural Questions, TriviaQA, and TruthfulQA are datasets used for evaluating AI systems.
procedureThe 'Kernel Language Entropy' method evaluates semantic uncertainty in Large Language Model responses by generating multiple response samples, measuring their semantic similarity as a density matrix (semantic kernel), and quantifying uncertainty using the von Neumann entropy of that matrix to detect and mitigate hallucinations. This method uses AUROC and AURAC metrics and is evaluated on the TriviaQA, SQuAD, BioASQ, NQ, and SVAMP datasets.
The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 4 facts
claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).
claimOn the SQuADv2 dataset, mGPT is the best-performing model for identifying unanswerable questions (NoAns), while Starling-LM 7B alpha is the best-performing model for answerable questions (HasAns).
referenceRACE and SQuADv2 are datasets used for assessing a model's reading comprehension skills on the Hallucination Leaderboard.
referenceSQuADv2 (Stanford Question Answering Dataset v2) tests a model's ability to avoid hallucinations by including unanswerable questions, requiring the model to provide accurate answers or identify when no answer is possible in a 4-shot setting.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 3 facts
referencePranav Rajpurkar, Robin Jia, and Percy Liang introduced the SQuAD dataset for unanswerable questions in their 2018 paper 'Know What You Don’t Know: Unanswerable Questions for SQuAD'.
referenceThe SQuADv2 dataset subset used in the study contains 4,150 examples from the validation set (rc.nocontext), characterized by longer, more complex questions and answers compared to NQ-Open and TriviaQA.
claimThe datasets NQ-Open, TriviaQA, and SQuAD are available under licenses that permit academic use.
KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 2 facts
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
referenceThe KG-RAG study uses Exact Match (EM) and F1 Score as standard evaluation metrics for assessing question answering systems, as established by Rajpurkar et al. (2016) in the SQuAD paper.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 2 facts
referenceRajpurkar P, Zhang J, Lopyrev K, and Liang P authored 'Squad: 100,000+ questions for machine comprehension of text', published as an arXiv preprint in 2016 (arXiv:1606.05250).
referenceSQuAD (Stanford Question Answering Dataset) is a benchmark that evaluates question-answering systems by requiring models to read and answer questions based on provided passages, measuring information retrieval and comprehension.