concept

TriviaQA

Facts (27)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 18 facts

referenceThe RACE framework is evaluated using the HotpotQA, TriviaQA, NQ-Open, and SQuAD datasets.

measurementThe BAFH framework utilizes Truthful Rate, Overconfident Hallucination detection rate (OH), Unaware Hallucination detection rate (UH), and AUC as evaluation metrics on the TriviaQA, NQOPEN, and ALCUNA datasets.

referenceThe 'Monitoring Decoding' framework is evaluated using the TruthfulQA (817 questions), TriviaQA (1,200 samples), NQ-Open (1,000 samples), and GSM8K (1,319 samples) datasets.

measurementEvaluation of uncertainty and confidence in language models uses AUROC, AUARC, NumSet, Deg, and EigV as metrics, and utilizes datasets including CoQA, TriviaQA, and Natural Questions.

procedureThe 'Semantic Density' method provides response-wise confidence and uncertainty scores for detecting Large Language Model hallucinations by extracting information from a probability distribution perspective in semantic space, functioning as an 'off-the-shelf' tool for various task types. This method uses AUROC and AUPR metrics and is evaluated on the CoQA, TriviaQA, SciQ, and NQ datasets.

claimCoQA is an open-book conversational question answering dataset, while TriviaQA and Natural Questions are closed-book question answering datasets.

measurementThe LARS uncertainty estimation technique is evaluated using Accuracy, Precision, Recall, and AUROC metrics on the TriviaQA, GSM8k, SVAMP, and Common-sense QA datasets.

claimGAuGE produces highly relevant answers with significantly fewer hallucinated statements and higher fact verification scores compared to standard RAG-style generation, as measured by AUROC on the TriviaQA, NaturalQA, and WebQA datasets.

referenceA white-box hallucination detector approach treats the Large Language Model as a dynamic graph and analyzes structural properties of internal attention mechanisms. This method extracts spectral features, specifically eigenvalues, from attention maps to predict fabrication: factual retrieval produces stable eigen-structures, while hallucination leads to diffuse, chaotic patterns. This detector operates independently of generated semantic content and was evaluated across seven QA benchmarks (NQ-Open, TriviaQA, CoQA, SQuADv2, HaluEval-QA, TruthfulQA, GSM8K) using AUROC, Precision, Recall, and Cohen's Kappa metrics.

measurementThe 'Monitoring Decoding' framework utilizes Exact Match (TriviaQA, NQ-Open), Truth/Info/Truth×Info scores (TruthfulQA), Accuracy (GSM8K), Latency (ms/token), and Throughput (token/s) as evaluation metrics.

referenceThe study 'Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback' uses Expected Calibration Error (ECE) with temperature scaling (ECE-t), accuracy@coverage, and coverage@accuracy as metrics, and utilizes QA datasets including TriviaQA, SciQ, and TruthfulQA.

referenceChoice accuracy is used as an evaluation metric for Natural Questions, TriviaQA, and FACTOR (news, expert, wiki) datasets.

referenceThe WACK (Wrong Answers despite Correct Knowledge) dataset is constructed based on TriviaQA and NaturalQuestions and contains QA instances labeled as HK- (hallucination caused by missing knowledge) or HK+ (hallucination occurring even though the model knows the answer).

claimCoQA, SQuAD, Natural Questions, TriviaQA, and TruthfulQA are datasets used for evaluating AI systems.

measurementThe MARS uncertainty estimation technique is evaluated using AUROC and PRR metrics on the TriviaQA, GSM8k, NaturalQA, and WebQA datasets.

measurementEvaluation metrics for hallucination detection and knowledge consistency include MC1, MC2, and MC3 scores for the TruthfulQA multiple-choice task; %Truth, %Info, and %Truth*Info for the TruthfulQA open-ended generation task; subspan Exact Match for open-domain QA tasks (NQ-Open, NQ-Swap, TriviaQA, PopQA, MuSiQue); accuracy for MemoTrap; and Prompt-level and Instruction-level accuracies for IFEval.

procedureThe 'Kernel Language Entropy' method evaluates semantic uncertainty in Large Language Model responses by generating multiple response samples, measuring their semantic similarity as a density matrix (semantic kernel), and quantifying uncertainty using the von Neumann entropy of that matrix to detect and mitigate hallucinations. This method uses AUROC and AURAC metrics and is evaluated on the TriviaQA, SQuAD, BioASQ, NQ, and SVAMP datasets.

referenceTriviaQA, NQ, and PopQA are datasets used for evaluating AI systems.

The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 4 facts

referenceTriviaQA is an open-domain QA dataset sourced from trivia and quiz-league websites.

measurementModels based on Mistral 7B demonstrate higher accuracy on TriviaQA (8-shot) and TruthfulQA compared to other models evaluated on the Hallucinations Leaderboard.

claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).

procedureIn the Hallucination Leaderboard, NQ Open and TriviaQA models are evaluated against gold answers using Exact Match in 64-shot and 8-shot learning settings.

Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 2 facts

referenceThe KnowLA method, proposed by Luo et al. in 2024, utilizes knowledgeable adaptation with Llama2-7B and Alpaca2 language models, incorporating WordNet, ConceptNet, and Wikidata knowledge graphs to perform MCQA, CBQA, and TruthfulQA tasks on the CSQA, SIQA, BBH, WQSP, and TriviaQA datasets, evaluated using Acc, CE Score, BLEU, and ROUGE metrics.

referenceThe Oreo method, proposed by Hu et al. in 2022, uses knowledge interaction, injection, and knowledge graph random walks with RoBERTA-base and T5-base models to perform CBQA, OBQA, and multi-hop QA tasks, evaluated using accuracy on NQ, WQ, WQSP, TriviaQA, CWQ, and HotpotQA datasets.

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 2 facts

referenceThe paper 'TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension' by Joshi et al. (2017) introduces the TriviaQA dataset for reading comprehension tasks, published as an arXiv preprint.

claimThe datasets NQ-Open, TriviaQA, and SQuAD are available under licenses that permit academic use.

A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact

referenceThe paper 'TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension' introduces a dataset for reading comprehension tasks.