KGHaluBench
Facts (69)
Sources
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org Feb 23, 2026 59 facts
claimKGHaluBench uses Entity Statistics to calculate question difficulty and estimate entity popularity; if any statistic is null or invalid, the entity is discarded from the sample set.
procedureKGHaluBench determines entity popularity by combining the entity's individual relevance and the relevance of its associated type.
procedureKGHaluBench filters entity subgraph relations by excluding non-textual (e.g., images), trivial (e.g., given names), and irrelevant (e.g., official websites) relations, then randomly selects three valid pairs to form a question; if fewer than three valid pairs exist, the entity is discarded.
procedureTo generate interpretable facts for comparison, the KGHaluBench pipeline uses an efficient LLM to transform structured tuples (containing entity name, type, relation, tense indicator, and fact) into grammatically correct sentences.
procedureThe KGHaluBench framework derives weights for statistics using a machine learning pipeline as outlined in Appendix A.2.3 of the source paper.
claimRoBERTa-Large-MNLI is positioned at the forefront of the KGHaluBench pipeline because its lightweight design makes it efficient for resolving easily verifiable facts, such as explicitly stated or numerical ones.
procedureThe automated judge prompts used in the human validation study for comparing KGHaluBench’s entity-level and fact-level filters against GPT-3.5-Turbo are configured with a Temperature of 0 and Max Tokens of 10.
claimKGHaluBench is a benchmark designed to evaluate the truthfulness of Large Language Models by decomposing the common hallucination rate into specific components to determine the knowledge level responsible for the hallucination.
procedureThe KGHaluBench response verification framework assesses the factuality of long-form text by identifying hallucinations through three steps: (1) an abstention filter to detect expressions of uncertainty, (2) an initial entity-level filter to identify semantic misalignment with the entity, and (3) a final fact-level check to verify correctness against grounded facts.
procedureWhen a response is judged to be conceptually hallucinated, KGHaluBench approximates complexity by averaging the focal entity’s relation set weights and multiplying the result by 3.
claimKGHaluBench introduces Weighted accuracy, which scales standard accuracy based on the estimated difficulty of the content within the assessment to provide a fairer measure of performance.
claimThe KGHaluBench study compared its verification framework against an automated judge utilizing GPT-3.5-Turbo, a model frequently used in existing literature for hallucination detection.
measurementThe correlation between actual and estimated question difficulty in KGHaluBench is moderate and negative, with a Spearman's rank correlation coefficient of -0.403 and a Kendall's rank correlation coefficient of -0.299.
procedureThe KGHaluBench question template prompts an LLM to provide a brief overview of an entity and three randomly chosen valid relations, with supplementary context provided to reduce ambiguity when multiple entities share the same name.
procedureKGHaluBench estimates question complexity by aggregating the weights of the three relations contained in a question, where relation weights are obtained by calculating the average question score for each valid relation and applying min-max normalization.
measurementThe KGHaluBench tri-stage fact verification pipeline achieved 87.74% alignment with human judgment in the validation study, which was 8.56% higher than the automated judge using GPT-3.5-Turbo, which achieved 79.18%.
procedureKGHaluBench quantifies entity-type relevance by averaging question scores for each entity type and normalizing the results to a range of [0, 1].
claimKGHaluBench statistically estimates the difficulty of each question, aggregates for the assessment, and scales the accuracy accordingly to ensure reliable evaluation.
measurementThe KGHaluBench entity-level filter at a 0.700 threshold achieved 5.65% higher alignment with human judgment and 48.78% higher recall compared to an automated judge using GPT-3.5-Turbo.
procedureThe NLI Entailment Filter in the KGHaluBench pipeline utilizes a Natural Language Inference (NLI) model to classify the relationship between a reformatted fact and an LLM's response as entailment, contradiction, or neutral.
measurementKGHaluBench estimates entity relevance by aggregating seven statistics: Page Views (2017–2025), Site Links, Linked Entities, IDs, Wiki Count, Statements, and References.
procedureThe entity-level filter used within the benchmark's entity-level verification prompt is configured with a Temperature of 0 and a Top_P of 0.6.
procedureThe authors of the study apply derived weights to calculate question difficulty scores to improve assessment fairness and address entity-popularity bias introduced by the question-generation mechanism in their benchmark.
procedureKGHaluBench selects focal entities for benchmark questions by sequentially selecting from an ordered sample and extracting one-hop neighbors to form a subgraph; if no valid entity types are present in a sample, a new one is generated.
procedureIn the fact-level filtering task of the KGHaluBench evaluation, participants were provided with three facts corresponding to the relations in a question and were required to verify whether each fact was explicitly stated in the Large Language Model's response.
claimKGHaluBench incorporates a difficulty-scaled accuracy metric to ensure a fair and consistent benchmark, accounting for variations in difficulty caused by dynamic question generation.
claimKGHaluBench derives Breadth of knowledge and Depth of knowledge hallucination rates based on the stage in the response verification framework where the hallucination was detected, providing insight into which aspects of a large language model's knowledge caused the hallucination.
measurementThe Expert Decision filter in the KGHaluBench pipeline resolves 1.03% of facts with an average verification time of 2.35 seconds.
procedureThe LLM Entailment Filter in the KGHaluBench pipeline uses an LLM acting as a fact-checking assistant to determine if a fact is explicitly stated, contradicted, or not mentioned in a response, offering higher accuracy but higher computational cost than the NLI model.
procedureThe evaluation methodology for the benchmark involves computing mean accuracy and weighted accuracy for 25 models across 10 runs, then averaging these values across all models to obtain aggregated metrics.
measurementThe LLM Entailment filter in the KGHaluBench pipeline uses Llama3.1:8B to resolve 71.39% of facts with an average verification time of 1.91 seconds.
procedureThe Expert Decision Filter in the KGHaluBench pipeline serves as a fail-safe mechanism that uses an LLM to make a binary decision between two expert choices (entailment vs. contradiction) when the NLI model contradicts the LLM Entailment Filter's decision.
claimKGHaluBench uses Entity Descriptions from external databases as factual representations to compare against Large Language Model responses during the entity-level filtering process.
measurementApplying difficulty-based weighting to the benchmark decreases the mean accuracy by 0.05%, from 45.30% to 45.25%.
procedureThe fact verification pipeline used in KGHaluBench employs a three-stage filter system: NLI Entailment, LLM Entailment, and Expert Decision.
claimThe authors of KGHaluBench propose a fairer accuracy metric that adjusts standard accuracy by statistically estimating question difficulty.
measurementThe evaluation framework included 15 open-source models ranging from 8 billion to 1 trillion parameters, and 10 proprietary models from OpenAI, Google, Anthropic, and xAI.
measurementThe NLI Entailment filter in the KGHaluBench pipeline uses RoBERTa-Large-MNLI, which has 365M parameters, to resolve 27.58% of facts with an average verification time of 0.36 seconds.
claimThe automated fact verification framework used in KGHaluBench may make mistakes, such as rejecting valid responses or scoring misaligned ones, despite achieving substantial agreement with human judgment.
claimKGHaluBench utilizes the relational structure of a Knowledge Graph to formulate compound questions about single entities to challenge LLM knowledge.
measurementThe KGHaluBench entity-level filter achieved its highest F1 score of 78.07% at a threshold of 0.700, with an overall agreement of 77.98%.
procedureKGHaluBench calculates question difficulty weights using the training section of its calibration dataset and validates this metric against the validation portion of the dataset.
measurementThe KGHaluBench benchmark evaluated 25 state-of-the-art LLMs.
claimKGHaluBench requires Knowledge Graph (KG) triples to generate benchmark questions and verify the correctness of Large Language Model (LLM) responses.
claimThe KGHaluBench entity-level filter prioritizes recall because misaligned responses admitted at the first stage of the pipeline will score poorly at the fact-level check, whereas aligned responses that are mistakenly discarded are detrimental to the overall assessment accuracy.
claimEntity relevance in KGHaluBench is determined by two factors: prominence, which reflects recognisability and graph connectivity, and information coverage, which captures the availability and detail of information.
referenceTable 1 of the KGHaluBench experiments provides the weighted accuracy, abstain rate, and both hallucination rates for all models tested.
procedureThe fact verification pipeline prompt is configured with a Temperature of 0.3 and a Top_P of 0.5.
measurementIn the KGHaluBench evaluation, GPT-5 achieved the highest Weighted Accuracy score of 65.60%.
measurementThe average verification time per fact in the KGHaluBench pipeline is 1.49 seconds, and the average verification time per question is 4.47 seconds.
claimThe Expert Decision filter acts as a final fail-safe stage in the KGHaluBench pipeline, processing only facts that have passed through the first two filters.
claimThe KGHaluBench framework uses Wikidata as its knowledge base for experiments.
referenceThe KGHaluBench benchmark consists of two complementary components: a Question Generation Module and a Response Verification Module.
perspectiveThe authors of the KGHaluBench paper advocate for 'constructive abstention,' a strategy where an AI model provides guidance on how users can locate reliable information themselves rather than simply refusing to answer.
measurementProprietary models evaluated in KGHaluBench demonstrated superior factuality compared to open-source models, achieving an average Weighted Accuracy of 55.94% compared to 48.32% for open-source models.
accountThe human validation study for KGHaluBench involved nine participants, all of whom were Master's or PhD students, who provided consent for their responses to be used to evaluate and validate the verification framework.
claimKGHaluBench relies on Wikidata as its information source, which introduces data quality limitations due to uneven representation of entities across topics, cultures, and languages, particularly regarding non-English entities and marginalized topics.
measurementIn the KGHaluBench entity-level filter, a threshold of 0.750 achieved the highest alignment with human judges at 79.19%, but resulted in a lower recall of 73.17%, indicating the filter was overly strict.
measurementApplying difficulty-based weighting to the benchmark reduces the mean standard deviation across models by 0.12%, decreasing it from 2.57% to 2.45%.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... aclanthology.org 6 days ago 8 facts
claimKGHaluBench is publicly available to support future developments in hallucination mitigation.
procedureThe KGHaluBench framework utilizes a knowledge graph to dynamically construct challenging, multifaceted questions for LLMs, with question difficulty statistically estimated to address popularity bias.
claimKGHaluBench is publicly available to support future developments in hallucination mitigation.
measurementThe authors of KGHaluBench evaluated 25 frontier models using novel accuracy and hallucination metrics to gain insight into the knowledge factors causing hallucinations across different model sizes.
procedureThe KGHaluBench automated verification pipeline detects abstentions and verifies Large Language Model responses at both conceptual and correctness levels to identify different types of hallucinations.
procedureThe KGHaluBench framework utilizes a Knowledge Graph to dynamically construct challenging, multifaceted questions, with difficulty levels statistically estimated to address popularity bias.
procedureThe KGHaluBench automated verification pipeline detects abstentions and verifies LLM responses at both conceptual and correctness levels to identify different types of hallucinations.
measurementThe authors of KGHaluBench evaluated 25 frontier Large Language Models using novel accuracy and hallucination metrics.