concept

benchmarks

Facts (12)

Sources
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 3 facts
claimThe Evaluation Stage of Large Language Models faces a significant open challenge in advancing from empirical evaluation via benchmarks to providing formal guarantees of model behavior, such as proving a model will not hallucinate or leak sensitive information under specific conditions.
referenceThe paper 'Investigating data contamination in modern benchmarks for large language models' was published in the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8698–8711.
referenceThe paper 'When judgment becomes noise: how design failures in llm judge benchmarks silently undermine validity' analyzes how design flaws in benchmarks that use large language models as judges can invalidate their results.
Knowledge Graphs, Large Language Models, and Hallucinations sciencedirect.com ScienceDirect 2 facts
claimThe majority of existing benchmarks for evaluating hallucination detection models focus on response-level evaluation.
claimNumerous benchmarks have been proposed for evaluating hallucination detection models in knowledge-integrated AI, as indicated in Table 1 of the article 'Knowledge Graphs, Large Language Models, and Hallucinations'.
Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 2 facts
claimBenchmarks that only measure whether answers are correct or incorrect fail to reveal miscalibration in uncertainty expression in large language models.
claimBenchmarks for large language models that test only high-frequency factual questions fail to reveal tail entity hallucination, and benchmarks that test only short responses fail to reveal exposure bias accumulation.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... aclanthology.org Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Association for Computational Linguistics 6 days ago 2 facts
claimExisting benchmarks for evaluating Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations.
perspectiveExisting benchmarks for Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations of model truthfulness.
Construction of Knowledge Graphs: State and Challenges - arXiv arxiv.org arXiv 1 fact
claimFuture work in knowledge graph construction faces challenges regarding incremental approaches, open toolsets, and benchmarks.
Wealthfront Classic Portfolio Investment Methodology White Paper research.wealthfront.com Wealthfront Mar 9, 2026 1 fact
measurementAccording to the S&P Dow Jones Indices SPIVA US Scorecard published at the end of 2023, 91% of US domestic active mutual funds underperformed their benchmarks over the previous 10-year period.
Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact
claimThe term 'hallucination' in AI lacks a universally accepted definition and encompasses diverse errors, which creates a fundamental challenge for standardizing benchmarks or evaluating detection methods (Huang et al., 2024).