benchmarks ↔ Large Language Models

Relations (1)

related 2.81 — strongly supporting 6 facts

Benchmarks serve as the primary mechanism for evaluating the performance and truthfulness of Large Language Models, as evidenced by research into their design flaws [1], limitations in coverage {fact:5, fact:6}, and specific failures in testing model capabilities like hallucination and calibration {fact:3, fact:4}. Academic literature specifically focuses on the intersection of these two concepts, such as investigating data contamination within these evaluation frameworks [2].

Facts (6)

Sources

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com 2 facts

claimBenchmarks that only measure whether answers are correct or incorrect fail to reveal miscalibration in uncertainty expression in large language models.

claimBenchmarks for large language models that test only high-frequency factual questions fail to reveal tail entity hallucination, and benchmarks that test only short responses fail to reveal exposure bias accumulation.

A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... aclanthology.org Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Association for Computational Linguistics 2 facts

claimExisting benchmarks for evaluating Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations.

perspectiveExisting benchmarks for Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations of model truthfulness.

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 2 facts

referenceThe paper 'Investigating data contamination in modern benchmarks for large language models' was published in the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 8698–8711.

referenceThe paper 'When judgment becomes noise: how design failures in llm judge benchmarks silently undermine validity' analyzes how design flaws in benchmarks that use large language models as judges can invalidate their results.