hallucination ↔ benchmarks

Relations (1)

related 2.00 — strongly supporting 3 facts

Benchmarks are used as the primary tool for evaluating Large Language Models, yet they struggle to address the challenge of hallucination due to a lack of standardized definitions [1] and the difficulty of providing formal guarantees against such errors [2]. Furthermore, the limitations of current benchmarks in providing comprehensive evaluations directly impact the ability to accurately measure and mitigate hallucination [3].

Facts (3)

Sources

A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... aclanthology.org Alex Robertson, Huizhi Liang, Mahbub Gani, Rohit Kumar, Srijith Rajamohan · Association for Computational Linguistics 1 fact

claimExisting benchmarks for evaluating Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations.

A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv 1 fact

claimThe Evaluation Stage of Large Language Models faces a significant open challenge in advancing from empirical evaluation via benchmarks to providing formal guarantees of model behavior, such as proving a model will not hallucinate or leak sensitive information under specific conditions.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv 1 fact

claimThe term 'hallucination' in AI lacks a universally accepted definition and encompasses diverse errors, which creates a fundamental challenge for standardizing benchmarks or evaluating detection methods (Huang et al., 2024).