Relations (1)
related 2.00 — strongly supporting 3 facts
Benchmarks are used as the primary tool for evaluating Large Language Models, yet they struggle to address the challenge of hallucination due to a lack of standardized definitions [1] and the difficulty of providing formal guarantees against such errors [2]. Furthermore, the limitations of current benchmarks in providing comprehensive evaluations directly impact the ability to accurately measure and mitigate hallucination [3].
Facts (3)
Sources
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... aclanthology.org 1 fact
claimExisting benchmarks for evaluating Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org 1 fact
claimThe Evaluation Stage of Large Language Models faces a significant open challenge in advancing from empirical evaluation via benchmarks to providing formal guarantees of model behavior, such as proving a model will not hallucinate or leak sensitive information under specific conditions.
Medical Hallucination in Foundation Models and Their ... medrxiv.org 1 fact
claimThe term 'hallucination' in AI lacks a universally accepted definition and encompasses diverse errors, which creates a fundamental challenge for standardizing benchmarks or evaluating detection methods (Huang et al., 2024).