reference
The Phare benchmark's hallucination module evaluates large language models across four task categories: factual accuracy, misinformation resistance, debunking capabilities, and tool reliability. Factual accuracy is tested through structured question-answering tasks to measure retrieval precision, while misinformation resistance examines a model's capability to correctly refute ambiguous or ill-posed questions rather than fabricating narratives.
Authors
Sources
- Phare LLM Benchmark: an analysis of hallucination in ... www.giskard.ai via serper
Referenced by nodes (2)
- Large Language Models concept
- factual correctness concept