concept

Vectara LLM Hallucination Leaderboard

Also known as: Vectara Hallucination Leaderboard, Vectara hallucination leaderboard

Facts (19)

Sources
vectara/hallucination-leaderboard - GitHub github.com Vectara 17 facts
claimAn extractive summarizer that copies and pastes text from the original document would score 100% (zero hallucinations) on the Vectara hallucination leaderboard because such a model would, by definition, provide a faithful summary.
referenceThe SummaC and True papers are cited as relevant resources for hallucination detection in the Vectara hallucination-leaderboard GitHub repository.
claimThe Vectara hallucination leaderboard is designed as a living leaderboard, with plans to update both the models included and the source documents used for evaluation over time.
referenceThe Vectara hallucination leaderboard utilizes specific API access points for various large language models: Llama 4 Maverick 17B 128E Instruct FP8 and Llama 4 Scout 17B 16E Instruct are accessed via Together AI; Microsoft Phi-4 and Phi-4-Mini are accessed via Azure; Mistral Ministral 3B, Ministral 8B, Mistral Large, Mistral Medium, and Mistral Small are accessed via Mistral AI's API; Kimi-K2-Instruct-0905 is accessed via Moonshot AI API; GPT-4.1, GPT-4o, GPT-5-High, GPT-5-Mini, GPT-5-Minimal, GPT-5-Nano, o3-Pro, o4-Mini-High, and o4-Mini-Low are accessed via OpenAI API; GPT-OSS-120B, GLM-4.5-AIR-FP8 are accessed via Together AI; Qwen3-4b, Qwen3-8b, Qwen3-14b, Qwen3-32b, and Qwen3-80b-a3b-thinking are accessed via dashscope API; Snowflake-Arctic-Instruct is accessed via Replicate API; Grok-3, Grok-4-Fast-Reasoning, and Grok-4-Fast-Non-Reasoning are accessed via xAI's API; and GLM-4.6 is accessed via deepinfra.
claimThe Vectara hallucination leaderboard does not evaluate summarization quality, but rather focuses exclusively on the factual consistency of the summaries produced by the models.
referenceThe Vectara hallucination leaderboard integrates Gemini 2.5 pro (gemini-2.5-pro), Gemini 2.5 flash (gemini-2.5-flash), and Gemini 2.5 Flash lite (gemini-2.5-flash-lite) via Vertex AI.
claimThe creators of the Vectara hallucination leaderboard chose to use a model-based evaluation process rather than human evaluation because human evaluation does not scale sufficiently to allow for constant updates as new APIs and models are released in the fast-moving field of AI.
referenceThe Vectara hallucination leaderboard integrates DeepSeek V3, DeepSeek V3.1, DeepSeek V3.2-Exp, and DeepSeek R1 via the Hugging Face inference provider.
claimThe Vectara hallucination leaderboard focuses on evaluating summarization tasks rather than general 'closed book' question answering, meaning the large language models evaluated do not require memorization of human knowledge but rather a solid grasp of the supported languages.
procedureThe Vectara hallucination leaderboard explicitly filters out model responses that refuse to summarize a document or provide one-to-two word answers to prevent models from gaming the evaluation.
claimThe evaluation protocol used by the Vectara hallucination leaderboard builds upon a large body of existing academic work on factual consistency.
referenceThe Vectara hallucination leaderboard integrates Llama 3.3 70B Instruct Turbo via the Together AI API.
referenceThe Vectara hallucination leaderboard integrates Granite-3.3-Instruct 8B and Granite-4.0-h-small via the Replicate API.
claimThe creators of the Vectara hallucination leaderboard prefer model-based evaluation over human evaluation because it provides a repeatable process that can be shared with others, whereas human annotation processes are difficult to replicate and share beyond the process description and labels.
procedureThe Vectara hallucination leaderboard evaluation is performed only on documents for which all models provided a summary, ensuring a consistent comparison set.
perspectiveThe author of the Vectara hallucination-leaderboard argues that testing models by providing a list of well-known facts is a poor method for detecting hallucinations because the model's training data is unknown, the definition of 'well known' is unclear, and most hallucinations arise from rare or conflicting information rather than common knowledge.
claimThe creators of the Vectara hallucination leaderboard assert that building a model for detecting hallucinations is significantly easier than building a generative model that never produces hallucinations.
A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 1 fact
referenceThe Vectara Hallucination Leaderboard, maintained by Vectara, Inc. since 2023, compares large language model performance in maintaining factual consistency when summarizing sets of facts.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact
referenceThe Vectara LLM Hallucination Leaderboard is a resource for evaluating hallucinations in large language models.