concept

Hallucination Leaderboard

Also known as: Hallucinations Leaderboard

Facts (19)

Sources
The Hallucinations Leaderboard, an Open Effort to Measure ... huggingface.co Hugging Face Jan 29, 2024 18 facts
measurementExperiments for The Hallucinations Leaderboard are conducted on the Edinburgh International Data Facility (EIDF) and internal clusters at the School of Informatics, University of Edinburgh, using NVIDIA A100-40GB and A100-80GB GPUs.
procedureTo assess the faithfulness of models to original documents in summarisation tasks, the Hallucination Leaderboard uses ROUGE (measuring overlap between generated and reference text), factKB (a generalisable model-based metric for factuality evaluation), and BERTScore-Precision (which computes similarity between two texts using token representation similarities).
claimThe Hallucinations Leaderboard is an open project designed to measure and address hallucinations in LLMs, aiming to provide insights into model generalization, limitations, and tendencies to generate hallucinated content.
procedureThe Hallucinations Leaderboard evaluation process ranks LLMs using quantitative metrics and provides qualitative analysis by sharing samples of model-generated text.
claimThe backend and front-end code for The Hallucinations Leaderboard is a fork of the Hugging Face Leaderboard Template.
measurementThe Hallucination Leaderboard normalizes all metrics to a scale where a score of 0.8 represents 80% accuracy, such as in the TruthfulQA MC1 and MC2 tasks.
measurementModels based on Mistral 7B demonstrate higher accuracy on TriviaQA (8-shot) and TruthfulQA compared to other models evaluated on the Hallucinations Leaderboard.
measurementFalcon 7B yields the best results on the NQ (8-shot) dataset among models evaluated on the Hallucinations Leaderboard.
claimThe Hallucinations Leaderboard is a platform designed to evaluate large language models against benchmarks specifically created to assess hallucination-related issues using in-context learning.
claimThe Hallucination Leaderboard includes tasks across several categories: Closed-book Open-domain QA (NQ Open, TriviaQA, TruthfulQA), Summarisation (XSum, CNN/DM), Reading Comprehension (RACE, SQuADv2), Instruction Following (MemoTrap, IFEval), Fact-Checking (FEVER), Hallucination Detection (FaithDial, True-False, HaluEval), and Self-Consistency (SelfCheckGPT).
referenceThe Hallucinations Leaderboard evaluates hallucination detection using two tasks: SelfCheckGPT, which checks for self-consistency in model answers, and HaluEval, which checks for faithfulness hallucinations in QA, Dialog, and Summarisation tasks relative to a knowledge snippet.
procedureFor both XSum and CNN/DM summarisation tasks, the Hallucination Leaderboard follows a 2-shot learning setting.
referenceRACE and SQuADv2 are datasets used for assessing a model's reading comprehension skills on the Hallucination Leaderboard.
procedureThe Hallucinations Leaderboard utilizes the EleutherAI Language Model Evaluation Harness to perform zero-shot and few-shot evaluations of large language models via in-context learning.
referenceThe Hallucinations Leaderboard project team released a paper titled 'The Hallucinations Leaderboard -- An Open Effort to Measure Hallucinations in Large Language Models', which is available on arXiv.
procedureIn the Hallucination Leaderboard, NQ Open and TriviaQA models are evaluated against gold answers using Exact Match in 64-shot and 8-shot learning settings.
claimThe Hallucinations Leaderboard team uses hierarchical clustering on datasets, metrics, and models to identify performance clusters, specifically grouping models into Mistral 7B-based models, LLaMA 2-based models, and smaller models such as BLOOM 560M and GPT-Neo.
claimThe Hallucinations Leaderboard evaluates Large Language Models (LLMs) on their ability to handle various types of hallucinations to provide researchers and developers with insights into model reliability and efficiency.
vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact
perspectiveVectara does not recommend using their hallucination leaderboard as a standalone metric, but rather as a quality metric to be run alongside other evaluations such as summarization quality and question-answering accuracy.