Cleanlab
Facts (24)
Sources
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Sep 30, 2024 17 facts
claimCleanlab defines the term 'hallucination' synonymously with 'incorrect response' in the context of RAG systems.
claimThe hallucination detectors evaluated by Cleanlab include RAGAS, G-eval, LLM self-evaluation, the DeepEval hallucination metric, and the Trustworthy Language Model.
claimCleanlab observed that the Context Utilization score from RAGAS was ineffective for hallucination detection.
claimCleanlab's study on hallucination detection focuses on algorithms that determine when an LLM response, generated based on retrieved context, should not be trusted.
claimFor fair comparison in the Cleanlab benchmark, the underlying LLM for all hallucination detection methods is fixed to gpt-4o-mini.
perspectiveCleanlab asserts that the current lack of trustworthiness in AI limits the return on investment (ROI) for enterprise AI, and that the Trustworthy Language Model (TLM) offers an effective way to achieve trustworthy RAG with comprehensive hallucination detection.
procedureThe Hallucination Metric from the DeepEval package estimates the likelihood of hallucination as the degree to which an LLM response contradicts or disagrees with the provided context, as assessed by an LLM (specifically GPT-4o-mini in the Cleanlab study).
claimCleanlab developed a variant of the RAGAS framework called RAGAS++ to overcome software issues encountered in the original RAGAS implementation.
procedureThe Cleanlab team utilizes chain-of-thought (CoT) prompting in Self-Evaluation to improve the technique by asking the LLM to explain its reasoning before outputting a confidence score.
accountCleanlab encountered persistent software issues, such as the internal error 'No statements were generated from the answer,' while running the RAGAS framework.
perspectiveCleanlab's study exclusively examines how effectively different detectors alert RAG system users when answers are incorrect, rather than assessing other system properties.
procedureEach hallucination detection method in the Cleanlab benchmark takes a user query, retrieved context, and LLM response as input and returns a score between 0 and 1 indicating the likelihood of hallucination.
formulaThe Cleanlab benchmark evaluates hallucination detectors based on AUROC, defined as the probability that the detector's score will be lower for an example where the LLM responded incorrectly than for an example where the LLM responded correctly.
claimThe Cleanlab researchers excluded the HaluEval and RAGTruth datasets from their benchmark suite because they discovered significant errors in the ground truth annotations of those datasets.
claimCleanlab evaluates popular hallucination detectors across four public Retrieval-Augmented Generation (RAG) datasets using precision and recall metrics.
claimThe Cleanlab hallucination detection benchmark evaluates methods across four public Context-Question-Answer datasets spanning different RAG applications.
procedureRAGAS++ is a refined variant of the RAGAS technique developed by Cleanlab that uses the gpt-4o-mini LLM for generation and as a critic, replacing the default gpt-3.5-turbo-16k and gpt-4 models.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Apr 7, 2025 7 facts
claimCleanlab’s Trustworthy Language Model (TLM) quantifies the trustworthiness of an LLM response using a combination of self-reflection, consistency across sampled responses, and probabilistic measures.
measurementThe Cleanlab RAG benchmark quantifies the effectiveness of detection methods using the Area under the Receiver Operating Characteristic curve (AUROC).
claimThe Cleanlab RAG benchmark evaluates how effectively detection methods flag incorrect responses, rather than focusing on finer-grained concerns like retrieval quality, faithfulness, or context utilization.
claimCleanlab’s Trustworthy Language Model (TLM) is a wrapper framework that utilizes any base LLM rather than being a custom-trained model.
claimIn the study conducted by Cleanlab, including the user question within the premise improved the results of the Hughes Hallucination Evaluation Model (HHEM) compared to using the context alone.
claimThe Cleanlab RAG benchmark datasets are composed of entries containing a user query, retrieved context, an LLM-generated response, and a binary annotation indicating whether the response was correct.
claimCleanlab’s Trustworthy Language Model (TLM) does not require a special prompt template and can be used with the same prompt provided to the RAG LLM that generated the response.