concept

FinanceBench

Facts (14)

Sources
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 9 facts
claimFor the FinanceBench application, the TLM (Trustworthy Language Model) method is the most effective technique for detecting hallucinations.
claimThe RAGAS hallucination detection metric often fails to produce internal LLM statements necessary for its computations when applied to the FinanceBench dataset, as RAGAS is more effective when answers are complete sentences rather than single numbers.
measurementThe RAGAS Faithfulness evaluation framework experienced a 58.90% failure rate on the DROP dataset, 0.70% on RAGTruth, 83.50% on FinanceBench, 0.10% on PubMedQA, and 21.20% on CovidQA, where a failure is defined as the software returning an error instead of a score.
referenceFinanceBench is a RAG benchmark dataset consisting of public financial statements, where each instance includes a large retrieved context of plaintext financial information, a question, and a generated answer.
measurementThe default version of the RAGAS Faithfulness metric failed to produce a score for 83.5% of the examples in the FinanceBench dataset.
claimThe RAGAS++ version, an improved version of the RAGAS Faithfulness metric, generated a score for all examples in the FinanceBench dataset, although this improvement did not significantly increase overall performance.
claimIn the FinanceBench dataset, hallucinated responses often contain incorrect numerical values.
claimMost hallucination detection methods, excluding the basic Self-Evaluation technique, struggled to provide significant improvements over random guessing when evaluated on the FinanceBench dataset.
measurementThe RAGAS++ evaluation framework experienced a 0.10% failure rate on the DROP dataset, 0.00% on RAGTruth, 0.00% on FinanceBench, 0.00% on PubMedQA, and 0.00% on CovidQA, where a failure is defined as the software returning an error instead of a score.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Cleanlab Apr 7, 2025 3 facts
referenceThe FinanceBench dataset reflects questions financial analysts answer using public filings such as 10Ks, 10Qs, 8Ks, and Earnings Reports, where retrieved contexts contain financial documents and questions are straightforward.
claimIn the FinanceBench benchmark, the TLM and LLM-as-a-judge evaluation models detect incorrect AI responses with the highest precision and recall, matching findings observed in the FinQA dataset.
claimPatronus Lynx was trained on RAG datasets including CovidQA, PubmedQA, DROP, and FinanceBench.
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 1 fact
referenceHaluBench is a partially synthetic hallucination benchmarking dataset where negative examples (non-hallucinated answers) are derived from existing question answering benchmarks including HaluEval, DROP, CovidQA, FinanceBench, and PubMedQA.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact
referenceThe HaluBench dataset consists of approximately 500 random samples from CovidQA, PubMedQA, DROP, and FinanceBench, along with a set of perturbations based on the retrieved samples.