concept

BERTScore

Also known as: BARTScore, BERT-score

Facts (38)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 11 facts
claimKF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, and average length are metrics used for evaluating AI systems.
referenceThe paper 'Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers' categorizes hallucination detection metrics into black-box scorers (non-contradiction probability, normalized semantic negentropy, normalized cosine similarity, BERTSCore, BLEURT, and exact match rate), white-box token-probability-based scorers (minimum token probability, length-normalized token probability), and LLM-as-a-Judge scorers (categorical incorrect/uncertain/correct).
referenceThe Q² metric evaluates factual consistency in knowledge-grounded dialogues and is compared against F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU using the WoW, Topical-Chat, and Dialogue NLI datasets.
claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.
referenceBERTScore, FEQA, QGFS, DAE, and FactCC are metrics used in the FRANK benchmark for evaluating factuality in abstractive summarization.
referenceThe HalluQA benchmark evaluates AI models using human annotations of intrinsic and extrinsic hallucinated spans and factuality, alongside metrics such as ROUGE-1/2/L, BERTScore, textual entailment, QA-based consistency, and Spearman correlation with human scores.
referenceThe QuestEval metric is used for testing consistency, coherence, fluency, and relevance in AI-generated text, alongside other metrics like ROUGE, BLUE, METEOR, BERTScore, SummaQA, and QAGS.
referenceEvaluation metrics for search-and-retrieve, meeting summarisation, and automated clinical report generation datasets (MS MARCO, QMSum, ACI-Bench) include ROUGE-L, BERTScore, BS-Fact, FactCC, DAE, and QuestEval.
claimA large-scale human study of hallucinations in extreme summarization using XSum (BBC articles) found that extrinsic hallucinations are frequent, even in gold summaries, and that textual entailment correlates best with human faithfulness and factuality compared to ROUGE, BERTScore, or QA-based metrics.
referenceSCALE is a metric proposed for hallucination detection that is compared against Q², ANLI, SummaC, F1, BLEURT, QuestEval, BARTScore, and BERTScore.
claimTraining summarization models with soft labels from a teacher large language model reduces overconfidence and hallucination rates while maintaining quality metrics like ROUGE and BERTScore.
The construction and refined extraction techniques of knowledge ... nature.com Nature Feb 10, 2026 10 facts
claimBERTScore is used to calculate the similarity between AI-generated answers and standard answers in the Knowledge Q&A task.
measurementExcluding Retrieval-Augmented Generation (RAG) from the knowledge graph construction framework resulted in a BERTScore drop to 0.89 in knowledge question answering tasks.
procedureThe BERTScore evaluation method proceeds in four steps: (1) map the words of the generated text and reference text to the embedding space to obtain word vectors, (2) calculate the cosine similarity for each word pair between the generated and reference texts to form a similarity matrix, (3) calculate Precision (P) as the average similarity of each word vector in the generated text to the most similar word vector in the reference text, and (4) calculate Recall (R) as the average similarity of each word vector in the reference text to the most similar word vector in the generated text.
claimThe evaluation metrics used in the study include BERTScore for automated scoring and Kendall’s Tau for ranking tasks.
claimThe BERTScore method evaluates semantic consistency in AI models by comparing the BERT embeddings of generated text and reference text.
measurementIn knowledge question answering, non-desensitized data achieves a BERTScore of 0.97, while desensitized data achieves 0.96.
measurementThe knowledge graph showed an average semantic similarity of 0.92 to expert-annotated references when evaluated via BERTScore on a subset of 10,000 triplets.
procedureThe experimental evaluation of the DeepSeek-R1 70B LoRA model uses BERTScore for knowledge question answering, an overall score for tactical planning, Kendall’s Tau for threat assessment, and privacy scores of k-anonymity ≥ 5 and l-diversity ≥ 2.
measurementIn knowledge question answering tasks, the LoRA fine-tuned model achieved a BERTScore of 0.96, while GPT-4 achieved a BERTScore of 0.85.
procedureThe evaluation framework for multi-task performance comparison utilizes BERTScore for automated scoring, human evaluation, and the Kendall’s Tau ranking correlation coefficient for assessing threat assessment tasks.
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 7 facts
claimBertScore is considered to have favorable agreement with human judgment compared to other metrics.
measurementThe Med-HallMark benchmark evaluates AI models on hallucination detection using the MediHall Score and traditional metrics including BertScore, METEOR, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU.
measurementIn ID1 and ID2 scenarios where Large Vision Language Model (LVLM) answers are entirely correct, BertScore values are 66.73% and 46.11% respectively, indicating a significant and unwarranted disparity.
referenceThe source text provides performance evaluation metrics (Bertscore, METEOR score, Rouge-1, Rouge-2, and Rouge-L) for multiple AI models including BLIP2, InstructBLIP-7b, InstructBLIP-13b, LLaVA1.5-7b, LLaVA1.5-13b, LLaVA-Med (SF, RF, PF variants), mPLUG-Owl2, XrayGPT, Mini-gpt4, and RadFM.
claimBertScore mitigates some shortcomings of ROUGE and BLEU but does not intuitively reflect factual accuracy or the degree of hallucination in medical texts.
measurementOn the BertScore metric, mPLUG-Owl2 scored 64.49% and XrayGPT scored 62.62%, while the BLIP and LLaVA1.5 model families achieved scores of approximately 47%.
claimMed-HallMark supports hallucination detection using POPE and CHAIR metrics for closed-ended questions, and BertScore and ROUGE metrics for open-ended questions.
Detect hallucinations for RAG-based systems - AWS aws.amazon.com Amazon Web Services May 16, 2025 2 facts
claimThe BERT score has been shown to correlate with human judgment on both sentence-level and system-level evaluation and computes precision, recall, and F1 measures for language generation tasks.
claimThe BERT score uses pre-trained contextual embeddings to capture semantic similarities between words or full sentences, differing from the BLEU score which relies on token-level comparisons.
Building Trustworthy NeuroSymbolic AI Systems - arXiv arxiv.org arXiv 2 facts
procedureThe study evaluated Large Language Model performance using two metrics: safety, measured through the averaged BART sentiment score (Yin, Hay, and Roth 2019), and consistency, evaluated by comparing provided 'Rule of Thumb' instructions to the rules learned by the LLMs using BERTScore (Zhang et al. 2019).
claimExisting metrics such as Elo Rating (Zheng et al. 2023), BARTScore (Liu et al. 2023), FactCC (Kryściński et al. 2020), and Consistency lexicons can be improved to account for the influence of knowledge on e-LLM generation.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 2 facts
claimSophisticated metrics including BERTScore, BLEU, and UniEval-fact show substantial disagreement with judgments from strong LLM-based evaluators, indicating limitations in capturing factual consistency.
referenceThe paper 'Bertscore: Evaluating text generation with BERT' by Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi was published in the 8th International Conference on Learning Representations (ICLR 2020).
A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 2 facts
referenceThe BERTScore metric, detailed in 'BERTScore: Evaluating Text Generation with BERT' (arXiv:1904.09675, 2020), utilizes BERT embeddings to evaluate text generation quality.
claimAutomated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 1 fact
referenceEvaluation metrics for synthesizing Large Language Models with Knowledge Graphs for Question Answering are categorized into: (1) Answer Quality, including BERTScore (Peng et al., 2024), answer relevance (AR), hallucination (HAL) (Yang et al., 2025), accuracy matching, and human-verified completeness (Yu and McQuade, 2025); (2) Retrieval Quality, including context relevance (Es et al., 2024), faithfulness score (FS) (Yang et al., 2024), precision, context recall (Yu et al., 2024; Huang et al., 2025), mean reciprocal rank (MRR) (Xu et al., 2024), and normalized discounted cumulative gain (NDCG) (Xu et al., 2024); and (3) Reasoning Quality, including Hop-Acc (Gu et al., 2024) and reasoning accuracy (RA) (Li et al., 2025a).
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
claimMany existing hallucination benchmarks rely on one-dimensional metrics such as Accuracy, Accept/Refusal rates, BLEU, and BERTScore, which limits the interpretability of results and obscures the underlying causes of Large Language Model performance issues.