concept

BLEU

Also known as: BLEU-4, BLUE, Bilingual Evaluation Understudy, blue

Facts (29)

Sources
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 8 facts
claimKF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, and average length are metrics used for evaluating AI systems.
referenceThe Q² metric evaluates factual consistency in knowledge-grounded dialogues and is compared against F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU using the WoW, Topical-Chat, and Dialogue NLI datasets.
claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.
referenceThe QuestEval metric is used for testing consistency, coherence, fluency, and relevance in AI-generated text, alongside other metrics like ROUGE, BLUE, METEOR, BERTScore, SummaQA, and QAGS.
referenceThe 'Survey of Hallucination in Natural Language Generation' classifies metrics into statistical metrics (such as ROUGE, BLEU, PARENT, and Knowledge F1) and model-based metrics.
referenceEvaluation metrics for hallucination rate in conversational settings include BLEU, ROUGE-1, ROUGE-2, and ROUGE-L, measured across settings such as original text, optimized system messages, full LLM weights, synthetic data, or mixtures of synthetic and reference data.
measurementEvaluation of generation tasks uses Perplexity, Unigram Overlap (F1), BLEU-4, ROUGE-L, Knowledge F1, and Rare F1 as metrics, and utilizes datasets including WoW and CMU Document Grounded Conversations (CMU_DoG) with the KiLT Wikipedia dump as the knowledge source.
referenceEvaluation metrics for estimating hallucination degree include BLEU, ROUGE-L, FeQA, QuestEval, and EntityCoverage (Precision, Recall, F1).
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 7 facts
claimThe BLEU metric accounts for significant length differences between generated text and ground truth, making it more versatile than ROUGE, but it remains a weak measure of factual correctness.
measurementLLaVA1.5-7b and LLaVA1.5-13b obtained scores of 27.16% on the METEOR metric and 4.39% on the BLEU metric.
measurementThe Med-HallMark benchmark evaluates AI models on hallucination detection using the MediHall Score and traditional metrics including BertScore, METEOR, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU.
claimTraditional Natural Language Processing (NLP) metrics like METEOR and BLEU fail to reflect the factual correctness of Large Vision-Language Model outputs because they only measure shallow similarities to ground truth.
claimBertScore mitigates some shortcomings of ROUGE and BLEU but does not intuitively reflect factual accuracy or the degree of hallucination in medical texts.
claimLLaVA1.5-7b, LLaVA1.5-13b, and mPLUG-Owl2 exhibit higher precision on the Med-VQA task compared to other models, as reflected in their METEOR and BLEU metric scores.
claimThe BLEU metric scores zero when there are no shared n-grams or subsequences between a model's generated response and the ground truth, even if the model's answer is semantically correct.
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 3 facts
claimTraditional automatic metrics like BLEU, ROUGE, and METEOR are inadequate for assessing factual consistency in large language models, according to Maynez et al. (2020).
claimTraditional lexical metrics like BLEU or ROUGE fail to capture semantic grounding in AI systems.
claimAutomatic metrics such as BLEU or ROUGE fail to capture factual consistency and reliability in Large Language Models, according to Maynez et al. (2020).
Practices, opportunities and challenges in the fusion of knowledge ... frontiersin.org Frontiers 2 facts
referenceThe paper 'Bleu: a method for automatic evaluation of machine translation' by Papineni, K., Roukos, S., Ward, T., Zhu, W.-J. introduces the BLEU method for the automatic evaluation of machine translation.
claimCurrent evaluation metrics like BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004) mainly measure surface text similarity and fail to effectively capture the semantic consistency between generated text and knowledge graph content.
A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature May 13, 2025 2 facts
referenceThe BLEU metric, described in 'Bleu: A method for automatic evaluation of machine translation' (2002), provides a standard approach for automatically evaluating machine translation quality.
claimAutomated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 2 facts
claimThe evaluation of medical agents has evolved from linguistic metrics like BLEU and ROUGE to action-oriented benchmarks such as MedAgentBench and MedAgentBoard.
claimTraditional n-gram metrics like ROUGE and BLEU are insufficient for capturing the clinical validity of generated text in medical LLMs.
Exploring the Influence of Language on Identity and Perception thespanishgroup.org The Spanish Group Sep 20, 2025 1 fact
claimRussian speakers, who have two distinct words for blue (sinij for dark blue and goluboj for light blue), notice subtle shade changes in blue more frequently than speakers of languages that only have one word for blue.
A survey on augmenting knowledge graphs (KGs) with large ... link.springer.com Springer Nov 4, 2024 1 fact
formulaBLEU (Bilingual Evaluation Understudy) is a metric used to evaluate text quality in large language models integrated with knowledge graphs by comparing generated text to human-written reference texts, calculated as BLEU = BP * exp(sum(w_n * log(p_n))), where BP is the brevity penalty, w_n are weights, and p_n are precision scores for n-grams.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 1 fact
claimSophisticated metrics including BERTScore, BLEU, and UniEval-fact show substantial disagreement with judgments from strong LLM-based evaluators, indicating limitations in capturing factual consistency.
Unknown source 1 fact
claimBLEU, ROUGE, and METEOR are traditional automatic metrics used for evaluating text generation.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
claimMany existing hallucination benchmarks rely on one-dimensional metrics such as Accuracy, Accept/Refusal rates, BLEU, and BERTScore, which limits the interpretability of results and obscures the underlying causes of Large Language Model performance issues.