Knowledge Tree

Relations (1)

related 8.00 — strongly supporting 8 facts

Justification not yet generated — showing supporting facts

KF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, and average length are metrics used for evaluating AI systems.
Hallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.
The QuestEval metric is used for testing consistency, coherence, fluency, and relevance in AI-generated text, alongside other metrics like ROUGE, BLUE, METEOR, BERTScore, SummaQA, and QAGS.
A large-scale human study of hallucinations in extreme summarization using XSum (BBC articles) found that extrinsic hallucinations are frequent, even in gold summaries, and that textual entailment correlates best with human faithfulness and factuality compared to ROUGE, BERTScore, or QA-based metrics.
BertScore mitigates some shortcomings of ROUGE and BLEU but does not intuitively reflect factual accuracy or the degree of hallucination in medical texts.
Training summarization models with soft labels from a teacher large language model reduces overconfidence and hallucination rates while maintaining quality metrics like ROUGE and BERTScore.
Automated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.
Med-HallMark supports hallucination detection using POPE and CHAIR metrics for closed-ended questions, and BertScore and ROUGE metrics for open-ended questions.

Facts (8)

Sources

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 5 facts

claimKF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, and average length are metrics used for evaluating AI systems.

claimHallucination detection metrics measure either the degree of hallucination in generated responses relative to given knowledge or their overlap with gold faithful responses, including Critic, Q² (F1, NLI), BERTScore, F1, BLEU, and ROUGE.

referenceThe QuestEval metric is used for testing consistency, coherence, fluency, and relevance in AI-generated text, alongside other metrics like ROUGE, BLUE, METEOR, BERTScore, SummaQA, and QAGS.

claimA large-scale human study of hallucinations in extreme summarization using XSum (BBC articles) found that extrinsic hallucinations are frequent, even in gold summaries, and that textual entailment correlates best with human faithfulness and factuality compared to ROUGE, BERTScore, or QA-based metrics.

claimTraining summarization models with soft labels from a teacher large language model reduces overconfidence and hallucination rates while maintaining quality metrics like ROUGE and BERTScore.

Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv 2 facts

claimBertScore mitigates some shortcomings of ROUGE and BLEU but does not intuitively reflect factual accuracy or the degree of hallucination in medical texts.

claimMed-HallMark supports hallucination detection using POPE and CHAIR metrics for closed-ended questions, and BertScore and ROUGE metrics for open-ended questions.

A framework to assess clinical safety and hallucination rates of LLMs ... nature.com Nature 1 fact

claimAutomated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.