claim
Automated metrics like ROUGE, BLEU, and BERT-score, which are designed to compare model-generated text with expert-written examples, have significant limitations in healthcare because they focus on surface-level textual similarity rather than semantic nuances, contextual dependencies, and domain-specific knowledge.
Authors
Sources
- A framework to assess clinical safety and hallucination rates of LLMs ... www.nature.com via serper