claim
Traditional n-gram metrics like ROUGE and BLEU are insufficient for capturing the clinical validity of generated text in medical LLMs.
Authors
Sources
- A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org via serper
Traditional n-gram metrics like ROUGE and BLEU are insufficient for capturing the clinical validity of generated text in medical LLMs.