measurement
ROUGE scores demonstrate systematic length bias, where responses exceeding 100 tokens consistently receive scores below the 0.3 threshold, regardless of factual accuracy.
Authors
Sources
- Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org via serper
Referenced by nodes (1)
- ROUGE concept