measurement
ROUGE scores demonstrate systematic length bias, where responses exceeding 100 tokens consistently receive scores below the 0.3 threshold, regardless of factual accuracy.

Authors

Sources

Referenced by nodes (1)