claim
Metrics such as ROUGE and F1 can be inaccurate because they rely on shallow linguistic similarities (word overlap) between ground truth and LLM responses, even when the actual meaning differs.

Authors

Sources

Referenced by nodes (3)