claim
ROUGE and other commonly used metrics based on n-grams and semantic similarity share vulnerabilities in hallucination detection tasks, indicating a broader deficiency in current evaluation practices.

Authors

Sources

Referenced by nodes (2)