claim
ROUGE-based evaluation systematically overestimates hallucination detection performance in Question Answering tasks.

Authors

Sources

Referenced by nodes (3)