claim
The moderate Pearson correlation coefficient between AUROC scores derived from ROUGE and LLM-as-Judge evaluation approaches suggests that hallucination detection methods may be inadvertently optimized for ROUGE’s lexical overlap criteria rather than genuine factual correctness.

Authors

Sources

Referenced by nodes (4)