procedure
The researchers curated a dataset of instances where ROUGE and an LLM-as-Judge metric provided conflicting assessments regarding the presence of hallucinations to examine ROUGE's failure modes.

Authors

Sources

Referenced by nodes (3)