procedure
Accuracies on QA datasets in the study are computed by selecting the most likely answer at a low temperature setting and comparing it to labels derived from either ROUGE or LLM-as-Judge evaluations.
Authors
Sources
- Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org via serper
Referenced by nodes (2)
- LLM-as-a-judge concept
- ROUGE concept