procedure
Accuracies on QA datasets in the study are computed by selecting the most likely answer at a low temperature setting and comparing it to labels derived from either ROUGE or LLM-as-Judge evaluations.

Authors

Sources

Referenced by nodes (2)