claim
The researchers used GPT4Score as a model-based evaluation metric, defined as the percentage of answers that GPT-4o identifies as correct when assessing if the model's output matches the ground truth answer.

Authors

Sources

Referenced by nodes (1)