measurement
Evaluation metrics for list-based questions on Wikidata and Wiki-Category List include test precision and the average number of positive and negative hallucination entities; MultiSpanQA uses F1, Precision, and Recall; and longform generation of biographies uses FactScore.

Authors

Sources

Referenced by nodes (4)