measurement
Evaluation metrics for list-based questions on Wikidata and Wiki-Category List include test precision and the average number of positive and negative hallucination entities; MultiSpanQA uses F1, Precision, and Recall; and longform generation of biographies uses FactScore.
Authors
Sources
- EdinburghNLP/awesome-hallucination-detection - GitHub github.com via serper