reference
Evaluation metrics for synthesizing Large Language Models with Knowledge Graphs for Question Answering are categorized into: (1) Answer Quality, including BERTScore (Peng et al., 2024), answer relevance (AR), hallucination (HAL) (Yang et al., 2025), accuracy matching, and human-verified completeness (Yu and McQuade, 2025); (2) Retrieval Quality, including context relevance (Es et al., 2024), faithfulness score (FS) (Yang et al., 2024), precision, context recall (Yu et al., 2024; Huang et al., 2025), mean reciprocal rank (MRR) (Xu et al., 2024), and normalized discounted cumulative gain (NDCG) (Xu et al., 2024); and (3) Reasoning Quality, including Hop-Acc (Gu et al., 2024) and reasoning accuracy (RA) (Li et al., 2025a).

Authors

Sources

Referenced by nodes (3)