Relations (1)

related 1.58 — strongly supporting 1 fact

BERTScore is identified as an evaluation metric used to assess answer quality in tasks involving Large Language Models [1], while simultaneously being cited as a metric that fails to reliably detect hallucination in question-answering scenarios [2].

Facts (1)

Sources
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv 1 fact
referenceEvaluation metrics for synthesizing Large Language Models with Knowledge Graphs for Question Answering are categorized into: (1) Answer Quality, including BERTScore (Peng et al., 2024), answer relevance (AR), hallucination (HAL) (Yang et al., 2025), accuracy matching, and human-verified completeness (Yu and McQuade, 2025); (2) Retrieval Quality, including context relevance (Es et al., 2024), faithfulness score (FS) (Yang et al., 2024), precision, context recall (Yu et al., 2024; Huang et al., 2025), mean reciprocal rank (MRR) (Xu et al., 2024), and normalized discounted cumulative gain (NDCG) (Xu et al., 2024); and (3) Reasoning Quality, including Hop-Acc (Gu et al., 2024) and reasoning accuracy (RA) (Li et al., 2025a).