reference
Benchmark datasets for Large Language Model and Knowledge Graph synthesis evaluate three primary criteria: Answer Quality (AnsQ), which measures the correctness of the generated answer against ground-truth; Retrieval Quality (RetQ), which measures the relevance of retrieved context against human-validated context; and Reasoning Quality (ReaQ), which measures the correctness of reasoning chains and intermediate steps.

Authors

Sources

Referenced by nodes (1)