Knowledge Tree

KG-RAG: Bridging the Gap Between Knowledge and Creativity - arXiv arxiv.org arXiv May 20, 2024 6 facts

formulaExact Match (EM) calculates the percentage of predicted answers that exactly match the ground truth answers.

claimAccuracy in the KG-RAG evaluation framework is defined as the proportion of answers that have any overlapping word with the ground truth answer, where the overlap function checks for any common word between the predicted and the correct answer.

claimA hallucination score of '1' in the KG-RAG evaluation framework indicates a hallucinated response, determined by the absence of perfect precision (token mismatch between predicted and ground truth answers) and the presence of specific heuristic indicators, such as phrases like 'I don’t know'.

formulaHallucination in the KG-RAG evaluation framework is defined as responses containing information not present in the ground truth, and it is calculated using the formula: Hallucination Rate = (1/N) * Σ(1 if predicted answer is not perfect precision AND contains heuristic indicators of uncertainty, else 0), where N is the number of samples.

claimA hallucination score of '1' in the KG-RAG evaluation framework indicates a hallucinated response, determined by the absence of perfect precision (token mismatch between predicted and ground truth answers) and the presence of specific heuristic indicators, such as phrases like 'I don’t know'.

formulaHallucination in the KG-RAG evaluation framework is defined as responses containing information not present in the ground truth, and it is calculated using the formula: Hallucination Rate = (1/N) * Σ(1 if predicted answer is not perfect precision AND contains heuristic indicators of uncertainty, else 0), where N is the number of samples.

RAG Hallucinations: Retrieval Success ≠ Generation Accuracy linkedin.com Sumit Umbardand · LinkedIn Feb 6, 2026 5 facts

claimEmbedding similarity metrics for RAG systems are deterministic and cheap but rigid, as they reward matching ground truth rather than correctness and can penalize improvements if the ground truth is narrow.

procedureProduction teams typically use a hybrid evaluation loop for RAG systems consisting of four steps: (1) generate a synthetic set, (2) conduct expert review, (3) correct the ground truth, and (4) re-evaluate.

claimEmbedding similarity metrics for RAG evaluation are deterministic and cheap but rigid because they reward matching the ground truth rather than actual correctness, and improvements can appear worse if the ground truth is narrow.

procedureProduction teams often use a hybrid evaluation loop consisting of generating a synthetic set, conducting expert review, correcting ground truth, and re-evaluating.

claimHuman evaluation is considered the gold standard for RAG systems, but it does not scale, and automation requires ground truth, which synthetic test sets often fail to provide in real enterprise domains.

Evaluating RAG applications with Amazon Bedrock knowledge base ... aws.amazon.com Amazon Web Services Mar 14, 2025 5 facts

claimAmazon Bedrock Knowledge Bases evaluation supports both ground truth and reference-free evaluation methods.

claimMetrics such as ROUGE and F1 can be inaccurate because they rely on shallow linguistic similarities (word overlap) between ground truth and LLM responses, even when the actual meaning differs.

claimTraditional automated evaluation metrics for AI typically require ground truth data, which is difficult to obtain for many AI applications, especially those involving open-ended generation or retrieval augmented systems.

procedureThe Amazon Bedrock Knowledge Bases RAG evaluation workflow consists of six steps: preparing a prompt dataset (optionally with ground truth), converting the dataset to JSONL format, storing the file in an Amazon S3 bucket, running the Amazon Bedrock Knowledge Bases RAG evaluation job (which integrates with Amazon Bedrock Guardrails), generating an automated report with metrics, and analyzing the report for system optimization.

claimIn Amazon Bedrock RAG evaluations, the 'referenceResponses' field must contain the expected ground truth answer that an end-to-end RAG system should generate for a given prompt, rather than the expected passages or chunks retrieved from the Knowledge Base.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

claimA significant obstacle in medical hallucination detection is the frequent absence or high cost of collecting reliable ground truth data, particularly for complex or novel queries (Hegselmann et al., 2024b).

10 RAG examples and use cases from real companies - Evidently AI evidentlyai.com Evidently AI Feb 13, 2025 1 fact

procedureThe development team for ChatLTV ensured response accuracy by using a mix of manual and automated testing, including an LLM judge that compared outputs to ground-truth data to generate a quality score.

Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 1 fact

claimFaithfulness evaluation assumes the provided context is correct and acts as ground truth; verifying the accuracy of the context itself is considered an independent problem.

Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 1 fact

claimHallucination resistance in AI models correlates more strongly with the depth of conceptual understanding, as measured by semantic similarity to ground truth, than with exposure to domain-specific training data.

[2502.14302] MedHallu: A Comprehensive Benchmark for Detecting ... arxiv.org arXiv Feb 20, 2025 1 fact

claimUsing bidirectional entailment clustering, the authors of the MedHallu paper demonstrated that harder-to-detect hallucinations are semantically closer to ground truth.

ground truth

Facts (21)