concept

MEDQA

Facts (20)

Sources

Bridging the Gap Between LLMs and Evolving Medical Knowledge arxiv.org arXiv Jun 29, 2025 14 facts

measurementAMG-RAG achieves an F1 score of 74.1% on the MEDQA benchmark and an accuracy of 66.34% on the MEDMCQA benchmark.

measurementThe AMG-RAG system configured with the PubMed-MKG and an 8B LLM backbone achieves an accuracy of 73.92% on the MEDQA benchmark, surpassing baseline models including Self-RAG (Asai et al., 2023), HyDE (Gao et al., 2022), GraphRAG (Edge et al., 2024), and MedRAG (Zhao et al., 2025).

measurementThe AMG-RAG model, which has 8 billion parameters, achieves an F1 score of 74.1% on the MEDQA benchmark without requiring fine-tuning, surpassing the performance of the 70 billion parameter Meditron model.

measurementThe AMG-RAG framework achieved an F1 score of 74.1% on the MEDQA benchmark and an accuracy of 66.34% on the MEDMCQA benchmark, outperforming comparable models and models 10 to 100 times larger.

referenceThe MEDQA dataset is a free-form, multiple-choice open-domain QA dataset derived from professional medical board exams, requiring retrieval of relevant evidence and sophisticated reasoning to answer questions.

referenceThe AMG-RAG model is designed to retrieve relevant content, structure key information, and formulate reasoning to guide answer selection when applied to the MEDQA dataset.

referenceTable 1 in the paper 'Bridging the Gap Between LLMs and Evolving Medical Knowledge' compares state-of-the-art language models on the MEDQA benchmark, showing that Med-Gemini (1800B) achieved 91.1% accuracy, GPT-4 (1760B) achieved 90.2% accuracy, Med-PaLM 2 (340B) achieved 85.4% accuracy, AMG-RAG (8B) achieved 73.9% accuracy, and BioMedGPT (10B) achieved 50.4% accuracy.

referenceThe dataset used for the medical question-answering system is sourced from medical textbooks in the MEDQA benchmark.

measurementThe AMG-RAG system built on the GPT4o-mini LLM backbone with PubMed-MKG achieves an accuracy of 73.92% on the MEDQA benchmark, which is higher than the performance achieved when using LLaMA 3.1 or Mixtral backbones with the same retrieval pipeline.

measurementThe test partition of the MEDQA dataset used in this study comprises approximately 1,200 samples.

measurementRemoving search functionality from the AMG-RAG system drops accuracy to 67.16%, and removing Chain-of-Thought (CoT) reasoning drops accuracy to 66.69% on the MEDQA benchmark.

claimLarger language models like Med-Gemini and GPT-4 achieve the highest accuracy and F1 scores on the MEDQA benchmark but require significantly larger parameter sizes.

claimIn the AMG-RAG system, the PubMed-MKG (Medical Knowledge Graph created via PubMedSearch) consistently outperforms the Wiki-MKG (Medical Knowledge Graph created via WikiSearch) on the MEDQA benchmark, likely due to the domain-specific nature of PubMed content.

claimAdvanced reasoning strategies, such as Chain-of-Thought (CoT) reasoning and the integration of search tools, are critical for achieving higher performance in language models on the MEDQA benchmark.

A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 3 facts

claimExisting benchmarks for medical LLMs, such as MedQA and MedMCQA, focus on static tasks like multiple-choice questions or summarization, which do not mirror the dynamic, multi-turn nature of real-world clinical diagnostic reasoning.

perspectiveMulti-turn evaluation is necessary for benchmarking medical AI because static benchmarks like MedQA may show only marginal differences between models like GPT-5 and Qwen3-235B-A22B-Instruct-2507.

referenceLLM-MedQA utilizes the MedQA dataset to improve Large Language Model performance through multi-agent architectures and case study generation, with a focus on domain-specific terminology and zero-shot reasoning.

Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 2 facts

referenceMedQA (Jin et al., 2021) is a medical multi-choice question-answering dataset containing multilingual medical examination text.

referenceThe KG-Rank method, proposed by Yang et al. in 2024, uses Similarity and MMR-based Ranking with GPT-4, Llama-2-7B, and Llama-2-13B language models and the UMLS and DBpedia knowledge graphs for domain-specific QA, evaluated using ROUGE-L, BERTScore, MoverScore, and BLEURT metrics on the LiveQA, ExpertQA-Bio, ExpertQA-Med, and MedQA datasets.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

claimGoogle's Med-PaLM and Med-PaLM 2 demonstrate strong performance on medical benchmarks such as MedQA (Jin et al., 2021), MedMCQA (Pal et al., 2022), and PubMedQA (Jin et al., 2019) by integrating biomedical texts into their training regimes, as reported by Singhal et al. (2022).