MEDMCQA
Facts (12)
Sources
Bridging the Gap Between LLMs and Evolving Medical Knowledge arxiv.org Jun 29, 2025 6 facts
measurementAMG-RAG achieves an F1 score of 74.1% on the MEDQA benchmark and an accuracy of 66.34% on the MEDMCQA benchmark.
measurementThe AMG-RAG framework achieved an F1 score of 74.1% on the MEDQA benchmark and an accuracy of 66.34% on the MEDMCQA benchmark, outperforming comparable models and models 10 to 100 times larger.
measurementThe MedMCQA development set used in this study contains approximately 4,000 questions.
referencePal et al. (2022a) published 'Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering' in the Proceedings of the Conference on Health, Inference, and Learning, volume 174, pages 248–260, published by PMLR.
measurementOn the MedMCQA benchmark, AMG-RAG achieves an accuracy of 66.34%, outperforming Meditron-70B (66.0%), Codex 5-shot CoT (59.7%), VOD (58.3%), Flan-PaLM (57.6%), PaLM (54.5%), GAL (120B, 52.9%), PubmedBERT (40.0%), SciBERT (39.0%), BioBERT (38.0%), and BERT (35.0%).
referenceThe MedMCQA dataset is a multiple-choice question-answering dataset tailored for medical QA that offers a broad variety of question types, encompassing both foundational and clinical knowledge across diverse medical specialties.
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org Jan 6, 2026 3 facts
claimExisting benchmarks for medical LLMs, such as MedQA and MedMCQA, focus on static tasks like multiple-choice questions or summarization, which do not mirror the dynamic, multi-turn nature of real-world clinical diagnostic reasoning.
claimMedMCQA and LLM-MedQA do not support multi-turn interactions, do not include key points rubrics, and are not expert-validated.
referenceThe MedMCQA benchmark provides over 194,000 high-quality multiple-choice questions derived from Indian medical entrance exams to evaluate Large Language Model reasoning across diverse healthcare topics in a single-turn format.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 2 facts
referenceThe Med-HALT dataset includes the following subsets: MEDMCQA, Headqa, Medqa USMILE, Medqa (Taiwan), and Pubmed.
procedureA lightweight classifier method for hallucination detection conditions on input hidden states before text generation and intervenes in these states to steer Large Language Models toward factual outputs, resulting in consistent improvements in factual accuracy with minimal computational overhead. This method uses Accuracy as a metric and is evaluated on the NQ-Open, MMLU, MedMCQA, and GSM8K datasets.
Medical Hallucination in Foundation Models and Their ... medrxiv.org Mar 3, 2025 1 fact
claimGoogle's Med-PaLM and Med-PaLM 2 demonstrate strong performance on medical benchmarks such as MedQA (Jin et al., 2021), MedMCQA (Pal et al., 2022), and PubMedQA (Jin et al., 2019) by integrating biomedical texts into their training regimes, as reported by Singhal et al. (2022).