Relations (1)
related 2.00 — strongly supporting 3 facts
GPT-4 is evaluated as a state-of-the-art language model on the MEDQA benchmark, where it achieves high accuracy scores as documented in [1] and [2], and it is also utilized as a core component in the KG-Rank method for domain-specific QA tasks as described in [3].
Facts (3)
Sources
Bridging the Gap Between LLMs and Evolving Medical Knowledge arxiv.org 2 facts
referenceTable 1 in the paper 'Bridging the Gap Between LLMs and Evolving Medical Knowledge' compares state-of-the-art language models on the MEDQA benchmark, showing that Med-Gemini (1800B) achieved 91.1% accuracy, GPT-4 (1760B) achieved 90.2% accuracy, Med-PaLM 2 (340B) achieved 85.4% accuracy, AMG-RAG (8B) achieved 73.9% accuracy, and BioMedGPT (10B) achieved 50.4% accuracy.
claimLarger language models like Med-Gemini and GPT-4 achieve the highest accuracy and F1 scores on the MEDQA benchmark but require significantly larger parameter sizes.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org 1 fact
referenceThe KG-Rank method, proposed by Yang et al. in 2024, uses Similarity and MMR-based Ranking with GPT-4, Llama-2-7B, and Llama-2-13B language models and the UMLS and DBpedia knowledge graphs for domain-specific QA, evaluated using ROUGE-L, BERTScore, MoverScore, and BLEURT metrics on the LiveQA, ExpertQA-Bio, ExpertQA-Med, and MedQA datasets.