concept

GPT-5

Also known as: GPT-5-High

Facts (15)

Sources
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 9 facts
measurementThe Majority Voting strategy for ensemble LLM judges consistently produces stable agreement with human clinical experts, maintaining F1-scores in the 75–79% range across Doctor Agents including DeepSeek, Gemini, and GPT-5.
measurementThe Liberal Strategy for ensemble LLM judges achieves the highest alignment metrics with human clinical experts, particularly for the GPT-5 model.
claimStudies by Maina et al. identify persistent challenges in LLM-as-a-Judge methods, including verbosity bias, inconsistency in low-resource languages, and a 'severity gap' where models like GPT-5 and Gemini exhibit divergent leniency compared to human clinicians.
claimThe study implements ensemble strategies using a panel of three advanced LLMs (GPT-5, Gemini-2.5-Pro, and DeepSeek-V3) to capture the complexity of medical decision-making, with alignment verified against human experts using Macro F1 on 300 cases.
referenceOpenAI announced GPT-5 in 2025.
claimGPT-5 exhibits a linear growth pattern in diagnostic accuracy, achieving its highest performance at 13+ turns in a consultation.
measurementThe study benchmarks two open-source models (Qwen3-235B-A22B-Instruct-2507 and DeepSeek-R1) and two proprietary models (GPT-5 and Gemini-2.5-Pro) to assess inquiry completeness in clinical contexts.
claimThe 'Liberal Strategy' for aggregation in the MedDialogRubrics multi-agent judging system shows high agreement for GPT-5, suggesting that stronger models generate nuanced answers that strict 'Unanimous' judges may fail to validate.
perspectiveMulti-turn evaluation is necessary for benchmarking medical AI because static benchmarks like MedQA may show only marginal differences between models like GPT-5 and Qwen3-235B-A22B-Instruct-2507.
Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 4 facts
measurementThe advanced reasoning model gpt-5 achieves a 71.2% baseline resistance to hallucinations and a semantic similarity score greater than 0.8.
claimOpenAI's GPT-5 emphasizes advanced long-context reasoning and more reliable factual grounding.
measurementThe AI model GPT-5 achieves a hallucination resistance performance of 87.6% when using search-augmented generation, representing a 16.5% improvement over its baseline performance.
claimThe authors conducted experimental analyses on medical hallucinations across general practice, oncology, cardiology, and medical education using GPT-5, Gemini-2.5 Pro, DeepSeek-R1, and MedGemma.
vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact
referenceThe Vectara hallucination leaderboard utilizes specific API access points for various large language models: Llama 4 Maverick 17B 128E Instruct FP8 and Llama 4 Scout 17B 16E Instruct are accessed via Together AI; Microsoft Phi-4 and Phi-4-Mini are accessed via Azure; Mistral Ministral 3B, Ministral 8B, Mistral Large, Mistral Medium, and Mistral Small are accessed via Mistral AI's API; Kimi-K2-Instruct-0905 is accessed via Moonshot AI API; GPT-4.1, GPT-4o, GPT-5-High, GPT-5-Mini, GPT-5-Minimal, GPT-5-Nano, o3-Pro, o4-Mini-High, and o4-Mini-Low are accessed via OpenAI API; GPT-OSS-120B, GLM-4.5-AIR-FP8 are accessed via Together AI; Qwen3-4b, Qwen3-8b, Qwen3-14b, Qwen3-32b, and Qwen3-80b-a3b-thinking are accessed via dashscope API; Snowflake-Arctic-Instruct is accessed via Replicate API; Grok-3, Grok-4-Fast-Reasoning, and Grok-4-Fast-Non-Reasoning are accessed via xAI's API; and GLM-4.6 is accessed via deepinfra.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
measurementIn the KGHaluBench evaluation, GPT-5 achieved the highest Weighted Accuracy score of 65.60%.