concept

o3-mini

Also known as: o4-Mini-High

Facts (11)

Sources
Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 5 facts
claimSystem Prompting offers noticeable improvements in reducing hallucination rates and works synergistically with Chain-of-Thought (CoT) prompting, particularly in models like o3-mini and deepSeek-r1.
claimOpenAI's o3-mini model was introduced in January 2025 and is designed to spend more time thinking before responding to enhance reasoning capabilities for complex tasks.
claimAdvanced general-purpose models like deepseek-r1 and o3-mini demonstrate superior performance in medical tasks compared to domain-specific models, suggesting that broad language understanding and reasoning capabilities are more crucial for reliability than domain-specific training alone.
claimHighly advanced architectures like deepseek-r1 and o3-mini show less pronounced gains from search augmentation, indicating an increasing reliance on their internal knowledge base for accuracy.
claimThe authors conducted experimental analyses on medical hallucinations using state-of-the-art Large Language Models including o3-mini, Gemini-2.0 Flash Thinking, and domain-specific models such as Meditron and Med-Alpaca across general practice, oncology, cardiology, and medical education scenarios.
Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 4 facts
claimOpenAI's o3-mini and o1 models allocate more inference time to deliberate reasoning before producing responses, which improves performance on complex tasks such as scientific reasoning, coding, and mathematics.
measurementThe models gemini-2.5-pro, o3-mini, and deepseek-r1 cluster in the high semantic similarity range of 0.8–0.9, indicating strong semantic alignment with ground truth medical information.
measurementThe AI model o3-mini achieves a hallucination resistance baseline of 80.4%, while deepseek-r1 achieves a baseline of 86.6%.
measurementSystem Prompting provides complementary gains to Chain-of-Thought (CoT) reasoning in reducing hallucination rates, as seen in o3-mini (baseline 80.4% to System Prompt 81.4% to CoT 90.7%) and deepseek-r1 (baseline 86.6% to System Prompt 84.5% to CoT 90.7%).
vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact
referenceThe Vectara hallucination leaderboard utilizes specific API access points for various large language models: Llama 4 Maverick 17B 128E Instruct FP8 and Llama 4 Scout 17B 16E Instruct are accessed via Together AI; Microsoft Phi-4 and Phi-4-Mini are accessed via Azure; Mistral Ministral 3B, Ministral 8B, Mistral Large, Mistral Medium, and Mistral Small are accessed via Mistral AI's API; Kimi-K2-Instruct-0905 is accessed via Moonshot AI API; GPT-4.1, GPT-4o, GPT-5-High, GPT-5-Mini, GPT-5-Minimal, GPT-5-Nano, o3-Pro, o4-Mini-High, and o4-Mini-Low are accessed via OpenAI API; GPT-OSS-120B, GLM-4.5-AIR-FP8 are accessed via Together AI; Qwen3-4b, Qwen3-8b, Qwen3-14b, Qwen3-32b, and Qwen3-80b-a3b-thinking are accessed via dashscope API; Snowflake-Arctic-Instruct is accessed via Replicate API; Grok-3, Grok-4-Fast-Reasoning, and Grok-4-Fast-Non-Reasoning are accessed via xAI's API; and GLM-4.6 is accessed via deepinfra.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Zylos Jan 27, 2026 1 fact
claimOpenAI's 2026 research on reasoning models demonstrates that naturally understandable chain-of-thought reasoning traces are reinforced through reinforcement learning, and that simple prompted GPT-4o models can effectively monitor for reward hacking in frontier reasoning models like o1 and o3-mini successors.