concept

GPT-4

Also known as: GPT-4o, GPT-4.0, gpt-4o

synthesized from dimensions

GPT-4 is a prominent, proprietary large language model (LLM) developed by OpenAI, built upon a transformer-based deep learning architecture language generation model architecture, transformer-based language models. Since its introduction, the series has evolved into a multimodal framework, with variants such as GPT-4V and the GPT-4o series capable of processing text, images, and audio multimodal model processing capabilities, 6. It is widely recognized for its high performance across complex tasks robust performance across complex tasks, 20, and it serves as a foundational benchmark for evaluating other artificial intelligence systems.

The model is extensively utilized in research and industry for a variety of technical applications. These include knowledge-based tasks such as multi-hop question answering and document analysis within Retrieval-Augmented Generation (RAG) frameworks 7, 8, automated knowledge graph construction key enablers of knowledge graphs, and medical record parsing ClinicalKG parses electronic records. Furthermore, GPT-4 is frequently employed as an automated evaluation metric—such as G-Eval or GPT4Score—to assess the quality of natural language generation and to generate synthetic training data for smaller models 16, 19, G-Eval NLG evaluation.

Despite its capabilities, GPT-4 faces significant challenges regarding accuracy, reliability, and reasoning. Research indicates that the model may rely on sophisticated pattern matching rather than genuine algorithmic reasoning, leading to performance declines on novel tasks cliff-like decline in performance. It is susceptible to adversarial attacks 15 and continues to struggle with hallucination detection, particularly in sensitive domains like medicine, where it has demonstrated risks in chronological ordering and data interpretation hallucination rates in medical data, clinical risk of hallucinations. Additionally, there is ongoing academic debate regarding whether its performance on tasks like false-belief reasoning represents a true "Theory of Mind" or merely advanced statistical mimicry performance on false-belief tasks.

In the broader AI landscape, GPT-4 is often positioned as a high-cost, high-performance standard. While it remains a leader in general capability, comparative studies frequently highlight that specialized, fine-tuned models (such as those using LoRA) or smaller, more efficient alternatives can match or exceed its performance in specific domains—such as knowledge QA or tactical planning—at a fraction of the deployment cost LoRA fine-tuned model performance, cost-effective alternatives, 14. Consequently, while GPT-4 remains a pivotal reference point in the field, the industry is increasingly focused on balancing its broad utility against the efficiency and transparency of specialized or open-source alternatives.

Model Perspectives (3)
openrouter/google/gemini-3.1-flash-lite-preview definitive 100% confidence
GPT-4 is a transformer-based large language model (LLM) developed by OpenAI, characterized by its deep learning architecture and extensive pre-training on large datasets robust performance across complex tasks, language generation model architecture, transformer-based language models. The model has evolved into a multimodal framework, with the GPT-4o variant capable of processing text, images, and audio multimodal model processing capabilities. It is widely utilized as a benchmark for AI performance and as a tool for various technical applications, including automated knowledge graph construction key enablers of knowledge graphs, medical case record parsing ClinicalKG parses electronic records, and NLG evaluation G-Eval NLG evaluation. Despite its capabilities, GPT-4 faces significant challenges regarding accuracy and reasoning. Research indicates that the model may rely on memorization rather than algorithmic reasoning, leading to performance declines on novel tasks cliff-like decline in performance. In medical contexts, GPT-4o has demonstrated high hallucination rates in areas like chronological ordering and lab data interpretation, with experts noting substantial clinical risks hallucination rates in medical data, clinical risk of hallucinations. Comparative studies often show that while GPT-4 remains a leader in performance, specialized fine-tuned models (such as those using LoRA) or smaller, cost-effective alternatives can outperform it in specific domains like tactical planning or knowledge question answering LoRA fine-tuned model performance, cost-effective alternatives. Furthermore, debate persists regarding whether its performance on tasks like false-belief reasoning constitutes genuine 'Theory of Mind' or merely sophisticated pattern matching performance on false-belief tasks.
openrouter/google/gemini-3.1-flash-lite-preview definitive 100% confidence
GPT-4 is a prominent large language model (LLM) developed by OpenAI, recognized for its capabilities in natural language understanding and generation 20. As a connectionist architecture, it processes human language using symbolic representations 3. The model utilizes large-scale pretraining and task-specific fine-tuning to achieve generalization across various tasks 1. Technical documentation and performance details were initially provided in the 'GPT-4 Technical Report' by OpenAI et al. 4, with early experiments exploring its potential regarding artificial general intelligence 42. Since its inception, the GPT-4 series has expanded to include multimodal variants like GPT-4V, which possesses vision capabilities 18, and the GPT-4o and GPT-4o mini models released in 2024 6. These models are widely utilized in research for diverse applications, including: * Knowledge-Based Tasks: GPT-4 is frequently integrated into RAG (Retrieval-Augmented Generation) and knowledge graph frameworks to perform multi-hop QA, document QA, and domain-specific tasks 7, 8, 12, 38. * Evaluation and Benchmarking: GPT-4 often serves as a model-based evaluation metric (e.g., GPT4Score) 19 and is used to generate synthetic training data for other models 16, 43. * Medical Applications: It has been used to validate medical knowledge graphs 21, 37 and evaluate medical reasoning 40. Despite its capabilities, research indicates limitations. GPT-4 can be susceptible to adversarial attacks 15 and may struggle with hallucination detection, even when prompted to explain its reasoning 5, 26. In specific knowledge QA tasks, some fine-tuned models have been observed to outperform GPT-4 in metrics such as BERTScore 14 and overall performance scores 27.
openrouter/x-ai/grok-4.1-fast 70% confidence
GPT-4 is referenced primarily as a high-performance proprietary large language model (LLM) serving as a benchmark for comparisons with open-source and smaller alternatives. According to IBM open-source promotion, it contrasts with transparent open-source LLMs. Amazon Science's RefChecker includes GPT-4 in its initial hallucination checking alongside Claude 2 and RoBERTa-NLI. Performance-wise, AnyScale reports Llama 2 matches GPT-4 accuracy in summaries at 30x lower cost, while Hugging Face's Think-on-Graph with small LLMs exceeds it in some scenarios, reducing deployment costs. Zylos' MiniCheck-FT5 reaches GPT-4-level fact-checking at 400x lower cost with 770M parameters. A Nature-published LoRA model scores 11.9% higher than GPT-4 in knowledge QA. It appears in detection methods, with Datadog comparing against GPT-4o and Lynx (8B) using shared faithfulness formats (Datadog; Aritra Biswas, Noé Vernier), and Tyler Malloy et al. modeling human-GPT-4 content similarity (arXiv). GPT-4o variants generate FinQA dataset responses (Cleanlab) and serve as baselines like Patronus AI's prompt on GPT-4o (Datadog). Chiang et al. (2023) highlight Vicuna's 90% ChatGPT quality impressing GPT-4 (arXiv). Overall, facts portray GPT-4 as a costly yet capable standard outperformed or matched by efficient alternatives.

Facts (105)

Sources
The construction and refined extraction techniques of knowledge ... nature.com Nature Feb 10, 2026 12 facts
claimLarge-scale language models such as GPT-4, LLaMA, and PaLM are key enablers of automated knowledge graph construction due to their strong semantic understanding and reasoning capabilities.
measurementThe LoRA fine-tuned model achieved an overall score 11.9% higher than GPT-4 in knowledge question answering tasks.
measurementIn tactical planning tasks, the LoRA fine-tuned model achieved an overall score of 0.88, while GPT-4 achieved a score of 0.76.
measurementClinicalKG parses electronic health records using GPT-4 to build a disease-symptom-drug network, achieving 89% accuracy in FDA-level drug interaction evaluations.
measurementThe LoRA fine-tuned model achieved a score 15.8% higher than GPT-4 in tactical planning tasks.
claimGPT-4 is a language generation model that utilizes a complex pre-training dataset and a multi-layer deep learning network architecture.
measurementThe LoRA fine-tuned model achieved a score 16.5% higher than GPT-4 in threat assessment tasks.
claimLarge-scale pre-trained Large Language Models (LLMs) such as GPT-4 and LLaMA-3 utilize large-scale pretraining and task-specific fine-tuning to achieve cross-task generalization.
claimGPT-3.5 is a language model capable of natural language understanding and generation, though it performs with more limitations compared to GPT-4.
measurementIn knowledge question answering tasks, the LoRA fine-tuned model achieved a BERTScore of 0.96, while GPT-4 achieved a BERTScore of 0.85.
measurementIn knowledge question answering tasks, the LoRA fine-tuned model achieved an overall score of 0.94, while GPT-4 achieved a score of 0.84.
claimThe study compares a fine-tuned DeepSeek-R1 70B LoRA model against baseline models including the original DeepSeek-R1 70B, GPT-4, GPT-3.5, and LLaMA3 70B to assess task performance improvements.
Large Language Models Meet Knowledge Graphs for Question ... arxiv.org arXiv Sep 22, 2025 10 facts
referenceRAG-KG-IL, as described by Yu and McQuade (2025), employs agent-based incremental learning and knowledge dynamic updates using GPT-4o on self-constructed knowledge graphs for domain-specific QA tasks.
referenceThe ODA method, proposed by Sun et al. in 2024, uses ODA-based knowledge graph retrieval with GPT-4 and GPT-3.5 models to perform KBQA tasks, evaluated using Hits@1 and Acc metrics on the QALD10-en dataset.
referenceThe ToG-2 method, proposed by Ma et al. in 2025, utilizes hybrid RAG and knowledge-guided context retrieval with GPT-3.5-Turbo, GPT-4o, Llama-3-8B, and Qwen2-7B models to perform multi-hop KBQA, document QA, and domain-specific QA tasks, evaluated using Acc, EM, R, and F1 metrics on WQSP, QALD10-en, AdvHotpotQA, HotpotQA, and ToG-FinQA datasets.
referenceThe KG-CoT method, proposed by Zhao et al. in 2024, uses chain-of-thought-based joint reasoning between knowledge graphs and LLMs (GPT-4, GPT-3.5-Turbo, Llama-7B, Llama-13B) to perform KBQA and multi-hop QA tasks, evaluated using Acc and Hit@K metrics on WQSP, CWQ, SQ, and WQ datasets.
referenceThe KGR method, proposed by Guan et al. in 2024, utilizes Refine-then-Retrieve and Knowledge Truthfulness Verification with GPT-4, Llama-2-7B, Vanilla Llama-2-7B, and Transformer models alongside CKG and PrimeKG knowledge graphs for domain-specific QA, evaluated using a Truthfulness Score on the MedQuAD dataset.
referenceThe KG-RAG method, proposed by Xu et al. in 2024, utilizes vector-based subgraph retrieval with the GPT-4 language model, incorporating self-constructed knowledge graphs to perform KGQA tasks on a curated dataset, evaluated using MRR, Recall@K, NDCG@K, BLEU, ROUGE, and METEOR metrics.
referenceThe KG-Agent method, proposed by Jiang et al. in 2024, uses KG-Agent-based instruction tuning with Davinci-003, GPT-4, and Llama-2-7B models to perform KGQA and ODQA tasks, evaluated using Hits@1 and F1 metrics on WQSP, CWQ, and GrailQA datasets.
referenceThe ToG method, proposed by Sun et al. in 2024, uses beam-search-based retrieval and LLM agents with GPT-3.5-Turbo, GPT-4, and Llama-2-70B-Chat models to perform KBQA and open-domain QA tasks, evaluated using Hits@1 on CWQ, WQSP, GrailQA, QALD10-en, and WQ datasets.
referenceThe KG-Rank method, proposed by Yang et al. in 2024, uses Similarity and MMR-based Ranking with GPT-4, Llama-2-7B, and Llama-2-13B language models and the UMLS and DBpedia knowledge graphs for domain-specific QA, evaluated using ROUGE-L, BERTScore, MoverScore, and BLEURT metrics on the LiveQA, ExpertQA-Bio, ExpertQA-Med, and MedQA datasets.
referenceKG-IRAG, as described by Yang et al. (2025), utilizes incremental retrieval and iterative reasoning with Llama-3-8B-Instruct, GPT-3.5-Turbo, GPT-4o-mini, and GPT-4o models on self-constructed knowledge graphs for temporal QA tasks.
Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 9 facts
claimOpenAI's GPT-4o model, released in May 2024, is a multimodal model capable of processing and generating text, images, and audio with enhanced reasoning and factual accuracy.
procedureThe procedure for refining text for completeness and structure involves prompting OpenAI’s GPT-4o to use text extracted by pdfminer to restore missing text from Marker-extracted content, while ensuring the final output is ordered and in Markdown format.
claimPretrained Large Language Models such as GPT-3, GPT-4, PaLM, LLaMA, and BERT have demonstrated advancements due to the extensive datasets used in their training.
measurementGPT-4o exhibited the highest hallucination rates in Chronological Ordering (24.6%) and Lab Data Understanding (18.7%) compared to other models, with many of these hallucinations classified by medical experts as posing 'Significant' or 'Considerable' clinical risk.
referenceThe final data representation for case records stores text in a 'text.txt' file using Markdown, stores tables in a structured JSON format (containing tabular data, Markdown formatting, and a GPT-4o generated summary), and saves images in a directory with associated JSON files containing captions and summaries.
procedureThe procedure for providing summaries of extracted images involves using the multimodal capability of OpenAI’s GPT-4o to generate concise summaries for critical visual content in case records.
procedureThe procedure for handling missing tables in medical case records involves: (1) prompting OpenAI’s GPT-4o model to identify missing tables in text extracted by Marker, (2) re-parsing the document with Marker if the model detects missing tables, and (3) limiting this verification process to a maximum of four trials.
measurementGPT-4o had a hallucination rate of 22.0% in Diagnosis Prediction, which was marginally lower than the rate observed for Gemini-2.0-flash-exp (2.25%), though the authors note a potential data discrepancy in the Gemini figures.
measurementThe study evaluated hallucination rates and clinical risk severity for five Large Language Models: o1, gemini-2.0-flash-exp, gpt-4o, gemini-1.5-flash, and claude-3.5 sonnet.
Efficient Knowledge Graph Construction and Retrieval from ... - arXiv arxiv.org arXiv Aug 7, 2025 8 facts
claimIn the CCM Chat evaluation, both variants of GraphRAG (one using GPT-4o for triplet extraction and one using a dependency graph for triplet creation) demonstrated an improvement in context precision scores compared to dense vector retrieval.
claimThe TripleExtractor system selects between GPT-4o and dependency graph models based on dataset size and cost calculations to optimize the knowledge graph construction process.
claimGraphRAG systems constructed using dependency parsing achieve performance comparable to those using GPT-4o for triplet extraction, indicating that dependency graph-based GraphRAG is a strong alternative for retrieval tasks.
measurementThe GraphRAG (GPT-4o) method achieved a Context Precision of 63.82%, Faithfulness of 74.24%, Answer Relevancy of 89.43%, and an average score of 75.83%.
claimOn the CCM Code Proposal dataset, both GraphRAG variants (using GPT-4o and dependency parsing) outperform dense retrieval methods in terms of winning rate and average score across five evaluation criteria.
measurementIn the CCM Chat evaluation, the dependency graph-based GraphRAG model achieved 95.75% of the GPT-4o variant's performance in context precision and 96.67% of the GPT-4o variant's performance in Full Coverage, demonstrating strong performance with a lighter knowledge graph construction pipeline.
measurementIn the CCM Code Proposal dataset evaluation, the Dense Vector (ada-002) method achieved a 23% winning rate and an average score of 3.48, while the GraphRAG (GPT-4o) method achieved a 77% winning rate and an average score of 4.04.
claimThe TripleExtractor system allows users to choose between commercial LLMs (GPT-4o and Sonnet) or a dependency parser-based approach for extracting knowledge triples.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 6 facts
measurementAccording to AnyScale, Llama 2 is approximately as factually accurate as GPT-4 for summaries and is 30 times cheaper to operate.
claimResearch presented at ACL 2025 evaluates leading AI models, specifically GPT-4o, Gemini-1.5, and Llama-3.2-Vision, in scenarios where a model correctly identifies an object visually in English but hallucinates its properties when generating text in another language.
measurementGPT-4 achieves an F1-score of approximately 0.625 in detecting subtle falsehoods on the hardest subset of the MedHallu benchmark.
referencePsiloQA is a large-scale dataset for multilingual span-level hallucination detection that supports 14 languages and is created through an automated three-stage pipeline involving QA generation, hallucinated answer elicitation, and GPT-4o–based span annotation.
procedureThe Lynx model is trained on 2400 samples from RAGTruth, DROP, CovidQA, and PubMedQA, incorporating GPT-4o generated reasoning as part of the training data.
procedureThe HaluBench dataset utilizes GPT-4o to generate hallucinated examples.
Medical Hallucination in Foundation Models and Their Impact on ... medrxiv.org medRxiv Nov 2, 2025 5 facts
claimMedical experts independently classified a substantial proportion of GPT-4o's hallucinations as posing 'Significant' or 'Considerable' clinical risk.
claimGPT-4 can sometimes surpass clinicians in estimating disease likelihoods, although both the model and human experts deviate substantially from actual prevalence rates.
claimOpenAI's GPT-4o is a multimodal model capable of processing and generating text, images, and audio with improved factual consistency.
measurementThe AI model o1 achieves a hallucination resistance baseline of 64.0%, while earlier-generation models gpt-4o and gpt-4o-mini achieve baselines of 54.4% and 48.3% respectively.
measurementOpenAI released GPT-4o in May 2024 and GPT-4o mini in July 2024.
Bridging the Gap Between LLMs and Evolving Medical Knowledge arxiv.org arXiv Jun 29, 2025 5 facts
referenceTable 1 in the paper 'Bridging the Gap Between LLMs and Evolving Medical Knowledge' compares state-of-the-art language models on the MEDQA benchmark, showing that Med-Gemini (1800B) achieved 91.1% accuracy, GPT-4 (1760B) achieved 90.2% accuracy, Med-PaLM 2 (340B) achieved 85.4% accuracy, AMG-RAG (8B) achieved 73.9% accuracy, and BioMedGPT (10B) achieved 50.4% accuracy.
claimAMG-RAG, which has 8B parameters, delivers competitive results compared to much larger models like Med-Gemini (1800B) and GPT-4 (1760B).
claimClinical experts and expert LLMs like GPT-4 validated the correctness of the Medical Knowledge Graph used in the AMG-RAG system.
claimLarger language models like Med-Gemini and GPT-4 achieve the highest accuracy and F1 scores on the MEDQA benchmark but require significantly larger parameter sizes.
measurementExpert LLMs like GPT-4 achieved an accuracy of 9/10 when validating knowledge extracted for the AMG-RAG Medical Knowledge Graph.
The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv Jul 11, 2024 4 facts
claimLarge Language Models (LLMs) are transformer-based language models, including OpenAI’s GPT-4, Google’s Gemini and PaLM, Microsoft’s Phi-3, and Meta’s LLaMA.
referenceJosh Achiam et al. published the GPT-4 technical report as an arXiv preprint in 2023.
claimOpenAI's GPT-4 demonstrates capabilities in natural language understanding and generation.
referenceSébastien Bubeck et al. conducted early experiments with GPT-4 and published their findings in the 2023 arXiv preprint 'Sparks of artificial general intelligence: Early experiments with gpt-4'.
A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 4 facts
claimAmidei et al. (2025) found that language switching alters the traits of GPT-4o as measured by the Eysenck Personality Questionnaire Revised, highlighting challenges in maintaining stable traits and reducing context dependence in Large Language Models.
referenceTyler Malloy, Maria José Ferreira, Fei Fang, and Cleotilde Gonzalez developed a cognitive model to measure the subjective similarity between human-written content and GPT-4-written content, as presented in the 2024 Proceedings of the 28th Conference on Computational Natural Language Learning.
claimWhile some researchers interpret GPT-4's performance on false-belief tasks as emergent Theory of Mind-like reasoning (Kosinski, 2024), others argue it is merely pattern matching, noting that minor prompt changes can significantly alter results (Strachan et al., 2024; Shapira et al., 2024).
measurementGPT-4 solves approximately 75% of false-belief tasks, which is comparable to the performance of a 6-year-old human, as reported by Kosinski (2024) and Strachan et al. (2024).
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog Aug 25, 2025 3 facts
procedureThe Datadog, Lynx (8B), and GPT-4o-based detection methods all utilize the same faithfulness evaluation format consisting of a question, context, and answer.
referenceLiu, X. et al. (2023) published 'G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment' at ACL 2023.
claimThe Datadog hallucination detection method was compared against two baselines: the open-source Lynx (8B) model from Patronus AI, and the same prompt used by Patronus AI evaluated on GPT-4o.
The Synergy of Symbolic and Connectionist AI in LLM ... arxiv.org arXiv 2 facts
claimOpenAI’s GPT-4 is an example of a Large Language Model that demonstrates unprecedented capabilities in natural language understanding and generation, exhibiting robust performance across a range of complex tasks.
referenceSébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. authored the paper 'Sparks of artificial general intelligence: Early experiments with gpt-4', published as an arXiv preprint in 2023.
New tool, dataset help detect hallucinations in large language models amazon.science Amazon Science 2 facts
claimIn the initial release of RefChecker, the automatic hallucination checker supports GPT-4, Claude 2, and RoBERTa-NLI, with plans to release additional open-source checkers such as AlignScore and a Mistral-based checker.
claimIn the initial release of RefChecker, the claim triplet extractor supports GPT-4 and Claude 2, with plans to provide a Mixtral-8x7B open-source extractor in a future release.
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Zylos Jan 27, 2026 2 facts
claimMiniCheck-FT5 is an efficient fact-checking system with 770 million parameters that achieves GPT-4-level performance at 400x lower cost, making it practical for synchronous production deployments.
claimOpenAI's 2026 research on reasoning models demonstrates that naturally understandable chain-of-thought reasoning traces are reinforced through reinforcement learning, and that simple prompted GPT-4o models can effectively monitor for reward hacking in frontier reasoning models like o1 and o3-mini successors.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org arXiv Mar 12, 2026 2 facts
claimHuang et al. (2024b) observed a 'cliff-like decline' in GPT-4's performance on medium-to-hard problems when tested on novel competition problems released after its training data cut-off, suggesting reliance on memorization rather than algorithmic reasoning.
referenceThe paper 'Speak, memory: an archaeology of books known to chatgpt/gpt-4' was published in the Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327.
The Impact of Open Source on Digital Innovation linkedin.com LinkedIn 2 facts
measurementDeepSeek's AI model costs 10 cents per million tokens, whereas GPT-4 costs $4.40 per million tokens.
claimDeepSeek, a Chinese AI lab, developed an AI model that matches or outperforms GPT-4 in several benchmarks.
Building Trustworthy NeuroSymbolic AI Systems - arXiv arxiv.org arXiv 2 facts
referenceChiang et al. (2023) authored 'Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality'.
claimGPT-3.5, Claude, and GPT-4.0 adhere more closely to instructions than LLama2 (Touvron et al. 2023), Vicuna (Chiang et al. 2023), and Falcon (Penedo et al. 2023).
Survey and analysis of hallucinations in large language models frontiersin.org Frontiers Sep 29, 2025 1 fact
referenceLee et al. (2023) analyzed the benefits, limits, and risks of using GPT-4 as an AI chatbot for medical applications in the New England Journal of Medicine.
LLM-empowered knowledge graph construction: A survey - arXiv arxiv.org arXiv Oct 23, 2025 1 fact
referenceResearch by Saeedizade & Blomqvist (2024) and Lippolis et al. (2025b) evaluated GPT-4's performance in ontology construction and confirmed that its outputs approach the quality of novice human modelers, validating the feasibility of intelligent ontology assistants.
What Is Open Source Software? - IBM ibm.com IBM 1 fact
claimOpen source LLMs promote a transparent, accessible, and community-driven approach compared to proprietary models like Google's LaMDA and OpenAI's ChatGPT-3 and GPT-4.
Daily Papers - Hugging Face huggingface.co Hugging Face 1 fact
claimIn certain scenarios, the performance of the 'Think-on-Graph' (ToG) approach using small large language models can exceed that of large models like GPT-4, thereby reducing the cost of LLM deployment and application.
vectara/hallucination-leaderboard - GitHub github.com Vectara 1 fact
referenceThe Vectara hallucination leaderboard utilizes specific API access points for various large language models: Llama 4 Maverick 17B 128E Instruct FP8 and Llama 4 Scout 17B 16E Instruct are accessed via Together AI; Microsoft Phi-4 and Phi-4-Mini are accessed via Azure; Mistral Ministral 3B, Ministral 8B, Mistral Large, Mistral Medium, and Mistral Small are accessed via Mistral AI's API; Kimi-K2-Instruct-0905 is accessed via Moonshot AI API; GPT-4.1, GPT-4o, GPT-5-High, GPT-5-Mini, GPT-5-Minimal, GPT-5-Nano, o3-Pro, o4-Mini-High, and o4-Mini-Low are accessed via OpenAI API; GPT-OSS-120B, GLM-4.5-AIR-FP8 are accessed via Together AI; Qwen3-4b, Qwen3-8b, Qwen3-14b, Qwen3-32b, and Qwen3-80b-a3b-thinking are accessed via dashscope API; Snowflake-Arctic-Instruct is accessed via Replicate API; Grok-3, Grok-4-Fast-Reasoning, and Grok-4-Fast-Non-Reasoning are accessed via xAI's API; and GLM-4.6 is accessed via deepinfra.
MedHallu - GitHub github.com GitHub 1 fact
measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.
Building Better Agentic Systems with Neuro-Symbolic AI cutter.com Cutter Consortium Dec 10, 2025 1 fact
claimDeep learning neural network-based large language models, such as GPT-4, Claude, and Gemini, process unstructured data including text, images, video, and streaming sensor data to learn patterns, classify data, and make predictions.
Applying Large Language Models in Knowledge Graph-based ... arxiv.org Benedikt Reitemeyer, Hans-Georg Fill · arXiv Jan 7, 2025 1 fact
claimHärer concluded that iterative modeling using GPT-4 is possible in a conversational fashion.
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Cleanlab Apr 7, 2025 1 fact
referenceThe FinQA dataset consists of complex questions from financial experts regarding public financial reports, with responses generated by OpenAI’s GPT-4o LLM.
Evaluating Evaluation Metrics — The Mirage of Hallucination ... machinelearning.apple.com Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu · Apple Machine Learning Research 1 fact
claimThe authors of 'Evaluating Evaluation Metrics — The Mirage of Hallucination Detection' observed that LLM-based evaluation, particularly using GPT-4, yields the best overall results for hallucination detection.
Knowledge Graphs Enhance LLMs for Contextual Intelligence linkedin.com LinkedIn Mar 10, 2026 1 fact
procedureThe author's 'SKILL.md' file contains hard-coded logic that forces AI models, including Claude, GPT-4o, and local Llama 3 instances, to follow a deterministic path for entity extraction.
A Knowledge Graph-Based Hallucination Benchmark for Evaluating ... arxiv.org arXiv Feb 23, 2026 1 fact
referenceThe 'GPT-4 technical report' is a cited reference regarding the GPT-4 model.
KG-IRAG: A Knowledge Graph-Based Iterative Retrieval-Augmented ... arxiv.org arXiv Mar 18, 2025 1 fact
claimIn the TrafficQA dataset, only GPT-4o successfully generated satisfactory results when numerical comparison was required.
The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv Jul 11, 2024 1 fact
claimLarge language models, such as ChatGPT and GPT-4, demonstrate the potential of connectionist architectures to process human language as a form of symbols.
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 1 fact
referenceThe 'GPT-4 Technical Report' by OpenAI et al. (2024) provides technical documentation and performance details for the GPT-4 large language model, published as an arXiv preprint.
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv Jun 14, 2024 1 fact
claimWhen evaluating hallucination detection capabilities, GPT-4V and GPT-4O followed instructions well but incorrectly classified hallucination types in Large Vision-Language Model (LVLM) outputs, failing to recognize their errors even when prompted to explain their classifications.
Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 1 fact
claimAdversarial attacks on Large Language Models (LLMs) for time series forecasting lead to more severe performance degradation than random noise across models including LLMTime with GPT-3.5, GPT-4, LLaMa, Mistral, TimeGPT, and TimeLLM.
Combining Knowledge Graphs and Large Language Models - arXiv arxiv.org arXiv Jul 9, 2024 1 fact
claimMultimodal Large Language Models, such as Google's Gemini and GPT-4 with vision (GPT-4V), possess vision capabilities.
Grounding LLM Reasoning with Knowledge Graphs - arXiv arxiv.org arXiv Dec 4, 2025 1 fact
claimThe researchers used GPT4Score as a model-based evaluation metric, defined as the percentage of answers that GPT-4o identifies as correct when assessing if the model's output matches the ground truth answer.
Leveraging Knowledge Graphs and LLM Reasoning to Identify ... arxiv.org arXiv Jul 23, 2025 1 fact
referenceThe experimental evaluation of the LLM agent framework utilized OpenAI’s GPT-4o via Langchain QA chains, interacting with a Neo4j knowledge graph through LLM-generated Cypher queries, with configuration settings of temperature 0.0, top_p 0.95, and a 4096-token limit.
MedHallu: Benchmark for Medical LLM Hallucination Detection emergentmind.com Emergent Mind Feb 20, 2025 1 fact
claimGeneral-purpose LLMs like GPT-4 outperform specialized medical fine-tuned models in hallucination detection tasks when no extra context is provided.
A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology 1 fact
measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Cleanlab Sep 30, 2024 1 fact
procedureRAGAS++ is a refined variant of the RAGAS technique developed by Cleanlab that uses the gpt-4o-mini LLM for generation and as a critic, replacing the default gpt-3.5-turbo-16k and gpt-4 models.
Unknown source 1 fact
claimThe research paper 'Towards the Automation of Knowledge Graph Construction Using ...' explores the semi-automatic and automatic construction of knowledge graphs using state-of-the-art large language models including Mixtral 8x22B Instruct v0.1, GPT-4o, and GPT-3.5.
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv Jan 6, 2026 1 fact
referenceNori et al. (2023) evaluated the capabilities of GPT-4 in medical reasoning in the paper titled 'Capabilities of gpt-4 in medical reasoning'.
[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io The Moonlight 1 fact
claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.
Reference Hallucination Score for Medical Artificial ... medinform.jmir.org JMIR Medical Informatics Jul 31, 2024 1 fact
referenceGiuliani C, Benadi G, Engel F, Werner J, Watter M, Schwarzer G, Groß O, Zeiser R, Binder H, and Kaier K created a 4-step cache-augmented generation approach using GPT-4o and PubTator 3.0 to identify biomedical entities for datasets in scientific articles, as published in JMIR Formative Research in 2025.