Relations (1)

related 3.91 — strongly supporting 14 facts

GPT-4 is explicitly categorized as a specific instance of Large Language Models in multiple contexts, including its architecture [1], its use in benchmarking [2], and its performance in various tasks {fact:3, fact:5, fact:10}.

Facts (14)

Sources
Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv 2 facts
claimPretrained Large Language Models such as GPT-3, GPT-4, PaLM, LLaMA, and BERT have demonstrated advancements due to the extensive datasets used in their training.
measurementThe study evaluated hallucination rates and clinical risk severity for five Large Language Models: o1, gemini-2.0-flash-exp, gpt-4o, gemini-1.5-flash, and claude-3.5 sonnet.
Daily Papers - Hugging Face huggingface.co Hugging Face 1 fact
claimIn certain scenarios, the performance of the 'Think-on-Graph' (ToG) approach using small large language models can exceed that of large models like GPT-4, thereby reducing the cost of LLM deployment and application.
The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv 1 fact
claimLarge Language Models (LLMs) are transformer-based language models, including OpenAI’s GPT-4, Google’s Gemini and PaLM, Microsoft’s Phi-3, and Meta’s LLaMA.
MedHallu - GitHub github.com GitHub 1 fact
measurementState-of-the-art Large Language Models, including GPT-4o, Llama-3.1, and UltraMedical, struggle with hard hallucination categories in the MedHallu benchmark, achieving a best F1 score of 0.625.
A Survey of Incorporating Psychological Theories in LLMs - arXiv arxiv.org arXiv 1 fact
claimAmidei et al. (2025) found that language switching alters the traits of GPT-4o as measured by the Eysenck Personality Questionnaire Revised, highlighting challenges in maintaining stable traits and reducing context dependence in Large Language Models.
Building Better Agentic Systems with Neuro-Symbolic AI cutter.com Cutter Consortium 1 fact
claimDeep learning neural network-based large language models, such as GPT-4, Claude, and Gemini, process unstructured data including text, images, video, and streaming sensor data to learn patterns, classify data, and make predictions.
The construction and refined extraction techniques of knowledge ... nature.com Nature 1 fact
claimLarge-scale pre-trained Large Language Models (LLMs) such as GPT-4 and LLaMA-3 utilize large-scale pretraining and task-specific fine-tuning to achieve cross-task generalization.
The Synergy of Symbolic and Connectionist AI in LLM-Empowered ... arxiv.org arXiv 1 fact
claimLarge language models, such as ChatGPT and GPT-4, demonstrate the potential of connectionist architectures to process human language as a form of symbols.
Track: Poster Session 3 - aistats 2026 virtual.aistats.org Samuel Tesfazgi, Leonhard Sprandl, Sandra Hirche · AISTATS 1 fact
claimAdversarial attacks on Large Language Models (LLMs) for time series forecasting lead to more severe performance degradation than random noise across models including LLMTime with GPT-3.5, GPT-4, LLaMa, Mistral, TimeGPT, and TimeLLM.
MedHallu: Benchmark for Medical LLM Hallucination Detection emergentmind.com Emergent Mind 1 fact
claimGeneral-purpose LLMs like GPT-4 outperform specialized medical fine-tuned models in hallucination detection tasks when no extra context is provided.
A Comprehensive Benchmark for Detecting Medical Hallucinations ... aclanthology.org Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, Ying Ding · ACL Anthology 1 fact
measurementState-of-the-art large language models, including GPT-4o, Llama-3.1, and the medically fine-tuned UltraMedical, struggle with the binary hallucination detection task in MedHallu, with the best model achieving an F1 score as low as 0.625 for detecting 'hard' category hallucinations.
Unknown source 1 fact
claimThe research paper 'Towards the Automation of Knowledge Graph Construction Using ...' explores the semi-automatic and automatic construction of knowledge graphs using state-of-the-art large language models including Mixtral 8x22B Instruct v0.1, GPT-4o, and GPT-3.5.
[Literature Review] MedHallu: A Comprehensive Benchmark for ... themoonlight.io The Moonlight 1 fact
claimThe MedHallu benchmark evaluates the effectiveness of general-purpose large language models, such as GPT-4o, Qwen, and Gemma, alongside medically fine-tuned models in detecting hallucinations.