hallucination detection ↔ GPT-4

Relations (1)

related 0.50 — strongly supporting 5 facts

GPT-4 is frequently utilized as a benchmark or primary tool for hallucination detection, as evidenced by its superior performance in general tasks [1], [2] and its role in automated annotation pipelines [3], despite documented limitations in specific classification scenarios [4], [5].

Facts (5)

Sources

Evaluating Evaluation Metrics — The Mirage of Hallucination ... machinelearning.apple.com Atharva Kulkarni, Yuan Zhang, Joel Ruben Antony Moniz, Xiou Ge, Bo-Hsiang Tseng, Dhivya Piraviperumal, Swabha Swayamdipta, Hong Yu · Apple Machine Learning Research 1 fact

claimThe authors of 'Evaluating Evaluation Metrics — The Mirage of Hallucination Detection' observed that LLM-based evaluation, particularly using GPT-4, yields the best overall results for hallucination detection.

Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com Aritra Biswas, Noé Vernier · Datadog 1 fact

claimThe Datadog hallucination detection method was compared against two baselines: the open-source Lynx (8B) model from Patronus AI, and the same prompt used by Patronus AI evaluated on GPT-4o.

Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org arXiv 1 fact

claimWhen evaluating hallucination detection capabilities, GPT-4V and GPT-4O followed instructions well but incorrectly classified hallucination types in Large Vision-Language Model (LVLM) outputs, failing to recognize their errors even when prompted to explain their classifications.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 1 fact

referencePsiloQA is a large-scale dataset for multilingual span-level hallucination detection that supports 14 languages and is created through an automated three-stage pipeline involving QA generation, hallucinated answer elicitation, and GPT-4o–based span annotation.

MedHallu: Benchmark for Medical LLM Hallucination Detection emergentmind.com Emergent Mind 1 fact

claimGeneral-purpose LLMs like GPT-4 outperform specialized medical fine-tuned models in hallucination detection tasks when no extra context is provided.