Relations (1)
related 0.50 — strongly supporting 5 facts
GPT-4 is frequently utilized as a benchmark or primary tool for hallucination detection, as evidenced by its superior performance in general tasks [1], [2] and its role in automated annotation pipelines [3], despite documented limitations in specific classification scenarios [4], [5].
Facts (5)
Sources
Evaluating Evaluation Metrics — The Mirage of Hallucination ... machinelearning.apple.com 1 fact
claimThe authors of 'Evaluating Evaluation Metrics — The Mirage of Hallucination Detection' observed that LLM-based evaluation, particularly using GPT-4, yields the best overall results for hallucination detection.
Detecting hallucinations with LLM-as-a-judge: Prompt ... - Datadog datadoghq.com 1 fact
claimThe Datadog hallucination detection method was compared against two baselines: the open-source Lynx (8B) model from Patronus AI, and the same prompt used by Patronus AI evaluated on GPT-4o.
Detecting and Evaluating Medical Hallucinations in Large Vision ... arxiv.org 1 fact
claimWhen evaluating hallucination detection capabilities, GPT-4V and GPT-4O followed instructions well but incorrectly classified hallucination types in Large Vision-Language Model (LVLM) outputs, failing to recognize their errors even when prompted to explain their classifications.
EdinburghNLP/awesome-hallucination-detection - GitHub github.com 1 fact
referencePsiloQA is a large-scale dataset for multilingual span-level hallucination detection that supports 14 languages and is created through an automated three-stage pipeline involving QA generation, hallucinated answer elicitation, and GPT-4o–based span annotation.
MedHallu: Benchmark for Medical LLM Hallucination Detection emergentmind.com 1 fact
claimGeneral-purpose LLMs like GPT-4 outperform specialized medical fine-tuned models in hallucination detection tasks when no extra context is provided.