Zero-Shot
Facts (13)
Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org Aug 13, 2025 5 facts
claimFor the Llama model, the performance discrepancy between ROUGE and LLM-as-Judge evaluation narrows significantly when using few-shot examples compared to zero-shot settings.
claimAlternative metrics such as BERTScore, BLEU, and UniEval-fact exhibit substantial shortcomings in reliably detecting hallucinations in question-answering tasks, particularly under zero-shot conditions.
measurementThe Mistral model exhibits pronounced performance degradation in zero-shot settings, with performance drops observed in Perplexity metrics, whereas the Llama model maintains more consistent performance with minimal degradation.
claimFew-shot settings consistently yield more reliable evaluations across metrics compared to zero-shot settings.
claimSemantic Entropy maintains the most consistent performance across both zero-shot and few-shot settings, while traditional metrics like Perplexity and LN-Entropy show higher sensitivity to setting changes.
Survey and analysis of hallucinations in large language models frontiersin.org Sep 29, 2025 3 facts
claimHallucination scores for language models change little across prompting techniques such as Zero-shot, Few-shot, CoT, and Instruction formats because the prompts are semantically equivalent and decoding is low-entropy, causing outputs to be dominated by the models' learned alignment policies.
procedureThe experimental pipeline evaluates hallucinations in open-source LLMs by integrating benchmark datasets, varied prompt strategies (zero-shot, few-shot, CoT), and text generation via HuggingFace.
procedureThe prompt engineering protocol used in the study involves five categories: Zero-shot (basic instruction), Few-shot (2-3 input-output examples), Instruction (structured natural language), Chain-of-thought (step-by-step reasoning), and Vague/misleading (intentionally unclear).
Grounding LLM Reasoning with Knowledge Graphs - arXiv arxiv.org Dec 4, 2025 1 fact
referenceThe baseline methods used for comparison include: (1) Zero-Shot (querying the model without additional context), (2) Text RAG (using text representation of nodes as input), (3) Graph RAG (including 1-hop node neighbors), and (4) Graph CoT (Agent) (implementing Graph CoT as an agent for reasoning).
LLM Hallucination Detection and Mitigation: State of the Art in 2026 zylos.ai Jan 27, 2026 1 fact
measurementChain-of-Verification (CoVe) improves F1 scores by 23% (from 0.39 to 0.48) and outperforms Zero-Shot, Few-Shot, and Chain-of-Thought methods, though it does not eliminate hallucinations in complex reasoning chains.
Medical Hallucination in Foundation Models and Their ... medrxiv.org Mar 3, 2025 1 fact
procedureThe 'Base' method for evaluating Large Language Models involves querying the models directly with questions from the Med-HALT benchmark without additional context or instructions to assess inherent hallucination tendencies in a zero-shot setting.
Track: Poster Session 3 - aistats 2026 virtual.aistats.org 1 fact
formulaIn the zero-shot case, where no regression data is available, task shift from classification to regression is impossible in both sparse signal and random signal models for any Gaussian covariate distribution.
A Survey on the Theory and Mechanism of Large Language Models arxiv.org Mar 12, 2026 1 fact
referenceThe paper 'Revisiting chain-of-thought prompting: zero-shot can be stronger than few-shot' is an arXiv preprint, identified as arXiv:2506.14641.