concept

Perplexity

Also known as: Perplexity AI

Facts (11)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv Aug 13, 2025 5 facts

referenceUncertainty-based methods for hallucination detection in large language models include Perplexity (Ren et al., 2023), Length-Normalized Entropy (LN-Entropy) (Malinin and Gales, 2021), and Semantic Entropy (SemEntropy) (Farquhar et al., 2024), which utilize multiple generations to capture sequence-level uncertainty.

measurementThe Mistral model exhibits pronounced performance degradation in zero-shot settings, with performance drops observed in Perplexity metrics, whereas the Llama model maintains more consistent performance with minimal degradation.

claimSemantic Entropy maintains the most consistent performance across both zero-shot and few-shot settings, while traditional metrics like Perplexity and LN-Entropy show higher sensitivity to setting changes.

measurementExisting hallucination detection methods experience performance drops of up to 45.9% for Perplexity and 30.4% for Eigenscore when evaluated using LLM-as-Judge criteria compared to ROUGE.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

EdinburghNLP/awesome-hallucination-detection - GitHub github.com GitHub 3 facts

claimMetrics used for hallucination detection include SelfCheckGPT, FactScore, EigenScore, Efficient EigenScore (EES), Semantic Entropy, Perplexity, HaluEval Accuracy, and ROUGE-1 (XSum).

measurementEstablished hallucination detection methods including Perplexity, EigenScore, and eRank suffer performance drops of up to 45.9% AUROC when evaluated with human-aligned LLM-as-Judge metrics instead of ROUGE.

measurementEvaluation of generation tasks uses Perplexity, Unigram Overlap (F1), BLEU-4, ROUGE-L, Knowledge F1, and Rare F1 as metrics, and utilizes datasets including WoW and CMU Document Grounded Conversations (CMU_DoG) with the KiLT Wikipedia dump as the knowledge source.

Hallucination Causes: Why Language Models Fabricate Facts mbrenndoerfer.com M. Brenndoerfer · mbrenndoerfer.com Mar 15, 2026 1 fact

claimProgress in large language model capabilities, such as perplexity or instruction-following quality, does not automatically translate into progress in hallucination reduction.

Medical Hallucination in Foundation Models and Their ... medrxiv.org medRxiv Mar 3, 2025 1 fact

measurementThe most commonly mentioned AI/LLM tools by survey respondents were ChatGPT (30 mentions), followed by Claude (20), Google Bard/Gemini (16), Llama (15), Perplexity (9), Alphafold (2), and Scite and Consensus (1).

Reference Hallucination Score for Medical Artificial ... medinform.jmir.org JMIR Medical Informatics Jul 31, 2024 1 fact

referenceWahid R, Craven C, Romanoff D, Kapralos B, and Chandross D authored the paper titled 'Exploring the Utilization of Perplexity AI for Academic Information Retrieval with Valid References Sourcing: A Study on Bina Nusantara Students', which was presented at the 2025 16th International Conference on Information, Intelligence, Systems & Applications (IISA).