claim
Models with higher PS (Prompt Sensitivity) and MV (Model Variance) metrics generally performed worse on factuality benchmarks like TruthfulQA (Lin et al., 2022) and HallucinationEval (Wu et al., 2023), while models with low MV, such as GPT-4, achieved better TruthfulQA scores.
Authors
Sources
- Survey and analysis of hallucinations in large language models www.frontiersin.org via serper
Referenced by nodes (2)
- TruthfulQA concept
- Prompt Sensitivity concept