claim
Models with higher PS (Prompt Sensitivity) and MV (Model Variance) metrics generally performed worse on factuality benchmarks like TruthfulQA (Lin et al., 2022) and HallucinationEval (Wu et al., 2023), while models with low MV, such as GPT-4, achieved better TruthfulQA scores.

Authors

Sources

Referenced by nodes (2)