measurement
The 'Survey and analysis of hallucinations in large language models' reports Prompt Sensitivity (PS) and Model Variability (MV) scores for LLMs as follows: LLaMA 2 (13B) (PS: 0.091, MV: 0.045), Mistral 7B (PS: 0.078, MV: 0.053), DeepSeek 67B (PS: 0.060, MV: 0.080), OpenChat-3.5 (PS: 0.083, MV: 0.062), and Gwen (PS: 0.079, MV: 0.057).
Authors
Sources
- Survey and analysis of hallucinations in large language models www.frontiersin.org via serper