measurement
The Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
Authors
Sources
- Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org via serper
Referenced by nodes (6)
- NQ-Open concept
- Mistral AI entity
- Eigenscore concept
- hallucination detection concept
- ROUGE concept
- LLM-as-a-judge concept