Relations (1)
cross_type 2.00 — strongly supporting 3 facts
Mistral AI's models are used as the subject of evaluation in studies comparing ROUGE and LLM-as-a-judge metrics [1], with the LLM-as-a-judge framework serving as the benchmark to measure performance erosion in hallucination detection methods like Perplexity [2] and Eigenscore [3].
Facts (3)
Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org 3 facts
measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.