Relations (1)

cross_type 2.00 — strongly supporting 3 facts

Mistral AI is the source of the model outputs used to evaluate hallucination detection methods in the study [1], and its performance metrics are specifically measured to assess the efficacy of detection techniques like Perplexity [2] and Eigenscore [3].

Facts (3)

Sources
Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 3 facts
measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.
procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.
measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.