LLM-as-a-judge ↔ Mistral AI

Relations (1)

cross_type 2.00 — strongly supporting 3 facts

Mistral AI's models are used as the subject of evaluation in studies comparing ROUGE and LLM-as-a-judge metrics [1], with the LLM-as-a-judge framework serving as the benchmark to measure performance erosion in hallucination detection methods like Perplexity [2] and Eigenscore [3].

Facts (3)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 3 facts

measurementThe Eigenscore hallucination detection method experiences a performance erosion of 19.0% for the Llama model and 30.4% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.

measurementThe Perplexity hallucination detection method sees its AUROC score decrease by as much as 45.9% for the Mistral model on the NQ-Open dataset when switching from ROUGE to LLM-as-Judge evaluation.