The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs ↔ LLM-as-a-judge

Relations (1)

related 2.32 — strongly supporting 4 facts

The paper 'The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs' utilizes 'LLM-as-a-judge' as a primary evaluation method, comparing its performance against ROUGE [1] and analyzing its effectiveness in detecting hallucinations [2]. The study validates that 'LLM-as-a-judge' aligns better with human assessments [3] while also critically examining the inherent biases and limitations of the approach [4].

Facts (4)

Sources

Re-evaluating Hallucination Detection in LLMs - arXiv arxiv.org arXiv 4 facts

claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' state that while LLM-as-Judge is more robust than ROUGE for human-aligned evaluation, it is not without its own biases and limitations.

claimThe authors of 'Re-evaluating Hallucination Detection in LLMs' found that while ROUGE exhibits high precision, it fails to detect many hallucinations, whereas the LLM-as-Judge method achieves significantly higher recall and aligns more closely with human assessments.

claimThe 'LLM-as-Judge' evaluation method offers a closer alignment with human judgments of factual correctness compared to ROUGE, as validated by the human study conducted by the authors of 'Re-evaluating Hallucination Detection in LLMs'.

procedureTo evaluate hallucination detection, the authors of 'Re-evaluating Hallucination Detection in LLMs' randomly selected 200 question–answer pairs from Mistral model outputs on the NQ-Open dataset, ensuring a balanced representation of cases where ROUGE and LLM-as-Judge yield conflicting assessments.