measurement
Several established hallucination detection methods for Large Language Models exhibit performance drops of up to 45.9% when evaluated using human-aligned metrics such as LLM-as-a-Judge.

Referenced by nodes (2)