formula
The Cleanlab benchmark evaluates hallucination detectors based on AUROC, defined as the probability that the detector's score will be lower for an example where the LLM responded incorrectly than for an example where the LLM responded correctly.
Authors
Sources
- Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai via serper