claim
In the ELI5 benchmark, the Prometheus and TLM evaluation models are more effective at detecting incorrect AI responses than other detectors, though no method achieves very high precision or recall.
Authors
Sources
- Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai via serper