claim
In the PubmedQA benchmark, the Prometheus and TLM evaluation models detect incorrect AI responses with the highest precision and recall, effectively catching hallucinations.
Authors
Sources
- Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai via serper