claim
In the DROP benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by LLM-as-a-judge, with no other evaluation model appearing very useful.
Authors
Sources
- Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai via serper
Referenced by nodes (3)
- LLM-as-a-judge concept
- DROP concept
- TLM concept