measurement
In the Cleanlab FinQA benchmark, the TLM and LLM-as-a-judge methods detect incorrect AI responses with the highest precision and recall.
Authors
Sources
- Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai via serper
Referenced by nodes (2)
- LLM-as-a-judge concept
- TLM concept