TLM
Also known as: Trustworthy Language Model
Facts (12)
Sources
Real-Time Evaluation Models for RAG: Who Detects Hallucinations ... cleanlab.ai Apr 7, 2025 10 facts
claimIn the PubmedQA benchmark, the Prometheus and TLM evaluation models detect incorrect AI responses with the highest precision and recall, effectively catching hallucinations.
claimThe Cleanlab RAG benchmark uses OpenAI’s gpt-4o-mini LLM to power both the 'LLM-as-a-judge' and 'TLM' scoring methods.
claimIn the ELI5 benchmark, the Prometheus and TLM evaluation models are more effective at detecting incorrect AI responses than other detectors, though no method achieves very high precision or recall.
claimEvaluation techniques such as 'LLM-as-a-judge' or 'TLM' (Trustworthy Language Model) can be powered by any Large Language Model and do not require specific data preparation, labeling, or custom model infrastructure, provided the user has infrastructure to run pre-trained LLMs like AWS Bedrock, Azure/OpenAI, Gemini, or Together.ai.
referenceA previous study benchmarking alternative hallucination detection techniques, including DeepEval, G-Eval, and RAGAS, found that TLM (Trustworthy Language Model) evaluation models detect incorrect RAG responses with higher precision and recall.
claimIn the CovidQA benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by Prometheus and LLM-as-a-judge.
claimIn the DROP benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by LLM-as-a-judge, with no other evaluation model appearing very useful.
measurementIn the Cleanlab FinQA benchmark, the TLM and LLM-as-a-judge methods detect incorrect AI responses with the highest precision and recall.
claimIn the FinanceBench benchmark, the TLM and LLM-as-a-judge evaluation models detect incorrect AI responses with the highest precision and recall, matching findings observed in the FinQA dataset.
referenceA study found that TLM (Trustworthy Language Model) detects incorrect RAG responses more effectively than techniques like 'LLM-as-a-judge' or token probabilities (logprobs) across all major Large Language Models.
Benchmarking Hallucination Detection Methods in RAG - Cleanlab cleanlab.ai Sep 30, 2024 2 facts
claimFor the FinanceBench application, the TLM (Trustworthy Language Model) method is the most effective technique for detecting hallucinations.
claimFor the Pubmed QA application, the TLM method is the most effective technique for detecting hallucinations, followed by the DeepEval Hallucination metric, RAGAS Faithfulness, and LLM Self-Evaluation.