claim
In the CovidQA benchmark, the TLM evaluation model detects incorrect AI responses with the highest precision and recall, followed by Prometheus and LLM-as-a-judge.

Authors

Sources

Referenced by nodes (3)