measurement
In the Cleanlab FinQA benchmark, the TLM and LLM-as-a-judge methods detect incorrect AI responses with the highest precision and recall.

Authors

Sources

Referenced by nodes (2)