claim
In the FinanceBench benchmark, the TLM and LLM-as-a-judge evaluation models detect incorrect AI responses with the highest precision and recall, matching findings observed in the FinQA dataset.

Authors

Sources

Referenced by nodes (3)