claim
In the PubmedQA benchmark, the Prometheus and TLM evaluation models detect incorrect AI responses with the highest precision and recall, effectively catching hallucinations.

Authors

Sources

Referenced by nodes (2)