claim
In the ELI5 benchmark, the Prometheus and TLM evaluation models are more effective at detecting incorrect AI responses than other detectors, though no method achieves very high precision or recall.

Authors

Sources

Referenced by nodes (2)