claim
Reasoning-enhanced models such as DeepSeek-R1 and GPT-o3-mini demonstrate superior inter-rater reliability with human experts compared to standard instruction-tuned models.

Authors

Sources

Referenced by nodes (1)