claim
Reasoning-enhanced models such as DeepSeek-R1 and GPT-o3-mini demonstrate superior inter-rater reliability with human experts compared to standard instruction-tuned models.
Authors
Sources
- A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org via serper
Referenced by nodes (1)
- DeepSeek-R1 concept