measurement
The study benchmarks two open-source models (Qwen3-235B-A22B-Instruct-2507 and DeepSeek-R1) and two proprietary models (GPT-5 and Gemini-2.5-Pro) to assess inquiry completeness in clinical contexts.
Authors
Sources
- A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org via serper
Referenced by nodes (2)
- DeepSeek-R1 concept
- GPT-5 concept