Relations (1)

related 2.00 — strongly supporting 3 facts

GPT-5 is evaluated as a specific implementation of the LLM-as-a-judge framework, with studies highlighting its performance characteristics such as leniency biases [1], its F1-score stability in ensemble voting [2], and its high alignment metrics under liberal strategies [3].

Facts (3)

Sources
A Comprehensive Benchmark and Evaluation Framework for Multi ... arxiv.org arXiv 3 facts
measurementThe Majority Voting strategy for ensemble LLM judges consistently produces stable agreement with human clinical experts, maintaining F1-scores in the 75–79% range across Doctor Agents including DeepSeek, Gemini, and GPT-5.
measurementThe Liberal Strategy for ensemble LLM judges achieves the highest alignment metrics with human clinical experts, particularly for the GPT-5 model.
claimStudies by Maina et al. identify persistent challenges in LLM-as-a-Judge methods, including verbosity bias, inconsistency in low-resource languages, and a 'severity gap' where models like GPT-5 and Gemini exhibit divergent leniency compared to human clinicians.