reference
The paper 'When judgment becomes noise: how design failures in llm judge benchmarks silently undermine validity' analyzes how design flaws in benchmarks that use large language models as judges can invalidate their results.
Authors
Sources
- A Survey on the Theory and Mechanism of Large Language Models arxiv.org via serper
Referenced by nodes (2)
- Large Language Models concept
- benchmarks concept