reference
The paper 'When judgment becomes noise: how design failures in llm judge benchmarks silently undermine validity' analyzes how design flaws in benchmarks that use large language models as judges can invalidate their results.

Authors

Sources

Referenced by nodes (2)