claim
Traditional benchmarks for large language models face issues of saturation, where top-tier models approach perfect scores, which limits the ability of these benchmarks to distinguish between state-of-the-art models.

Authors

Sources

Referenced by nodes (1)