claim
According to Zhang et al. (2025a), high performance on static benchmarks for Large Language Models may not correlate with true, generalized capabilities.

Authors

Sources

Referenced by nodes (1)