claim
Existing benchmarks for evaluating Large Language Models are limited by static and narrow questions, which leads to limited coverage and misleading evaluations.

Authors

Sources

Referenced by nodes (3)