claim
Large Language Models may be overfitting to the specific artifacts of a test set rather than the underlying task, leading to a fundamental lack of robustness, according to Lunardi et al. (2025).

Authors

Sources

Referenced by nodes (2)