The paper 'Enhancing auto-regressive chain-of-thought through loop-aligned reasoning' is an arXiv preprint with identifier arXiv:2502.08482.
The research paper 'On llms-driven synthetic data generation, curation, and evaluation: a survey' was published as an arXiv preprint (arXiv:2505.10559) and cited in section 3.2.2 of the survey.
The paper 'Rnns are not transformers (yet): the key bottleneck on in-context retrieval' is an arXiv preprint (arXiv:2402.18510).
The paper 'Deepseek-r1: incentivizing reasoning capability in llms via reinforcement learning' (arXiv:2501.12948) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding reasoning capabilities.
The paper 'From low intrinsic dimensionality to non-vacuous generalization bounds in deep multi-task learning' is an arXiv preprint (arXiv:2501.19067) cited in section 2.2.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Vision superalignment: weak-to-strong generalization for vision foundation models' (arXiv:2402.03749) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding alignment.
The paper 'Demystify mamba in vision: a linear attention perspective' (arXiv:2405.16605) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding linear attention.
The paper 'Looped transformers are better at learning learning algorithms' is an arXiv preprint (arXiv:2311.12424) that investigates the capability of looped transformers to learn algorithms.
The paper 'A mathematical exploration of why language models help solve downstream tasks' is an arXiv preprint (arXiv:2010.03648) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Transformers, parallel computation, and logarithmic depth' is an arXiv preprint (arXiv:2402.09268) cited in section 3.2.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Ampo: automatic multi-branched prompt optimization' is an arXiv preprint (arXiv:2410.08696) that introduces a method for automatic multi-branched prompt optimization.
The paper 'Connecting large language models with evolutionary algorithms yields powerful prompt optimizers' (arXiv:2309.08532) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding prompt optimization.
The paper 'Jailbreak attacks and defenses against large language models: a survey' is an arXiv preprint with identifier arXiv:2407.04295.
The paper 'Trustllm: trustworthiness in large language models' is an arXiv preprint, identified as arXiv:2401.05561.
The paper 'Can you trust llm judgments? reliability of llm-as-a-judge' is an arXiv preprint (arXiv:2412.12509) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Benchmark data contamination of large language models: a survey' is an arXiv preprint (arXiv:2406.04244).
The paper 'Large language model alignment: a survey' is an arXiv preprint (arXiv:2309.15025) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Data mixing laws: optimizing data mixtures by predicting language modeling performance' is an arXiv preprint with identifier arXiv:2403.16952.
The paper 'Contranorm: a contrastive learning perspective on oversmoothing and beyond' (arXiv:2303.06562) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding contrastive learning.
The research paper 'Neural thermodynamic laws for large language model training' was published as an arXiv preprint (arXiv:2402.15505) and cited in section 5.2.1 of the survey.
The paper 'The prompt report: a systematic survey of prompt engineering techniques' is an arXiv preprint (arXiv:2406.06608) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'DeepSeekMath: pushing the limits of mathematical reasoning in open language models' is an arXiv preprint (arXiv:2402.03300) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Gated delta networks: improving mamba2 with delta rule' is an arXiv preprint (arXiv:2412.06464) that proposes improvements to the Mamba2 architecture using the delta rule.
The paper 'An explanation of in-context learning as implicit bayesian inference' is an arXiv preprint (arXiv:2111.02080).
The paper 'STanhop: sparse tandem hopfield model for memory-enhanced time series prediction' is an arXiv preprint (arXiv:2312.17346).
The paper 'Rest-mcts*: llm self-training via process reward guided tree search' is an arXiv preprint (arXiv:2406.03816) cited in section 6.2.3 of 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'A taxonomy for data contamination in large language models' is an arXiv preprint, identified as arXiv:2407.08716.
The paper 'Benefits of transformer: in-context learning in linear regression tasks with unstructured data' is an arXiv preprint (arXiv:2402.00743).
The research paper 'On robustness and reliability of benchmark-based evaluation of llms' was published as an arXiv preprint (arXiv:2509.04013).
The paper 'Language models represent space and time' (arXiv:2310.02207) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding representation.
The paper 'Large language model safety: a holistic survey' is an arXiv preprint (arXiv:2412.17686) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Are transformers universal approximators of sequence-to-sequence functions?' is an arXiv preprint with identifier arXiv:1912.10077.
The paper 'Autoprompt: eliciting knowledge from language models with automatically generated prompts' is an arXiv preprint (arXiv:2010.15980) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Compression represents intelligence linearly' is an arXiv preprint, identified as arXiv:2404.09937.
The paper 'The mosaic memory of large language models' is an arXiv preprint (arXiv:2405.15523) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Evaluating large language models: a comprehensive survey' (arXiv:2310.19736) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding LLM evaluation.
The paper 'On protecting the data privacy of large language models (llms): a survey' is an arXiv preprint (arXiv:2403.05156) that reviews data privacy concerns regarding large language models.
The paper 'How close is chatgpt to human experts? comparison corpus, evaluation, and detection' (arXiv:2301.07597) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding LLM evaluation.
The paper 'Training large language models to reason in a continuous latent space' (arXiv:2412.06769) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding reasoning.
The paper 'Gated linear attention transformers with hardware-efficient training' is an arXiv preprint (arXiv:2312.06635) that discusses gated linear attention transformers and their training efficiency.
The paper 'Knowledge-infused prompting: assessing and advancing clinical text data generation with large language models' is an arXiv preprint (arXiv:2311.00287) that explores the intersection of large language models and clinical data generation.
The paper 'Emergence of segmentation with minimalistic white-box transformers' is an arXiv preprint with identifier arXiv:2308.16271.
The paper 'On the emergence of weak-to-strong generalization: a bias-variance perspective' is an arXiv preprint (arXiv:2505.24313).
The paper 'Theoretically grounded framework for llm watermarking: a distribution-adaptive approach' (arXiv:2410.02890) is cited in the survey 'A Survey on the Theory and Mechanism of Large Language Models' regarding watermarking.
The paper 'Improving weak-to-strong generalization with scalable oversight and ensemble learning' is an arXiv preprint (arXiv:2402.00667) cited in section 5.2.1 of 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Self-attention networks can process bounded hierarchical languages' is an arXiv preprint (arXiv:2105.11115) that demonstrates the capability of self-attention networks to process bounded hierarchical languages.
The paper 'Entropy-memorization law: evaluating memorization difficulty of data in llms' is an arXiv preprint, identified as arXiv:2507.06056.
The paper 'Proximal policy optimization algorithms' is an arXiv preprint (arXiv:1707.06347) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Reasoning with latent thoughts: on the power of looped transformers' is an arXiv preprint (arXiv:2502.17416) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.
The paper 'Spurious rewards: rethinking training signals in rlvr' is an arXiv preprint (arXiv:2506.10947) cited in 'A Survey on the Theory and Mechanism of Large Language Models'.