Mathematical Foundations of Machine Learning & AI

Introduction to Machine Learning and AI from a Mathematical Perspective

Overview of ML and AI

Artificial intelligence (AI) is a branch of computer science focused on creating systems capable of performing tasks that typically require human intelligence, such as problem-solving, decision-making, and understanding language. Machine learning (ML) is a subset of AI that involves creating algorithms that enable computers to learn patterns and make decisions or predictions based on data, without being explicitly programmed for every task.
According to Jordan and Mitchell (2015), ``machine learning is concerned with the question of how to construct programs that automatically improve their performance at some tasks through experience''. More formally, a computer program is said to learn from experience $E$ with respect to some class of tasks $T$ and performance measure $P$ if its performance at tasks in $T$, as measured by $P$, improves with experience $E$.
Developers of ML and AI systems might in some cases find easier to train a system with a training sample than programming an exhaustive set of rules policies, i.e., the mappings from perceived states of the environment to the actions; see Sutton and Barto (2018). ML and AI might be viewed as a neighbour branch of scientific machine learning, the purpose of which is to construct reliable reduced models of highly complex systems from even partially observed trajectories of data.
ML methods rely on a multilayer neural network (MNN), an input/output connectionist system of variable dimension, the architecture of which consists of layers of nodes connected in an organized and structured way. The use of hierarchical and organized structures for modelling complex systems represented by structured data is rather intuitive, and has been advocated by the Turing Award and Nobel laureate Herbert Simon (1962).
There is a wide array of reference books on the topic, e.g., arranged in chronological order Vapnik (1998, 2000), Bishop (2006), Hastie et al. (2009), Shalev-Shwartz and Ben-David (2014), Goodfellow et al. (2016), Aggarwal (2018), Mohri et al. (2018), etc. The current rate of evolution of ML and AI is almost exponential, making the task of writing an exhaustive reference book solely on its mathematics a Sisyphean endeavor.

Mathematical formalization in ML and AI

This fast development pace of ML and AI is supported by its mathematical formalization, which borrows from a wide array of fields in mathematics and computer science: algebra, optimization, statistical inference and estimation. This allows to obtain reliable and reproducible results, to bridge connections between different academic fields and develop new tools for obtaining powerful and sometimes counter-intuitive new results.

I: Mathematical Foundations and Tools

Introduction

ML and AI make extensive use of calculus, linear algebra, optimization, probability and statistics. There is a wide array of excellent books covering these topics, see e.g., MacKay (2003), Wasserman (2004), Nocedal and Wright (2006), Bishop (2006), Hastie et al. (2009), Goodfellow et al. (2016), Strang (2019), Wainwright (2019) and Bishop and Bishop (2023) among others.

Key Algorithms and Models

Neural Networks

The multilayer neural network (MNN), a connectionist system the architecture of which consists of several connected nodes and layers, constitute the building block of ML models. The internal organization of these computational models, inspired by the human brain and consisting of layers of interconnected nodes, depends on the purpose of the model, the structure of the data used for its training and other inductive biases (Goyal and Bengio, 2022). A huge dataset is then presented to the network for the training, i.e., estimating the weights associated with each connection, the hyperparameters, by optimizing a performance measure $P$ with algorithms. The use of multilayer neural networks has strong mathematical foundations, as Hornik et al. (1989) demonstrated that even shallow neural networks are universal approximators, i.e., they can approximate any function $$ X \in \mathbb{R}^m\mapsto Y=f(X) \in \mathbb{R}^n, \: m, n \geqslant 1, $$ and its derivatives. However, these shallow neural networks are useless, as their generalization properties, i.e., their performance on samples different from the training sample, are rather poor. This overfitting in the estimation of statistical models is corrected by using regularization methods, e.g., the use of penalty functions as in nonparametric regression analysis (Green and Silverman, 1994). The estimation of the function $f$ is obtained by minimizing a regularized loss function: $$ \min_{f \in \mathcal{H}} \mathcal{L}(Y,X;\theta) + \lambda R(f), $$ where $\mathcal{H}$ is the space of candidate functions, $\theta \in \Theta$ the set of hyperparameters to be estimated, $R(f)$ is a penalty functional with weight $\lambda >0$. The hyperparameters are adjusted on a training set, the model is then tested and validated on different datasets.
Since the works by Krizhevsy et al. (2012) introducing deep networks, the picture changed completely as it was empirically observed that the generalization properties strongly improve with the size of the network. New regularization methods like Dropout (Srivastava et al., 2014) proved useful for these deep networks. The seminal mathematical works by Belkin and its co-author study the mathematical foundations of the empirical "auto-regularization" properties of over-parameterized systems, e.g., the large language models with over several trillions of hyper-parameters. These systems are characterized by the "double-descent" phenomena: once the interpolation threshold is bypassed, the risk function of the systems decreases monotonically with its capacity.
Furthermore, although the risk function of these over-parameterized systems is neither convex, nor locally convex, other topological properties, studied by Łojasiewicz (1962) and Poljak (1963), verified by these over-parameterized systems allows to use gradient descent methods, including the stochastic gradient descent method (SGD) commonly used in ML.
Further details on the theory of over-parameterized machine learning (TOPML) i.e., the minimization of the loss function, and the recent advances in this field for very large systems are given in the AI & Math Blog.

Deep Learning Models

The following architectures constitute the main building blocks of MNN. The combination of these architectures yields the class of hybrid models.
  1. Convolutional Neural Networks (CNN) are based on the invariance property of the convolution operator for translation and rotation, so that an object will be recognized as the same after translations and/or rotation. This makes CNNs particularly well-suited for tasks involving image processing, image recognition, and other visual applications.
  2. Recurrent Neural Networks (RNN) are designed for sequential data, with loops to process temporal dependencies. Time series analysis and speech recognition are their main domain of applications.
  3. Long Short-Term Memory (LSTM), proposed by Hochreiter and Schmidhuber (1997), was designed to address the vanishing gradient problem encountered in RNNs with a large number of loops. During training, the model minimizes a risk or loss function. However, the standard backpropagation algorithm, which relies on a chain of derivative calculations, can lead to gradients with extremely small numerical values, making them practically unusable.
  4. Transformers: A Transformer is a deep learning achitecture, based on the attention mechanism (Vaswani et al., 2017), primarily for processing sequential data, e.g., text, without resorting to RNN. This architecture is particularly effective for tasks such as translation, summarization, and text generation.

II: Theoretical Aspects of ML and AI algorithms

Machine Learning Paradigms

Supervised Learning

Supervised learning is the simplest learning method, where the model learns from labeled data, i.e., (input,output) pairs. We have a dataset $ \mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\} $, which consists of a collection of $N$ pairs of inputs, $x_i$ or features, and outputs $ y_i$, written in matrix form as $ \mathcal{D} =(X,Y)$. The output $Y$ might be discrete, continuous, a set of labels in the case of the classification problem, etc.
The goal of supervised learning is to approximate the function $f$ that maps $X$ to $Y$, i.e., $Y=f(X)$, by learning this function from the dataset $\mathcal{D}$.

Unsupervised Learning

The model learns patterns or structures from data without labeled outcomes, as labels are either costly to obtain, or at least partially available. Furthermore, errors in the labels affect the result of supervised learning and require either to use specific corrective tools (Northcutt et al., 2021, Hendrycks et al., 2018), or to design alternative methods for estimating the model without depending on labels.
Unsupervised learning takes into account the internal structure/organisation of the data, and does not need the set of labels. This approach is similar to the non-parametric approach in statistics which 'let the data speak for themselves'.
Semi-supervised learning is the intermediate case between supervised and unsupervised learning: the learning procedure uses first the labelled data while leveraging a larger amount of unlabeled data, which strongly enhance the accuracy of the model. This resembles nonparametric statistics, where no model is specified and the user "let the data speak for themselves". As a consequence, some nonparametric methods are used in unsupervised learning. Standard exemples of unsupervised learning are:
  1. Clustering: Grouping similar data points (e.g., customer segmentation).
  2. Dimensionality Reduction: Simplifying high-dimensional data while preserving important features (e.g., PCA for data visualization).

Self-supervised Learning

Self-supervised learning (SSL) is a machine learning approach that utilizes unlabeled data to train models. Instead of relying on external labels, it formulates auxiliary tasks based on inherent data properties, e.g., internal structure and organization , such as predicting missing parts, detecting relationships, or reconstructing data representations.
The model learns meaningful representation or features of the data using e.g., pretext tasks, contrastive learning, redundancy reduction (Zbontar et al., 2021), etc. These meaningful patterns and features are then used for downstream tasks, often with minimal additional supervision. SSL bridges the gap between supervised learning and unsupervised learning by exploiting the vast availability of unlabeled data while reducing the consequences of possible errors in labeled datasets mentioned above.

Reinforcement Learning

Reinforcement learning (RL) is a fundamental and specific learning method. In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active as he/she is an agent, or decision maker, interacting in an unsupervised manner with its environment through a sequence of actions. In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:
  1. Its current state in the environment,
  2. A real-valued immediate reward for its actions. In more advanced frameworks, future or long-term reward are provided. The trade-off between immmediate and future reward is controled by the discount factor $\gamma \in (0,1)$.
The agent takes into account this feedback for its future actions, The objective of the agent is to maximize its reward and thus to select the best course of actions, the policy, for maximizing the cumulative reward over time. If the environment is known, this is a planning problem, whilst if the environment is unknow, this is a learning problem: the agent learns a sequence of decisions, i.e., a policy.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.
RL methods are often computationally intensive. However, the recent emergence of DeepSeek-R1 (2025), which utilizes RL and operates with fewer resources than state-of-the-art large language models (LLMs), is likely to enhance the adoption of RL methods.

The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}.$
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected, $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions. Usually, the cardinalities $|\mathcal{S}|$ and $|\mathcal{A}|$ are finite: this is the tabular case.

A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$. The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ , which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e., \begin{eqnarray*} \pi^\star &:= &\arg \max_\pi V^\pi(s),\\ V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ]. \end{eqnarray*} The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined as $$ Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ], $$ $\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.

This MDP framework is flexible enough to deal with several configurations, e.g., off-policy RL where data logged from a different policy are used in the exploration step, but rigorous enough as it allows to assess the reliability of the results, e.g., we can estimate confidence intervals for the rewards. Furthermore, this setup allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$, or optimal action value function $Q^\star := Q^{\pi^\star}$.
This MDP framework has been extended to the non-tabular case, with a very large number of states, and/or continuous states. For that purpose, the authors resort to kernel function approximation, already studied by Farahmand et al. (2016) and Long et al. (2022). Long and Han (2022) introduce the perturbational complexity $\Delta_\mathcal{M}(\varepsilon)$ which indicates whether the RL problem can be solved efficiently, in terms of sample complexity, i.e., without being affected by the curse of dimensionality. Furthermore, $\Delta_\mathcal{M}(\varepsilon)$ is the lower bound of the estimation error for the RL-type algorithms they propose. Long and Han (2023) provide an exhaustive survey on the non-tabular case.

III: Current Research and Open Problems

Recent Advances in Reinforcement Learning

Sample Efficiency

Sample Complexity

The sample complexity of a RL algorithm is the minimum number of calls to a sampling model required for the algorithm to learn a policy that is within a specified target accuracy level, $\varepsilon$ of the optimal policy, with high probability. Formally, the sample complexity $m(\varepsilon, \delta)$ is the smallest number of samples such that, for any $\delta \in (0,1)$, the learned policy $ \pi$ is an $\varepsilon$-accurate policy, i.e., \begin{eqnarray*} V^{\pi}(s) &\geqslant& V^\star(s) -\varepsilon,\\ Q^{\pi}(s,a) &\geqslant& Q^\star(s,a) -\varepsilon, \end{eqnarray*} with probability at least $1 - \delta$. In RL, there is a trade-off between sample complexity and statistical accuracy: exploring more information to improve statistical accuracy, versus exploiting known actions to maximize reward, which reduces sample complexity. The agent's learning speed is directly related to the sample complexity of the task.
The two common frameworks for analyzing sample complexity are:
  1. The PAC bounds, which allow to determine the number of samples needed to guarantee that the algorithm learns a policy with a small accuracy error $\varepsilon$, compared to the optimal policy
  2. On the other hand, the regret bounds measure the cumulative loss incurred before reaching the optimal policy. They quantify how much reward we miss out on during the learning process.

New results on sample complexity and the the minimum sample size for achieving the minimax sample complexity are provided in the AI & Math Blog.

Generalization and Transfer Learning

Stability and Convergence

Multi-Agent Reinforcement Learning

Natural Language Processing

Generative AI is a part of automation technologies: it generates text, programming code and images using as database quite the exhaustive content of the web, using attention mechanisms, transformer architectures and autoregressive probabilistic models. Although the syntax of the generated text is correct, as the correctness of the grammar is the straight consequence of the autoregressive probabilistic tools of generative AI, the meaning of this text must be evaluated.
These combinatorial generative models, based on the linguistic representation of the physical world, are working inside the box of that representation, and then are directly unable to plan and produce any content outside that box. From a mathematical perspective, LLMs often lack complete deductive closure. This means they might not always draw every possible conclusion from the information they've been trained on. They might be able to solve a math problem, but they might not be able to explain every step of the solution. This is similar to Kahneman's (2011) Type 1 system, which relies on intuition and heuristics rather than logical reasoning.

Reasoning Properties of LLMs

The implications of LLMs lacking deductive closure are quite significant. LLMs excel in inductive reasoning, which involves making generalizations based on specific examples. However, their inability to achieve complete deductive closure means they struggle with tasks that require strict logical reasoning, multi-step reasoning or the application of rules to derive conclusions. For instance, in mathematical proofs or legal reasoning, where every step must logically follow from the previous one, LLMs might miss critical connections.
This can lead to errors in situations where precise conclusions are necessary. Due to their probabilistic nature, LLMs can produce inconsistent or contradictory outputs when faced with similar prompts. This inconsistency can undermine trust in their responses, especially in high-stakes environments like science, healthcare or law.
LLMs rely heavily on the data they are trained on. If the training data lacks comprehensive examples of deductive reasoning, the model will not be able to perform well in tasks requiring such reasoning. This highlights the importance of curating diverse and high-quality training datasets.
The limitations in deductive reasoning raise ethical concerns, particularly in applications where decisions impact human lives. For example, in autonomous systems or AI-driven decision-making tools, the inability to reason deductively could lead to harmful outcomes.

To address these limitations, there is a growing interest in developing hybrid models that combine LLMs with other AI systems capable of deductive reasoning. In particular, the reasoning performance of LLMs cand be extended and improved with the Chain-of-Thought (CoT) prompting (Wei et al., 2022). CoT is an extension of the few-shots prompting properties of LLMs (Brown et al., 2020): for the few-shot prompting, the LLM is provided with a small number of in-context (input, output) pairs of examples, or 'shots'. In contrast, for the CoT prompting, the LLM is provided with triples (input, chain-of-thought, output), where the chain-of-thought consists of a series of intermediate reasoning steps that lead to the solution of the studied problem.
The CoT consists of breaking down complex tasks into smaller, intermediate and more manageable steps. These intermediate steps, which make the decision-making process more transparent, can be particularly useful in tasks that require multiple steps of reasoning, where understanding the model's reasoning is crucial.
Furthermore, this decomposition in elementary steps can reduce the likelihood of generating irrelevant or incorrect information, known as hallucinations. CoT reduces the cognitive load of processing complex information all at once by distributing the problem-solving process over multiple steps, allowing the model to handle each step more effectively and carrying intermediate verification of each step in the reasoning process. This means that errors can be identified and corrected earlier, preventing them from propagating and affecting the final conclusion. RL can be combined with CoT for reducing the hallucinations.
Prompting the model consists of generating a series of thoughts or steps that lead to the final answer, rather than directly producing the solution. This approach helps the model to handle more complex, multi-step problems by mimicking human-like logical step-by-step reasoning leading to the final answer. This requires a careful prompt engineering, i.e., carefully designing and refining specific instructions or questions that guide the model's responses in a way that aligns with the user's goals or requirements. Poorly designed prompts can lead to suboptimal performance. Furthermore, CoT prompting does not require extensive retraining.
The single pathway CoT prompting has been further generalized by Tree of Thoughts (ToT) prompting (Yao et al., 2024), which is characterized by multiple competing reasoning paths modeled as a tree. The formalization of a human problem solving as a search problem along a tree has been advocated by the Turing Award laureates Newell and Simon (1972).

Solving mathematical problems

The straight use of LLMs for solving mathematical problems gives poor results, e.g., the multi-step reasonning of a maths solver must be verified (Cobbe et al., 2021).
Recent research has shown that the reasoning properties of large language models (LLMs) can be enhanced by equipping them with reasoning and planning capabilities that are closer to Kahneman's (2011) Type 2 system.
In particular, Chervonyi et al. (2025) developed a new fast algorithm called the deductive database arithmetic reasoning (DDAR 2). The DDAR employs a fixed set of deduction rules to iteratively add new elements to the initial information set until its deduction closure is reached. The DDAR 2 algorithm successively solves gold medal-level olympiad problems in geometry, and improves the results obtained with the previous version (DDAR 1) presented by Trinh et al. (2024).

IV: Learning Resources & References

Academic Learning Resources

References

  1. Abernethy, J., Agarwal, A., Marinov, T.V. and Warmuth, M.K., 2024. A mechanism for sample-efficient in-context learning for sparse retrieval tasks. Proceedings of Machine Learning Research, PMLR 237, pp. 3-46. MR4740478
  2. Agarwal, A., Kakade, S.M., Lee, J.D. and Mahajan, G., 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98), pp.1-76. MR4279749
  3. Aggarwal, C.C., 2018. Neural networks and deep learning. A textbook. Springer, Cham. MR3966422
  4. Albrecht, S.V., Christianos, F. and Schäfer, L., 2024. Multi-agent reinforcement learning: Foundations and modern approaches. MIT Press.
  5. Azar, M.G., Osband, I. and Munos, R., 2017. Minimax regret bounds for reinforcement learning. Proceedings of the 34th International Conference on Machine Learning, PMLR 70, pp.263-272.
  6. Azar, M.G., Munos, R. and Kappen, H.J., 2013. Minimax PAC bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91, pp.325-349. MR3064431
  7. Baevski, A., Hsu, W.N., Xu, Q., Babu, A., Gu, J. and Auli, M., 2022, Data2vec: A general framework for self-supervised learning in speech, vision and language. Proceedings of the 39th International Conference on Machine Learning, PMLR, 162, pp. 1298-1312.
  8. Bahdanau, D., Cho, K., and Bengio, Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  9. Belkin, M., 2021. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica 3, 203-248. MR4298218
  10. Belkin , M., Hsu, D., Ma, S. and Mandal, S., 2019. Reconciling modern machine-learning practice and the classical bias-variance trade-off, Proc. Natl. Acad. Sci. USA 116, no. 32, 15849-15854. MR3997901
  11. Belkin, M., Hsu, D. and Xu, J., 2020. Two models of double descent for weak features, SIAM Journal on Mathematics of Data Science 2, no. 4, 1167-1180. MR4186534
  12. Bishop, C.M. and Bishop, H., 2023. Deep learning: Foundations and concepts. Springer Nature. MR4719738
  13. Bishop, C. M., 2006. Pattern recognition and machine learning. Inf. Sci. Stat. Springer, New York. MR2247587
  14. Brown, T. B. et al., 2020. Language models are few-shot learners. arXiv preprint ArXiv:2005.14165.
  15. Cheng, K., Yang, J., Jiang, H., Wang, Z., Huang, B., Li, R., Li, S., Li, Z., Gao, Y., Li, X. and Yin, B., 2024. Inductive or deductive? Rethinking the fundamental reasoning abilities of LLMs. arXiv preprint arXiv:2408.00114.
  16. Chervonyi, Y., Trinh, T.H., Olšák, M., Yang, X., Nguyen, H., Menegali, M., Jung, J., Verma, V., Le, Q.V. and Luong, T., 2025. Gold-medalist performance in solving olympiad geometry with AlphaGeometry2. arXiv preprint arXiv:2502.03544.
  17. Chou, S.C., Gao, X.S. and Zhang, J.Z., 2000. A deductive database approach to automated geometry theorem proving and discovering. Journal of Automated Reasoning, 25(3), pp.219-246. MR1774686
  18. Chou, S.C., Gao, X.S. and Zhang, J.Z., 1996. Automated generation of readable proofs with geometric invariants: I. Multiple and shortest proof generation. Journal of Automated Reasoning, 17(3), pp.325-347. MR1428346
  19. Chou, S.C., Gao, X.S. and Zhang, J., 1994. Machine proofs in geometry: Automated production of readable proofs for geometry theorems (Vol. 6). World Scientific. MR1334761
  20. Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R. and Hesse, C., 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  21. Colbrook, M.J., Antun, V. and Hansen, A.C. 2022. The difficulty of computing stable and accurate neural networks: on the barriers of deep learning and Smale's 18th problem. Proc. Natl. Acad. Sci. USA 119, no. 12, Paper No. e2107151119, 10 pp. MR4417588
  22. Dixon, M.F., Halperin, I. and Bilokon, P. 2020. Machine learning in finance—from theory to practice. Springer, Cham. MR4220647
  23. Farahmand, A., Ghavamzadeh, M., Szepesvàri, C. and Mannor, S. 2016. Regularized policy iteration with nonparametric function spaces. Journal of Machine Learning Research, 17, Paper No. 139, 66 pp. MR3555030
  24. Geiping, J., Garrido, Q., Fernandez, P., Bar, A., Pirsiavash, H., LeCun, Y., and Goldblum, M., 2023. A Cookbook of self-supervised learning. arXiv preprint arXiv:2304.12210.
  25. Goodfellow, I., Bengio, Y., and Courville, A., 2016. Deep learning. Adapt. Comput. Mach. Learn. MIT Press, Cambridge, MA. MR3617773
  26. Goyal, A. and Bengio, Y., 2022. Inductive biases for deep learning of higher-level cognition. Proc. R. Soc. A.478 no.2266, Paper No. 20210068, 35 pp. MR4505795
  27. Green, P. J. and Silverman, B. W., 1994. Nonparametric regression and generalized linear models. A roughness penalty approach. Monogr. Statist. Appl. Probab., 58 Chapman & Hall, London. MR1270012
  28. Greff, K., Srivastava, R., Koutník, J., Steunebrink, B. and Schmidhuber, J., 2017. LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, no. 10, 2222-2232. MR3709742
  29. Gronauer, S. and Diepold, K., 2022. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 55(2), pp.895-943.
  30. Hastie, T., Tibshirani, R. and Friedman, J., 2009. The elements of statistical learning. Data mining, inference, and prediction. Second edition Springer Ser. Statist. Springer, New York. MR2722294
  31. Hendrycks, D., Mazeika, M., Wilson, D. and Gimpel, K., 2018. Using trusted data to train deep networks on labels corrupted by severe noise. Advances in Neural Information Processing Systems (NIPS 2018) 31, Curran Associates Inc., Red Hook, NY, USA, pp. 10477–10486.
  32. Hinton, G.E. and Salakhutdinov, R.R., 2006. Reducing the dimensionality of data with neural networks. Science, 313(5786), pp.504-507 MR2242509
  33. Hinton, G., 2023. How to represent part-whole hierarchies in a neural network. Neural Computation, 35(3), pp.413-452. MR4555492
  34. Hochreiter, S. and Schmidhuber, J., 1997. Long short-term memory. Neural Computation, 9(8), pp.1735-1780.
  35. Hornik, K., Stinchcombe, M. and White, H., 1989. Multilayer feedforward networks are universal approximators. Neural networks, 2(5), pp.359-366.
  36. Jordan, M. I., and Mitchell, T. M., 2015. Machine learning: trends, perspectives, and prospects. Science, 349, no. 6245, 255-260. MR3382217
  37. Kahneman, D., 2011. Thinking Fast and Slow. Macmillan.
  38. Kallus, N. and Uehara, M., 2020. Double reinforcement learning for efficient off-policy evaluation in Markov decision processes. Journal of Machine Learning Research, Paper No. 167, 63 pp. MR4209453
  39. Kallus, N. and Uehara, M., 2022. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning, Operations Research 70 no.~6, 3282-3302. MR4538517
  40. Kambhampati, S., 2024. Can large language models reason and plan? Annals of the New York Academy of Sciences, 1534(1), pp.15-18.
  41. Kojima, T., Gu, S., Reid, M., Matsuo, Y. and Iwasawa, Y., 2022. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.
  42. Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes.Mathematical Programming 198, no.1, Ser. A, 1059-1106. MR4550970
  43. Lan, G., 2020. First-order and stochastic optimization methods for machine learning. Springer Cham. MR4219819
  44. LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. Nature, 521(7553), pp.436-444.
  45. Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Operation Research, 72, no.1, 203–221. MR4705834
  46. Li, Z., Liu, B., Yang, Z., Wang, Z. and Wang, M., 2023. Double duality: variational primal-dual policy optimization for constrained reinforcement learning. Journal of Machine Learning Research, 24(385), pp.1-43. MR4720841
  47. Liu, C., Zhu, L. and Belkin, M., 2022. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks, Applied and Computational Harmonic Analysis 59, 85-116. MR4412181
  48. Łojasiewicz, S., 1962. Une propriété topologique des sous-ensembles analytiques réels, in Les Équations aux Dérivées Partielles , 87-89. MR0160856
  49. Long, J. and Han, J., 2022. Perturbational complexity by distribution mismatch: a systematic analysis of reinforcement learning in reproducing kernel Hilbert space.Journal of Machine Learning 1, no. 1, 1–37. MR4716372
  50. Long, J,, Han, J. and E, W. 2022. An $L^2$ analysis of reinforcement learning in high dimensions with kernel and neural network approximation. CSIAM Transactions on Applied Mathematics 3, no.2, 191-220. MR4456674
  51. Long, J. and Han, J. 2023. Reinforcement learning with function approximation: from linear to nonlinear. Journal of Machine Learning 2, no. 3, 161-193. MR4734710
  52. MacKay, D.J.C., 2003. Information theory, inference and learning algorithms. Cambridge University Press, New York, 2003. MR2012999
  53. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G. and Petersen, S., 2015. Human-level control through deep reinforcement learning. Nature, 518(7540), pp.529-533.
  54. Mohajerin Esfahani, P. and Kuhn, D., 2018. Data-driven distributionally robust optimization using the Wasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming 171, no. 1-2, 115-166. MR3844536
  55. Mohri, M., Rostamizadeh, A. and Talwalkar, A., 2018. Foundations of machine learning, second edition. Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA. MR3931734
  56. Newell, A., Shaw, J.C. and Simon, H.A., 1959, June. Report on a general problem solving program. In IFIP congress, Vol. 256, p. 64.
  57. Newell, A. and Simon, H., 1972. Human problem solving, ACS symposium series, Prentice Hall.
  58. Nocedal, J., and Wright, S. J., 2006. Numerical optimization. Second edition. Springer Ser. Oper. Res. Financ. Eng. Springer, New York. MR2244940
  59. Northcutt, C., Jiang, L. and Chuang, I., 2021. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research 70, pp.1373-1411. MR4306614
  60. Pananjady, A. and Wainwright, M.J., 2021. Instance-dependent $\ell_\infty $-bounds for policy evaluation in tabular reinforcement learning. IEEE Transactions on Information Theory,67(1) pp.566–585. MR4231973
  61. Poljak, B. T., 1963. Gradient methods for minimizing functionals, Ž. Vyčisl. Mat i Mat. Fiz. 3, 643-653. MR0158568
  62. Radhakrishnan, A., Belkin, M. and Uhler, C., 2020. Overparameterized neural networks implement associative memory, Proc. Natl. Acad. Sci. USA, 117, no. 44, 27162-27170. MR4255943
  63. Rahimi, A. and Recht, B., 2007. Random features for large-scale kernel machines. Advances in Neural Information Processing Systems (NIPS 2007) 20 (J. C. Platt et al., eds), Curran Associates, pp. 1177-1184.
  64. Rumelhart, D.E., Hinton, G.E. and Williams, R.J., 1986. Learning representations by back-propagating errors. Nature, 323(6088), pp.533-536.
  65. Schmidhuber, J., 2015. Deep learning in neural networks: An overview. Neural networks, 61, pp.85-117.
  66. Shalev-Shwartz, S. and Ben-David, S., 2014. Understanding machine learning: From theory to algorithms. Cambridge university press.
  67. Shi, C., Zhang, S., Lu, W. and Song, R., 2020. Statistical inference of the value function for reinforcement learning in infinite-horizon settings, Journal of the Royal Statistical Society. Series B. 84, no.~3, 765-793. MR4460575
  68. Sidford, A., Wang, M., Wu, X., Yang, L. and Ye, Y., 2018. Near-optimal time and sample complexities for solving Markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31.
  69. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A. and Chen, Y., 2017. Mastering the game of go without human knowledge. Nature, 550(7676), pp.354-359.
  70. Simon, H. A., 1962. The architecture of complexity, Proceedings of The American Philosophical Society, 106(6), 467-482. Jstor
  71. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. and Salakhutdinov, R., 2014. Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(1), pp.1929-1958. MR3231592
  72. Strang, G., 2019. Linear algebra and learning from data. Wellesley-Cambridge Press.
  73. Strehl, A.L. and Littman, M.L., 2008. An analysis of model-based interval estimation for Markov decision processes. Journal of Computer and System Sciences, 74(8), pp1309-1331. MR2460287
  74. Sutton, R. S. and Barto, A.G., 2018. Reinforcement learning: an introduction.. Second edition Adapt. Comput. Mach. Learn. MIT Press, Cambridge, MA. MR3889951
  75. Szepesvàri, C., 2022. Algorithms for reinforcement learning, reprint of the 2010 original, Synthesis Lectures on Artificial Intelligence and Machine Learning, 9, Springer, Cham, MR4647095
  76. Teyssière, G., 2022. Review of "Belkin, M. (2021). Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". Mathematical Reviews, American Mathematical Society.
  77. Teyssière, G., 2023. Review of "Lan, G., 2023. Policy mirror descent for reinforcement learning: linear convergence, new sampling complexity, and generalized problem classes". Mathematical Reviews, American Mathematical Society.
  78. Teyssière, G., 2023. Review of "Dixon, M.F., Halperin, I. and Bilokon, P. 2020. Machine learning in finance—from theory to practice." Mathematical Reviews, American Mathematical Society.
  79. Teyssière, G., 2025. Review of "Long, J. and Han, J., 2022. Perturbational complexity by distribution mismatch: a systematic analysis of reinforcement learning in reproducing kernel Hilbert space". Mathematical Reviews, American Mathematical Society.
  80. Teyssière, G., 2025. Review of "Li, G., Wei, Y., Chi, Y. and Chen, Y., 2024. Breaking the sample size barrier in model-based reinforcement learning with a generative model". Mathematical Reviews, American Mathematical Society.
  81. Teyssière, G., 2025. Review of "Wang, J., Gao, R. and Zha, H., 2024. Reliable off-policy evaluation for reinforcement learning ". Mathematical Reviews, American Mathematical Society.
  82. Trinh, T.H., Wu, Y., Le, Q.V., He, H. and Luong, T., 2024. Solving olympiad geometry without human demonstrations. Nature, 625(7995), pp.476-482.
  83. Vapnik, V.N., 2000. The nature of statistical learning theory. Second edition Stat. Eng. Inf. Sci. Springer-Verlag, New York. MR1719582
  84. Wainwright, M.J., 2019. High-dimensional statistics: A non-asymptotic viewpoint. Cambridge university press.
  85. Wang, J., Gao, R., and Zha, H., 2024. Reliable off-policy evaluation for reinforcement learning. Operation Research, 72, no. 2, 699-716. MR4677384
  86. Wasserman, L. 2004. All of statistics. A concise course in statistical inference. Springer Texts Statist. Springer-Verlag, New York. MR2055670
  87. Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., and Zhou, D., 2022. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems (NIPS '22), 35, Article No 1800, 24824-24837.
  88. Williams, R.J., 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8, pp.229-256.
  89. Wu, W. T. 1984 On the decision problem and the mechanization of theorem-proving in elementary geometry. Automated theorem proving (Denver, Col., 1983), 213–234. Contemp. Math., 29 American Mathematical Society, Providence, RI, 1984 MR0749248
  90. Yang, I., 2017. A convex optimization approach to distributionally robust Markov decision processes with Wasserstein distance. IEEE Control Systems Letters 1, no.1, 164-169. MR4208527
  91. Yeo, E., Tong, Y., Niu, M., Neubig, G. and Yue, X., 2025. Demystifying Long Chain-of-Thought Reasoning in LLMs. arXiv preprint arXiv:2502.03373.
  92. Zhang, K., Yang, Z. and Başar, T., 2021. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, pp.321-384. MR4328446
  93. Zhang, K., Kakade, S.M., Basar, T. and Yang, L.F., 2023. Model-based multi-agent rl in zero-sum markov games with near-optimal sample complexity. Journal of Machine Learning Research, 24(175), pp.1-53. MR4633564
Updated March 2025.