Machine Learning Paradigms
Supervised Learning
Supervised learning is the simplest learning method, where the model learns from labeled data, i.e., (input,output) pairs.
We have a dataset $ \mathcal{D} = \{(x_1, y_1), (x_2, y_2), \ldots, (x_N, y_N)\} $, which consists of a collection of $N$ pairs of inputs, $x_i$ or features, and outputs $ y_i$, written in matrix form as $ \mathcal{D} =(X,Y)$. The output $Y$ might be discrete, continuous, a set of labels in the case of the classification problem, etc.
The goal of supervised learning is to approximate the function $f$ that maps $X$ to $Y$, i.e., $Y=f(X)$, by learning this function from the dataset $\mathcal{D}$.
Unsupervised Learning
The model learns patterns or structures from data without labeled outcomes, as labels are either costly to obtain, or at least partially available. Furthermore, errors in the labels affect the result of supervised learning
and require either to use specific corrective tools (Northcutt
et al., 2021), or to design alternative methods for estimating the model without depending on labels.
Unsupervised learning takes into account the internal structure/organisation of the data, and does not need the set of labels. This approach is similar to the non-parametric approach in statistics which 'let the data speak for themselves'.
Semi-supervised learning is the intermediate case between supervised and unsupervised learning: the learning procedure uses first the labelled data while leveraging a larger amount of unlabeled data, which strongly enhance the accuracy of the model.
This resembles nonparametric statistics, where no model is specified and the user "let the data speak for themselves". As a consequence, some nonparametric methods are used in unsupervised learning.
Standard exemples of unsupervised learning are:
- Clustering: Grouping similar data points (e.g., customer segmentation).
- Dimensionality Reduction: Simplifying high-dimensional data while preserving important features (e.g., PCA for data visualization).
Self-supervised Learning
Self-supervised learning is an approach in machine learning that utilizes unlabeled data to train models. Instead of relying on external labels, it formulates auxiliary tasks based on inherent data properties, e.g., internal structure and organization , such as predicting missing parts, detecting relationships, or reconstructing data representations.
The model learns meaningful representation or features of the data using e.g., pretext tasks, contrastive learning, redundancy reduction (Zbontar
et al., 2021), etc.
These meaningful patterns and features are then used for downstream tasks, often with minimal additional supervision. SSL bridges the gap between supervised learning and unsupervised learning by exploiting the vast availability of unlabeled data while reducing the consequences of possible errors in labeled datasets mentioned above.
Reinforcement Learning
Reinforcement learning (RL) is a fundamental learning method at the intersection of artificial intelligence and machine learning.
In the standard supervised, unsupervised, self-supervised frameworks, the learner receives the data and learns a model from them.
In the RL framework, the learner is more active as he/she is an agent, or decision maker, interacting in an unsupervised manner with its environment through a sequence of actions.
In response to the action just taken, the agent receives two types of information through a standard feedback mechanism:
- Its current state in the environment,
- A real-valued immediate reward for its actions. In more advanced frameworks, future or long-term reward are provided. The trade-off between immmediate and future reward is controled by the discount factor $\gamma \in (0,1)$.
The agent takes into account this feedback for its future actions,
The objective of the agent is to maximize its reward and thus to select the best course of actions, the policy, for maximizing the cumulative reward over time. If the environment is known, this is a planning problem, whilst if the environment is unknow, this is a learning problem.
In the latter case, the agent balances the trade-off between the exploration, i.e., obtaining more information on the environmemt and rewards, and the exploitation of the information already collected for optimizing the rewards.
RL methods are often computationally intensive. However, the recent emergence of DeepSeek-R1 (2025), which utilizes RL and operates with fewer resources than state-of-the-art large language models (LLMs), is likely to enhance the adoption of RL methods.
The standard framework for formalizing the reinforcement learning is the Markov decision process (MDP), defined by $M=(\mathcal{S,A,P},r,\gamma )$ where $\mathcal{S,A}$ are respectively the state space and the action space, which might be infinite, $r : \mathcal{S\times A} \rightarrow [0,1]$ is the reward function, i.e., $r(s,a)$ is the immediate reward upon choosing the action $a \in \mathcal{A}$ while in the state $s \in \mathcal{S}.$
The probability transition kernel is defined as $\mathcal{P}: \mathcal{S \times A} \rightarrow \Delta(S)$, where $\mathcal{P}(s^{'}|s,a)$ is the probability of transiting from state $s$ to state $s^{'}$ when $a$ is selected; $\Delta(S)$ is the probability simplex over $\mathcal{S}$. In the generative case, the probability transition kernel $\mathcal{P}$ is empirically estimated from the data.
The model is Markovian since the transition and reward probabilities depend only on the current state $s$ and not the entire history of states and actions. Usually, the number of states $|\mathcal{S}|$ and actions $|\mathcal{A}|$ is finite: this is the tabular case.
A deterministic policy $\pi$ is a mapping $\pi: \mathcal{S}\rightarrow \mathcal{A}$.
The objective of the agent is to find the optimal policy $\pi^\star$ maximizing the value function $V^\pi(s)$ , which is the expected discounted total reward starting from the initial state $s^0=s$ , i.e.,
\begin{eqnarray*}
\pi^\star &:= &\arg \max_\pi V^\pi(s),\\
V^\pi(s)& := &E \Big[ \sum_{t=0}^\infty \gamma^t r(s^t, a^t)\big | s^0 = s \Big ].
\end{eqnarray*}
The corresponding action value function $Q^\pi: \mathcal{S\times A} \rightarrow \mathbb{R}$ of a policy $\pi$ is defined
as
$$
Q^\pi(s,a) := E \Big [ \sum_{t=0}^\infty \gamma^t r(s^t, a^t) \big | (s^0, a^0)=(s, a) \Big ],
$$
$\forall (s,a) \in \mathcal{S \times A}$. There exists an optimal policy $\pi^\star$ that simultaneoulsy maximizes $V^\pi(s) \; \forall s \in \mathcal{S}$, and $Q^\pi(s,a) \; \forall (s,a) \in \mathcal {S \times A}$.
This MDP framework is flexible enough to deal with several configurations, e.g., off-policy RL where data logged from a different policy are used in the exploration step, but rigorous enough as it allows to assess the reliability of the results, e.g., we can estimate confidence intervals for the rewards. Furthermore, this setup allows to provides guarantees on the reliability of the algorithms used for estimating the optimal value function $V^\star := V^{\pi^\star}$,
or optimal action value function $Q^\star := Q^{\pi^\star}$. In particular, new results show that the minimum sample size for achieving the minimax sample complexity is $ N \geqslant O\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right)$.
This MDP framework has been extended to the non-tabular case, with a very large number of states, and/or continuous states.
For that purpose, the authors resort to kernel function approximation, already studied by Farahmand
et al. (2016) and Long
et al. (2022). Long and Han (2022) introduce the perturbational complexity $\Delta_\mathcal{M}(\varepsilon)$ which indicates whether the RL problem can be solved efficiently, in terms of sample complexity, i.e., without being affected by the curse of dimensionality. Furthermore, $\Delta_\mathcal{M}(\varepsilon)$ is the lower bound of the estimation error for the RL-type algorithms they propose. Long and Han (2023) provide an exhaustive survey on the non-tabular case.