3.1. Neural Network as a Dynamical System
In order to study DL as an optimal control problem, it is necessary to express the NN learning process as a dynamical system [
6,
21]. In the simplest form, the feed-forward propagation in a 
T layer, 
, network can be expressed by the following difference equation:
        where 
 is the input, e.g., an image, several time-series, etc., while 
 is the final output, to be compared to some target 
 by means of a given loss function. By moving from a discrete time formulation to a continuous one, the forward dynamics we are interested in will be described by a differential equation that takes the role of (
14). The learning aim is to tune the trainable parameters 
 to have 
 as close as possible to 
, according to a specified metric and knowing that the target 
 is joined to the input 
 by means of a probability measure 
.
Following the dynamical systems approach developed in [
6], the supervised learning method aims to approximate some function 
, usually called the oracle, denoted by 
.
As stated before, the set  contains the d-dimensional array of inputs, e.g., images, financial time-series, sound recorded data, text, etc., while  are the targets modelling the corresponding images, numerical forecast, or predicted texts.
In this setting, it is standard to define what is called a hypothesis space as:
Training moves from a collection of K samples of input-target pairs  the goal being to approximate  exploiting these training data points.
Let  be a probability space supporting random variables  and , jointly distributed according to  with  modelling the distribution of the input-target pairs. The set of controls  denotes the admissible training weights that are assumed to be essentially bounded, measurable  functions. The network depth, i.e., the number of layers, is denoted by . We also introduce the functions:
- the feed-forward dynamics ; 
- the terminal loss function ; 
- the regularization term . 
State dynamics are described by an Ordinary Differential Equation (ODE) of the form:
        representing the continuous version of Equation (
14), equipped with an initial condition 
, which is a random variable responsible for the randomness term characterizing Equation (
15).
The population risk minimization problem in DL can be expressed by the following MF-optimal control problem (see [
5] (p. 5)):
        subject to the dynamics expressed by the stochastic ODE (
15). Since the weights 
 are shared by the distribution 
 of random variable 
 pairs, Equation (
16) can be studied as an MF-optimal control problem.
On the other hand, the empirical risk minimization problem can be expressed by a sampled optimal control problem after computing i.i.d. samples 
 modelled by 
:
        subject to the dynamics:
        whose solutions, moving from random initial conditions through a deterministic path, correspond to random variables.
As in classical optimal control theory, the previous problem can be solved following two inter-connected approaches: a global theory, based on the Dynamic Programming Principle (DPP) leading to the HJB equation, or considering the Pontryagin Maximum Principle (PMP) approach, hence expressing the solution by a system of Forward Backward SDEs (FBSDEs) plus a local optimality condition.
  3.2. HJB Equation
The idea behind the HJB formalism is to define a value function corresponding to the optimal loss of the control problem w.r.t. the general starting time and state. For the population risk minimization formulation expressed by Equation (
16), the state argument of the value function corresponds to an infinite-dimensional object that models a joint distribution of the input-target as an element of a suitable Wasserstein space.
As regards random variables and their distribution, a suitable space must be defined for the rigorous treatment of the optimal control problem. In particular, we use the shorthand notation 
 for 
 to denote the set of 
-valued square integrable random variables w.r.t. a given probability measure 
. Then, we deal with a Hilbert space considering the norm:
The set 
 denotes the integrable probability measures defined on the Euclidean space 
. Let us recall that the random variable 
X is square integrable in 
 if and only if its law 
. The space 
 can be endowed with a metric by considering the Wasserstein distance defined in Equation (
2). For 
, the two-Wasserstein distance reads: 
according to the marginals introduced in 
Section 2.1, or equivalently:
        see, e.g., [
5] (p. 6). Moreover, 
, we define the associated norm:
Given a measurable function 
 that is square integrable w.r.t. the probability distribution 
, the following notation is introduced:
Concerning the dynamical evolution of probability measures, let us fix 
 and the control process 
. Then, the dynamics of the system can be written as:
 being the law associated with the variable 
 defined by 
, and we can rewrite the law of 
 as:
Indeed, the law involving the dynamics 
 depends only on the law of 
 and not on the random variable itself; see, e.g., [
5] (p. 7).
It turns out that, to obtain the HJB Equation (
5) corresponding to the above introduced formulation, it is necessary to define the concept of the derivative w.r.t. a probability measure. To begin with, it is useful to consider probability measures on 
 as laws expressing probabilistic features of the 
-valued random variables defined over the probability space 
. Then, we define the Banach space of random variables to define the derivatives. Moreover, if we define a function 
, it is possible to lift it into its extension 
U defined on 
, as follows:
        then the definition of the derivative w.r.t. a probability measure can be expressed in terms of 
U in the usual Banach space setting. In particular, we have that 
u is 
, if the lifted function 
U is Fréchet differentiable with continuous derivatives.
Since 
 can be identified with its dual, if the Fréchet derivative 
 exists, by Riesz’s theorem, it can be identified with an element of 
, i.e.,
        
It is worth underlining that 
 does not depend on 
X, but only on the law described by 
X; hence, the derivative of 
u at 
 is described by 
, defined as:
By duality, we know that 
 is square integrable w.r.t. 
. To define a notion for the chain rule in 
, a dynamical system is described by:
        where 
 denotes the feed-forward dynamics. If a function 
, meaning that it is differentiable with a continuous derivative w.r.t. a probability measure, then, for all 
, we have:
        where · denotes the usual inner product between the vector in 
. Equivalently, exploiting the lifted function of 
u, we can state:
Moreover, the variable w denotes the concatenated -dimensional variable , where  and . Correspondingly,  is the extended -dimensional Feed-Forward Function (FFF),  is the extended -dimensional regularization loss and  represents the terminal loss function.
Since the state variable is identified with a probability distribution 
, the resulting objective functional can be defined as:
        which can be written, with the concatenated variable 
w and the bracket notation introduced in (
18), as:
In this setting, some assumptions for the value function are needed to solve Equation (
16). In particular:
- f, L and  are bounded; 
- f, L and  are Lipschitz w.r.t. x, and the Lipschitz constant of f and L are independent of ; 
- . 
The value function 
 is defined as the real-valued function on 
, corresponding to the infimum of the functional 
J over the training parameters 
:
It is essential to observe how the value function satisfies a recursive relation based on the Dynamic Programming Principle (DPP). This implies that, for any optimal trajectory starting from any intermediate point, the remaining part of the trajectory still has to be optimal. The latter principle can be expressed by defining the value function as:
 and 
.
Considering a small increment of time 
 with 
, we can compute the Taylor expansion in the Wasserstein sense, hence obtaining:
By the chain rule in 
, we have:
Since the infinitesimal 
 does not affect the distribution 
 and the controls 
 (see [
5] (p. 13)), integrating the second term, we have:
Taking 
, we have:
Since the value function should solve the HJB equation, it is essential to find the precise link between the solution of this PDE and the value function obtained from the minimization of the functional J. To provide the result, we use a verification argument allowing the following consideration: if the solution of the HJB is smooth enough, then it corresponds to the value function ; moreover, it allows computing the optimal control .
Theorem 1 (The verification argument). 
Let v be a function in . If v is a solution of the HJB equation in (
21)
 and there exists  that is mapping  attaining the infimum in the HJB equation, then , and  is an optimal feedback control policy, i.e.,  is a solution of the population risk minimization problem expressed by Equation (
16)
, where  with  and . Proof of Theorem 1.  Given any control process 
, applying Formula (
20) between 
 and 
 with explicit time dependence gives:
          
Equivalently:
          
          where the first inequality comes from the infimum condition in (
21).
Since the control is arbitrary, we have:
          
          then it can be substituted with 
 where 
 is computed by the optimal feedback control. Repeating the above argument, the inequality becomes an equality since the infimum is attained for 
:
          
Thus, 
, and 
 defines an optimal control policy. For more details, see [
5] (Proposition 3, pp. 13–14). □
 The importance of Theorem 1 consists of linking smooth solutions of the parabolic PDE to the solutions of the population risk minimization problem, becoming a natural candidate for the DL problem.
Moreover, the optimal control policy 
 is identified by computing the infimum in (
21). Hence, it turns out that the HJB equation strongly characterizes the learning problem’s solution for the feedback, or closed-loop networks: control weights are actively adjusted according to the outputs, and this is the essential feature of closed-loop control. Nevertheless, the solution comes from a PDE that is in general difficult to solve, even numerically. On the other hand, open-loop solutions can be obtained from the closed-loop control policy by sequentially setting 
, 
 being the solution of the feed-forward ODE describing the dynamics of the state variable, with 
 up to time t. Usually within DL settings, open-loops are used during training or to measure the inference of a trained model, since trained weights for each neuron will have a fixed value.
The great limit of such a formulation relies in assuming that the value function 
 is continuously differentiable. It is straightforward to study a more flexible characterization for 
 dealing with weak solutions, also denoted as viscosity solutions. Thus, it is worth considering a weaker formulation of the PDE to go beyond the concept of classical solutions, by introducing the notion of viscosity ones, hence allowing obtaining relevant results when dealing with weaker assumptions on the coefficients defining the (stochastic) differential problem we are interested in; see, e.g., [
5] (Section 5 pp. 14–22) for more details.
The key idea relies on exploiting the lifting identification between measures and random variables and moving from the Wasserstein space  to the Hilbert space , using tools developed to study viscosity solutions.
We introduce a functional defined as the Hamiltonian for viscosity formulation 
 through:
Then, the lifted Bellman equation can be written w.r.t. 
 as follows:
        hence, the PDE we are analysing is now set within a larger space corresponding to 
.
We say that a bounded, uniformly continuous function 
 is a viscosity solution of HJB Equation (
21) if its lifted function 
 defined by:
        is a viscosity solution to the lifted Bellman Equation (
23), namely:
-  and for any test function  -  such that  -  has a local maximum at  - ,  -  solves:
             
-  and for any test function  -  such that the map  -  has a local minimum at  - ,  -  solves:
             
Actually, the unique solution of this formulation corresponds to the value function 
 from the minimization problem; see, e.g., [
5] (Theorem 1, p. 15). Therefore, the HJB equation provides both the necessary and sufficient condition for the optimality of the learning procedure.
Adopting the MF-optimal control viewpoint implies that the population risk minimization problem of DL can be studied as a variational problem, whose solution is characterized by a suitable HJB equation, in analogy with the classical calculus of variations. In other words, the HJB equation is a global characterization of the value function to be solved over the entire space  of input-target distributions. From the numerical point of view, it is a hard task to get a solution for the entire space; this is why the learning problem is typically locally solved, around some (small set of) trajectories generated according to the initial condition , then applying the obtained feedback to nearby input-target distributions.
  3.3. Mean-Field Pontryagin Maximum Principle
We have seen how the HJB approach provides a characterization of the optimal solution for the population risk minimization problem that holds globally in , at the price of being difficult to handle in practice. Moving from this consideration, the MF-PMP aims to show a local condition for optimality, expressed in terms of , i.e., the expectation of the Hamiltonian function.
Starting from the population risk minimization problem defined in Equation (
16) and given a collection of 
K sample input-target pairs, the single 
ith input sample is considered. The prediction of the network can be approximated by a deterministic transformation of the terminal state 
 for some 
 that models a function both of the initial input 
 and of the control parameters 
. Moreover, we define a loss function 
, which is minimized when the arguments are equal. Therefore, the goal is to minimize:
Since g is fixed, it can be absorbed into the definition of the loss function by defining the array .
Then, the supervised learning problem can be expressed as:
        where 
 acts as a regularizer term to model a running cost.
Input variables 
 can be considered as the elements of a Euclidean space 
, representing the initial conditions of the following ODE system:
        where 
 are the control parameters to be trained. The dynamics (
25) are decoupled except for the control. A general space 
 for controls 
 is then defined as: 
, and we are aiming to choose 
 in 
 to have 
 closer to 
 for 
.
To formulate the PMP as a set of necessary conditions for optimal solutions, it is useful to define the Hamiltonian 
 given by:
        with 
p modelling an adjoint process as in Equation (
6).
Let us underline that all input-target pairs , connected by the distribution , share a common control parameter, and this feature suggests the idea to develop a maximum condition that has to hold in the average sense. Indeed, the control is now enforced on the continuity equation that describes the evolution of probability densities.
The following assumptions are needed:
- f is bounded and f, L are continuous w.r.t. ; 
- f, L and  are continuously differentiable w.r.t. x; 
- the distribution  has bounded support in , which means there exists  such that . 
Theorem 2 (Mean-Field Pontryagin Maximum Principle).  Let assumptions 1–3
 hold and  be the minimizer for  corresponding to the optimal control of the population risk minimization problem (
16)
. Define the Hamiltonian  as Then, there exist absolutely continuous stochastic processes  and  solving the following forward backward SDEs:and the related optimality condition expressed in terms of the expectation of the Hamiltonian function:  Proof of Theorem 2.  For the sake of simplicity, let us introduce a new coordinate 
 satisfying the dynamics 
 with 
. Through this choice, the definition of the Hamiltonian in Equation (
26) can be rewritten without running loss 
L by redefining:
          
Assumptions 1–3 are still preserved, but now we consider without loss of generality the case .
Let some 
 be a Lebesgue point of 
; in this setting, these points are dense in 
. Now, for 
, define the family of perturbed controls:
          
          where 
 is an admissible control; this kind of perturbation is called needle perturbation. Accordingly, define 
 by:
          
          that is the solution of the forward propagation equation with the perturbed control 
. Clearly, 
 for every 
 and every 
 since the perturbation is not present. At the limit point 
, the following holds:
          
          and since 
 is a Lebesgue point of 
F:
          
It is possible to characterize  as the leading order perturbation on the state due to the needle perturbation introduced in the infinitesimal interval . In interval , the dynamics is the same before applying the perturbation since the controls are the same.
Now, it is necessary to consider how the perturbation 
 propagates. Thus, define for 
:
          
          and:
          
 is well-defined for almost every 
t, which are every Lebesgue point of the map 
, and it satisfies the following linearised equation:
          
In particular, 
 represents the perturbation of the final state introduced by this control. By the optimality assumption of 
, it follows that:
          
Assumptions 1 and 2 (p. 13) imply 
 is bounded. By the dominated convergence theorem, we know that:
          
Let us define 
 as the solution of the adjoint of Equation (
30), hence:
          
By Equation (
31), it follows that 
, and moreover, for all 
:
          
          thus,
          
          so that taking 
:
          
Since 
 is arbitrarily chosen, this completes the proof by recalling that 
. See [
5] (Theorem 3, pp. 23–24) for more details. □
 MF-PMP refers only to the control of the open-loop type; Equation (
27) is a feed-forward ODE, describing the state dynamics under optimal controls 
. Equation (
28) defines the evolution of the co-state variable 
, characterizing the evolution of an adjoint variational condition backwards in time. It is interesting to note how the optimality condition described in Equation (
29) does not involve first order partial derivatives, being expressed in terms of expectations. In particular, it requires that optimal solutions must globally maximize the Hamiltonian function. This aspect allows considering also cases of non-differentiable dynamics w.r.t. the controls weights, as well as cases characterized by optimal weights lying on the boundary of the training set 
. Moreover, the usual first order optimality condition can be derived from (
29). Comparing this MF formulation to the classical PMP, we can see that the main difference lies in the fact that the maximization condition is expressed in terms of the expectation above a probability density 
. The latter result is not surprising, since the mean-field-optimal control must depend on the probability distribution of input-target pairs.
Let us also note that the mean-field PMP expressed in Theorem 2 can be written more compactly as follows. For each control process 
, we denote by 
 and 
 the solutions of Hamilton’s Equations (
27) and (
28), and we enforce the control expressed by the random variables 
, through:
Then, 
 satisfies the PMP if and only if:
Furthermore, observe that the mean-field PMP includes, as a special case, the necessary conditions for the optimality of the sampled optimal control problem (
17). In order to point out this aspect, define the empirical measure:
        and apply the mean-field PMP with 
 in place of 
 to obtain:
        where 
 and 
 are defined through the input-target pair 
. Moreover, since 
 is a random measure, (
33) is a random equation whose solutions correspond to random variables.
Concerning possible numerical analysis of the DL algorithm based on the maximum principle, we refer to [
4] (pp. 13–15), where a comparison can be found between usual gradient approaches to the discrete formulation of the Mean-Field PMP stated in Theorem 2, with a loss function based on Equation (
24). The test and train losses of some variants of SGD algorithms are compared to the mean-field algorithm based on the discrete PMP. It is possible to observe that the latter algorithm is characterized by a better convergence rate, being then faster. This improvement is mainly due to the fact that it allows avoiding possibly getting stuck, caused by the flat regions, as clearly shown by the graphs reported in [
4] (p. 14).
  3.4. Connection between the HJB Equation and the PMP
In what follows, we provide connections between the global and local formulation of the HJB formalism via the PMP, exploiting the connection between Hamilton’s canonical equations (ODEs) and the Hamilton–Jacobi equations (PDEs). The Hamiltonian dynamics of Equations (
27) and (
28) describe the trajectory of a random variable that is completely determined by random variables 
. On the other hand, the optimality condition described by Equation (
29) does not depend on the particular probability measure of the initial input-target pairs. Notice that the maximum principle can be expressed in terms of a Hamiltonian flow that depends on a probability measure in a suitable Wasserstein space and where Equation (
29) is the corresponding lifting version. Analogously, in order to have both solutions in the same functional space, HJB has to be lifted in the 
 space.
Starting from the lifted Bellman Equation (
23), lying in 
, it is possible to apply the method of the characteristics and define the following system of equations, after introducing 
:
Suppose Equation (
34) has a solution that satisfies an initial condition given by:
        and a terminal one involving Bellman equation given by:
We also assume that 
 is the optimal control achieving the infimum in (
21), as an interior point of 
, then we can explicitly write the equation of the Hamiltonian w.r.t. the optimal control as:
Therefore, by the first order condition, we have:
        so that, taking into consideration Equation (
34), we obtain the Hamilton-type equations:
Use 
 as concatenated variable and 
 to remark that the last 
l components of 
 are zero and by considering only the first 
d components:
Summing up: Hamilton’s equation of the system (
36) can be viewed as the characteristic equations of the HJB equation in its lifted formulation described by Equation (
23). Essentially, the PMP gives a necessary condition for any characteristic of the HJB equation: any characteristic originating from 
, that is the initial law of the random variables, must satisfy a local necessary optimal condition constituted by the mean-field PMP. This justifies the claim that the PMP constitutes a local condition if compared to the HJB equation.