1. Introduction
We study a system that is governed by a neural ODE that can be considered a continuous-time ResNet. Before we can outline the system, some notation is necessary.
The activation function
is assumed to be continuously differentiable and Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, for example
or
where for
the function
acts component-wise; that is,
with the
i-th component, e.g.,
, (
).
Let a real number and natural numbers d and p in be given. For and almost everywhere, let and be given. The are the columns of the matrix and the are the columns of the matrix . For , almost everywhere, let the bias vector in with the components ( be given. In order to state the required regularity assumptions, we introduce the space
such that
For parameters
, the system
is defined as follows:
(see for example [
1,
2]).
The motivation to study (
1) is that a time-discrete version can be considered as a residual neural network (ResNet) that has been very useful in many applications; see [
3] for identification problems with physics-informed neural ordinary differential equations, [
4] for applications in image classification for the detection of colorectal cancer, and [
5] for examples in image registration and classification problems. A time-discrete version can be obtained for example by an explicit Euler discretization of (
1); see (
28) in
Section 4. The fact that ’ResNet, PolyNet, FractalNet and RevNet, can be interpreted as different numerical discretizations of differential equations’ has been discussed in detail in [
6].
For the given time horizon
, we study an optimal control problem on the time interval
. The desired state is given by
; that is,
denotes the desired output of the system. Let
be given. For the training of the system, we study the loss function with a tracking term
with the non-smooth norm
. For our result, the inclusion of the derivative
in the definition of
Q in (
2) is essential, since due to this inclusion the loss function multiplied with the factor
is an upper bound for the maximum norm of
on
; see the inequality (
9) in Lemma 1. This allows us to prove the finite-time turnpike property in Theorem 1. We will explain the turnpike phenomenon below.
We define the control cost (regularization term)
Here denotes the Frobenius norm of .
Lemma 10 in [
7] states that system (
1) is exactly controllable; that is, the terminal condition
can be satisfied for all
. To be precise, for all
there exists a constant
, such that for all
we can find a control
, such that for the state
that is generated by (
1) with the initial condition
, we have
and
Also, the linearized system is exactly controllable in the sense that for all
there exists a constant
, such that for all
we can find a control
, such that for the state
that is generated by the linearized system that is stated below with the initial condition
, we have
and
The linearized system at a given
for the variation
of the state that is generated by a variation
of the control is
with the initial condition
.
A universal approximation theorem for the corresponding time-discrete case with recurrent neural networks can be found in the seminal paper [
8] by Cybenko; see also [
9,
10,
11,
12].
For a parameter
, define
with
Q as defined in (
2) and
R as defined in (
3). We study the minimization (training) problem
Our main result is that if is chosen sufficiently large, the optimal control problem has the finite-time turnpike property; that is, the desired state is already reached within the time interval and remains there until the end of the time interval.
Figure 1 shows an example for the graph of the function
(
), where
x is an optimal trajectory that has the finite-time turnpike property. Our main result in Theorem 1 states that the optimal trajectories of
vanish on the interval
.
The finite-time turnpike property has been studied for example in [
13,
14,
15]. In the first two references, the finite-time turnpike property is achieved by the non-smoothness of the objective functional. In this paper, we use a similar approach adapted to the framework of neural ordinary differential equations.
The finite-time turnpike property is an extremal case of the celebrated turnpike property that has originally been studied in economics. The turnpike analysis investigates how the solutions of dynamic optimal control problems with a time evolution are related to the solutions of the corresponding static problems where the time derivatives are set to zero and the initial conditions are cancelled. It turns out that, often for large time horizons on large parts of the time interval, the solution of the dynamic problems is very close to the solution of the corresponding static problem. For an overview about the turnpike property, see [
16,
17,
18,
19] and the numerous references therein.
In the case of the finite-time turnpike property, after a finite time, the solution of the dynamic problem coincides with the solution of the static problem. The exponential turnpike property for ResNets and beyond has been studied for example in [
20], but not the finite-time turnpike property.
Our approach yields an optimization problem with a non-smooth objective functional without terminal constraints. For the numerical solution of this type of problem, a relaxation approach can be used; see, for example, [
21]. Due to the finite-time turnpike property, we obtain learning problems without terminal conditions (that are easier to solve than problems with terminal constraints), where the optimal trajectories still attain the desired terminal states exactly.
Since the objective functional contains a tracking term of
-norm type, our problem is related to studies in compressed sensing, where this type of objective functional is used to enhance sparsity; see [
22]. For a study about sparsity in Bregman machine learning, see [
23].
In
Section 3, we discuss the well-posedness of
. We present a result about the existence of solutions of
for a fixed matrix
A. This implies the existence of a solution for the problem where the feasible set contains only constant matrices
A that are independent of
t; see Remark 2.
In
Section 4, numerical examples are presented that illustrate that the finite-time turnpike property is also visible in the time-discrete case.
3. The Existence of Solutions of P(T, γ) for Fixed A
For the sake of the completeness of the analysis, we also state an existence result. However, we can only prove the existence of a solution for the problem where the matrix
A is fixed and not an optimization parameter for
. Thus, for a given matrix-valued function
, we consider the problem
In order to show the existence of a solution of , we assume that there exists a number , such that for , almost everywhere, we have . This is the case if the are elements of the function space ; for example, if they are step functions over .
Theorem 2. Assume that and the Lipschitz constant of σ is less than or equal to 1. Assume that is given, such that we have Then, for each and , problem has a solution , such that .
If for , for sufficiently large γ, each solution of has the finite-time turnpike property stated in Theorem 1.
The proof of Theorem 2 uses Gronwall’s Lemma (see, for example, [
24]). For the convenience of the reader, we state it here:
Lemma 2 (Gronwall’s Lemma). Let , , and an integrable function U on be given.
Assume that for , almost everywhere, the integral inequalityhold. Then, for , almost everywhere, the function U satisfies the inequality Now, we present the proof of Theorem 2.
Proof of Theorem 2. Consider a minimizing sequence
with
for all
. Define the norm
and the corresponding inner product that gives a Hilbert space structure to
. Due to the definition of
J, there exists a number
, such that for all
we have
that is the sequence is bounded in
.
Hence, there exists a weakly converging subsequence with a limit
Let
denote the state generated by
. For the states
generated by the
as a solution of
defined in (
1) we can assume, by increasing
M if necessary, that we have
Due to Mazur’s Lemma (see for example [
25,
26]), there exists a subsequence of convex combinations that converges strongly. To be precise, there exist convex combinations
such that
Since
is Lipschitz continuous with a Lipschitz constant that is less than or equal to 1, this implies for
Thus, for
, almost everywhere, we have
Then, the fact that
, the Cauchy–Schwarz inequality, (
25) and (
26) yield
Due to Mazur’s Lemma, this yields the existence of a sequence
with
and
, such that
Thus, by increasing the value of
M if necessary, we obtain for
, almost everywhere,
Since and
, this yields the integral inequality
The integral inequality (
27) has the form of (
23) in Lemma 2. Hence, (
24) from Gronwall’s Lemma yields for
, almost everywhere,
For the time derivatives we obtain, again by increasing the value of
M if necessary,
Hence, is a solution of . This shows that solutions of exist.
The exact controllability properties that have been used for the construction in the proof of Theorem 1 still hold if the matrix A is fixed. Hence, also the assertion about the finite-time turnpike property follows. □
Remark 2. The results in Theorem 1 can be adapted to the case where the feasible set only contains constant (that is, time-independent) matrices A and constant vectors b, since the exact controllability results that we have used in the proof still hold in this case. Since in this case A and b are in finite-dimensional spaces, Theorem 2 implies the existence of optimal parameters . We consider a problem of this type in the numerical example that is presented in the next section.
4. Numerical Examples
To illustrate our findings, we present numerical experiments. Let a natural number
and
be given. For
, we consider the time-discrete system
Here, the and the real numbers are independent of the time step. Our results can be adapted to the case of a constant matrix A and a constant vector b, since the exact controllability results that we have used in the proofs still hold. Since in this case A and b are in finite-dimensional spaces, we also obtain the existence of optimal parameters in this case.
The matrices
depend on the current time step. To define the objective functional for the time-discrete case, let
where
x is the solution of (
28),
and
We consider the minimization problem
For the numerical example we have chosen , , , and . For the training, we have used the global optimization toolbox in matlab.
Table 1 contains the evolution of the norms
(
) along the computed approximation of the optimal trajectories for different values of
. It clearly illustrates the finite-time turnpike behavior that is predicted by Theorem 1. The zeros in
Table 1 represent numerical values of the order less than
.
For
and
, we have obtained the numerical result presented in
Table 2. Also, for this smaller value of
, the turnpike structure is still visible. Here, we have given the size of the small norms (that are the numerical approximations of zero) in more detail.
5. Conclusions
We have shown that with a suitable non-smooth loss function, each solution of a learning problem has the finite-time turnpike property, which means that it reaches the desired state exactly after a finite time. Since the finite time
can be considered as a problem parameter, this situation allows us to choose
in a convenient way. Thus,
arises as an additional design parameter in the design of optimal neural networks, which corresponds to the number of layers. Since for
the optimal parameters are zero, System (
1) has a constant state on
, and thus the time horizon can be cut off at
.
Therefore, the problem of finding the optimal number of layers in a neural network corresponds in the setting of neural ODEs to the problem of
time-optimal control (see, for example, [
27]), which is defined as follows. Let a number
be given, which serves as problem parameter. The optimization problem involves
subject to (1), the terminal constraint and for the inequality constraint holds.Here,
is as defined in (
10). The solution of the time-optimal control problem is closely related to the solution of
. This can be seen as follows. Let
denote the optimal value of
. Then, Theorem 1 implies that (if
is sufficiently large) for optimal parameters
that solve
, we have
and
, and for the optimal state we have
. Hence, we conclude that the optimal parameters for
also solve the time-optimal control problem with parameter
and the optimal time is
.
This relation allows us to adapt the choice of to the desired value of . If is enlarged, this value can be decreased for the optimal parameters.
We have shown the existence of a solution of the nonlinear optimization problem for the case that one of the parameters, namely the matrix , is fixed. In order to show that a solution also exists with as an additional (time-dependent) optimization parameter, we expect that an additional regularization term in the objective functional (for example is necessary. This is a topic for future research.
We expect that the finite-time turnpike property also holds in the case
. However, the proof that is presented here does not apply to this case, so this is another topic for future research. As a possible application of our results, we have the numerical solution of shape inverse problems in mind, as described in [
28]. Studying the finite-time turnpike phenomenon in a practical machine learning scenario will be of high value for future research.
Moreover, it would be interesting to combine the dynamics with an approach that allows for data assimilation, such as nudging induced neural networks (NINNs) that have been introduced recently in [
29].