Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation

Poznyak, Alexander; Noriega-Marquez, Sebastian; Hernandez-Sanchez, Alejandra; Ballesteros-Escamilla, Mariana; Chairez, Isaac

doi:10.3390/math11051211

Open AccessArticle

Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation

by

Alexander Poznyak

¹

,

Sebastian Noriega-Marquez

¹,

Alejandra Hernandez-Sanchez

²,

Mariana Ballesteros-Escamilla

³

and

Isaac Chairez

^4,*

¹

Automatic Control Department, Centro de Investigacion y Estudios Avanzados del Instituto Politecnico Nacional, Ciudad de Mexico 07360, Mexico

²

Institute of Advanced Materials for the Sustainable Manufacturing, Tecnologico de Monterrey, Ciudad de Mexico 14380, Mexico

³

Centro De Innovacion Y Desarrollo Tecnologico En Computo, Instituto Politecnico Nacional, Ciudad de Mexico 07700, Mexico

⁴

Institute of Advanced Materials for the Sustainable Manufacturing, Tecnologico de Monterrey, Jalisco 45201, Mexico

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(5), 1211; https://doi.org/10.3390/math11051211

Submission received: 20 January 2023 / Revised: 16 February 2023 / Accepted: 24 February 2023 / Published: 1 March 2023

(This article belongs to the Special Issue Dynamics and Control Theory with Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This research focuses on designing a min–max robust control based on a neural dynamic programming approach using a class of continuous differential neural networks (DNNs). The proposed controller solves the robust optimization of a proposed cost function that depends on the trajectories of a system with an uncertain mathematical model satisfying a class of non-linear perturbed systems. The dynamic programming min–max formulation enables robust control concerning bounded modelling uncertainties and disturbances. The Hamilton–Jacobi–Bellman (HJB) equation’s value function, approximated by a DNN, permits to estimate the closed-loop formulation of the controller. The controller design is based on an estimated state trajectory with the worst possible uncertainties/perturbations that provide the degree of robustness using the proposed controller. The class of learning laws for the time-varying weights in the DNN is produced by studying the HJB partial differential equation. The controller uses the solution of the obtained learning laws and a time-varying Riccati equation. A recurrent algorithm based on the Kiefer–Wolfowitz method leads to adjusting the initial conditions for the weights to satisfy the final condition of the given cost function. The robust control suggested in this work is evaluated using a numerical example confirming the optimizing solution based on the DNN approximate for Bellman’s value function.

Keywords:

robust optimal control; artificial neural networks; approximate models; Kiefer–Wolfowitz method

MSC:

49L20

1. Introduction

The term optimal control refers to a group of techniques that can be used to create a control strategy that produces the best possible behaviour about the specified criterion (i.e., a strategy that minimizes a loss function). Two main techniques can define the solution of a given optimal control problem: maximal principle and dynamic programming. Pontryagin’s maximal principle gives a necessary condition for optimal outcomes [1]. Still, it cannot be used to derive close-loop solutions to the optimal control when the system is non-linear and is affected by modelling uncertainties and external perturbations [2].

On the other hand, the Hamilton–Jacobi–Bellman (HJB) provides a sufficient condition that can be used to arrive at the solutions of optimal control problems (see, for example, [3,4]. Both approaches, being theoretically elegant, have some significant drawbacks due to the requirement of precise knowledge of the system dynamics, which in practice includes internal uncertainties and external perturbations. When a mathematical model contains uncertain parameters from a priory given set, the formulation of necessary conditions for this case is known as the robust maximum principle [5]. The formulation of HJB equation under the presence of uncertainties/disturbances in the model description can be found in [6,7]. This robust version of the optimal control introduces novel approaches that can be exploited to consider no precise representations of the system that must be controlled. Moreover, it allows considering the effect of modelling uncertainties and external perturbations that satisfy some predefined bounds. Hence, using the robust HJB equations represents an innovation to solve online optimization problems that could usually be interpreted as an extremum-seeking control design problem. However, such solutions require a state-dependent explicit solution of the HJB equation, which is a complex task in general, considering that HJB is a non-linear partial equation [8,9].

In general, HJB equation cannot be solved analytically for non-linear mathematical models, even without the influence of errors or disturbances. Because of this, it was suggested in [10,11] that a numerical approximation of the relevant solution based on the use of static neural networks (NN) in which the NN weights are modified online by the least square method implementation [12,13]. From this point of view, this method is referred to as an adaptive realization of an optimal controller. This idea has been used in recent decades considering adaptive approximation of the value function associated with the HJB, which can offer an adaptive version of the optimization problem for dynamical systems. When the approximation is based on the introduction of artificial neural networks as an approximate model of the value function or the optimal structure, the technique is known as neural adaptive dynamic programming.

In this study, we suggest using a dynamic neural network (DNN) approach [14] for outline approximation of the Bellman’s function corresponding to the min–max solution of the HJB equation [15], where the maximum is taken over by the class of admissible uncertainty. The minimum is evaluated for the system trajectories affected by the worst internal parametric ignorance and the external bounded perturbations. Unlike static neural networks, which have weights that converge to "the optimal" approximate values, DNNs have changeable weights throughout the learning process. Utilizing the acquired necessary features [16], the entire design process may be achieved before applying the min–max control [17].

The main contributions of this study are:

-: Min–max formulation of the problem and the analytical calculation of the worst internal parametric uncertainties and external perturbations;
-: Development of the HJB equation corresponding to the considered min–max problem formulation;
-: Proposition of a DNN approximation for the solution of the HJB equation for the min–max problem;
-: Designing of the differential equations for the adjustment of the weights (learning) in the suggested DNN structure;
-: Numerical validation of the suggested method for some non-linear tested systems.

The structure of this manuscript is the following: Section 2 presents the problem formulation related to the type of non-linear system to be controlled and the min–max optimal control description. Section 3 presents the min–max formulation to design the proposed controller based on the approximate dynamic programming. Section 4 contains the estimation of the worst possible set of trajectories under the maximum value of external perturbations and modelling uncertainties. Section 5 establishes the robust version of the HJB equation evaluated over the estimated worst trajectories. Section 6 establishes the approximation of Bellman’s functions based on applying the DNN approximation capacities. Section 7 relates the DNN approximate values and its adjustment law, considering a recursive algorithm using the Kiefer–Wolfowitz method. Section 8 contains the final remarks and future trends based on the obtained results presented in this study.

2. Problem Formulation

Let us consider the non-linear controllable plant given by the following ordinary differential equation (ODE)

\dot{x} (t) = [A_{0} + Δ A (x (t), t)] x (t) + B_{0} u + η (x (t), t), x_{0} = x (0), t \in [0, T],

(1)

where

x \in R^{n}

is the state vector, u

\in R^{m}

is the control action to be designed. The matrix

A_{0} \in R^{n \times n}

characterizes the internal linear relationship between the state and its dynamics. The matrix

B_{0} \in R^{n \times m}

defines the constant effect of control action u on the dynamics of the system. Time is represented by the variable t. The time window is defined by the finite value represented by T.

The internal uncertainty matrix

Δ A (x (t), t) : R^{n} \times R^{+} \to R^{n \times n}

is supposed to be bounded as follows

\underset{t \geq 0}{s u p} {∥Δ A (x (t), t)∥}^{2} \leq α^{2} .

(2)

here

{∥Δ A (x (t), t)∥}^{2} : = tr \{{(Δ A (x (t), t))}^{⊤} Δ A (x (t), t)\}

.

The non-measurable external disturbance is defined by

η (x (t), t) : R^{n} \times R^{+} \to R^{n}

. This term is also assumed to be bounded according to the following inequality:

\underset{t \geq 0}{s u p} {∥η (x (t), t)∥}^{2} \leq β^{2} .

(3)

Let the cost functional be given in the Bolza form:

J (0, x_{0}; u (\cdot)) : = h_{0} (x (T)) + \int_{t = 0}^{T} h (x (t), u (t)) d t,

(4)

with the quadratic cost functions

\begin{matrix} h (x, u) = x^{⊤} Q x + u^{⊤} R u \\ h_{0} (x (T)) = 0.5 x^{⊤} (T) S x (T) \end{matrix}\} .

(5)

Here the matrix

Q \in R^{n \times n}

is positive, semi-definite and symmetrical

(Q \geq 0, Q = Q^{⊤})

, the control-associated matrix in the functional

R \in R^{n \times n}

is positive, definite and symmetrical

(R > 0, R = R^{⊤})

. The matrix related to the final condition

S \in R^{n \times n}

is positive, definite and symmetrical

(S > 0, S = S^{⊤})

.

This study designs the min–max control

u^{*} \in R^{m}

which must solve the following optimization problem

max_{\{Δ A, η\} \in Ψ^{a d m}} J (0, x_{0}; u (\cdot)) \to min_{u (\cdot) \in U_{a d m} [0, T]},

subject to the dynamics given in Equation (1).

The set

Ψ^{a d m}

defines the class of admissible internal uncertainties and external perturbations given by

Ψ^{a d m} = \{Δ A, η : \underset{t \geq 0}{s u p} {∥Δ A (x (t), t)∥}^{2} \leq α^{2}, \underset{t \geq 0}{s u p} {∥η (x (t), t)∥}^{2} \leq β^{2}\} .

(6)

The set

U_{a d m} [0, T]

consists of all piece-wise continuous vector functions

u (t)

measurable (in the Lebesgue sense) for all

t \in [0, T]

. This means that the min–max optimal control is as follows:

u^{*} = \underset{u (\cdot) \in U_{a d m} [0, T]}{arg min} max_{\{Δ A, η\} \in Ψ^{a d m}} J (0, x_{0}; u (\cdot)) .

(7)

3. HJB Min–Max Formulation

According to the results presented in [6,7], the sufficient condition for a control action u to be min–max (robust) optimal (7) consists of fulfilling the following max–min HJB equation [18,19]:

- \frac{\partial V (t, x)}{\partial t} + max_{u (\cdot) \in U_{a d m}} min_{\{Δ A, η\} \in Ψ^{a d m}} H (- \frac{\partial V (t, x)}{\partial x}, x, u, Δ A, η) = 0,

(8)

where the function

H : R^{n} \times R^{n} \times R^{m} \times R^{n \times n} \times R^{n} \to R

is called the perturbed Hamiltonian (considering the effect of the parameters and external perturbations), which is given by

H (ψ, x, u, Δ A, η) : = ψ^{⊤} \dot{x} - h (x, u) .

(9)

The vector

ψ

is referred to as the adjoint variable satisfying the following differential equation with the corresponding final condition

\dot{ψ} = - \frac{\partial}{\partial x} H (ψ, x, u, Δ A, η), ψ (T) = {\frac{d}{d x} h_{0} (x)|}_{x = x (T)} = S x (T) .

(10)

The value (Bellman) function V is defined for any

(0, x_{0}) \in [0, T) \times R^{n}

as

V (s, y) : = min_{u \in U_{a d m}} max_{\{Δ A, η\} \in Ψ^{a d m}} J (s, y; u (\cdot))

(11)

with the boundary condition

V (T, y) = h_{0} (y) .

(12)

Here

h_{0}

is defined in Equation (5).

4. Determination of the Worst Possible Uncertainty

Given the min–max formulation of the optimization problem considered in this study, it is necessary to estimate the state trajectories under the worst possible evolution of parametric uncertainties

Δ A (x, t)

and non-parametric imprecision, as well as external perturbations

η (x, t)

. This section presents this analysis.

Consider the following joint vector

ν (x (t), t)

defined as

ν (x (t), t) = [\begin{matrix} Δ A (x (t), t) x \\ η (x (t), t) \end{matrix}] .

The following lemma presents the result regarding the estimation of the mentioned worst possible trajectory:

Lemma 1.

The worst estimation

ν^{*} = \underset{ν \in Ψ^{a d m}}{argmin} \{ψ^{⊤} ν [\begin{matrix} x \\ 1 \end{matrix}]\}, ψ = - \frac{\partial}{\partial x} V,

(13)

for the uncertainties/perturbation extended vector

ν (x (t), t)

, as follows

ν^{*} (t) = \sqrt{α^{2} + β^{2}} \frac{ψ (t)}{∥ψ (t)∥} \frac{{[\begin{matrix} x (t) \\ 1 \end{matrix}]}^{⊤}}{∥[\begin{matrix} x (t) \\ 1 \end{matrix}]∥} .

(14)

Proof.

In view of the Cauchy–Schwartz inequality [20] and Equations (2) and (3) we have the following upper estimate

|ψ^{⊤} ν [\begin{matrix} x \\ 1 \end{matrix}]| \leq ∥ψ∥ ∥ν∥ ∥[\begin{matrix} x \\ 1 \end{matrix}]∥ \leq \sqrt{α^{2} + β^{2}} ∥ψ∥ ∥[\begin{matrix} x \\ 1 \end{matrix}]∥ .

In view of the equality attainable for

ν = ν^{*}

(14), the lemma is proven. □

Remark 1.

The "worst" trajectory

x^{*}

(when

ν = ν^{*})

, the following auxiliary ordinary differential equation gives

{\dot{x}}^{*} = A_{0} x^{*} + B_{0} u + ν^{*} {[\begin{matrix} {[x^{*}]}^{⊤} & 1 \end{matrix}]}^{⊤}, x_{0}^{*} = x^{*} (0), t \in [0, T] .

(15)

The initial condition

x^{*} (0)

is a given constant vector. Notice that the dynamics (15) do not contain either uncertainty or perturbations. Therefore, these trajectories could be used to obtain the minimum value of the cost function using the standard dynamic programming approach based on HJB.

5. HJB Equation under the Effect of the Worst Possible Trajectories

Given Lemma 1, the following optimization problem arises for estimating the minimizing controller under the worst possible trajectory. The mathematical optimization technique known as dynamic programming has a necessary optimally derived condition from applying the Bellman equation. The payout from some initial decisions and the value of the decision problem that remains due to these initial choices is used to express the value (using the proposed Bellman function) of a decision problem at a specific time. Following Bellman’s principle of optimality, this divides a dynamic optimization issue into a series of better-defined sub-problems. Based on this technique, the optimization problem considered in this study can be solved based on the conjugated application of online optimization using the HJB with an approximated Bellman’s function [21,22].

According to the dynamic programming theory, and considering the application of the Hamiltonian function, the following equivalent problem for solving the optimal (under the worst possible uncertainties/perturbations) control

u^{*}

can be proposed as follows:

\begin{matrix} u^{*} (t) = \underset{u (\cdot) \in U_{a d m} [0, T]}{arg max} \underset{ν \in Ψ^{a d m}}{arg min} H (ψ, x, u, ν) = \underset{u (\cdot) \in U_{a d m} [0, T]}{arg max} H (ψ^{*}, x^{*}, u, ν^{*}) \\ ψ^{*} : = - {\frac{\partial V (t, x)}{\partial x}|}_{x = x^{*}} . \end{matrix}

(16)

Given the arguments given in Equation (8), the robust optimal control

u^{*}

is

u^{*} = - \frac{1}{2} R^{- 1} B_{0}^{⊤} {\frac{\partial V (t, x)}{\partial x}|}_{x = x^{*}} .

(17)

Substituting both

u^{*}

and

ν^{*}

in the HJB equation leads to:

\begin{matrix} - \frac{\partial V (t, x^{*})}{\partial t} - {(\frac{\partial V (t, x^{*})}{\partial x})}^{⊤} A_{0} x^{*} + \frac{1}{4} {(\frac{\partial V (t, x^{*})}{\partial x})}^{⊤} B_{0} R^{- 1} B_{0}^{⊤} \frac{\partial V (t, x^{*})}{\partial x} \\ - ∥ x^{*} ∥_{Q}^{2} - \sqrt{α^{2} + β^{2}} {(\frac{\partial V (t, x^{*})}{\partial x})}^{⊤} \frac{\frac{\partial V (t, x^{*})}{\partial x}}{∥\frac{\partial V (t, x^{*})}{\partial x}∥} \frac{{[\begin{matrix} x^{*} \\ 1 \end{matrix}]}^{⊤}}{∥[\begin{matrix} x^{*} \\ 1 \end{matrix}]∥} [\begin{matrix} x^{*} \\ 1 \end{matrix}] = 0 . \end{matrix}

Using a simplification process on the last term of the right-hand side of the previous partial differential equation leads to

\begin{matrix} - \frac{\partial V (t, x^{*})}{\partial t} - {(\frac{\partial V (t, x^{*})}{\partial x})}^{⊤} A_{0} x^{*} + \frac{1}{4} {(\frac{\partial V (t, x^{*})}{\partial x})}^{⊤} B_{0} R^{- 1} B_{0}^{⊤} \frac{\partial V (t, x^{*})}{\partial x} \\ - ∥ x^{*} ∥_{Q}^{2} - \sqrt{α^{2} + β^{2}} ∥\frac{\partial V (t, x^{*})}{\partial x}∥ \sqrt{{∥x^{*}∥}^{2} + 1} = 0 . \end{matrix}

(18)

The solution for the Bellman’s function

V (t, x^{*})

in Equation (18) is tough to find analytically. That is why we propose to use DNN approximation for

V (t, x^{*})

, which is explained in the next section.

6. Bellman’s Function DNN Approximation

Consider the representation

V (t, x^{*}) ≃ V_{a} (t, x^{*})

for the Bellman’s function. The function

V_{a} (t, x)

is an approximate solution associated with the HJB Equation (18) that satisfies the following DNN structure:

V_{a} (x^{*}, t) = ω^{⊤} (t) σ (x^{*}) + {[x^{*}]}^{⊤} P (t) x^{*} .

(19)

In Equation (19), the matrix

P \in R^{n \times n}

is a positive definite solution, which is uniformly bounded with respect to time

t \in R^{n}

and

σ \in R^{n}

are associated with the DNN weights and activation functions, respectively. The components of

σ

are selected strictly for positive sigmoidal functions.

The substitution of the approximate Bellman’s function

V_{a} (x^{*}, t)

in Equation (18) leads to

\begin{matrix} - {[x^{*}]}^{⊤} \dot{P} (t) x^{*} - {\dot{ω}}^{⊤} (t) σ (x^{*}) - ω^{⊤} (t) \frac{d}{d x} σ^{⊤} (x^{*}) A_{0} x^{*} - 2 {[x^{*}]}^{⊤} P (t) A_{0} x^{*} \\ + \frac{1}{4} (ω^{⊤} (t) \frac{d}{d x} σ (x^{*}) + 2 {[x^{*}]}^{⊤} P (t)) B_{0} R^{- 1} B_{0}^{⊤} (\frac{d}{d x} σ^{⊤} (x^{*}) ω (t) + 2 P (t) x^{*}) \\ - \sqrt{α^{2} + β^{2}} \sqrt{(∥ x^{*} ∥^{2} + 1)} ∥\frac{d}{d x} σ^{⊤} (x^{*}) ω (t) + 2 P (t) x^{*}∥ - {∥ x^{*} ∥}_{Q}^{2} ≃ 0 . \end{matrix}

(20)

Introducing the unitary and finite valued term

\frac{σ^{⊤} (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} σ (x^{*} (t))

in the components that contain

ω (t)

in Equation (20) yields

\begin{matrix} - {\dot{ω}}^{⊤} (t) σ (x^{*} (t)) - ω^{⊤} (t) \frac{d}{d x} σ (x^{*} (t)) A_{0} x^{*} (t) \frac{σ^{⊤} (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} σ (x^{*} (t)) + \\ \frac{1}{4} ω^{⊤} (t) \frac{\partial}{\partial x} σ (x^{*} (t)) B_{0} R^{- 1} B_{0}^{⊤} {(\frac{d}{d x} σ (x^{*} (t)))}^{⊤} ω (t) \frac{σ^{⊤} (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} σ (x^{*} (t)) + \\ {[x^{*} (t)]}^{⊤} P (t) B_{0} R^{- 1} B_{0}^{⊤} {(\frac{d}{d x} σ (x^{*} (t)))}^{⊤} ω (t) \frac{σ^{⊤} (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} σ (x^{*} (t)) + \\ - \sqrt{α^{2} + β^{2}} \sqrt{(∥ x^{*} {(t) ∥}^{2} + 1)} ∥\frac{d}{d x} σ^{⊤} (x^{*}) ω (t) + 2 P (t) x^{*} (t)∥ \frac{σ^{⊤} (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} σ (x^{*} (t)) ≃ 0 . \end{matrix}

(21)

Based on the introduction of the unitary term presented above, the strategy simplifies and removes some of the practical implementations of previous robust optimal control designs using neural network approximations [7,23]. Notice that the expression presented in Equation (21) contains several terms that are proportional to the

σ (x^{*} (t))

. Reorganizing the last expression as

{(\cdot)}^{⊤} σ (x^{*} (t)) = 0

, we may conclude that

(\cdot) = 0

. This strategy is commonly implemented in the design of learning laws for a DNN as mentioned in [14,24]. Hence, based on this strategy, it is feasible to estimate the temporal evolution of the weights according to the following ordinary differential equation, which defines the dynamics of all components in

ω (t)

:

\begin{matrix} \dot{ω} (t) = \frac{σ (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} Ω (x^{*} (t), ω (t)) \\ Ω (x^{*} (t), ω (t)) = - [{(x^{*} (t))}^{⊤} A_{0}^{⊤} {(\frac{d}{d x} σ (x^{*} (t)))}^{⊤} ω (t)] + \\ \frac{1}{4} [ω^{⊤} (t) (\frac{d}{d x} σ (x^{*} (t))) B_{0} R^{- 1} B_{0}^{⊤} {(\frac{\partial}{\partial x} σ (x^{*} (t)))}^{⊤} ω (t)] + \\ [ω^{⊤} (t) \frac{d}{d x} σ (x^{*} (t)) B_{0} R^{- 1} B_{0}^{⊤} P (t) x^{*} (t)] - \\ \sqrt{α^{2} + β^{2}} \sqrt{(∥ x^{*} {(t) ∥}^{2} + 1)} ∥{(\frac{d}{d x} σ (x^{*} (t)))}^{⊤} ω (t) + 2 P (t) x^{*} (t)∥ \\ ω (T) = 0, ω (t) \in R^{n} . \end{matrix}

(22)

Hence, based on the definition of

Ω (x^{*} (t), ω (t))

(which is known as the learning law), Equation (20) admits the following representation for the HJB equation using the approximate representation of the Bellman’s function:

{[x^{*}]}^{⊤} Ric (P) x^{*} + (\dot{ω} (t) - \frac{σ (x^{*} (t))}{{∥σ (x^{*} (t))∥}^{2}} Ω (x^{*} (t), ω (t))) σ (x^{*} (t)) ≃ 0,

(23)

where the expression

Ric (P)

described the following matrix Riccati differential equations given by

Ric (P) = - \dot{P} (t) - P (t) A_{0} - A_{0}^{⊤} P (t) + P (t) B_{0} R^{- 1} B_{0}^{⊤} P (t) - Q = 0

(24)

Taking into account the Property (12), the initial conditions for

ω (0)

and

P (0)

should be selected in such a way the following final conditions could be met

ω (T) = 0 and P (T) = S .

(25)

So, based on the presented results, we are now ready to formulate the principal result of this study.

Theorem 1.

If adjustment laws for the parameters in the DNN are given by Equation (22) and the time-dependent Riccati Equation (24) has a positive definite solution P with the initial conditions and

P (0)

, providing the Property (25), then

V_{a} (x^{*}, t)

gives the approximation of the Bellman’s function

V (x^{*}, t) .

Since both ODE (22) and (24) have the terminal Conditions (25) we need to apply a recursive method (in this case, the “shooting” method) to find the appropriate initial conditions

ω (0)

and

P (0)

. Notice that the ODE

Ric (P) = 0

(24) can be resolved in inverse time

τ = T - t

using the initial condition for

τ = 0

which is

P (T) = S

. So, fulfilling of the terminal condition for Equation (24) does not present any problem. In the next section, we will discuss the recursive method for finding initial the condition

ω (0)

realizing the terminal requirements

ω (T) = 0

. In view of the approximated model Equation (19), finally, the robust control

u^{*}

may be represented as

u^{*} (t) = - \frac{1}{2} R^{- 1} B_{0}^{⊤} (\frac{d}{d x} σ^{⊤} (x^{*} (t)) ω (t) + 2 P (t) x^{*} (t)),

(26)

where

x^{*} (t)

and

ω (t)

are given by Equations (15) and (22), respectively, including

P (t)

as in Equation (24).

7. Recursive Method to Realize the Terminal Conditions for the Weight Dynamics

The exact dependence of the performance index

J (0, x_{0}^{*}; u^{*} (\cdot))

(4) on

ω (0)

is difficult to be obtained analytically. This condition constrains the application of traditional recurrent optimization algorithms such as gradient descendent [25] or Levenberg–Marquardt methodology [26]. As an alternative, here we apply the so-called, modified deterministic Kiefer–Wolfowitz method [27,28] applied to the optimization performance index considering that

J (0, x_{0}^{*}; u^{*} (\cdot)

= J

(ω (0))

. The following describes this method’s recursive procedure:

\begin{matrix} ω_{k} (0) = ω_{k - 1} (0) - \frac{γ_{k}}{α_{k}} (Y_{k} (ω_{k - 1} (0), α_{k}) - Y_{k} (ω_{k - 1} (0), 0) \\ Y_{k} (ω_{k - 1} (0), α_{k}) = {(Y_{k, 1} (ω_{k - 1} (0), α_{k}), \dots, Y_{k, n} (ω_{k - 1} (0), α_{k}))}^{⊤} \\ Y_{k, i} (ω_{k - 1} (0), α_{k}) = J (ω_{k - 1} (0) + α_{k} e_{i}), e_{i} = (\underset{i}{\underset{︸}{0, \dots, 0, 1}}, 0, \dots, 0) \\ \sum_{k = 1}^{\infty} γ_{k} = \infty, \sum_{k = 1}^{\infty} α_{k} = \infty, \sum_{k = 1}^{\infty} γ_{k} α_{k} < \infty, γ_{k}, α_{k} > 0 . \end{matrix}\}

(27)

In this function,

γ_{k}

can be selected as a small constant and

α_{k} = α_{0} k^{- α},

0 < α \leq 1

. From a practical point of view, both parameters

γ_{k}

and

α_{k}

can be selected as small constants.

The evolution of this algorithm allows the adjustment of the initial weights of the DNN running at each step of the recurrent methodology. The combination of this recurrent strategy running in discrete form and the continuous evolution of the system controlled with the approximate value of the Bellman’s function based on the DNN appear to be a class of hybrid system. Moreover, adjusting the initial weights in each recurrent stage, driving the performance function towards its minimum value, seems to operate as a class of reinforcement learning using the Kiefer–Wolfowitz method.

A simple diagram of the hybrid algorithm is shown in Figure 1. This diagram shows the two outer loops that define how the recurrent method works over the continuous development of the proposed min–max optimized controller.

8. Numerical Example

The following parameters of the model presented in Equation (1) were considered for numerical evaluation purposes:

\begin{matrix} A_{0} = [\begin{matrix} - 1.0 & 0.5 \\ 0.0 & 2.0 \end{matrix}], B_{0} = [\begin{matrix} 1 \\ 1 \end{matrix}] \\ Δ A (x (t), t) = ε_{1} [\begin{matrix} \frac{x_{1}^{2}}{x_{1}^{2} + 1} & t a n^{- 1} (x_{2}) \\ 0.5 s i n (ω_{A} t) & 0 \end{matrix}] \\ η (x (t), t) = ε_{2} [\begin{matrix} s i n (ω_{1} t + ϕ_{1}) \\ s i n (ω_{2} t + ϕ_{2}) \end{matrix}] . \end{matrix}

The model with uncertain mathematical structure was simulated in Matlab/Simulink using the ODE-01 (Euler) integration method with the integration step fixed to 0.001 s.

The proposed modified deterministic Kiefer–Wolfowitz method was implemented in M-language in Matlab to perform the recurrent scheme described in Equation (27). The number of cycles evaluated for this recurrent algorithm was

k = 3900 .

The evolution of the states

x_{1}

and

x_{2}

appears in Figure 2 and Figure 3. These figures compare the evolution of the worst trajectories

x_{1}^{*}

and

x_{2}^{*}

considering the upper bound of the admissible perturbations. Notice that both states are moving towards the origin. Still, in the case of the worst trajectories, there are no oscillations in the steady state in opposition to the controlled states that are affected by the selected uncertainties and perturbations. Furthermore, one may notice that the first variable has a more evident difference between the worst estimated trajectories for the controlled ones. The trajectories for

x_{1}

and

x_{2}

correspond to those generated with the final weights derived at the end of the iteration process endorsed with the Kiefer–Wolfowitz strategy.

The evolution of the states shown in Figure 2 and Figure 3 are generated by the control action depicted in Figure 4, that is the result produced when the complete set of iterations in the Kiefer–Wolfowitz algorithm is ended. This controller also responds to the effect of uncertainties and perturbations, limiting the possibility that the control action converges towards the origin. Nevertheless, the obtained control action corresponds to the outcome endorsed after the iterative algorithm ends, reaching the early stop criterion.

Noticing the dependence of the functional J concerning the weights, Figure 5 shows the evolution of the norm of the weights

ω_{K}

at the initial condition as a function of time. This temporal evaluation confirmed the weight tendency towards the origin according to the application of the Kiefer–Wolfowitz method. Even though the optimization strategy is not enforcing that the norm of the initial conditions should move to the origin, the continuous dependence of the system trajectories for the control action motivates this behaviour.

Based on the trajectories produced for the state x and the corresponding control action

u,

the proposed functional J was estimated using the Formula (4). According to the case considered above for the states and the control, the evolution of the norm of the weights concerning time is also evidenced in this study in Figure 6, which confirms the effect of the optimization algorithm on the norm of the weights that are also participating in the evolution of the function. The selected example presented in this section shows the details of how the proposed algorithm could be used on an arbitrary system. Such a condition permits to claim that the algorithm introduced in this study could work for some other systems with similar dynamical structure as the one used in the example.

The evaluation of the Kiefer–Wolfowitz method allows for estimating the evolution of the norm for the final weights in the proposed DNN. The temporal evolution in discrete steps for the norm of these weights is shown in Figure 7. This evolution is shown along the entire set of iterations before the weights norm becomes smaller than 0.01, which is considered an early stopping criterion. As expected, the norm of the initial weights has a monotonically decreasing tendency to reach a final value of

ω_{3900} (0) = 0.01

. Because of the iterative algorithm is based on the evolution of the functional J at the final time T, the temporal evolution of this function over the iterations of the Kiefer–Wolfowitz method confirms the successful application of the recurrent optimization methodology.

The application of the recurrent Kiefer–Wolfowitz method also implied significant effects on the temporal evolution of the cost function over the continuous time within each step (Figure 8).

The lines shown in Figure 8 represent the evolution of the performance function J evaluated with the weights proposed at the beginning of the iterative simulation

k = 0

(red line), one generated when the iterative process was evaluated with

k = 1900

(the middle of the iterative sequence, blue line) and one corresponding to the case when the iterative process was evaluated with

k = 3800

corresponding to the end of the process (green line). This figure confirms the iterative process leads to reducing the final value of the performance function, showing a non-increasing behaviour with respect to the number of iterative cycles. These characteristics confirm the sub-optimal construction of the proposed controller.

The variation of the cost function starts at the first round of the recurrent sequence with a non-convergent behaviour after 15 seconds. When the recurrent algorithm moves to its 1900 step (in the middle of the detected maximum number of steps), the cost function does not grow as fast as the case observed in the first step. When the recurrent algorithm reaches the last step, the cost function converges to an almost constant value of 0.8, which indirectly confirms the state convergence to the origin. Moreover, this final value obtained at the temporal evolution of the cost function within each stage of the recurrent algorithm confirms the observed result shown previously. Notice that this asymptotic behaviour of the performance function at the last step of the iteration algorithm is forced by the robustness of the developed controller to the class of admissible perturbations and modelling uncertainties.

9. Conclusions

The following are the main conclusions of this study:

This study provides a robust optimal controller for a class of perturbed non-linear systems using an approximate model of Bellman’s function based on a DNN formulation. The approximation of the min–max value function for the HJB equation allows developing the robust optimal control for non-linear systems affected by non-measurable uncertainties and perturbations. The analytical representation of the system trajectories

x^{*} (t)

under the worst admissible uncertainties and perturbations

ν^{*} (t)

is obtained. The analysis of the HJB equation using the trajectories

x^{*} (t)

and the approximated value function leads to deriving the learning laws for the DNN weights. The numerical procedure for adjusting the initial value of the weight dynamics is developed based on the modified deterministic version of the Kiefer–Wolfowitz method. The evaluated numerical example confirms the workability of the proposedrobust optimal control for a wide class of non-linear systems with an uncertain mathematical model.

Author Contributions

Conceptualization, A.P. and I.C.; methodology, A.H.-S. and M.B.-E.; software, S.N.-M.; validation, A.P., M.B.-E. and I.C.; formal analysis, A.P.; investigation, M.B.-E.; resources, I.C.; data curation, A.H.-S.; writing—original draft preparation, S.N.-M.; writing—review and editing, I.C.; visualization, S.N.-M. and A.H.-S.; supervision, A.P.; project administration, I.C.; funding acquisition, I.C. All authors have read and agreed to the published version of the manuscript.

Funding

The Tecnologico de Monterrey funded this research, and the Institute of Advanced Materials for Sustainable Manufacturing under the grant Challenge-Based Research Funding Program 2022 number I006-IAMSM004-C4-T2-T.

Data Availability Statement

Data will available under the proper request to the corresponding author.

Acknowledgments

The author acknowledges the scholarship provided by the Consejo Nacional de Ciencia y Tecnologia of the Mexican government.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DNN	Differential Neural Networks
HJB	Hamilton–Jacobi–Bellman Equation
NN	Neural Network

References

Onori, S.; Serrao, L.; Rizzoni, G. Pontryagin’s minimum principle. In Hybrid Electric Vehicles; Springer: Berlin/Heidelberg, Germany, 2016; pp. 51–63. [Google Scholar]
Cannon, M.; Liao, W.; Kouvaritakis, B. Efficient MPC optimization using Pontryagin’s minimum principle. Int. J. Robust Nonlinear Control: IFAC-Affil. J. 2008, 18, 831–844. [Google Scholar] [CrossRef]
Kirk, D.E. Optimal control theory: An introduction; Courier Corporation: Englewood Cliffs, NJ, USA, 2004. [Google Scholar]
Gadewadikar, J.; Lewis, F.L.; Xie, L.; Kucera, V.; Abu-Khalaf, M. Parameterization of all stabilizing H∞ static state-feedback gains: Application to output-feedback design. Automatica 2007, 43, 1597–1604. [Google Scholar] [CrossRef]
Boltyanski, V.G.; Poznyak, A.S. The Robust Maximum Principle: Theory and Applications; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Azhmyakov, V.; Boltyanski, V.; Poznyak, A. The dynamic programming approach to multi-model robust optimization. Nonlinear Anal. Theory Methods Appl. 2010, 72, 1110–1119. [Google Scholar] [CrossRef]
Ballesteros, M.; Chairez, I.; Poznyak, A. Robust min–max optimal control design for systems with uncertain models: A neural dynamic programming approach. Neural Netw. 2020, 125, 153–164. [Google Scholar] [CrossRef] [PubMed]
Munos, R. A study of reinforcement learning in the continuous case by the means of viscosity solutions. Mach. Learn. 2000, 40, 265–299. [Google Scholar] [CrossRef] [Green Version]
Swiech, A. Viscosity Solutions to HJB Equations for Boundary-Noise and Boundary-Control Problems. SIAM J. Control Optim. 2020, 58, 303–326. [Google Scholar] [CrossRef]
Vrabie, D.; Lewis, F. Adaptive dynamic programming for online solution of a zero-sum differential game. J. Control Theory Appl. 2011, 9, 353–360. [Google Scholar] [CrossRef]
Zhao, D.; Liu, D.; Lewis, F.L.; Principe, J.C.; Squartini, S. Special issue on deep reinforcement learning and adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 2038–2041. [Google Scholar] [CrossRef] [PubMed]
Liu, D.; Wei, Q.; Wang, D.; Yang, X.; Li, H. Adaptive Dynamic Programming with Applications in Optimal Control; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Bertsekas, D.; Tsitsiklis, J.N. Neuro-Dynamic Programming; Athena Scientific: Belmont, MA, USA, 1996. [Google Scholar]
Poznyak, A.S.; Sanchez, E.N.; Yu, W. Differential Neural Networks for Robust Nonlinear Control: Identification, State Estimation and Trajectory Tracking; World Scientific: Singapore, 2001. [Google Scholar]
Bertsekas, D.P.; Ioffe, S. Temporal Differences-Based Policy Iteration and Applications in Neuro-Dynamic Programming; Laboratory for Information and Decision Systems Report LIDS-P-2349; MIT: Cambridge, MA, USA, 1996; Volume 14. [Google Scholar]
Han, Y.; Huang, G.; Song, S.; Yang, L.; Wang, H.; Wang, Y. Dynamic neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7436–7456. [Google Scholar] [CrossRef] [PubMed]
Ballesteros, M.; Chairez, I.; Poznyak, A. Robust optimal feedback control design for uncertain systems based on artificial neural network approximation of the Bellman’s value function. Neurocomputing 2020, 413, 134–144. [Google Scholar] [CrossRef]
Peng, S. A generalized Hamilton–Jacobi-Bellman equation. In Proceedings of the Control Theory of Distributed Parameter Systems and Applications: Proceedings of the IFIP WG 7.2 Working Conference, Shanghai, China, 6–9 May 1990; pp. 126–134. [Google Scholar]
Kundu, S.; Kunisch, K. Policy iteration for Hamilton–Jacobi–Bellman equations with control constraints. Comput. Optim. Appl. 2021, 78, 1–25. [Google Scholar] [CrossRef]
Poznyak, A. Advanced Mathematical Tools for Control Engineers: Volume 1: Deterministic Systems; Elsevier: Amsterdam, The Netherlands, 2010; Volume 1. [Google Scholar]
Murray, J.J.; Cox, C.J.; Lendaris, G.G.; Saeks, R. Adaptive dynamic programming. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2002, 32, 140–153. [Google Scholar] [CrossRef]
Wang, F.Y.; Zhang, H.; Liu, D. Adaptive dynamic programming: An introduction. IEEE Comput. Intell. Mag. 2009, 4, 39–47. [Google Scholar] [CrossRef]
Lewis, F.L.; Liu, D. Reinforcement Learning and Approximate Dynamic Programming for Feedback Control; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Chairez, I. Wavelet differential neural network observer. IEEE Trans. Neural Netw. 2009, 20, 1439–1449. [Google Scholar] [CrossRef] [PubMed]
Haji, S.H.; Abdulazeez, A.M. Comparison of optimization techniques based on gradient descent algorithm: A review. PalArch’s J. Archaeol. Egypt/Egyptology 2021, 18, 2715–2743. [Google Scholar]
Yu, H.; Wilamowski, B.M. Levenberg–marquardt training. In Intelligent Systems; CRC Press: Boca Raton, FL, USA, 2018; pp. 12–31. [Google Scholar]
Kiefer, J.; Wolfowitz, J. Stochastic estimation of the maximum of a regression function. Ann. Math. Stat. 1952, 23, 462–466. [Google Scholar] [CrossRef]
Larson, J.; Menickelly, M.; Wild, S.M. Derivative-free optimization methods. Acta Numer. 2019, 28, 287–404. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Diagram of the min–max optimization approach based on the hybrid form based on the DNN approximation of the Bellman’s function to derive the controller and the recurrent method to force the fulfilment of the final condition.

Figure 2. Temporal evolution of the first state

x_{1}

considering the initial iteration (

k = 0

) and the final iteration (

k = 2600

).

Figure 2. Temporal evolution of the first state

x_{1}

considering the initial iteration (

k = 0

) and the final iteration (

k = 2600

).

Figure 3. Temporal evolution of the second state

x_{2}

considering the initial iteration (

k = 0

) and the final iteration (

k = 2600

).

Figure 3. Temporal evolution of the second state

x_{2}

considering the initial iteration (

k = 0

) and the final iteration (

k = 2600

).

Figure 4. Temporal evolution of the control action

u^{*}

evaluated with the application of the approximate model based on the DNN.

Figure 4. Temporal evolution of the control action

u^{*}

evaluated with the application of the approximate model based on the DNN.

Figure 5. Recurrent evolution of the norm of the weights at the initial moment defined by the Kiefer–Wolfowitz method.

Figure 6. Recurrent evolution of the cost function defined by the Kiefer–Wolfowitz method.

Figure 7. Recurrent evolution of the norm of the weights at the time moment T defined by the Kiefer–Wolfowitz method.

Figure 8. Temporal behaviour of the cost function depending on the sequence of stages evaluated over the sequence of iterations.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Poznyak, A.; Noriega-Marquez, S.; Hernandez-Sanchez, A.; Ballesteros-Escamilla, M.; Chairez, I. Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation. Mathematics 2023, 11, 1211. https://doi.org/10.3390/math11051211

AMA Style

Poznyak A, Noriega-Marquez S, Hernandez-Sanchez A, Ballesteros-Escamilla M, Chairez I. Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation. Mathematics. 2023; 11(5):1211. https://doi.org/10.3390/math11051211

Chicago/Turabian Style

Poznyak, Alexander, Sebastian Noriega-Marquez, Alejandra Hernandez-Sanchez, Mariana Ballesteros-Escamilla, and Isaac Chairez. 2023. "Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation" Mathematics 11, no. 5: 1211. https://doi.org/10.3390/math11051211

APA Style

Poznyak, A., Noriega-Marquez, S., Hernandez-Sanchez, A., Ballesteros-Escamilla, M., & Chairez, I. (2023). Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation. Mathematics, 11(5), 1211. https://doi.org/10.3390/math11051211

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Min–Max Dynamic Programming Control for Systems with Uncertain Mathematical Models via Differential Neural Network Bellman’s Function Approximation

Abstract

1. Introduction

2. Problem Formulation

3. HJB Min–Max Formulation

4. Determination of the Worst Possible Uncertainty

5. HJB Equation under the Effect of the Worst Possible Trajectories

6. Bellman’s Function DNN Approximation

7. Recursive Method to Realize the Terminal Conditions for the Weight Dynamics

8. Numerical Example

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI