Reinforcement Learning for Solving Control Problems in Robotics

Diveev, Askhat; Sofronova, Elena; Konstantinov, Sergey; Moiseenko, Viktoria

doi:10.3390/engproc2023033029

Open AccessProceeding Paper

Reinforcement Learning for Solving Control Problems in Robotics^†

¹

Federal Research Center “Computer Science and Control” of the Russian Academy of Sciences, 44/2 Vavilova str., Moscow 119333, Russia

²

RUDN University, 6 Miklukho-Maklaya str., Moscow 117198, Russia

^*

Author to whom correspondence should be addressed.

^†

Presented at the 15th International Conference “Intelligent Systems” (INTELS’22), Moscow, Russia, 14–16 December 2022.

Eng. Proc. 2023, 33(1), 29; https://doi.org/10.3390/engproc2023033029

Published: 15 June 2023

(This article belongs to the Proceedings of 15th International Conference “Intelligent Systems” (INTELS’22))

Download

Browse Figures

Versions Notes

Abstract

:

The use of reinforcement learning technology for the optimal control problem solution is considered. To solve the optimal control problem an evolutionary algorithm is used that finds control to ensure the movements of a control object along different trajectories with approximately the same values of the quality criterion. Additional conditions for passing the trajectory in the neighbourhood of given areas of the state space are included in the quality criterion. To build a stabilization system for the movement of an object along a given trajectory, machine learning control by symbolic regression is used. An example of solving the optimal control problem for a quadcopter is given.

Keywords:

optimal control; evolutionary algorithm; reinforcement learning; symbolic regression

1. Introduction

The optimal control problem with phase constraints often has a multi-modal functional. Therefore, with its numerical solution by direct approach, it is possible to obtain several control functions that ensure the movement of the object along different trajectories in the state space with approximately the same value of the control quality criterion which is close to the optimal.

A numerical solution to the optimal control problem leads to some difficulties. As a rule, in most optimal control problems, it is necessary to minimize not one but at least two criteria, reach the control goal or minimize the error of reaching the terminal state and still minimize the given quality criterion. Addition of weight coefficients into criteria does not significantly simplify the problem, since the problem of choosing weights arises.

Another search problem is defined as the loss of unimodality of the functional on the space of parameters of the approximating function. Even a piecewise linear approximation of the control function, when only one parameter needs to be found on each interval for each control component, does not guarantee the presence of a single minimum of the goal functional on the space of parameters.

The problem becomes more complicated in the presence of phase constraints that describe the areas of state space forbidden for the optimal trajectory. It is most likely that due to these reasons, and despite numerous attempts [1,2], a universal computational method for the optimal control problem has not been created.

Further studies have shown that if a strictly optimal solution is not needed and solutions close to the optimal are quite satisfactory, then evolutionary algorithms can be successfully applied to the optimal control problems [3].

Sometimes in practice the researcher knows how the object should move along the optimal trajectory, i.e., approximately knows the areas in the state space the optimal trajectory should pass through. If we introduce additional requirements in the form of passing through the given areas into the quality criterion, then the evolutionary algorithm should change the search area and look for a solution that satisfies the additional requirements. This approach is effective when using evolutionary algorithms. Due to the inheritance property, the improvement of the criterion value at each generation is performed on the basis of small evolutionary transformations of possible solutions to the previous generation. Therefore, if at some generation one of the possible solutions passes through the required areas specified by the researcher, then with a high probability the evolutionary algorithm will search for the optimal solution that preserves the obtained properties. A similar technique is used in machine learning with reinforcement [4,5], when the researcher awards the object by the change of the target functional value for the right actions. Currently, reinforcement learning is actively used in the practice of solving control problems [6]. The paper contains a formal description and practical application of reinforcement learning for solving the optimal control problem.

2. The Optimal Control Problem and Reinforcement Learning

Consider a formal statement of the optimal control problem.

The mathematical model of a control object is given in the Cauchy form of an ordinary differential equation system

\dot{x} = f (x, u),

(1)

where

x

is a state space vector,

u

is a control vector,

x \in R^{n}

,

u \in U \subseteq R^{m}

, and

U

is a compact set.

The initial state is given by

x (0) = x^{0} \in R^{n} .

(2)

The terminal state is given by

x (t_{f}) = x^{f} \in R^{n},

(3)

where

t_{f}

is the time to reach the terminal state (3). Time

t_{f}

is not given, but it is limited to

t_{f} \leq t^{+}

, where

t^{+}

is a given positive value.

The quality criterion is given by

J = \int_{0}^{t_{f}} f_{0} (x, u) d t \to min_{u \in U} .

(4)

Assume that the researcher knows the areas in the state space of where the optimal trajectory should be. Then, additional conditions are included in the quality criterion

J_{1} = \int_{0}^{t_{f}} f_{0} (x, u) d t + p \sum_{i = 1}^{r} ψ_{i} (x (t)) \to min_{u \in U},

(5)

where

ψ_{i} (x) = ϑ (min_{t} {∥ z^{i} - x ∥} - ε_{i}) (min_{t} {∥ z^{i} - x ∥} - ε_{i}),

(6)

p is a penalty coefficient, and

ϑ (α)

is a Heaviside step function

ϑ (α) = \{\begin{matrix} 1, if α > 0 \\ 0, otherwise \end{matrix},

(7)

ε_{i}

,

i = 1, \dots, r

are given small positive values, and

z^{i}

,

i = 1, \dots, r

are the centres of known areas.

According to the introduced additional conditions, if an optimal trajectory does not pass near some given point

z^{i}

, then value of criterion (5) will grow.

3. Computation Experiment

Consider the optimal control problem for the spatial movement of a quadcopter. The mathematical model of the control object is

\begin{matrix} {\dot{x}}_{1} & = & x_{4}, \\ {\dot{x}}_{2} & = & x_{5}, \\ {\dot{x}}_{3} & = & x_{6}, \\ {\dot{x}}_{4} & = & u_{4} (sin (u_{3}) cos (u_{2}) cos (u_{1}) + sin (u_{1}) sin (u_{2})), \\ {\dot{x}}_{5} & = & u_{4} cos (u_{3}) cos (u_{1}) - g_{c}, \\ {\dot{x}}_{6} & = & u_{4} (cos (u_{2}) sin (u_{1}) - cos (u_{1}) sin (u_{2}) sin (u_{3})), \end{matrix}

(8)

where

g_{c} = 9.80665

.

Control is constrained

u_{i}^{-} \leq u_{i} \leq u_{i}^{+}, i = 1, \dots, 4,

(9)

where

u_{1}^{-} = - π / 12

,

u_{1}^{+} = π / 12

,

u_{2}^{-} = - π

,

u_{2}^{+} = π

,

u_{3}^{-} = - π / 12

,

u_{3}^{+} = π / 12

,

u_{4}^{-} = 0

and

u_{4}^{+} = 12

.

The initial state is given by

x^{0} = {[0 5 0 0 0 0]}^{T} .

(10)

The terminal state is given by

x^{f} = {[10 5 10 0 0 0]}^{T} .

(11)

The phase constraints are given by

φ_{k} (x) = r_{k} - \sqrt{{(x_{1} - x_{k, 1})}^{2} + {(x_{3} - x_{k, 3})}^{2}} \leq 0,

(12)

where

k = 1, 2

,

r_{1} = r_{2} = 2

,

x_{1, 1} = 2.5

,

x_{1, 3} = 2.5

,

x_{2, 1} = 7.5

,

x_{2, 3} = 7.5

.

It is necessary to find a control function, taking into account the constraints in (9), that minimizes the following criterion

J_{1} = t_{f} + p_{1} \sum_{k = 1}^{2} \int_{0}^{t_{f}} ϑ (φ_{k} (x)) d t \to min_{u \in U},

(13)

where

t_{f} \leq t^{+} = 5.6

,

p_{1} = 3

.

To solve the control problem numerically, let us use a piecewise linear approximation. The time axis is divided into equal intervals

Δ t

, and the search for constant parameters is performed at the interval boundaries for each control component. Control is a piecewise linear function that consists of segments connecting points at the bounds of intervals. Given the control constraints, the desired control function is as follows

u_{i} = \{\begin{matrix} u_{i}^{+}, if {\hat{u}}_{i} \geq u_{i}^{+} \\ u_{i}^{-}, if {\hat{u}}_{i} \leq u_{i}^{-} \\ {\hat{u}}_{i}, otherwise \end{matrix}, i = 1, \dots, 4,

(14)

where

{\hat{u}}_{i} = d_{i + (j - 1) m} + (d_{i + j m} - d_{i + (j - 1) m}) \frac{t - (j - 1) Δ t}{Δ t}, i = 1, \dots, 4, j = 1, \dots, K,

(15)

and K is a number of time interval boundaries

K = ⌊\frac{t^{+}}{Δ t}⌋ + 1 = ⌊\frac{5.6}{0.4}⌋ + 1 = 15 .

(16)

When solving the problem by a direct approach, the condition of reaching the terminal state is included in the quality criterion

J_{2} = t_{f} + p_{1} \sum_{k = 1}^{2} \int_{0}^{t_{f}} ϑ (φ_{k} (x)) d t + p_{2} ∥ x^{f} - x (t_{f}) ∥ \to min_{u \in U},

(17)

where

p_{2} = 1

,

t_{f} = \{\begin{matrix} t, if t \leq t^{+} and ∥ x^{f} - x (t) ∥ \leq ε = 0.01 \\ t^{+}, otherwise \end{matrix} .

(18)

To solve the problem, a hybrid evolutionary algorithm [7] is used.

Figure 1 and Figure 2 show projections on the horizontal plane

{x_{1}; x_{3}}

of the two found optimal trajectories. The big circles present the phase constraints in (12).

The criterion for the solutions found had the following values: for the solution in Figure 1

J_{2} = 5.6434

, for the solution in Figure 2

J_{2} = 5.6330

.

As can be seen from the experiment, the values of the criteria practically coincide, the solutions found ensure the movement of the object from the given initial state (10) to the given terminal state (11) without violation of the phase constraints. In the series of experiments, the hybrid evolutionary algorithm found solutions that bypass the phase constraints either from above, as in Figure 1, or from below, as in Figure 2.

Suppose that we need the control object to move between obstacles. For this purpose the desired areas on the horizontal plane are defined. It is known that in the presence of interfering phase constraints, the optimal trajectory should be close to the boundary of these constraints. For the given problem four desired areas are defined as

\begin{matrix} z^{1} = {[2.5 0.4]}^{T}, ε_{1} = 0.6, \\ z^{2} = {[4.5 2.5]}^{T}, ε_{2} = 0.6, \\ z^{3} = {[5.5 7.5]}^{T}, ε_{3} = 0.6, \\ z^{4} = {[7.5 9.6]}^{T}, ε_{4} = 0.6 . \end{matrix}

(19)

The conditions for passing through the desired areas (19) are included in the quality criterion

J_{3} = t_{f} + p_{1} \sum_{k = 1}^{2} \int_{0}^{t_{f}} ϑ (φ_{k} (x)) d t + p_{2} ∥ x^{f} - x (t_{f}) ∥ + p_{3} \sum_{i = 1}^{4} ψ_{i} (x) \to min_{u \in U},

(20)

where

p_{3} = 3

.

The hybrid evolutionary algorithm found the following optimal solution

d = {[d_{1} \dots d_{60}]}^{T} = [- 11.1092

,

5.9957

,

- 0.0532

,

7.0045

,

17.5091

,

18.1172

,

- 1.6764

,

18.7012

,

- 16.0121

,

10.5543

,

- 19.9307

,

12.1721

,

- 6.0892

,

0.5339

,

- 0.8616

,

19.2556

,

- 13.7218

,

15.1266

,

0.3982

,

14.2650

,

- 1.1768

,

2.9832

,

4.3286

,

15.1508

,

- 8.9240

,

- 19.6814

,

4.5363

,

15.9879

,

- 0.0026

,

1.1203

,

13.2592

,

- 6.6358

,

- 6.2012

,

- 0.5328

,

- 0.0354

,

4.2548

,

11.6764

,

- 4.3345

,

- 6.7336

,

19.8643

,

0.3360

,

- 8.9741

,

- 2.6648

,

12.5608

,

19.6577

,

- 19.9308

,

- 1.6252

,

19.3797

,

- 1.1954

,

2.2625

,

5.9582

,

16.0807

,

- 0.8272

,

2.3167

,

0.9842

,

14.2695

,

- 6.3767

,

2.3895

,

0.3742

,

{16.2710]}^{T}

.

Figure 3 shows the projection of the found optimal trajectory for the solution with quality criterion

J_{3} = 5.5730

. Small dashed circles are the desired areas, while big circles are the constraints.

To implement the obtained solution according to the extended statement of the optimal control problem, it is necessary to build a system to stabilize the movement of the object along the optimal trajectory [8]. For this purpose, machine learning control is used [9]. The control function structure search is carried out by symbolic regression [10].

The obtained solution is

u_{i} = \{\begin{matrix} u_{i}^{+}, if {\tilde{u}}_{i} \geq u_{i}^{+} \\ u_{i}^{-}, if {\tilde{u}}_{i} \leq u_{i}^{-} \\ {\tilde{u}}_{i}, otherwise \end{matrix}, i = 1, \dots, 4,

(21)

where

{\tilde{u}}_{1} = μ (G),

(22)

{\tilde{u}}_{2} = ({\tilde{u}}_{1} - {\tilde{u}}_{1}^{3}) ρ_{17} (A + μ (G)) ϑ (F) ρ_{17} (x_{4}^{*} - x_{4}),

(23)

{\tilde{u}}_{3} = {\tilde{u}}_{2} + tanh ({\tilde{u}}_{1}) + ρ_{19} (A + μ (G)) + ρ_{17} (W),

(24)

\begin{matrix} {\tilde{u}}_{4} & = & {\tilde{u}}_{3} + ln | {\tilde{u}}_{2} | + sgn (A + μ (G)) \sqrt{| A + μ (G) |} + ρ_{19} (A) + arctan (C) + \\ sgn (E) + arctan (F) + exp (q_{2} (x_{2}^{*} - x_{2})) + \sqrt{q_{1}}, \end{matrix}

(25)

A = B sgn (D) \sqrt{| D |} tanh (D) exp (H), B = exp (C) + ρ_{17} (F) + cos (q_{6} (x_{6}^{*} - x_{6})),

C = D + tanh (E) + ρ_{18} (V), D = E + \sqrt[3]{F} + sin (W),

E = F + G + exp (H) - V, F = H + sgn (x_{5}^{*} - x_{5}) + {(x_{2}^{*} - x_{2})}^{3},

G = q_{6} (x_{6}^{*} - x_{6}) + q_{3} (x_{3}^{*} - x_{3}) + sgn (x_{2}^{*} - x_{2}) \sqrt{| x_{2}^{*} - x_{2} |},

H = ρ_{17} (q_{6} (x_{6}^{*} - x_{6}) + q_{3} (x_{3}^{*} - x_{3})) + V^{3} + W + q_{6} q_{5}^{2} {(x_{5}^{*} - x_{5})}^{2} + {(x_{5}^{*} - x_{5})}^{2},

V = sin (q_{6} (x_{6}^{*} - x_{6})) + q_{5} (x_{5}^{*} - x_{5}) + q_{2} (x_{2}^{*} - x_{2}) + cos (q_{1}) + exp (x_{5}^{*} - x_{5}) + ϑ (x_{2}^{*} - x_{2}),

W = q_{4} (x_{4}^{*} - x_{4}) + q_{1} (x_{1}^{*} - x_{1}) + sin (q_{6}),

μ (α) = \{\begin{matrix} α, if | α | \leq 1 \\ sgn (α), otherwise \end{matrix}, ρ_{17} (α) = sgn (α) ln (| α | + 1),

ρ_{18} (α) = sgn (α) (exp (| α |) - 1), ρ_{19} (α) = sgn (α) exp (- | α |),

where

q_{1} = 13.02930

,

q_{2} = 11.21509

,

q_{3} = 15.91016

,

q_{4} = 14.33447

,

q_{5} = 14.67798

,

q_{6} = 9.91431

and

x^{*} = {[x_{1}^{*} \dots x_{6}^{*}]}^{T}

is a state vector of the reference model.

Figure 4 shows the trajectories from eight initial states on the horizontal plane.

4. Results

The paper presents the use of machine learning technology with reinforcement to solve the optimal control problem with phase constraints using an evolutionary algorithm. To implement reinforcement learning, additional conditions defining the form of the optimal trajectory are introduced into the quality criterion. The optimal trajectory should pass through specified areas whose positions depend on the phase constraints. An example of solving the optimal control problem for a quadcopter by machine learning with reinforcement was given.

5. Discussion

The use of reinforcement learning technology to solve the optimal control problems of robotic devices is advisable, since in most cases the developer approximately knows the form of the optimal trajectory for the problem being solved.

Author Contributions

Conceptualization, A.D. and E.S.; methodology, A.D. and S.K.; software, A.D., S.K. and E.S.; validation, A.D. and V.M.; formal analysis, E.S.; investigation, S.K.; data curation, A.D., S.K. and E.S.; writing—original draft preparation, A.D, E.S. and V.M.; writing—review and editing, E.S.; visualization, V.M.; supervision, A.D. All authors have read and agreed to the published version of the manuscript.

Funding

The work was performed with partial support from the Russian Science Foundation, Project No 23-29-00339.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are openly available in https://cloud.mail.ru/public/pKW4/fmrfqjkv3 (for optimal trajectory 1), https://cloud.mail.ru/public/mVZ5/Kr2yb9jJn (for optimal trajectory 2).

Conflicts of Interest

The authors declare no conflict of interest.

References

Rao, A.V. A Survey of Numerical Methods for Optimal Control. Adv. Astronaut. Sci. 2010, 135, 497–528. [Google Scholar]
Grachev, N.I.; Evtushenko, Y.G. A Library of Programs for Solving Optimal Control Problems. USSR Comput. Math. Math. Phys. 1980, 2, 99–119. [Google Scholar] [CrossRef]
Diveev, A.I.; Konstantinov, S.V. Study of the Practical Convergence of Evolutionary Algorithms for the Optimal Program Control of a Wheeled Robot. J. Comput. Syst. Sci. Int. 2018, 4, 561–580. [Google Scholar] [CrossRef]
Brown, B.; Zai, A. Deep Reinforcement Learning in Action; Manning, Publications Co.: Shelter Island, NY, USA, 2019; 475p. [Google Scholar]
Morales, M. Grokking Deep Reinforcement Learning; Manning Publications Co.: Shelter Island, NY, USA, 2020; 472p. [Google Scholar]
Duriez, T.; Brunton, S.; Noack, B.R. Machine Learning Control–Taming Nonlinear Dynamics and Turbulence; Springer: Cham, Switzerleand, 2017; 229p. [Google Scholar]
Diveev, A. Hybrid Evolutionary Algorithm for Optimal Control Problem. In Lecture Notes in Networks and Systems; Springer: Berlin/Heidelberg, Germany, 2023; Volume 543, pp. 726–738. [Google Scholar]
Diveev, A.; Sofronova, E. Synthesized Control for Optimal Control Problem of Motion Along the Program Trajectory. In Proceedings of the 2022 8th International Conference on Control, Decision and Information Technologies (CoDIT), Istanbul, Turkey, 17–20 May 2022; pp. 475–480. [Google Scholar]
Diveev, A.I.; Shmalko, E.Y. Machine Learning Control by Symbolic Regression; Springer: Cham, Switzerleand, 2021; 155p. [Google Scholar]
Koza, J.R.; Keane, M.A.; Streeter, M.J.; Mydlowec, W.; Yu, J.; Lanza, G. Genetic Programming IV. Routine Human-Competitive Machine Intelligence; Springer: Boston, MA, USA, 2003; 590p. [Google Scholar]

Figure 1. Projection of optimal trajectory 1 on the horizontal plane.

Figure 2. Projection of optimal trajectory 2 on the horizontal plane.

Figure 3. Projection on the horizontal plane of the optimal trajectory found by reinforcement learning.

Figure 4. Projection on the horizontal plane of the optimal trajectories from eight initial states.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Diveev, A.; Sofronova, E.; Konstantinov, S.; Moiseenko, V. Reinforcement Learning for Solving Control Problems in Robotics. Eng. Proc. 2023, 33, 29. https://doi.org/10.3390/engproc2023033029

AMA Style

Diveev A, Sofronova E, Konstantinov S, Moiseenko V. Reinforcement Learning for Solving Control Problems in Robotics. Engineering Proceedings. 2023; 33(1):29. https://doi.org/10.3390/engproc2023033029

Chicago/Turabian Style

Diveev, Askhat, Elena Sofronova, Sergey Konstantinov, and Viktoria Moiseenko. 2023. "Reinforcement Learning for Solving Control Problems in Robotics" Engineering Proceedings 33, no. 1: 29. https://doi.org/10.3390/engproc2023033029

APA Style

Diveev, A., Sofronova, E., Konstantinov, S., & Moiseenko, V. (2023). Reinforcement Learning for Solving Control Problems in Robotics. Engineering Proceedings, 33(1), 29. https://doi.org/10.3390/engproc2023033029

Article Menu

Reinforcement Learning for Solving Control Problems in Robotics^†

Abstract

1. Introduction

2. The Optimal Control Problem and Reinforcement Learning

3. Computation Experiment

4. Results

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Reinforcement Learning for Solving Control Problems in Robotics †

Abstract

1. Introduction

2. The Optimal Control Problem and Reinforcement Learning

3. Computation Experiment

4. Results

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Reinforcement Learning for Solving Control Problems in Robotics^†