Optimal Greedy Control in Reinforcement Learning

We consider the problem of dimensionality reduction of state space in the variational approach to the optimal control problem, in particular, in the reinforcement learning method. The control problem is described by differential algebraic equations consisting of nonlinear differential equations and algebraic constraint equations interconnected with Lagrange multipliers. The proposed method is based on changing the Lagrange multipliers of one subset based on the Lagrange multipliers of another subset. We present examples of the application of the proposed method in robotics and vibration isolation in transport vehicles. The method is implemented in FRUND—a multibody system dynamics software package.


Introduction and Related Works
The problem of optimal control is an important scientific problem in various fields of technology, e.g., robotics, vibration damping systems, etc. The exact theoretical solution to this problem can be achieved by using Bellman's dynamic programming method [1] and Pontryagin's maximum principle [2]. However, these methods are limited to lowdimensional equations because of their high computational complexity. Today, various variational formulations of optimal control problems, in particular, reinforcement learning, have become widely used. When using this approach, the control problem is simplified by parametrizing the control function and reducing it to the parametric optimization problem. However, a number of topical control problems (for example, in robotics) still have too high dimensionality to be solved efficiently [3][4][5][6][7][8][9][10][11]. Some studies (e.g., We et al. [12] and Tu Vu et al. [13]) have investigated the stability problem of perturbed control motion for known referenced motion, which is a much easier task. Our study is aimed at finding the reference motion.
In the general form, the optimal control problem has the following formulation [1]. For the system described by differential equations f(ẋ, x, u, t) = 0, where x(t) is the coordinate vector of the entire system with dimension n. We need to find control functions, u(t), that let us achieve the extreme value of the criterion I = T 0 R(ẋ, x, u, t)dt.
As a rule, sign-constant functions are used as the R function. It was previously noted that the exact solution for Equations (1) and (2) using Bellman's dynamic programming method and Pontryagin's maximum principle was only obtained for several cases of lowdimension tasks [1,2]. The optimal control problem (1 and 2) is transformed into the parametric optimization problem in the reinforcement learning method by using a discrete form of recording the optimization criterion and parameterizing the control function. The optimization criterion is [14]: where R i is the value of the criterion function corresponding to the i-th moment of time, and γ is the discount coefficient, which takes a value from 0 to 1. It is assumed that the time interval of control T is divided into N sections. The control function is parameterized on basic functions and takes the form u(s, t), where s is the parameter of the control function. Neural networks, Fourier series, etc., can be used as basic functions [14]. The discount coefficient γ allows the optimality criterion to "weaken" (2). Formula (3) corresponds to a discrete formulation of Bellman's optimal control problem when the discount coefficient is equal to one. The control function is called "greedy" in the reinforcement learning method when it was obtained with a discount coefficient equal to zero. The parameterization of the control function for multidimensional problems leads to high-dimensional optimization problems and also makes the solution dependent on the basic functions on which the control function was interpolated. Moreover, the dependence of the control function on time ties it to time-dependent external disturbances. All this makes developing new methods of solving variational formulations of machine learning problems important.

Theoretical Description
Consider the optimal control problem for systems with constraints in the form of algebraic equations. The equations of the state of these systems can be written as where Q(x, t) is the constraint equation vector with dimensionality k ≤ n. For the numerical solution, system (4) is usually used in the form [15][16][17] f(ẋ, x, u, t) where D is the matrix of coefficients of the constraint equations with dimension k × n, p is the k-dimensional vector of Lagrange multipliers, and h(x, t) is the vector of the right parts of derivatives of the constraint equations. The second equation of system (5) is obtained by differentiating the constraint equations with respect to time. The physical meaning of the Lagrange multipliers for the problems of the dynamics of mechanical systems is the constraint reactions. As the applications considered in this article are related to mechanical systems, the term "constraint reactions" will be used as equal to the term "Lagrange multipliers". The differential-algebraic system in Equation (5) is widely used in multibody systems (MBS) dynamics software packages for modeling the dynamics of connected systems of bodies [16]. The features of numerical integration (5) related to ensuring stability are considered in [17]. In numerical integration (5), derivatives of coordinates and Lagrange multipliers from the system of linear algebraic equations are found at each integration step according to M D T D 0 The solution to a system of linear algebraic equations can be written in the following form ẋ Consider the control problem as described in [18]. There is a subset of reactions p 1 in the vector of constraint reactions p whose elements are numbered from the set K 1 ; the number of elements in the set K 1 is k 1 .Their values are described by functions ϕ i (t), i = 1, 2, . . . , k 1 , or in the matrix form There is also a subset of k 2 reactions, p 2 , from the vector of constraint reactions, p, whose values are taken from subset K 2 , which can vary due to changes in the values of unknown functions h 2j (t), j = 1, 2, . . . , k 2 , which will be called corrective terms. Each reaction from p 2 corresponds to its own constraint equation; the relevant corrective term h 2j (t) is added to the right part of this equation. The corrective terms h 2j (t) form a column matrix h 2 in the k 2 dimension. Reactions p 2 , generally, are control functions, so we will consider k 2 as the number of control functions. The values of reactions p 1 , taking into account (7) and the corrective terms, are where A −1 1 corresponding to set K 1 is the submatrix of A −1 , consisting of the rows A −1 whose numbers belong to K 1 . Only the components with numbers from K 2 are non-zero in the column matrix h * 2 of dimension n + k. We assume that p 10 1 with numbers from K 2 and has dimensionality k 1 × k 2 . Then (9) takes the form Taking into account that p 1 = ϕ(t), from (10), we can obtain the system of linear algebraic equations for determining h 2 (t).
If the system of linear Equation (11) is joint, then it is possible to determine the control functions of p 2 as where A −1 2 is the submatrix A −1 corresponding to set K 2 , consisting of rows A −1 whose numbers belong to K 2 . Given that p 20 contains only columns from matrix A −1 2 with numbers from K 2 and has dimension k 2 × k 2 . Equation (12) can be rewritten as Equation (11) gives the values of changes on the right sides of the constraint equations, ensuring the achievement of the desired values of reactions p 1 . Since h 2 (t) affects all the variables in the system (4), when integrating the equations of the mathematical model, the accelerations and constraint reactions are calculated from the system with the modified right side as follows ẋ Equation (11) has a unique solution if k 1 = k 2 and matrix C is non-singular. The properties of matrix C are determined by the properties of matrix A. The main reason for the singularity of matrix A is redundant constraints, i.e., linearly dependent rows in matrix D. Redundant constraints can be inherent to the system's structure or introduced on purpose, for example, in the parallel-structure mechanisms. In the following, it is assumed that the square matrix A is non-singular unless otherwise stated.
Case k 1 = k 2 is the simplest. Cases k 1 = k 2 are of greater interest, so we will consider them further.
The systems where k 1 > k 2 are commonly called underactuated systems. The analysis of the controlled motion of these systems can be found in [19]. It is difficult to obtain a meaningful solution for such systems within the framework of the considered approach.
Systems where k 1 > k 2 are called overactuated. These systems are widespread, for example, in robotics. Their analysis is relevant to our study. We consider the methods of solving the linear system of equations (11) in this case. Matrix C of the system is rectangular-with dimensions k 1 × k 2 . As already mentioned, matrix C is a full-rank matrix.
System (12) can be converted to a system with a square matrix by adding equations. The simplest way of achieving this is by adding linear equations for the corrective terms h 2 , i.e., converting (11) to the form where V 1 is the non-singular matrix of constant terms with dimensionality (k 2 − k 1 ) × k 2 , and b 1 is the column matrix of constant terms on the right side. Only (k 2 − k 1 ) × (k 1 + 1) terms are linearly independent in the second equation of the system (15), so matrix V 1 can be represented as where E is the identity matrix of dimensionality (k 2 − k 1 ) × (k 2 − k 1 ), and V * 1 is the matrix of arbitrary coefficients of dimensionality (k 2 − k 1 ) × k 1 . This method of reduction to a single solution will be called the method of additional equations for corrective terms.
Additional equations to (11) can be formed by imposing linear connections on the controls. The second equation of system (15) will take the form Substituting p 2 = p 20 + Bh 2 (t), we will get We call (17) the method of reduction to a single solution by additional equations for controls. Another way to eliminate the uncertainty of solution (11) is the conditional extremum method. Consider the conditions for the extremum of the expression where V 2 is a diagonal matrix of weights. Therefore, (18) is the weighted sum of the squares of controls. Consider the problem of finding the conditional extremum of expression (18), taking into account the conditions (11). In this case, (18) will be represented as where λ is the column matrix of Lagrange multipliers of dimension k 1 . Extremum conditions for (20) are from which we get where B 1 is the matrix of dimension k 2 × k 2 with elements b 1lm = 2v 2mm the column matrix b 21 of dimension k 2 with elements is the column matrix b 22 of dimension k 1 is The system of linear equations (21) has the square matrix of coefficients of dimension k 1 + k 2 and allows a single solution to be obtained. The method based on the use of (20) will be called the conditional extremum method with constraints in the form of equations of program reactions or simply the conditional extremum method.
This method allows taking into account k 2 − k 1 more of the constraint equations. The linear combinations of forces in the actuators (13) can be used as these constraints. In this case, expression (20) will be where V 3 is the matrix of weights with dimension k 3 × k 2 , k 3 ≤ k 2 − k 1 and λ 1 is the corresponding vector of Lagrange multipliers of dimension k 3 . The linear system, Equation (21), for the functional (25) will have the following form with the matrix B 2 as and the matrix b 23 as b 23 The method based on the use of (25) will be called the conditional extremum method with constraints in the form of forces in the actuators. If k 3 + k 1 = k 2 , system (26) splits into two independent systems. The column matrix h 2 is unambiguously determined from the second and third equations of (26), which are similar to system (17). Function (25) is close to the function of the Karush-Kuhn-Tucker (KKT) method, which is well-known in the theory of nonlinear programming [8]. However, in our case, it is used without any additional conditions that are used in the KKT.
The conditional extremum method makes it possible to reduce the optimal control problem to the parametric optimization problem over the state space of relatively small dimensionality, compared with the direct parameterization of the control functions.

Case Studies
The method considered in Section 2 is implemented in the MBS dynamics software FRUND [20]. The examples below are solved using it.

Inverted Double Pendulum
Consider the flat inverted double pendulum shown in Figure 1. These pendulums are often considered in problems of control synthesis of walking robots [21][22][23][24][25][26][27][28][29][30][31][32]. The limit on the magnitude of the torque in the pendulum support is a condition for the stability of the robot (i.e., avoiding overturning). The problem is to find the law of change of torque at points B and C with an arbitrary law of motion of point A. Simple usage of the constraint equations in one or two directions at point A leads either to the fixation of the pendulum in its original position or to the fall of the pendulum if one connection is specified in the horizontal direction. Various options for finding control torques can be considered within the framework of the proposed method. The simplest case is specifying the constraints at point A along the vertical coordinate Z A = 0. The control torque will be found only at point B. The parameters of the dimension that was introduced in Section 2 are n = 6, k = 6, k 1 = 1, k 2 = 1; the control torque M B is found from Equation (13). The motion picture of the pendulum is presented in Figure 1b. The plot of the torque change M B is presented in Figure 2. During the calculations, it was assumed that the control torque smoothly reaches the program's preset value in 0.05 seconds. The sharp increase in the control torque at the end of the movement is explained by the approach to the singularity position-aligning the pendulum links along the same line (see Figure 1b). Therefore, in this simplest case, the problem of determining the control torque is solved unambiguously.

Spatial Model of Android
Let us consider the spatial motion of a mechanical system on the example of an android robot. The calculation scheme of such a robot is presented in Figure 3. The system parameters are n = 150, k = 144. The number of actuators in the android structure is 21. The calculation scheme contains two masses with large values of inertial parameters to determine 12 reactions in the contacts of the android's feet with the supporting surface. This method allows for solving some special cases of systems with redundant constraints. In this case, the redundant reactions are six reactions in the feet. Calculations were made for the variant of 8 control drives and restrictions on 6 reactions-k 1 = 6, k 2 = 8. Control drives are rotation drives working around the transverse axis of the robot in the hinges of its shoulders, hips, knees, and feet-two for each type of hinge. We modeled the displacement of the center of mass of the robot back by 2 cm in 2 s. Vertical reactions and reaction torques were considered unchanged relative to the transverse axis. Horizontal reactions in the feet were calculated from the horizontal inertia forces caused by the movement of the center of mass. As the torque of the reaction was set to be unchanged while the static torque of this reaction increased due because of the movement of the center of mass, this change was compensated for by the movement of the robot's sections, in particular, by the rotation of its arms (see Figure 4). This movement corresponds to the natural reaction of a human when trying to maintain balance without being able to move their legs. The results shown in Figure 4 are obtained by resolving the ambiguity using the conditional extremum method-Equation (21). The squares of angular velocities in the corresponding hinges were taken as weights. During calculations with the same unit weights, the torque compensation occurred solely because of the movement of the body.  The considered example of controlling an android robot as a multibody mechanical system allows the conclusion that the method is sufficiently versatile and applicable to a wide class of mechanisms to be made, for example, for parallel mechanisms [33][34][35][36][37].

Optimal Control Problems in the Example of Car Vibrations
Optimization criteria such as (2) are widely used in practical applications of mathematics and mechanics [38][39][40]. Consider the classical problem of controlling a vibrationisolating system using the car suspension example. Minimization criterion (2) can be presented in the proposed approach as follows As R * is a function of some program values of Lagrange multipliers p 1 , which are assigned to desirable functions ϕ(t), the sum of the components of vector ϕ(t) can be used as R * function. The minimum of functional (29) is achieved, for example, by functions ϕ(t) being equal to zero. The greedy control criterion (3), in this case, will take the following form The controls for criterion (30) can be found by using (10)- (25). Let us emphasize that the control obtained from criterion (30) is greedy control. However, it is the optimal control as well because it provides the minimum of criterion (29).
Consider the problem of controlling a car's suspension to reduce its vibrations from the impact of the road's micro profile. The existing problem statements can be found in [41][42][43][44]. Figure 5 shows the calculation scheme of the mathematical model of the car, which makes simulating its movement along the road irregularities possible. We simulated the movement of the car through a triangular irregularity. Vertical accelerations in the front of the car are presented in Figure 6. In the controlled version, two connections are set-zero vertical movements at two symmetrical points, A and B, in the front of the car body. The M1 and M2 torques in the two front suspension arms are used as actuators. The dimension parameters are n = 96, k = 85, k 1 = 2, k 2 = 2. Using (12), we determined the control torques in the suspension levers, ensuring the movement of points on the frame with zero reactions. Frame accelerations at the considered points are close to zero (see Figure 6). This problem can be considered an example of solving an optimal control problem with the optimization criterion in the form of zero displacements of the selected points of a mechanical system.

Conclusions
The proposed method of calculating control can be considered a universal theoretical method for solving a wide range of problems related to controlled system dynamics, including the problems of controlling robot manipulators, anthropomorphic and zoomorphic robots, vibration damping problems, etc. The important feature of this method is that it is based on numerical models of machine dynamics, which are widely used in existing computer simulation programs for the dynamics of mechanical systems. The method has no fundamental limitations on the dimensionality of modeled systems and types of nonlinearities.
The evaluation of the proposed method on the described use cases and other test examples proved that computational efficiency has increased for all problems described by DAE (differential-algebraic equations). It was achieved for DAE with a wide range of state dimensions-from 12 to 180 (k 1 + k 2 ) and control dimensions from 1 to 8. The dimensionality of the parameter space is independent of the state dimension and defined only by the number of controls.
The proposed method is a universal theoretical method for the optimal control problem of the systems meeting the following requirements: • the system is described by DAE (2), which has numerical solutions; constraint Equation (1) is a function of coordinates (holonomic constraints in mechanics); • the integral object function contains only Lagrange multipliers (29); • matrix A is not singular; • the linear system in Equation (11) is joint, i.e., it has at least one solution.
The important feature of this method is that it can be considered a kind of machine learning, in particular, reinforcement learning, as a variational formulation of the control problem. The formulation of the proposed method in the form of functionals (20) and (25) corresponds to the so-called "greedy" control [14] in reinforcement learning methods and, at the same time, is the optimal control with the appropriate formulation of integral optimality criteria. From this point of view, the considered method can significantly reduce the dimensionality of the parameter space, and consequently, increase the computational efficiency of machine learning methods.
The purpose of the presented method is to provide the reference optimal trajectories and controls in the case of the agent having complete knowledge of the environment. The stability problem, controller optimization, and uncertainty model fall beyond the borders of this study. For the tasks of robot control, the standard methods of achieving robustness can be used [45].
This work presents the fundamental theoretical provisions of the method and does not address such issues as control stability and control in systems with singular matrices. These issues are the subject of further research.