A Parametric Study of a Deep Reinforcement Learning Control System Applied to the Swing-Up Problem of the Cart-Pole

: In this investigation, the nonlinear swing-up problem associated with the cart-pole system modeled as a multibody dynamical system is solved by developing a deep Reinforcement Learning (RL) controller. Furthermore, the sensitivity analysis of the deep RL controller applied to the cart-pole swing-up problem is carried out. To this end, the inﬂuence of modifying the physical properties of the system and the presence of dry friction forces are analyzed employing the cumulative reward during the task. Extreme limits for the modiﬁcations of the parameters are determined to prove that the neural network architecture employed in this work features enough learning capability to handle the task under modiﬁcations as high as 90% on the pendulum mass, as well as a 100% increment on the cart mass. As expected, the presence of dry friction greatly affects the performance of the controller. However, a post-training of the agent in the modiﬁed environment takes only thirty-nine episodes to ﬁnd the optimal control policy, resulting in a promising path for further developments of robust controllers.


Introduction
This section provides some necessary background material and emphasizes the significance of the present research. For this purpose, the formulation of the specific problem of interest for this investigation is set first. A concise literature review on the topics considered in this work is then reported for the benefit of the reader. More importantly, the scope and contributions of this paper are highlighted, whereas the structure of the manuscript is finally illustrated.

Background and Significance of the Research
The control of articulated mechanical systems relies on a high level of system understanding [1], expressed mainly through multibody dynamics models [2][3][4]. However, this approach is restrictive for systems with a high degree of complexity or uncertainty [5,6]. In particular, considering model-based control algorithms, such as the linear-quadratic regulator (LQR) and back-stepping (BS) techniques, the use of differential-algebraic equations of motion typical of the multibody approach is challenging. These heavily rely on model accuracy, which hinders their application to real platforms [7]. Therefore, the development of controllers for nonlinear systems is extremely challenging, time-consuming, and in many cases, an infeasible task. The traditional engineering approach consists of analytically deriving interactions [27]. Regardless of the simulator accuracy, there is also a substantial difference between simulation and reality, the so-called reality-gap [38]. However, RL shows satisfactory results in simulated environments. Despite this fact, it has not been broadly applied in real physical systems like robots due to the fact that RL models trained in a simulation context normally come up short and need to be trained again in modified environments, which is hazardous and time-consuming [39].

Literature Review
The use of benchmark problems in the research and development of control systems is fundamental for analyzing a proposed system's quality. Besides, this allows its comparison with other existing systems. Perhaps one of the most famous benchmarks in control development is the inverted pendulum system. Recently, a book on its use in control and robotics was published [40]. This is because, despite its structural simplicity, it is challenging to obtain a proper control law of the inverted pendulum system for many standard control techniques, in particular given its high nonlinearity, non-minimum phase, and underactuated nature. Research work on the inverted pendulum system focuses on developing controllers capable of swinging the pendulum while exhibiting robustness in the presence of external disturbances.
Multiple model-based control approaches are found in the literature considering the inverted pendulum benchmark. Nonlinear Model Predictive Control (NMPC) was employed in [41] for the swing-up of the parallel double inverted pendulum. The authors considered the Real-Time Iteration (RTI) scheme to allow the control to be implemented in hardware. In [42], the Integral Sliding Mode Control (ISMC) and the State-Dependent Riccati Equation (SDRE) were combined to perform the swing-up of the rotary inverted pendulum. The resulting robust continuous-time nonlinear optimal controller was experimentally tested. The work in [43] presented a Switched Robust Integral Sliding Mode (SRISM) control to address unmatched uncertainties. A Backstepping Sliding Mode Control (BSMC) scheme was proposed in [44] by transforming the dynamic system model into the so-called regular form. In this way, they took advantage of the convenient features that characterize the backstepping and sliding mode control schemes. A similar approach is that found in [45].
Many works have recently been reported in the literature for multiple types of nonlinear controllers featuring automatic tuning methodologies. In particular, metaheuristic algorithms to perform online tuning of the control parameters have been reported. For instance, Al-Araji presented a sliding mode control for the cart-pole system swing-up featuring the culture bees algorithm for online parameter tuning [46]. Su et al. complemented the hybrid sliding mode control by self-tuning the control parameters with the fireworks algorithm [47]. In addition, Mansoor and Bhutta considered the backstepping control scheme with the genetic algorithm for a similar scope [48]. Other authors considered the energy method to address the cart-pole swing-up control task. For instance, Singla and Singh considered this approach for the swing-up and switches to a Linear Quadratic Regulator Controller (LQRC) to stabilize in the vertical position [49].
In contrast to the mentioned model-based controllers, a model-free approach is that found in [50]. There, the energy method was employed in an online setup with no information of the system's physical parameters. The algorithm self-learns a weight that is used in normalizing the energy injected to the pendulum. An approach considering neural networks to model the controller was presented in [51]. There, the swing-up problem of the double inverted pendulum was solved in a data-driven framework employing a neural network controller with parameters optimized via the hybrid uniform design multiobjective genetic algorithm. In [52], the Advantage Actor-Critic (A2C) was used to stabilize the cart-pole system [53]. This RL algorithm is considered in two different setups for the observations. One is with the state of the system, while the other employs a convolutional neural network to extract features from the animation produced by the simulation. Both cases consider a continuous observation space and a discrete control space. Recently, there has been interest in combining model-free and model-based approaches for controller development. Some works with this perspective are [54][55][56]. In these works, the RL was complemented with Proportional-Integral-Derivative (PID) controller, sliding mode control, and fuzzy logic, respectively.
The effect of the uncertainty in the friction coefficient on the controllers was analyzed in [57]. In this work, an NMPC and an RL based on temporal-difference (TD) were benchmarked under environments where the viscous friction, modeled as a damping force, was modified [58]. The effect of the uncertainty in the physical parameters of the cart-pole system was addressed briefly in [59]. There, the performance of an LQRC to balance the pole in the vertical position was evaluated when modifying the length of the pole.

Scope and Contributions of This Paper
The ML and deep RL techniques considered in this paper have great potential for mechanical engineering applications that are not well explored in the literature. Basically, there are two key points. First, RL methods allow for developing effective nonlinear controllers of mechanical systems that, to some extent, ignore the underlying physical behavior of the dynamical system to be controlled. In fact, the mechanical system to be controlled is treated as a black-box external environment from which input and output data are collected. This allows for using the same strategy for the control design in a large variety of state-space systems. As discussed in the paper, the control designer is called to develop an appropriate shape of the reward function that is more suitable for the engineering problem to be considered. Second, by using the post-training approach devised in this paper, one can design a robust control policy. Therefore, its performance is still satisfactory when some environment parameters are changed because of model uncertainty or due to other unpredictable phenomena. To investigate these two key points, a benchmark problem of nonlinear control is employed in the paper, namely the swing-up maneuver of the cart-pole system. The presence of dry friction is also considered to assess the robustness of the proposed approach. The difficulties related to traditional robust control systems with uncertainties in their physical parameters are often discussed in the literature. On the other hand, controllers based on reinforcement learning employ neural networks to represent the actuation policy. Thus, these present robust performance by virtue of the generalization capability of neural networks. Such robustness is exploited and improved in this work by employing a post-training of an optimal control policy in modified environments.
The study of the swing-up problem for the cart-pole system is one of the most common examples found in the literature as a benchmark problem to exemplify utilizing numerical experiments to test the effectiveness of a novel approach for designing a nonlinear controller. Thus, the scope and the contributions of this work are to study the robustness of an RL-based control system, specifically developed and applied to the swing-up task of the cart-pole system, and to address the issue of data efficiency in real-world implementations. This is performed through a parametric study of a trained policy represented by a multilayer perceptron. The main procedure consists of evaluating the behavior of the control system under modified environments that feature modifications of the cart-pole physical properties. This, to the best of the authors' knowledge, was only briefly investigated in an RL controller with a simplified friction force approximated as damping in [57]. In contrast, the dry friction model employed in the present work is more accurate. Therefore, this paper is the first work in which the proposed control method based on deep reinforcement learning techniques is thoroughly developed and tested considering the dry friction phenomena. Additionally, modifications are considered for the mass of the cart, the mass and length of the pendulum, as well as the dry friction coefficient. These are varied in reasonable ranges to test the robustness of the control law devised in this paper. Besides, a method is proposed to further extend the agent capability in a data-efficient way by performing a subsequent training of the agent neural networks in a modified environment in the presence of dry friction. To this end, dry friction is modeled as a smoothed function of the instant normal force between the road and the environment. The findings of this work are based on extensive numerical experiments and result in a promising methodology for improvements in data efficiency and robustness for real systems' implementation.
In this paper, a simplified, but realistic friction model is employed, namely viscous friction based on the hyperbolic tangent law, to perform the numerical experiments on the performance of the proposed controller. Consequently, some complex nonlinear features related to general friction phenomena, such as the stick-slip effect, the static friction, or, to some extent, the Stribeck curvature, are not captured by the present model. Although there is some loss of information by doing so, nonlinear viscous friction laws are widespread in engineering applications for implementing the analysis of articulated mechanical systems. In fact, in general, specific computational tools should be employed to consider the friction phenomena thoroughly. This mainly reflects using an appropriate numerical solution procedure for solving the equations of motion with discontinuous and fully nonlinear friction effects, at the expense of the computational time required to solve the set of ordinary differential equations. This is one of the main reasons why simplified versions of the friction model are preferred in engineering applications, such as linear and nonlinear friction laws. This is particularly true for the present work since, for each case considered in the paper, the differential equations of the cart-pole system must be solved a hundred times to run the deep reinforcement learning algorithms. Furthermore, in this paper, the effect of friction is mainly taken into account to demonstrate the effectiveness of a given controller designed in the absence of friction and subsequently post-trained to cope with its presence as an additional unmodeled disturbance phenomenon. Therefore, as mentioned before, in general, one should model friction forces considering more complete laws that better capture the phenomenon, but this is not functional or useful for the goals of the present investigation.

Organization of the Manuscript
The remainder of this paper is organized as follows. Section 2 elaborates on the theoretical framework of the RL algorithm employed and the experimental procedure proposed. Section 3 describes in detail the nonlinear model of the dynamical system under consideration and the setup of the numerical experiments. Section 4 reports the numerical results found for the environments considered, as well as some final remarks on the work done.

Research Methodology
RL is a data-driven approach to solve decision-making problems in the form of Markov Decision Processes (MDP). Figure 1 shows the workflow where an agent is fed with an observation s and a reward r so that it can decide an action a to take on an environment, with the scope of optimizing the cumulative reward ∑ r over time [60,61]. In this work, the algorithm used for training the agent is the Deep Deterministic Policy Gradient (DDPG) [62], an off-policy, model-free, actor-critic structured approach. The agent is composed of four neural networks, namely the Q-function Q ϑ , the policy π π π ϕ , and their respective target lagged copies Q ϑ and π π π ϕ . The Q ϑ network is used to approximate the Q-function of the environment, that is the function predicting the cumulative reward for a given s and a. The policy π π π ϕ maps the state with an action. Each of the target copies contains lagged parameters of its corresponding neural network. The DDPG algorithm works through the joint learning of the optimal Q-function and the policy. The reinforcement learning loop begins with the random initialization of the four neural networks. Then, the agent begins to act in the environment to collect experiences in an experience buffer of dimension D. Each entry of the experience buffer is a tuple with the structure (s, a, r, s ), where s represents the new state of the system after taking action a. Then, in order to approximate the optimal Q-function, the Mean Squared Bellman Error (MSBE) is minimized by employing the loss function given by: where Q ϑ (s, a) is the neural network with the current approximation of the optimal Q-function, d is the size of the replay buffer of previous experiences randomly sampled from the experience buffer of length D containing a subset of all the transitions in (s, a, r, s ) tuples, s is the observation of the subsequent state, and y i is the target value function defined as: where t is the time of the observation s , t end is the total duration of the episode of training, and γ is the discount factor. In Equation (2), the target lagged copies π π π ϕ (s i ) and Q ϑ s i , π π π ϕ (s i ) are used to avoid the instability in the training process derived from the recursive approach to approximate the Q-function. Then, the neural network Q ϑ is updated through the use of a one-step gradient by employing L MSBE defined in Equation (1). Subsequently, from the current Q ϑ , the gradient ascent ∇ ∇ ∇ is computed to update the policy π π π ϕ . This is done by differentiating Q ϑ with respect to the parameters ϕ of π π π ϕ while considering the parameters ϑ constants, leading to the following equation: Finally, the parameters ϑ and ϕ of the target neural networks Q ϑ and π π π ϕ are updated by Polyak averaging with the parameter τ as: This procedure is repeated until the optimal policy is found.

Dynamic Model Description and Numerical Experiments' Setup
Recently, a plethora of benchmark problems has been used to evaluate the performance of RL algorithms. For instance, the Arcade Learning Environment [63,64] for discrete action space problems or OpenAI Gym [65], which considers benchmark control problems. Among the latter, there is the cart-pole swing-up [66], an underactuated, multivariable, inherently unstable, and strongly coupled nonlinear problem, which has been studied in the past by several authors [67]. When it comes to RL, the moderate level of complexity of the cart-pole system results in being advantageous due to its potential generalization to different domains, in contrast with other RL benchmarks such as the locomotion task [68,69], which requires highly specific algorithmic adaptations [70]. In this paper, the said system is chosen as a case study because its topological simplicity allows for performing a detailed study of the sensitivity of the developed controller in the presence of dry friction forces. Figure 2 shows the scheme of the system, consisting of two rigid bodies, the cart with mass m c , which moves along the X axis, and the pole with mass m p , its centroidal moment of inertia I zz,p on the Z axis, and the center of mass positioned at a distance L from the revolute joint that allows the pendulum to rotate freely. The Lagrangian coordinates of the system are set as q = x θ T , and the corresponding equations of motion can be written as follows: whereq is the second time derivative of q, whereas M and Q b are respectively equal to: where F c is the horizontal force the controller applies on the cart with the goal to swing-up the pole and F f is the friction force. The modeling of dry friction in mechanical systems' simulation plays a fundamental role. Considering this phenomenon in the early stages of engineering design is crucial to reduce the gap between numerical simulation and experimental results [71]. Multiple approaches to model dry friction exist in the literature, yet none of them has general validity [72]. The classical Coulomb dry friction model is widespread in engineering applications. The model considers that friction force opposes the motion and that its magnitude is independent of the contact area and velocity. Therefore, the dry friction force can be computed as: where µ is the dry friction coefficient, F N is the instant normal force produced by the contact between the two surfaces interacting, andẋ is the relative velocity between them, whereas sgn is the sign function defined as: Despite the apparent straightforwardness of the Coulomb friction model, for the implementation into dynamical systems simulation, it results in being challenging for numerical integration [71]. This is due to the friction force discontinuity at zero velocity, producing stiff equations of motion and the nonuniqueness and nonexistence of the solution for accelerations [72]. As a solution, the so-called smooth Coulomb friction model has been proposed [73]. It considers a smoothing function to remove the discontinuity. Multiple smoothing functions, such as linear, exponential, or trigonometric functions, have been employed in the literature. However, the smooth Coulomb friction model does not fully capture the physical phenomenon. This is because at zero velocity, the friction force is null. Therefore, it cannot reproduce stiction, the pre-sliding effect, or the Stribeck effect. Although there is a partial loss of information, nonlinear viscous friction laws such as the smooth Coulomb model are widespread in engineering applications devoted to implementing a quick but realistic analysis of articulated mechanical systems. More complex and accurate friction models can exhibit these effects, but require multiple additional physical parameters. In this work, given that the main interest is to perform a sensitivity analysis of the proposed control system, a smooth Coulomb friction model will be considered. Therefore, the dry friction force is given by: where µ is the dry friction coefficient, N is the instant normal force produced by the contact between the cart and the road, whereas the hyperbolic tangent is employed as an artificial function of the cart velocity necessary during the numerical simulations for smoothing the dry friction force described by the Coulomb law of friction. Figure 3 shows the Simulink implementation of the RL workflow, where the vector of observations s is defined as: where the abbreviations sin(θ) = s θ and cos(θ) = c θ are used. The instant reward function for the environment is defined as follows: (12) where x lim is the maximum displacement allowed for the cart during the swing-up task, while r 1 and r 2 are defined as: and: where the constant parameters are respectively set as A r = 10 −2 , B r = 10 −1 , C r = 5, D r = −10 −2 , E r = −10 2 , and n = 2. The shape of the reward function allows for penalizing the magnitude of x, θ, and F c . Therefore, the optimal policy will be that in which the system manages to perform the swing-up with the lowest force possible, and the cart is in the initial position at the end of the task. On the other hand, the actor π π π ϕ and critic Q ϑ neural networks are multilayer perceptrons with architectures as respectively shown in (a) and (b) of Figure 4.  The training process of the agent is done employing the DDPG algorithm under the following restrictions: where F max is the maximum force allowed to the controller, set to 20 (N), x lim is the maximum lateral displacement of the cart, equal to 1 (m), and t f is the maximum duration of the task, set to 25 (s). Then, the sensitivity analysis of the controller consists of evaluating its performance under modified environments featuring variations of the physical properties of the system. Finally, to expand the optimal behavior space for the agent, a post-training of its structural parameters is performed in a modified environment.

Numerical Results and Discussion
In this section, the results obtained from the parametric analysis are reported. In total, thirteen different cases of modified environments are considered. The remainder of the section is organized as follows. Section 4.1 shows the results obtained for the parametric study where the physical properties of the system, namely the masses and geometry, are modified, and no presence of dry friction is considered. Section 4.2 reports the results regarding the effect of the dry friction presence and its corresponding variation on the agent, as well as the results arising from the post-training method. Section 4.3 provides some general final remarks and a concise discussion of the numerical results found.

Physical Properties' Parameter Study
The hyperparameters for the neural networks learning and DDPG algorithm employed for the training of the agent are mentioned in Table 1, and the corresponding learning curves are shown in (a) and (b) of Figure 5.  The optimal policy is found at Episode 435 considering the criteria r avg > −80, and r avg is the average cumulative reward in a window of five episodes. Table 2 reports the properties of the environments considered, where no dry friction is present. For the purposes of the present work, the swing-up time is defined according to the following formula: if |θ| < A l and |θ| ≥ B l and θ ≤ C l and |ẋ| ≤ D l T su = t else T su = 0 end (16) where t is the time variable, T su is the swing-up time, x is the cart linear displacement, θ is the pendulum angular displacement,ẋ is the cart linear velocity,θ is the pendulum angular velocity, while A l = 5 (deg), B l = 0 (deg), C l = 0.1 (rad s), and D l = 0.5 (m/s) are limit constants.
The training is carried out in the environment of Case 1, producing the agent A. The remaining cases report the performance of agent A in each defined environment. The magnitudes for the physical properties are the extreme values for which agent A manages to perform the swing-up under the conditions mentioned in Equation (15). In particular, Cases 2 to 7 report individual variations of the three physical parameters m c , m p , and L where, based on the cumulative reward obtained during the task, it is found that the increase in m c has the greatest impact on the agent performance. An interesting effect is observed in Cases 4 and 5, where the respective 90% increase and decrease in m p yield close cumulative rewards. Figures 6 and 7 show the state response of agent A in Cases 1, 4, and 5.  It can be seen that, even though the cumulative rewards reported in Table 2 for Cases 4 and 5 are close, in general, the best output takes place in Case 4, which performs a swing-up in a shorter time without the presence of full rotations of the pole, in contrast with Case 5. Therefore, the pendulum mass m p is the parameter with the least impact on the agent performance. Cases 8 and 9 consider joint equal changes of the three physical properties, a decrease of 35% and an increase of 25%, respectively. Figures 8 and 9 show the response of agent A in Cases 1, 8, and 9, where even though the latter is the more inertial one, its response yields a better reward than Case 8, as shown in Figure 10.  This is due to the oscillating pattern of F c in Case 8 shown in (a) of Figure 10. Therefore, the agent performance is more susceptible to joint equal inertial decreases.

Dry Friction Parameter Study
A subsequent analysis considers the effect of the presence of dry friction on the performance of agent A. Case 10 of Table 3 proves the impact of this phenomenon to be high since the cumulative reward drops sharply, and the system has an unstable response. In Case 10 of Table 3, the swing-up has not been achieved; therefore, no swing-up time is reported. Apart from Figure 11, this is also shown in Figures 12 and 13, where the swing-up is performed only for short periods.  Additional training is performed on agent A in an environment with the properties of Case 11 in Table 3, and the corresponding learning curves are shown in (a) and (b) of Figure 11. The optimal policy is found after 39 training episodes, and the resulting agent is called agent B, featuring optimal performance with or without the presence of dry friction, as well as with high increments of the dry friction coefficient, as shown in Cases 12 and 13 of Table 3.

Discussion and Final Remarks
The starting point of this work is the formulation of a multibody model of the cart-pole system that can be easily manipulated by controlling an external force applied to the cart. This dynamic model is sufficiently complex to exhibit an interesting nonlinear behavior during the swing-up task and, at the same time, sufficiently simple to be used for a parametric study of the robustness and/or sensitivity of a nonlinear control law devised by using a deep RL method. Furthermore, the presence of dry friction, often erroneously neglected in robotic applications leading to the origination of unwanted dynamic phenomena, is taken into account in this investigation. As expected, the presence of dry friction has a great impact on the performance of the controller devised by employing the deep RL approach. Implementing the modified environment and performing a post-training procedure for the agent resulting from the original design require only a small set of thirty-nine additional episodes to generate the optimal control policy for environments with or without the dry friction presence. For instance, agent B can indeed do the swing-up in an environment with and without dry friction, as shown in Case 12 of Table 3. Thereby, this approach provides a promising path for further developments.
Based on the impact of the modified environments on the performance of the agents A and B devised in this work, it is concluded that the RL-based control system is robust. In particular, the maximum percentage of change achieved by performing the task correctly is up to 90% in the pendulum mass m p alone, 94% in the cart mass m c alone, and up to 25% for the two parameters together. Additionally, as shown in the manuscript, the proposed post-training of the agent carried out in an environment with dry friction required only thirty-nine episodes to find the optimal policy, thereby being robust enough to work optimally in the presence of dry friction with its coefficient increased up to 50% or equal to zero. Therefore, the numerical experiments performed in this investigation prove the transfer learning approach to be promising for the reality-gap problem found in robotics. Nonlinear model-based control systems, such as [41][42][43][44], to name a few, do not consider dry friction since the complexity of the resulting model is restrictive. In contrast, a model-free approach like the one in this work, based on deep RL, allows us to consider such a phenomenon. Additionally, given the generalization capability of neural networks, the robustness of the system to the uncertainty in the physical parameters is high, as shown by the numerical results of the sensitivity analysis carried out. This, to the best of the authors' knowledge, has not been previously done in the literature. Furthermore, said robustness can be increased with the post-training proposed and numerically tested in the present work. The developed control system has limitations related to the simplifications used in the dry friction model present in the environment during training. This is because physical effects, such as stiction, the pre-sliding, and the Stribeck effect, cannot be reproduced using the smooth Coulomb friction model. Additionally, the agent capacity can be extended by running training episodes in a subsequent session in an experimental setting. Said implementation is feasible given the sampling time used during training and the performance required by an actuator, both within commercial hardware margins. This is also demonstrated by the presence in the literature of several alternative investigations on the practical methods for controlling the cart-pole system that were also tested experimentally using a dedicated test rig.
The friction model considered was used to exemplify the effect of a modified environment. Furthermore, the dry friction model employed was chosen for the sake of simplicity and clarity, to avoid the ambiguity in the non-uniqueness of the resulting acceleration arising from the discontinuity of the friction force at zero velocity that is typical of the Coulomb model and also because other more advanced friction models require multiple parameters for their correct definition and functioning [71]. The sensitivity analysis performed showed a high parameters space where the controller could perform the task required. For a physical system with higher complexity compared to the model considered in the paper, which features multiple physical parameters and phenomena involved, it would be required to consider an in-detail statistical approach, such as the two-level factorial analysis, which is clearly outside of the scope of the present work.

Summary and Conclusions
The authors' research focuses on designing, modeling, and controlling mechanical systems subjected to complex force fields [74,75]. Therefore, the mutual interactions between multibody system dynamics, applied system identification, and nonlinear control fall within the scope of the authors' research domain. In this work, on the other hand, the sensitivity analysis of a deep Reinforcement Learning (RL) controller applied to the cart-pole swing-up problem is performed. Through extensive numerical experiments, the effectiveness of the proposed controller is analyzed in the case of the presence of friction forces. Subsequent works will focus on studying environments characterized by randomized parameters during the training procedure to further improve the robustness of the resulting control system.
Future research works will be devoted to further refining the numerical procedure for developing the control algorithm proposed in this paper, so that it can be readily applied to more complex dynamical systems, such as articulated mechanical systems modeled within the multibody framework that have an underactuated structure. For instance, an immediate extension of this work, which is focused on the swing-up problem of a cart-pole system, is the development of a robust controller similarly based on the use of the deep RL approach for the Furuta pendulum, as well as for an inverted pendulum controlled through a gyroscope mounted on a gimbal. In fact, the benchmark problems mentioned before are closer to practical systems employed in engineering applications, and the development of a robust controller for such systems seems quite promising.
In summary, the current research work focuses on the computational aspects of the swing-up problem of the cart-pole system. The paper aims to develop a nonlinear controller based on reinforcement learning techniques simulated in a virtual environment. Therefore, the experimental verification using a test-rig of the proposed controller is outside the scope of the present work. However, this issue will be addressed in future investigations specifically dedicated to developing experimental results for supporting the theoretical controller obtained using the proposed approach. Additionally, the development of an in-depth analysis of the DRL-based the sensitivity and robustness of the control system and its experimental implementation, using advanced dry friction models capable of exhibiting stiction, the pre-sliding, and the Stribeck effect, are topics that will be considered as future extensions of the present work.
Author Contributions: This research paper was principally developed by the first author (C.A.M.E.) and by the second author (C.M.P.). The detailed review carried out by the third author (D.G.) considerably improved the quality of the work. The manuscript was written through the contribution of all authors. All authors read and agreed to the published version of the manuscript.