applied

: With the mature development of artiﬁcial intelligence technology, the application of intelligent control algorithms in control systems has become a trend to meet the high-performance requirements of modern society. This paper proposes a deep deterministic policy gradient (DDPG) controller design method based on deep reinforcement learning to improve system control performance. Firstly, the optimal control policy of the DDPG algorithm is derived from the Markov decision process and the Actor–Critic algorithm. Secondly, in order to avoid local optima in traditional control systems, the capacity and the settlement method of the DDPG experience pool are adjusted to absorb positive experience to accelerate convergence and to complete efﬁcient training. In response, and to solve the overestimation of the Q value in DDPG, the overall structure of the Critic network is changed to shorten the convergence period of DDPG at low learning rates. Finally, a ﬁrst-order inverted pendulum control system was constructed in a simulation environment to verify the control effectiveness of PID, DDPG, and improved DDPG. The simulation results reveal that the improved DDPG controller has a faster response to disturbances, smaller displacement, and angular displacement of the ﬁrst-order inverted pendulum. The simulation further proves that the improved DDPG algorithm has better stability and convergence and stronger anti-interference ability and stability recovery. This control method provides a certain reference for the application of reinforcement learning in traditional control systems.


Introduction
Modern society has put forward more and more requirements for the dynamic performance of control systems.Traditional control methods like PID [1] and fuzzy control [2] cannot achieve satisfactory results in terms of anti-interference ability, dynamic performance, and nonlinear adjustment accuracy, especially in continuously moving space where fast convergence is not possible [3].Furthermore, traditional control methods require precise mathematical modeling as a prerequisite before formulating control strategies.In practical control systems, especially in first-order inverted pendulums, due to the numerous influencing factors, the actual system is mostly nonlinear and has strong coupling, making it very difficult to accurately establish a model that can describe the process of the system's motion.The modeling in traditional control methods is a linearized mathematical model near the system's equilibrium point, which leads to poor robustness of the control algorithm.When the system state is far from the equilibrium point, the effect of the control algorithm will be greatly decreased or even ineffective.
Based on the above, deep reinforcement learning is introduced into nonlinear system control.Deep reinforcement learning is one of the major branches of artificial intelligence, and it has made significant progress in recent years.DDPG was proposed by Lillicrap T P et al. in 2015, which applies the deep deterministic policy gradient as a replay buffer to formulate the target policy.It combines the deep Q-network (DQN) [4] and deterministic policy gradient (DPG) [5] to solve the convergence problems of neural networks and slow algorithm updates in a continuously moving space.
At the algorithm level, to improve the training efficiency of DDPG, avoid local optima, and accelerate convergence, Li et al. proposed that in the off-policy reinforcement learning algorithm, an experience pool is used to store the experience sample data.Suitable sampling strategies are combined to sample the dataset in the buffer pool, which can accelerate training [6].Schaul et al. proposed a mechanism of a buffer pool priority experience replay and applied it in DQN.This method can increase the probability of important experiences being replayed, thereby making learning more efficient and achieving good results in some experiments [7].Zhang Jianxing et al. proposed a deep deterministic policy gradient based on plot experience replay (EEP-DDPG), which saves sample data as plot units and classifies them through two buffer pools.During the training phase, it focuses on selecting data with higher cumulative returns for sampling to improve training efficiency [8].The main ideas for improving the buffer pool above are focused on using more experience data with higher return values to improve the efficiency of the experience playback.This paper starts from the environmental status of each experience data, designs a recursive small experience pool structure, and optimizes the settlement method.
In order to find the optimal Q value, Doltsinis et al. and Wang et al. proposed a Q-learning algorithm and a SARSA algorithm, respectively [9].However, when the state and action dimensions increase, the size of the Q table will increase in geometric multiples, resulting in the 'curse of dimensionality' [10].In 2015, Mnih et al. used deep neural networks to fit the Q table and then proposed the DQN algorithm.However, using a greedy policy to select the Q value will lead to an overestimation of the Q value, resulting in a significant deviation [11].Therefore, Hasselt et al. proposed the DDQN (Double Deep Q-Network) to decouple action selection and Q-value calculation.The main network selects the action with the highest corresponding Q-value and uses this action to calculate the Q-value of the target network to solve the overestimation of the Q-value [12].To improve the performance of the DQN algorithm, Wang et al. proposed the Dueling DQN algorithm.Unlike the DQN algorithm, this algorithm divides the Q network into a value function and an advantage function.The final output is a linear combination of the reward function and the punishment function.This prevents excessive Q values and improves the expression ability of the state action reward function [13].
In the field of motion control, Zhang et al. used the DQN in 2015 to directly train the robot controller's ability from raw pixel images without prior knowledge [14].In 2017, Gu et al. used the DQN algorithm without prior knowledge to train robots' spatial operation ability and complete complex limb movements represented by opening doors [15].In 2017, Sallab et al. used the DDPG algorithm to test the automated driving of 3D cars based on TORCS in a simulator under a simulation environment.The results showed that autonomous vehicles can learn complex road conditions and interact with other vehicles better than non-autonomous vehicles [16].In 2020, Liu et al. conducted balance control experiments on a bipedal robot using a DDPG algorithm.After multiple rounds of training, the posture angle of the bipedal robot was controlled at [−4 • , 4 • ], and its stability performance was significantly improved [17].Liu et al. applied DDPG in robots to improve the target tracking and obstacle-avoidance control in 2022.The results show that DDPG has better performance than PID [18].In 2021, Xue et al. proposed a fractional gradient-descent RBF neural network applied in DDPG to control the inverted pendulum.According to the fractional order, RBF adopted differentiation or integration to accelerate the convergence speed of the gradient descent method and improve its control performance [19].In 2022, Wang et al. adopted the Swish function and baseline function as an activation function to improve the training efficiency of DDPG, which was verified in an inverted pendulum system [20].The DDPG algorithm has good knowledge transfer ability and can learn the mathematical model of a system in a black box or gray box model based on its input and output.Based on the reward function, observation function, and constraint conditions, it can achieve optimal control of the control system [21].However, DDPG follows the same method as DQN to update the Q value, which can easily lead to the overestimation of the Q value for certain control actions.This will lead to the increased bias and suboptimal strategies.In addition, DDPG is very sensitive to the setting of hyperparameters, which will directly affect its convergence stability.
With the development and maturity of artificial intelligence technology, the application of intelligent control algorithms in control systems has become a trend.This paper proposes a deep deterministic policy gradient algorithm for control and improvement.Firstly, based on the principle of reinforcement learning, the DDPG algorithm is developed to give the system strong self-learning and self-tuning capabilities.Then, by optimizing the experience pool and the critic network structure, the problem of local optimization, long convergence period, and overestimation of the Q value in DDPG under low learning rates is solved.By learning to update the control policy, the dynamic performance, anti-interference ability, and response speed under load disturbances are improved.Finally, relying on the verification and comparison of the classic first-order inverted pendulum with the classic PID, DDPG, and improved DDPG algorithms, the better dynamic performance of the improved DDPG controller is verified.

Construction of a Control System Based on DDPG
Reinforcement learning enables interactive learning between agents and the environment to complete specific tasks and maximize reward values [22].The essence of reinforcement learning can be concretized as a Markov decision process, where the valueand -policy-based optimization strategies derived from it are combined and complementary, forming the Actor-Critic algorithm.Combined with the DQN algorithm for experience playback and target network construction, the basic framework of the DDPG algorithm is formed.For control systems, the key to introducing the DDPG algorithm is to construct a suitable neural network structure based on the feedback provided by the system from the environment and determine the learning rate of the Critic and Actor networks based on the output dimension of the neural network.At the same time, according to the order of magnitude and the number of feedback quantities, the corresponding reward function form is adopted.Finally, the hyperparameter of training is set to ensure that the training process is within a reasonably convergent process and avoid local optimization and sub optimization.The overall control flowchart of DDPG is shown in Figure 1.
Appl.Sci.2023, 13, 7594 3 of 16 The DDPG algorithm has good knowledge transfer ability and can learn the mathematical model of a system in a black box or gray box model based on its input and output.Based on the reward function, observation function, and constraint conditions, it can achieve optimal control of the control system [21].However, DDPG follows the same method as DQN to update the Q value, which can easily lead to the overestimation of the Q value for certain control actions.This will lead to the increased bias and suboptimal strategies.In addition, DDPG is very sensitive to the setting of hyperparameters, which will directly affect its convergence stability.
With the development and maturity of artificial intelligence technology, the application of intelligent control algorithms in control systems has become a trend.This paper proposes a deep deterministic policy gradient algorithm for control and improvement.Firstly, based on the principle of reinforcement learning, the DDPG algorithm is developed to give the system strong self-learning and self-tuning capabilities.Then, by optimizing the experience pool and the critic network structure, the problem of local optimization, long convergence period, and overestimation of the Q value in DDPG under low learning rates is solved.By learning to update the control policy, the dynamic performance, anti-interference ability, and response speed under load disturbances are improved.Finally, relying on the verification and comparison of the classic first-order inverted pendulum with the classic PID, DDPG, and improved DDPG algorithms, the better dynamic performance of the improved DDPG controller is verified.

Construction of a Control System Based on DDPG
Reinforcement learning enables interactive learning between agents and the environment to complete specific tasks and maximize reward values [22].The essence of reinforcement learning can be concretized as a Markov decision process, where the value-and -policy-based optimization strategies derived from it are combined and complementary, forming the Actor-Critic algorithm.Combined with the DQN algorithm for experience playback and target network construction, the basic framework of the DDPG algorithm is formed.For control systems, the key to introducing the DDPG algorithm is to construct a suitable neural network structure based on the feedback provided by the system from the environment and determine the learning rate of the Critic and Actor networks based on the output dimension of the neural network.At the same time, according to the order of magnitude and the number of feedback quantities, the corresponding reward function form is adopted.Finally, the hyperparameter of training is set to ensure that the training process is within a reasonably convergent process and avoid local optimization and sub optimization.The overall control flowchart of DDPG is shown in Figure 1.

Markov Decision Process
The decision-making process of the neural network constructed by DDPG can be represented as a Markov decision process (MDP), which is a theoretical framework for achieving the ultimate goal through interactive learning.It can be described by quads (S, A, R, P), as shown in Figure 2, which shows the interaction between the agent and the environment during the decision-making process.

Markov Decision Process
The decision-making process of the neural network constructed by DDPG can resented as a Markov decision process (MDP), which is a theoretical framew achieving the ultimate goal through interactive learning.It can be described by qu A, R, P), as shown in Figure 2, which shows the interaction between the agent a environment during the decision-making process.During the interaction process, the intelligent agent generates an action At ba the current state St, then obtains the next state St+1 from the environment, and then the reward sRt+1 obtained by outputting the current action.For the benefits obtain ing the interaction process, the ultimate goal of the intelligent agent is to maxim cumulative reward during long-term operation.( )

Agent
The state value s when taking action a under the policy  is called the actio function ( , ) q s a  of the policy  .The equation of the state values at two time p derived in Equation ( 2).The equation for the action value function can also be obta shown in Equation (3).

r s a s
When solving problems through reinforcement learning, finding an optima  to maximize long-term benefits is the key.Equation (4) shows that searching optimal policy  can be based on the state value function or the action policy fu Therefore, due to the different criteria, reinforcement learning algorithms based o and policy have also emerged.
During the interaction process, the intelligent agent generates an action A t based on the current state S t , then obtains the next state S t+1 from the environment, and then obtains the reward sR t+1 obtained by outputting the current action.For the benefits obtained during the interaction process, the ultimate goal of the intelligent agent is to maximize the cumulative reward during long-term operation.
The state value s when taking action a under the policy π is called the action value function qπ(s, a) of the policy π.The equation of the state values at two time points is derived in Equation ( 2).The equation for the action value function can also be obtained as shown in Equation (3).
When solving problems through reinforcement learning, finding an optimal policy π to maximize long-term benefits is the key.Equation (4) shows that searching for the optimal policy π can be based on the state value function or the action policy function.Therefore, due to the different criteria, reinforcement learning algorithms based on value and policy have also emerged.

Actor-Critic Algorithm
In DDPG, the Critic network mainly evaluates the quality of actions based on state value, while the Actor network determines the next action based on action policy.Firstly, an optimization policy based on state value is introduced.Its core is to form a Q table of states and actions in the environment.The data in the table include the value Q(s, a) of the action value function when taking an action from the current state.An action is selected based on the table, and it is executed.When selecting an action, the greedy algorithm is followed, as shown in Equation ( 5), where m is the total number of selectable actions.After operation, a reward value will be obtained, updating Q(s, a) as shown in Equation ( 6), where a is the learning rate, and then the values in the Q-table are updated.
In practical terms, there are often many state vectors and actions that can make tables very large and make them fall into dimensional disasters.So on the basis of Q-learning, a neural network was added to approximate the value function by neural network to form a deep Q-network.At the same time, the experience playback mechanism [23] and target network were added to improve the convergence of the network.It contains the current network and the target network, which have the same structure.The current network selects actions and updates the network parameter θ, as shown in Equation (7).The target network calculates the Q target value as Q(s , a ; θ i ), and updates the network by minimizing the loss function shown in Equation ( 8), where ρ(.) is the probability distribution of sequence s and action a.
Figure 3 shows the training process of DQN algorithm: action value function when taking an action from the current state.An action is selected based on the table, and it is executed.When selecting an action, the greedy algorithm is followed, as shown in Equation ( 5), where m is the total number of selectable actions.After operation, a reward value will be obtained, updating Q(s, a) as shown in Equation ( 6), where a is the learning rate, and then the values in the Q-table are updated.
In practical terms, there are often many state vectors and actions that can make tables very large and make them fall into dimensional disasters.So on the basis of Q-learning, a neural network was added to approximate the value function by neural network to form a deep Q-network.At the same time, the experience playback mechanism [23] and target network were added to improve the convergence of the network.It contains the current network and the target network, which have the same structure.The current network selects actions and updates the network parameter  , as shown in Equation (7).The target network calculates the Q target value as ( , ; ) i Q s a     , and updates the network by mini- mizing the loss function shown in Equation ( 8), where (.)  is the probability distribution of sequence s and action a.
Figure 3 shows the training process of DQN algorithm: From the above analysis, it can be seen that the DQN series of algorithms are suitable for dealing with discrete action problems, and the output policy is a deterministic policy.However, many problems in practice are continuous actions; they are not deterministic strategies but random strategies.Therefore, scholars have proposed reinforcement learning algorithms based on action strategies, which are algorithms used in Actor networks, i.e., policy gradient algorithms.By outputting the probability of action occurring, one or more optimal actions are selected.Assuming that there is an intelligent agent's exploration trajectory  , ( | , ) as  is a random policy.The policy gradient can be expressed as Equation (9).After this equation is solved, the parameters are updated in a gradient ascending manner, as shown in Equation (10).From the above analysis, it can be seen that the DQN series of algorithms are suitable for dealing with discrete action problems, and the output policy is a deterministic policy.However, many problems in practice are continuous actions; they are not deterministic strategies but random strategies.Therefore, scholars have proposed reinforcement learning algorithms based on action strategies, which are algorithms used in Actor networks, i.e., policy gradient algorithms.By outputting the probability of action occurring, one or more optimal actions are selected.Assuming that there is an intelligent agent's exploration trajectory τ, π(a|s, θ) is a random policy.The policy gradient can be expressed as Equation (9).After this equation is solved, the parameters are updated in a gradient ascending manner, as shown in Equation (10).
By inputting the state into the policy network and outputting the probability of each action, the action can be selected within a continuous interval.However, its updates are based on round updates, which will decrease the efficiency of learning.Value-based algorithms and policy-based algorithms each have their own advantages and disadvan-tages.So researchers combined the two to form the Actor-Critic algorithm.An Actor is a policy network responsible for interacting with the environment, inputting states and then outputting actions executed by the agent.Critic is a value network, inputting states and actions, outputting evaluations of the actions generated by the policy network.It will evaluate the current quality of the policy network's actions, then optimize the policy network to produce better actions.The flowchart of Actor-Critic algorithm is shown in Figure 4.

()
By inputting the state into the policy network and outputting the probability action, the action can be selected within a continuous interval.However, its upd based on round updates, which will decrease the efficiency of learning.Value-bas rithms and policy-based algorithms each have their own advantages and disadv So researchers combined the two to form the Actor-Critic algorithm.An Actor is network responsible for interacting with the environment, inputting states and t putting actions executed by the agent.Critic is a value network, inputting states tions, outputting evaluations of the actions generated by the policy network.It w uate the current quality of the policy network's actions, then optimize the policy to produce better actions.The flowchart of Actor-Critic algorithm is shown in Fig The Actor-Critic network is updated based on the time difference method cording to the Bellman equation, it is known that the state values at two times recursive relationship.The TD error can be expressed as Equation (11).
The critic network is updated by minimizing the TD error shown in Equat The update method for the Actor network is derived from Equations ( 10) and ( 11) tion (12).

Recursive Small Experience Pool DDPG Algorithm
In the Actor-Critic algorithm, the output actions of the Actor network depen guidance of the Critic network.However, due to the inherent problems of diffic vergence and easily falling into local optima, the combination of the two netwo further increase the difficulty of network convergence [25].Therefore, in order to s problems in the Actor-Critic algorithm, some scholars have integrated the solutio DQN algorithm to form a deep deterministic policy gradient algorithm.The DDP rithm includes a Critic network based on value updates, an Actor network based o updates, and an experience pool for storing samples.The Critic network is divid the current network Ce and the target network Ct, while the Actor network is divi The Actor-Critic network is updated based on the time difference method [24].According to the Bellman equation, it is known that the state values at two times follow a recursive relationship.The TD error can be expressed as Equation (11).
The critic network is updated by minimizing the TD error shown in Equation (11).The update method for the Actor network is derived from Equations ( 10) and (11) as Equation (12).

Recursive Small Experience Pool DDPG Algorithm
In the Actor-Critic algorithm, the output actions of the Actor network depend on the guidance of the Critic network.However, due to the inherent problems of difficult convergence and easily falling into local optima, the combination of the two networks will further increase the difficulty of network convergence [25].Therefore, in order to solve the problems in the Actor-Critic algorithm, some scholars have integrated the solutions in the DQN algorithm to form a deep deterministic policy gradient algorithm.The DDPG algorithm includes a Critic network based on value updates, an Actor network based on policy updates, and an experience pool for storing samples.The Critic network is divided into the current network C e and the target network C t , while the Actor network is divided into the current network A e and the target network A t .The samples in the experience pool are generated by A e , which generates an action from the current state S, as shown in Equation ( 13): Among them, µ represents the current strategy.θ µ represents the parameters of the current network A e .After executing this action, the reward value r and the next state S t+1 are generated, and this set of values is stored in the experience pool.The current network C e calculates the actual Q value Q(S i , a i θ Q ) and the target network C t calculate the target Q value y i as shown in Equation (14), where θ Q is the network parameter of the C t network and θ µ is the parameter of the A t network.Then, the C e network and the A e network are updated by minimizing the value of loss function.The definition of the C e network loss function is shown in Equation ( 15) and the loss function of A e network is shown in Equation ( 16): After θ Q and θ µ are updated, θ Q and θ µ are updated through soft updates.Unlike the DQN algorithm, which directly copies the parameters of the current network to the target network, this paper adds an updated coefficient τ in the DDPG algorithm, which is determined by the size of the experience pool and the effective experience.In short, the experience pool is divided into n small experience pools, an evaluation function is introduced in advance to evaluate the small experience pool, the effective experience is stored in a new small experience pool, and all the effective experience is finally extracted for network updates.Although τ is small, it can control each parameter to be updated relative to the previous one and effectively improve local optima and suboptimal, reduce training time, and improve control effectiveness.The updating method is shown in Equation ( 17):

Optimization Design of Critic Network Structure
In the value-based reinforcement learning method, the Approximation error of the function will lead to an overestimation of the value and a suboptimal policy.The problem of overestimating the Q value of the Q function still exists in the DDPG algorithm.Moreover, the critic network uses the time-series difference algorithm to estimate the value function, which itself will have unavoidable bias, so the overestimation cannot be avoided in the algorithm.In the process of reinforcement learning training, the goal of maximizing the value of a certain process and the optimal Q value Q(s, a) is key to solving the cumulative reward maximization and optimal policy.Therefore, the accuracy of the Q value estimation is related to the quality of the entire learning process.Moreover, the overestimation problem will cause a cumulative error as the training continues.This error will result in some poor samples having higher Q values and being selected during training, which may form local optima and lead to suboptimal updates and the divergent behavior of the strategy.The training process will not converge.
Simply changing the experience pool size of the Critic network cannot fundamentally solve the overestimation problem of Q values caused by greedy characteristics.Therefore, in such cases, it is possible to consider setting one or more reference networks to evaluate the Q value of the Critic network.One or more sets of networks have been added to the Critic network, including the real network and the target network.When the Actor network generates an action, these two or more networks are used for estimation.The minimum Q target value generated by the two target networks is taken, and relatively fewer overestimated Q values are selected.Then, these Critic networks are updated, as shown in Equation ( 18): After that, we can update the loss function of the value network.For the update of the policy network, we can use any value network.Because the two networks will become more and more similar in the end, it will not affect the update of the policy network.At the same time, the improved algorithm adds noise obeying the normal distribution when the target network of the strategic network generates actions to avoid the overfitting of the value function.The updating method is shown in Equation ( 19): Based on the above improvements, the flowchart of the improved DDPG algorithm is shown in Figure 5.
 =+ (18) After that, we can update the loss function of the value network.For the update of the policy network, we can use any value network.Because the two networks will become more and more similar in the end, it will not affect the update of the policy network.At the same time, the improved algorithm adds noise obeying the normal distribution when the target network of the strategic network generates actions to avoid the overfitting of the value function.The updating method is shown in Equation ( 19): Based on the above improvements, the flowchart of the improved DDPG algorithm is shown in Figure 5.

Experimental Results and Analysis
In this section, we use a first-order inverted pendulum as the controlled object to validate the DDPG algorithm, the improved DDPG algorithm, and traditional PID methods.Next, we will analyze and compare their control effects.An inverted pendulum is a universal and classic physical model in traditional control systems.As a controlled object, it is also a relatively complex system with high order, nonlinearity, instability, multivariable, strong coupling, and other characteristics, and it is widely used in the verification of controlled stability problems.

Experimental Results and Analysis
In this section, we use a first-order inverted pendulum as the controlled object to validate the DDPG algorithm, the improved DDPG algorithm, and traditional PID methods.Next, we will analyze and compare their control effects.An inverted pendulum is a universal and classic physical model in traditional control systems.As a controlled object, it is also a relatively complex system with high order, nonlinearity, instability, multivariable, strong coupling, and other characteristics, and it is widely used in the verification of controlled stability problems.

Simulation Environment
The simulation environment was mainly conducted in the Matlab/Simulink environment, using the Simscape Multibody Toolbox and Reinforcement Learning Toolbox provided in the development environment for experimental simulation.The Simscape Multibody Toolbox provides a multi-body simulation environment for mechanical systems, which can use modules such as solids, joints, constraints, and force elements to model multi-body systems and solve the motion equations of complete mechanical systems.At the same time, 3D animations are generated to visualize the motion effects.In this paper, the slider cart of the inverted pendulum is connected to the pendulum rod through joint structure and coordinate transformation to form a complete first-order inverted pendulum model.The key parameters are shown in Table 1.The Reinforcement Learning Toolbox is considered as a proxy agent for the DDPG algorithm, connecting the entire network.This module is connected to the corresponding environment and set reward functions.The reward r t is defined by Equation ( 20): where θ t is the angle displacement from the upright position of the pole.x t is the displacement from the center position of the cart.u t−1 is the control effort from the previous time step.B is a flag (1 or 0) that indicates whether the cart is out of bounds.The coefficient parameter is applied for standardization, and the stop value is −400.After training, the Agent module can act as a controller to input the system and receive feedback adjustment inputs.The experimental environment is shown in Figure 6.
Pendulum mass (kg) 0.2 Cart mass (kg) 0.46 Length from centroid to hinge point (m) 0.25 frictional coefficient 0.08 Gravitational acceleration (m×s 2 ) 9.81 The Reinforcement Learning Toolbox is considered as a proxy agent for the DDPG algorithm, connecting the entire network.This module is connected to the corresponding environment and set reward functions.The reward t r is defined by Equation ( 20): 0.1(5 0.05 ) 100 where t  is the angle displacement from the upright position of the pole.t x is the displacement from the center position of the cart.
1 t u − is the control effort from the previous time step.B is a flag (1 or 0) that indicates whether the cart is out of bounds.The coefficient parameter is applied for standardization, and the stop value is −400.After training, the Agent module can act as a controller to input the system and receive feedback adjustment inputs.The experimental environment is shown in Figure 6.Combining mature engineering test routines, the PID controller selects a displacement controller with PID parameters of K p = 16, K i = 5, and K d = 3 and angular displacement controller with PID parameters of K p = 125, K i = 20, and K d = 6.The hyperparameter settings selected in DDPG training are shown in Table 2. Before the simulation, the initial state of the entire inverted pendulum has a greater impact on the control results of the nonlinear system.Considering the actual physical significance of an inverted pendulum, the input quantity needs to be limited.The range of motion is set to [−3.5,3.5] m.The starting position of the DDPG controller is set at the center of the guide rail, and the swing rod is in a natural drooping state.The entire simulation process achieves the transformation of the swing rod from a natural drooping state to a vertical upward state and maintains balance within the limited range of motion.The PID controller is initially in a vertical upward state and maintains balance during the simulation process.

Results and Analysis
After the above hyperparameters and the corresponding neural network are constructed, the reinforcement learning training parameters were set, as shown in the above table, and the training was started.The training results are shown in Figures 7 and 8.
impact on the control results of the nonlinear system.Considering the actual physical sig-nificance of an inverted pendulum, the input quantity needs to be limited.The range of motion is set to [−3.5,3.5] m.The starting position of the DDPG controller is set at the center of the guide rail, and the swing rod is in a natural drooping state.The entire simulation process achieves the transformation of the swing rod from a natural drooping state to a vertical upward state and maintains balance within the limited range of motion.The PID controller is initially in a vertical upward state and maintains balance during the simulation process.

Results and Analysis
After the above hyperparameters and the corresponding neural network are constructed, the reinforcement learning training parameters were set, as shown in the above table, and the training was started.The training results are shown in Figures 7 and 8.              Subsequently, a square wave signal with a period of 2 s, a duty cycle of 20%, an amplitude of 1, and a delay of 5 s was added as the interference force in the simulation.The interference experimental results are shown in Figures 12-14.Subsequently, a square wave signal with a period of 2 s, a duty cycle of 20%, an amplitude of 1, and a delay of 5 s was added as the interference force in the simulation.The interference experimental results are shown in Figures 12-14.Subsequently, a square wave signal with a period of 2 s, a duty cycle of 20%, an amplitude of 1, and a delay of 5 s was added as the interference force in the simulation.The interference experimental results are shown in Figures 12-14.From the figures showing the experimental results, it can be seen that the first-order inverted pendulum system controlled by the DDPG algorithm successfully started oscillation around 3 s.After 2 s, it withstood interference to maintain balance until the end of the experiment in 10 s.Therefore, the first 3 s in the graph of simulation results reflects the  From the figures showing the experimental results, it can be seen that the first-order inverted pendulum system controlled by the DDPG algorithm successfully started oscillation around 3 s.After 2 s, it withstood interference to maintain balance until the end of From the figures showing the experimental results, it can be seen that the first-order inverted pendulum system controlled by the DDPG algorithm successfully started oscillation around 3 s.After 2 s, it withstood interference to maintain balance until the end of the experiment in 10 s.Therefore, the first 3 s in the graph of simulation results reflects the large-range fluctuations of various parameters of the entire system during oscillation.The initial state of the first-order inverted pendulum under PID control is vertical and upward, with significant fluctuations at the beginning of the simulation and gradually stabilizing around 3 s.Similarly, after 5 s, it undergoes interference to maintain balance until the end of the 10 s experiment.
The comprehensive simulation results show that the DDPG controller and the improved DDPG controller have slightly inferior control performance and retention in displacement and speed compared with the PID controller, but both are within a reasonable control range.At the same time, the fluctuation range of the DDPG controller is more excellent in angular displacement and angular velocity control.Compared with the PID controller, the DDPG controller shows an order-of-magnitude advantage.It has stronger anti-interference and recovery stability in the case of white noise and sudden interference force.From a practical perspective, slight fluctuations in the angular displacement of an inverted pendulum can lead to significant fluctuations in displacement, and poor control can lead to the collapse of the overall system balance.The reward function used in the DDPG controller in this paper has a weight of five times the displacement for angular displacement.Therefore, the agent increases the priority of angular displacement in training, which in turn sacrifices the anti-interference ability of some displacements to improve the anti-interference ability of angular displacement.Therefore, compared to more balanced and versatile PID controllers, DDPG controllers may experience this situation.
Comparing the DDPG controller with the improved DDPG controller, the balanced strategy of the DDPG controller is too singular, and it only completes the balance task within the corresponding time and neglects to offset the lateral velocity.This will result in the inverted pendulum continuously shifting laterally on the slide rail, which will exceed the given range of motion during long-term motion.The improved DDPG controller suppresses overestimated Q values, allowing it to explore more effectively in the action space.Therefore, after each fluctuation, it will be appropriately shifted back, and compared to the DDPG controller, it effectively reduces the range of displacement fluctuations.In the phase of relative balance after the starting vibration, under the action of white noise, the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.9027 m/s, 0.0892 rad, and 0.8386 rad/s, respectively, and the maximum speed, angular displacement, and angular velocity fluctuation range of the improved DDPG controller are 0.7092 m/s, 0.0759 rad, and 0.7527 rad/s, respectively.Under the influence of interference force, the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.3821 m/s, 0.04243 rad, and 0.2410 rad/s, respectively, and the maximum speed, angular displacement, and angular velocity fluctuation range of the DDPG controller are 0.3595 m/s, 0.02551 rad, and 0.2119 rad/s, respectively.In the longitudinal comparison, due to the high frequency and large amplitude of white noise, compared with the interference force with low frequency mechanical characteristics, it will cause greater impact on system stability, so the fluctuation of the parameter results of white noise is greater than that of the interference force.The improved DDPG controller has a small range of fluctuation in speed, especially angular displacement and angular velocity, and it has strong anti-interference ability.It is within the motion control range throughout the entire waveform, except for overshoot, which must be generated during the initiation of vibration.Under the influence of interference force, the DDPG controller regains stability after being disturbed for 1.725 s, while the improved DDPG controller regains stability after being disturbed for 1.388 s.Compared to the DDPG controller, the improved DDPG controller has a faster response speed in horizontal comparison, and in the case of sudden disturbance to the controlled system, the interference received is small, resulting in faster restoration of stability.The four evaluation indicators of displacement, velocity, angular displacement, and angular velocity are all within a reasonable and controllable range.

Conclusions
This paper analyzes the control principle of the DDPG algorithm, starting from the Markov decision process, introduces the Actor-Critic algorithm, and combines it with the DQN algorithm to derive the gradient-updating process of the DDPG algorithm.It deeply analyzes and organizes the recursive small-experience-pool DDPG algorithm.An improved method is proposed to address the problem of high Q-value estimation in the DDPG algorithm, and feasibility verification is conducted.Finally, a first-order inverted pendulum model was built by Simscape Multibody in a simulation environment.The improved DDPG algorithm was simulated in the Matlab/Simulink environment and compared with the DDPG algorithm and the traditional PID controller.
By configuring corresponding neural networks and evaluation indicators and improving the experience pool structure and settlement method, the learning and training of DDPG and improved DDPG controllers for first-order inverted pendulum are completed.The simulation results show that under the control of the DDPG and PID controllers, the displacement and angular displacement of the first-order inverted pendulum are within the control range.Through comparison, DDPG controllers have better response speed and stronger stability and convergence compared to PID controllers, and they also have excellent anti-interference ability and stability recovery against sudden interference.After the structure is changed, the Critic network selects the minimum Q value to effectively solve the problem of overestimation of Q value, draws effective experience within the same training time, conducts efficient training, converges earlier under the same convergence conditions, and avoids local optima.Based on the analysis of training figures and experimental results, it can be seen that the improved DDPG algorithm has better stability and convergence and stronger anti-interference ability and stability recovery compared to DDPG.
This paper demonstrates the effectiveness of the reinforcement learning DDPG algorithm control and makes improvements to address the issues that arise.The improved DDPG has good autonomous optimization ability and is trained using trial and error methods to achieve control objectives in a way that is better and faster.This provides a reference for reinforcement learning to handle classical control problems and for its application in traditional control systems.There is still a lot of work to be performed to address classical control problems through reinforcement learning.First, adjusting the index weights in the reward function and training hyperparameters will help increase trial-and-error efficiency.Additionally, it is also helpful in introducing data-driven prior experience agents.Finally, deploying the trained agents into an industrial controller will solve the "reality gap" problem.These problems are all worthy of in-depth research.

Figure 6 .
Figure 6.Experimental environment.Combining mature engineering test routines, the PID controller selects a displacement controller with PID parameters of Kp = 16, Ki = 5, and Kd = 3 and angular displacement

16 Figure 8 .
Figure 8. Improved DDPG training figure.In the training figures, the training reward is shown in each training episode.Episode Q0 is the convergence index maximized by the reward function, which decreases by the discount factor in each episode during the training process.The episode reward was calculated using the reward function due to the performance of each training episode.The average reward is set to ensure that the agent will not converge prematurely due to several outstanding rewards.It is calculated as the current total rewards divided by training episodes.From the training figures, it can be seen that DDPG requires 235 training cycles to complete the training, while improved DDPG only requires 145 training cycles to complete the training.This is because, after the experience pool size, settlement method, and the Critic network structure are improved, the Q-value of each step is kept at a reasonable level, reducing overestimation and suppressing local optima and suboptimal phenomena during the training process.This allows the Actor network to learn with fewer steps during experience accumulation and climbing, quickly absorbing positive gain experience, avoiding local optima, completing experience accumulation in fewer steps, and improving overall training efficiency.The trained DDPG network, the improved DDPG network, and the traditional PID control are tested on the same controlled target.At the same time, white noise with a power of 0.1 w and a seed of 23,341 are added to the simulation.Blue, red, yellow and purple curves represent displacement, angular displacement, velocity and angular velocity respectively.The white noise experimental results are shown in Figures 9-11. .

Figure 8 .
Figure 8. Improved DDPG training figure.In the training figures, the training reward is shown in each training episode.Episode Q0 is the convergence index maximized by the reward function, which decreases by the factor in each episode during the training process.The episode reward was calculated using the reward function due to the performance of each training episode.The average reward is set to ensure that the agent will not converge prematurely due to several outstanding rewards.It is calculated as the current total rewards divided by training episodes.From the training figures, it can be seen that DDPG requires 235 training cycles to complete the training, while improved DDPG only requires 145 training cycles to complete the training.This is because, after the experience pool size, settlement method, and the Critic network structure are improved, the Q-value of each step is kept at a reasonable level, reducing overestimation and suppressing local optima and suboptimal phenomena during the training process.This allows the Actor network to learn with fewer steps during experience accumulation and climbing, quickly absorbing positive gain experience, avoiding local optima, completing experience accumulation in fewer steps, and improving overall training efficiency. .
Appl.Sci.2023, 13, 7594 12 of 16 power of 0.1 w and a seed of 23,341 are added to the simulation.Blue, red, yellow and purple curves represent displacement, angular displacement, velocity and angular velocity respectively.The white noise experimental results are shown in Figures 9-11.