Research on Deep Reinforcement Learning Control Algorithm for Active Suspension Considering Uncertain Time Delay

The uncertain delay characteristic of actuators is a critical factor that affects the control effectiveness of the active suspension system. Therefore, it is crucial to develop a control algorithm that takes into account this uncertain delay in order to ensure stable control performance. This study presents a novel active suspension control algorithm based on deep reinforcement learning (DRL) that specifically addresses the issue of uncertain delay. In this approach, a twin-delayed deep deterministic policy gradient (TD3) algorithm with system delay is employed to obtain the optimal control policy by iteratively solving the dynamic model of the active suspension system, considering the delay. Furthermore, three different operating conditions were designed for simulation to evaluate the control performance: deterministic delay, semi-regular delay, and uncertain delay. The experimental results demonstrate that the proposed algorithm achieves excellent control performance under various operating conditions. Compared to passive suspension, the optimization of body vertical acceleration is improved by more than 30%, and the proposed algorithm effectively mitigates body vibration in the low frequency range. It consistently maintains a more than 30% improvement in ride comfort optimization even under the most severe operating conditions and at different speeds, demonstrating the algorithm’s potential for practical application.


Introduction
With the development of microprocessor, sensor, and actuator technologies, scholars have studied various aspects of active suspension systems.The active suspension system can adaptively adjust the output force by controlling the actuator according to the driving operation and road conditions, achieving both ride comfort and better vehicle driving performance.The active suspension system control has significant prospects due to its ability to meet the comfort and safety requirements of modern vehicles.
Due to the closed-loop nature of the active suspension system, which includes sensors, controllers, and actuators, there are inherent delays in the measured signal or actuator.In the majority of cases, the time delay can be disregarded as it is insignificant.However, there are situations where the magnitude of the time delay is comparable to the control cycle, making it impossible to ignore and requiring careful consideration.Time delays tend to degrade control performance and potentially induce instability in the control system [1][2][3][4].Therefore, for active suspension systems with delay, developing control strategies that resist uncertain time delays is particularly important.
Currently, delay-sensitive control systems have made significant progress in multiple industries including transportation [5,6], autonomous vehicle control [7], wireless sensor networks [8], and power systems [9,10].In recent years, stability analysis and controller integration have been used in linear systems with measurement delays or actuator delays.In general, there are two main approaches to dealing with the problem of system delays.One is to design the controller using an integrated system model that includes actuator Sensors 2023, 23, 7827 2 of 20 dynamics [11], and the other is to consider system delays during the controller design process.Ji et al. [12] proposed an improved variable universe fuzzy control strategy with real-time adjustment of the contracting-expanding factor parameters to improve the ride comfort of vehicles.Some scholars have proposed the use of delay compensation techniques to reduce or offset the negative effects of delays [13][14][15].Udwadia et al. [16] proposed the application of delayed state positive feedback proportional control to the active control of structures.Pan et al. [17] designed a suspension delay active controller using an adaptive control strategy.Some scholars [18,19] have also attempted to design delayed instantaneous optimal control laws for suspension systems with delay using a state transformation approach to ensure the stability of the system.Du et al. [20] designed full-frequency domain state feedback controllers considering input delay for automotive suspensions and seat suspensions and achieved good damping effects within a certain delay range.Li et al. [21], on the other hand, proposed a time-delayed full-frequency domain robust control method based on dynamic output feedback.In addition, Kim et al. [22] combined perception technology and proposed a model predictive control of a semi-active suspension with a shift delay compensation using preview road information.Wu et al. [23] proposed a time-delay control strategy and the idea of using a linear motor as the actuator.According to this idea, they proposed a linear equivalent excitation method to optimize the optimal time-delay control parameters under complex excitation.Li et al. [24] proposed a fuzzy cooperative control strategy based on linear matrix inequality theory to weaken the effect of perturbations on vehicle driving conditions by integrating the problems of variable structure, disturbance immunity, and time delay.Based on Lyapunov theory and backstepping technique, Wang [25] studied the adaptive control problem of nonlinear active suspension systems with random perturbations and time delay.Moreover, some scholars have considered the time-delay characteristics of active suspension systems and controlled them according to the Takagi-Sugeno fuzzy model [26][27][28].Further, scholars have been searching for more robust control methods to ensure the stability and performance of the suspension system [29][30][31].
Although numerous scholars have conducted extensive research on the problem of delay, the delay control system remains enigmatic due to its infinite-dimensional nature and the inherent uncertainty associated with delay.Therefore, further in-depth research is essential.Simultaneously, the automotive industry is rapidly adopting intelligence as part of its processes, which has led to the integration of artificial intelligence (AI) technology as a solution for controlling complex and uncertain systems.Notably, deep reinforcement learning (DRL) techniques have demonstrated significant advantages in addressing highdimensional problems.By combining the robust perception and information-processing capabilities of deep learning (DL) with the decision-making proficiency of reinforcement learning (RL) in complex environments, these techniques are gradually gaining traction in the study of various intricate systems.
Currently, DRL has shown its advantages in vision and decision-making [32,33].Some scholars have also tried to apply it to suspension control [34,35].In an intuitive application, Pang et al. [36] proposed a non-fragile fault-tolerant control design for Markovtype systems.Providing more depth, Kozek et al. [37] proposed a neural algorithm based on reinforcement learning to optimize the creation of a linear quadratic regulator (LQR).In recent years, Li et al. [38] used an actor-critic architecture to study the adaptive neural network output feedback optimal control problem.In addition, many scholars have utilized different DRL architectures for active or semi-active suspension control.Lin et al. [39] studied a deep deterministic policy gradient (DDPG) control strategy for a full-vehicle active Macpherson suspension system.Yong et al. [40] proposed learning and control strategies for a semi-active suspension system in a full car using soft actor-critic (SAC) models on real roads.Similarly, Lee et al. [41] conducted a study on semi-active suspension control using DRL and proposed a state-normalization filter to improve the generalization performance.Further, Du et al. [42] proposed the utilization of external knowledge in the DDPG framework for suspension control and integration of speed planning to ensure ride comfort.It is worth mentioning that Han et al. [43] and Dridi et al. [44] have tried to use a proximal policy optimization (PPO) algorithm for semi-active and active suspension control and achieved satisfactory control results.Although there have been some useful studies on the application of DRL to suspension control, few attempts have been made to solve the delay problem.However, it is worth noting that in sequential decisions such as DRL control, the delay problem can have a significant impact on the actual control effect.Therefore, using DRL to solve the delay problem is still a new idea to be investigated.In this study, by adding a delay link to the twin-delayed deep deterministic policy gradient (TD3) algorithm, the agent is guided to explore the possibility of obtaining a more robust control strategy in a time-delayed environment, and then effectively suppress the effect of uncertain delay on the active suspension system.To summarize, the main innovations of this study are as follows: 1.
To our knowledge, this study represents the first research endeavor to employ DRL techniques within the realm of delay control.The primary aim of this investigation is to alleviate the repercussions of uncertain delays through the implementation of active suspension control strategies rooted in DRL.Furthermore, this study demonstrates the utilization of high-dimensional advantages of DRL in an infinite-dimensional delay control system, ultimately achieving commendable results.

2.
In this study, multiple sets of simulation considering deterministic delay, semi-regular delay, and uncertain delay are proposed to test the control performance of the algorithm.
Various delay characteristics and uncertainty of the control system under their influence are considered to make the simulation closer to the actual working conditions.

3.
The control algorithm proposed in this study maintains good control performance in multiple sets of the simulation built by MATLAB/Simulink for different working conditions and speeds, proving its application potential.
This paper is organized as follows.Section 2 presents the active suspension quarter model and road model.The proposed controller algorithm and the model design associated with it are presented in Sections 3 and 4, respectively.In Section 5, the simulation environments with deterministic, semi-regular, and uncertain delays set up to validate the performance of the algorithm and their results are shown.In Section 6, the simulation results are discussed in a broader context with respect to the simulation results.Finally, Section 7 provides the conclusions of this study.A table of notations and abbreviations used in this paper is provided in Appendix A.

Active Suspension Quarter Model
In the field of control algorithms for active suspension systems, the two-degreesof-freedom quarter suspension has emerged as the established benchmark model for experimentation.This model offers simplicity, and despite only addressing the vertical vibration of the sprung and unsprung masses, it provides an intuitive representation of the algorithm's impact on control performance.Consequently, it aids researchers in algorithm development.Further, the vertical vibration of the vehicle body is the primary determinant of ride comfort, while the complex structure and pitch-and-roll motion of the body are presently disregarded.Hence, the quarter suspension serves as the foundational model for developing a dynamic model that accounts for delay.
Typically, models that take into account time delays are generally constructed from Takagi-Sugeno fuzzy models or include a model of an adjustable damper with hysteresis in the dynamics model [45][46][47].The two-degrees-of-freedom active suspension dynamics model considering delay used in this study is shown in Figure 1.In this model, the system state delay is too small in magnitude, so it can be neglected, and only the inherent delay τ and actuator delay τ a in the control system are considered.Among them, the inherent delay is the amount of delay caused by the acquisition and transmission of signals in the active suspension control system and the controller operation, and the actuator delay is the amount of delay caused by the actuator response delay, which is unavoidable in the control loop.τ and actuator delay a τ in the control system are considered.Among them, the inherent delay is the amount of delay caused by the acquisition and transmission of signals in the active suspension control system and the controller operation, and the actuator delay is the amount of delay caused by the actuator response delay, which is unavoidable in the control loop.According to the second class of Lagrangian equations, the kinetic equation with delayed quantities at moment t can be obtained as ( ) is the active suspension actuator control force for the time-delay system, N. Take the system state variables as Take the system output variable as Then, the state space expression of the active suspension, considering the delay, is

t Ax t Bu t Ew t y t Cx t Du t
Lw t According to the second class of Lagrangian equations, the kinetic equation with delayed quantities at moment t can be obtained as where m b is the sprung mass, kg.m u is the unsprung mass, kg.k b is the spring stiffness, N/m.k u is the equivalent tire stiffness, N/m.c b is the equivalent damping factor of the suspension damping element, N•s/m.x b is the vertical displacement of the sprung mass, m.
x u is the vertical displacement of the unsprung mass, m. w is the vertical tire displacement, which can also be equated to the road displacement, m.F = u(t − τ − τ a ) is the active suspension actuator control force for the time-delay system, N. Take the system state variables as .
x u (t) T Take the system output variable as x b (t) .
x b (t) Then, the state space expression of the active suspension, considering the delay, is . where .

Road Model
The vehicle road model uses a filtered white noise time domain road input model, i.e., .
where w(t) is the road displacement.f 0 is the lower cutoff frequency.S q (n 0 ) is the road unevenness coefficient, which is related to the road class, where the unevenness coefficients of class A, B, and C roads are 16, 64, and 256, respectively.v is the speed, m/s.w 0 is a uniformly distributed white noise with a mean of 0 and an intensity of 1.
The lower cutoff frequency is calculated as where n 00 is the spatial cutoff frequency of the pavement, n 00 = 0.011 m −1 .In this study, the time domain road unevenness curve used by the training agent is shown in Figure 2.

Road Model
The vehicle road model uses a filtered white noise time domain road input model, i.e.,

( ) ( ) ( ) ( )
where ( ) w t is the road displacement.0 f is the lower cutoff frequency.( ) S n is the road unevenness coefficient, which is related to the road class, where the unevenness coefficients of class A, B, and C roads are 16, 64, and 256, respectively.v is the speed, m/s.0 w is a uniformly distributed white noise with a mean of 0 and an intensity of 1.
The lower cutoff frequency is calculated as where 00 n is the spatial cutoff frequency of the pavement,

Controller Algorithm
This section introduces the primary algorithm employed in constructing the controller for the active suspension delay system.Initially, the fundamental principles and framework of reinforcement learning (RL) are presented, followed by an elucidation of the advantages offered by the deep reinforcement learning (DRL) algorithm compared to traditional RL algorithms.Subsequently, the TD3 algorithm, known for its suitability in

Controller Algorithm
This section introduces the primary algorithm employed in constructing the controller for the active suspension delay system.Initially, the fundamental principles and framework of reinforcement learning (RL) are presented, followed by an elucidation of the advantages offered by the deep reinforcement learning (DRL) algorithm compared to traditional RL algorithms.Subsequently, the TD3 algorithm, known for its suitability in continuous control systems, is selected based on the characteristics of the delay control system.The algorithmic process and technical intricacies of TD3 are then described.

Reinforcement Learning
RL is a payoff learning method developed from traditional attempted learning, and its basic principles can be traced back to optimal control theory and Markov decision processes (MDP).The RL can be represented as an interactive system consisting of an agent and an environment, as shown in Figure 3.At time t, the environment generates information describing the state of the system, which is the state s t .The agent interacts with the environment by observing the state and using this information to select the action a t .The environment accepts the action and transitions to the next state s t+1 .The environment then feeds the next state and its reward r t to the agent.The cycle repeats and iterates continuously until the environment terminates.RL explores learning the optimal policy by maximizing the total reward in the process.

Deep Reinforcement Learning
In RL, the agent learns a function to formulate appropriate actions to maximize the goal.However, RL has limitations when dealing with large-scale problems.For example, in cases where the state space and action space are extremely large, traditional RL algorithms may face the issues of the curse of dimensionality and computational complexity.Additionally, traditional RL algorithms may require a large number of samples and time to learn a good policy, which can be time-consuming and inefficient for certain tasks.With the development of DL technology, we can use deep neural networks as function-approximation methods to learn value functions that imply multidimensional information.It extends the application of RL to higher dimensions, allowing it to suppress the high-dimensional disasters of traditional optimization problems to some extent.
The three primary functions that can be learned in RL align with the three primary methods in DRL, namely, policy-based, value-based, and model-based.Researchers use these three major classes of methods individually or in combination to meet the needs of practical tasks.In this study, the active suspension control system considering delay is a typical continuous state and continuous action control system.The classical DRL algorithm in this field is the DDPG [48].DDPG utilizes an actor-critic architecture for learning, i.e., an actor network is added to the deep Q network (DQN) [49,50] to output action values directly.It helps the agent to obtain more feedback information and thus make more accurate decisions.

Deep Reinforcement Learning
In RL, the agent learns a function to formulate appropriate actions to maximize the goal.However, RL has limitations when dealing with large-scale problems.For example, in cases where the state space and action space are extremely large, traditional RL algorithms may face the issues of the curse of dimensionality and computational complexity.Additionally, traditional RL algorithms may require a large number of samples and time to learn a good policy, which can be time-consuming and inefficient for certain tasks.With the development of DL technology, we can use deep neural networks as function-approximation methods to learn value functions that imply multidimensional information.It extends the application of RL to higher dimensions, allowing it to suppress the high-dimensional disasters of traditional optimization problems to some extent.
The three primary functions that can be learned in RL align with the three primary methods in DRL, namely, policy-based, value-based, and model-based.Researchers use these three major classes of methods individually or in combination to meet the needs of practical tasks.In this study, the active suspension control system considering delay is a typical continuous state and continuous action control system.The classical DRL algorithm in this field is the DDPG [48].DDPG utilizes an actor-critic architecture for learning, i.e., an actor network is added to the deep Q network (DQN) [49,50] to output action values directly.It helps the agent to obtain more feedback information and thus make more accurate decisions.

Twin-Delayed Deep Deterministic Policy Gradient
Although the classical DDPG algorithm has achieved satisfactory results in many tasks, the traditional actor-critic framework suffers from the bias and variance problems associated with function approximation, considering the cumulative nature of the DRL iterative process.Specifically, on the one hand, the variance can cause overestimation, while at the same time, high variance can cause the accumulation of errors, which in turn makes the system less stable.Considering the characteristics of time-delayed control systems, high variance means more aggressive control behavior and high deviation means gradual deviation from the steady state, and the agent gains more benefit but also takes more risk, which is more likely to cause instability of the control system.Therefore, we need to suppress both overestimation and cumulative error.Based on the above practical control requirements, this study uses the TD3 algorithm [51] as the basic algorithmic framework, i.e., adding techniques such as clipped double Q-learning, delayed policy updates, and target policy smoothing to the DDPG framework.The framework improvement of the TD3 algorithm is shown in Figure 4.
gradual deviation from the steady state, and the agent gains more benefit but also takes more risk, which is more likely to cause instability of the control system.Therefore, we need to suppress both overestimation and cumulative error.Based on the above practical control requirements, this study uses the TD3 algorithm [51] as the basic algorithmic framework, i.e., adding techniques such as clipped double Q-learning, delayed policy updates, and target policy smoothing to the DDPG framework.The framework improvement of the TD3 algorithm is shown in Figure 4.When building a controller using the TD3 algorithm, first, randomly initialize the critic network ( ) , Q Q s a θ , and actor network ( ) and μ θ , respectively.At the same time, we need to initialize target networks ( ) , , In the learning phase, we set the capacity of the experience buffer and the number of episodes.After initializing the state t s , the actions are selected based on the current actor network and exploration noise:  When building a controller using the TD3 algorithm, first, randomly initialize the critic network Q 1 s, a θ Q1 , Q 2 s, a θ Q2 , and actor network µ(s|θ µ ) with parameters θ Q1 , θ Q2 , and θ µ , respectively.At the same time, we need to initialize target networks Q 1 s, a θ Q1 , Q 2 s, a θ Q2 , and µ s θ µ with parameters θ Q1 , θ Q2 , and θ µ , respectively.Then, synchronize the parameters: In the learning phase, we set the capacity of the experience buffer and the number of episodes.After initializing the state s t , the actions are selected based on the current actor network and exploration noise: where ξ t ∼ N t (0, σ).
Performing actions in the environment and obtaining rewards r t and new state s t+1 .We store these experiences in the form of samples of (s t , a t , r t , s t+1 ) that are transformed into the experience buffer.When the experience buffer reaches a certain number of samples, we randomly sample a mini-batch of experience-transfer samples (s i , a i , r i , s i+1 ) from it to perform parameter updates.Let where γ is the discount factor and takes the value of 0.99, which is taken to be equivalent to considering the situation after 100 time steps.c is the noise clipping, which in this study is taken as 0.5.The clipped double Q-learning idea and target policy smoothing idea of TD3 are reflected here.With clipped double Q-learning, the value target cannot introduce any additional overestimation bias using the standard Q-learning target.While this update rule may induce an underestimation bias, this is far preferable to overestimation bias, as unlike overestimated actions, the value of underestimated actions will not be explicitly propagated through the policy update [51].Moreover, the effect of inaccuracy caused by function approximation error can be effectively reduced by target policy smoothing.
Update the parameters θ Qj of the critic network according to the minimization loss function, i.e., where N is the size of the mini-batch sample, whose value we take as 128 in this study.According to our experience, the size of the mini-batch should be related to the complexity of the problem being studied.The parameters θ µ of the actor network are updated according to the objective maximization function, i.e., Finally, in order to avoid unknown oscillations affecting convergence during gradient descent to ensure that the update of the network can balance stability and rapidity, we perform the soft update process for the network parameters, i.e., where ε is the soft update factor, since the delay control system is a high-dimensional unstable system, ε = 0.001 is taken in this study to satisfy the robustness of the policy to some extent.In addition, the critic network is updated twice as often as the actor network in order to first minimize the errors induced by the value estimation before introducing policy updates.Soft updating of network parameters and delayed policy updating ensure target stabilization and thus reduce error increase.TD3 is summarized in Algorithm 1.
Update parameters θ Qj : Soft update target networks:

Controller Model
Solving a new problem using DRL involves the creation of an environment, so in this section, we focus on the environment components involved in the controller model, namely, the state, action, reward, and transfer function.Among them, the transfer function of DRL is the dynamics model introduced in Section 2, which will not be repeated here.

State
The state is the information that describes the environment, and the RL environment must provide the algorithm with enough information to solve the problem.In the active suspension delay control system, multidimensional information such as displacement, velocity, acceleration, control force, and time are included.These high-level states are the raw states.The raw state should contain all the information relevant to the problem, but it is often difficult to learn because it usually has a lot of redundancy and background noise.Raw and complete information provides more freedom, but extracting and interpreting useful signals from it is a much heavier burden.Therefore, the selection and design of the state space becomes particularly important.
Based on the important suspension performance parameters included in the dynamics model, the following state variables are chosen to characterize the state space considering the control performance requirements of the active suspension.The designed state space takes into account both real-world constraints and observability cases under the influence of sensor arrangements.
The state information takes into account the ride comfort, actuator efficiency, and suspension travel of the suspension system.This state contains much less information than the original state, but it contains more straightforward information for the algorithm and is easier to learn.
Further, the agent further preprocesses the designed states for its own use.In order to improve the generalization performance of the controller model, this study normalizes the states, and the final state vector is represented as where λ is the normalized coefficient of each state variable.

Action
Action is the output of the agent, which is the amount of control output by the controller.Actions change the environment by transitioning the dynamics model to the next state and into the next round of iterations.How the action is designed affects the ease of control of the system and thus directly affects the problem's difficulty.In this study, the action is the control force of the active suspension actuator, i.e., where µ(s|θ µ ) is the actor network.It should be noted that due to the delay, the control force in the actual system often appears as Considering the specific performance constraints of the actuator, adding the double truncation constraint to the actor network, the final action is represented as a = clip(a t+τ+τ a , F min , F max ) (17) In this study, based on the saturation constraint of the actuator, we set F min = −3 kN and F max = 3 kN.

Reward
The reward signal is used to define the objective that the agent should maximize to guide the agent's exploration to obtain a better policy from a global level.The reward function can be somewhat analogous to the objective function of a traditional control problem, so reward design is an important issue in DRL.In the design of an active suspension controller considering delay, the control performance requirements of multiple objectives need to be considered.
The first issue is the ride comfort, which is closely related to the vertical acceleration of the vehicle body.Therefore, the impact of road unevenness on the body is avoided, and the vertical acceleration is reduced while a suitable active control force is needed.
Secondly, the practical constraints of suspension travel need to be considered.Dynamic suspension travel needs to satisfy the following inequalities within a safe range: where f d is the maximum travel value of the suspension, and in this study, let f d = 0.15 m.The value is determined by referring to the literature [52] and considering the actual constraints of the suspension obtained.
Then, the grounding of the suspension needs to be considered, i.e., the following inequalities need to be satisfied to ensure the handling stability of the vehicle: where F m is the static load of the tire, and its calculation formula is Finally, the control characteristics of the actuator need to be considered.The actuator delay τ a has a close relationship with the total delay of the whole system, so in order to suppress the effect of delay at the physical level, we should ensure that the control force is relatively stable in a relatively small interval as much as possible.
In summary, the reward function is defined as where 1 are the weight coefficients of the balanced multi-objective optimization problem.
It should be noted that the agent's reward function in the training phase references the state information of the system after the control force has been applied after the delay.In contrast, the state referenced by the actor network and critic network is the current state.In other words, the delayed control system in the experimental phase is equivalent to an open-loop control system.

Simulation and Results
This section describes the simulation and results used to verify the proposed control algorithm in the simulation environment built by MATLAB2023a/Simulink. First, the specific environment and agent-related network information required for the simulation experiments were introduced and set up.Then, parallel experiments were conducted for 10 ms, 20 ms, and 30 ms defined delay conditions to demonstrate the control performance of the proposed algorithm for deterministic delay control.In addition, we established a semiregular delay control condition based on the fuzzy relationship between actuator control force and delay to test the excellent control performance of the proposed algorithm.Finally, we established a severe operating condition with uncertain delay to test the anti-disturbance performance of the proposed algorithm and its improvement of ride comfort.

Implementation Details
The active suspension system dynamics model with consideration of delay introduced in Section 2 was used to build the environment required for training the DRL, and the model parameters and values are listed in Table 1.The road information used for training the agent is shown in Figure 2, and a deterministic delay of 30 ms was added to the network during training.The critic and actor network used for the agent were specifically designed for the control of active suspension systems, as shown in Figure 5, and the hyperparameters used to train the network are shown in Table 2.In order to better verify the performance of the proposed algorithm in this study, we chose the active suspension DDPG control architecture proposed in the literature [53] as a baseline for comparison.The hyperparameters of the baseline algorithm were used in this study, and additional hyperparameters were selected based on the original TD3 algorithm.In addition, we performed combinatorial experiments on some of the hyperparameters; see Appendix B.

Deterministic Delayed Conditions
Due to the random nature of the road excitation during the actual driving of the vehicle, we selected a random class C road as the random excitation.The proposed active control algorithm of the delay system was used to study the dynamics of the suspension system under random excitation.One study [20] showed that the classical control method could not prevent unstable divergence in an active suspension system with a 24 ms time delay.Based on this conclusion, it was determined that the delay times were designed to be similar to 10 ms, 20 ms, and 30 ms.
We quantitatively evaluated the intensity of body vibration under DRL control using root mean square (RMS) values of acceleration.In addition, we also analyzed the frequency response of the body acceleration to a random road.The RMS results of the proposed algorithm with different time delays in the simulation environment were compared with passive results, as tabulated in Table 3.The simulation comparison results of the body acceleration and the frequency response are shown in Figure 6.In addition, we compared the proposed algorithm with the most classical and effective DDPG algorithm [53]; the comparison results are also presented in the graphs.As we can see from the graphs, the proposed control algorithm optimized the ride comfort by 43.58%, 44.9%, and 32.28% for 10 ms, 20 ms, and 30 ms deterministic delays, respectively, compared to the passive suspension.Although the control algorithm of the proposed algorithm is slightly inferior to that of DDPG under the low latency condition of 10 ms, the control performance of the proposed algorithm improves by 25.56% compared to that of DDPG under the latency condition of 20 ms.Further, under the large delay condition of 30 ms, the proposed algorithm still maintains the optimization result of 32.28% compared to DDPG, which cannot maintain stability and crashes.The above results clearly demonstrate the superior control performance of the proposed algorithm.Although the proposed algorithm exhibited good control performance at deterministic delays, realistic control systems did not always have deterministic delays.Deterministic delay is equivalent to adding a deterministic dimension to the overall control system, which is still solvable to some extent.Therefore, the study of delayed control systems requires more in-depth discussion and analysis.
posed algorithm with different time delays in the simulation environment were compared with passive results, as tabulated in Table 3.The simulation comparison results of the body acceleration and the frequency response are shown in Figure 6.In addition, we compared the proposed algorithm with the most classical and effective DDPG algorithm [53]; the comparison results are also presented in the graphs.x b /w.

Semi-Regular Delayed Conditions
The amount of delay in a control system is often closely related to the actual actuation capability of the actuator, so some scholars used an integrated system model that includes the actuator dynamics to design the controller.To simulate this characteristic, we developed a semi-regular delay condition, specifically using the following rules: where δ is the unit delay amount, and its value was taken as 10 ms in this study.f is the unit amount determined according to the maximum limiting control force of the actuator, and its value was taken as 2 kN in this study.This value is chosen by considering the bandwidth of the active suspension actuator and can actually be obtained by testing the actuation force response characteristics of the active suspension actuator under different loads.The semi-regular delay condition was based on the fuzzy relationship between actuator delay and actuation capacity, and the delay was graded in steps.It can make the system simple while retaining the actuator's role to a greater extent, so this condition had stronger practical significance for testing the control algorithm.The body acceleration and frequency response with the semi-regular delay condition are shown in Figure 7.
where δ is the unit delay amount, and its value was taken as 10 ms in this study.f is the unit amount determined according to the maximum limiting control force of the actuator, and its value was taken as 2 kN in this study.This value is chosen by considering the bandwidth of the active suspension actuator and can actually be obtained by testing the actuation force response characteristics of the active suspension actuator under different loads.
The semi-regular delay condition was based on the fuzzy relationship between actuator delay and actuation capacity, and the delay was graded in steps.It can make the system simple while retaining the actuator's role to a greater extent, so this condition had stronger practical significance for testing the control algorithm.The body acceleration and frequency response with the semi-regular delay condition are shown in Figure 7.In the semi-regular delay condition, the RMS result of the proposed algorithm is 0.9529 m/s 2 , and the ride comfort is optimized by 44.13%.In comparison, the RMS value under the DDPG baseline control is 1.1321 m/s 2 , while the control performance of the proposed algorithm exceeds the DDPG baseline by 15.8%.We can see that the proposed algorithm still maintained good control performance in the operating conditions where the fuzzy characteristics of the actuator were considered.

Uncertain Delay Conditions
The previous consideration of deterministic delay and semi-regular delay conditions amounted to adding one or more dimensions of uncertainty to the delay control system.Further, the delay in a practical time-delay control system was often uncertain.Mathematically and theoretically, this environment was closer to an infinite-dimensional control system.Therefore, we set up simulation experiments under uncertain delay time working conditions to simulate a more severe environment.Under uncertain delay conditions, DDPG fails to converge due to its inability to adapt to such a high-dimensional working condition.
Specifically, the delay time was set to satisfy the uniform distribution of the following equation: The body acceleration and frequency response under the uncertain delay conditions are shown in Figure 8.The value of RMS under the control of the proposed algorithm was 1.1116 m/s 2 in the uncertain delay condition, and the optimization of ride comfort reached 37.56%.In comparison, the RMS value under the DDPG baseline control is 1.2528 m/s 2 , while the control performance of the proposed algorithm exceeds the DDPG baseline by 11.3%.It can be seen that the control performance of the proposed algorithm only showed a small degradation in the case of a sharp increase in system complexity.In Figure 8b, the response of the proposed algorithm to road vibration shows a slight increase in a certain range in the high-frequency part.This is due to the fact that the uncertain delay condition is an extremely severe condition, so the DRL algorithm will prioritize the control needs in the low-frequency section.A more in-depth study will be carried out for high-frequency control.
a small degradation in the case of a sharp increase in system complexity.In Figure 8b, the response of the proposed algorithm to road vibration shows a slight increase in a certain range in the high-frequency part.This is due to the fact that the uncertain delay condition is an extremely severe condition, so the DRL algorithm will prioritize the control needs in the low-frequency section.A more in-depth study will be carried out for high-frequency control.It should be noted that in the comparative experiments, the DDPG algorithm was unable to complete the full verification due to triggering the termination condition within It should be noted that in the comparative experiments, the DDPG algorithm was unable to complete the full verification due to triggering the termination condition within the episode described in Equation ( 18) under the conditions of a 30 ms large delay, a semiregular delay, and an uncertain delay.In order to conduct the comparative experiments, we had to remove the relevant termination conditions.In comparison, the proposed algorithm consistently did not trigger the termination condition, ensuring the safety of the driving process.Furthermore, in order to verify the generalization performance of the proposed algorithm in different environments, we conducted experiments by varying the speed.The control results are shown in Table 4.The table indicates that the proposed algorithm shows good control performance at different speeds.

Discussion
In this study, we made some beneficial attempts using DRL to solve the challenging problem of time delay control in active suspension systems.We set deterministic, semiregular, and uncertain time delays to simulate the changes from ideal working conditions to real working conditions and then to harsh working conditions, thereby testing the control performance of the proposed algorithm.Under deterministic delay, the proposed algorithm demonstrated good control performance at working conditions of 10 ms, 20 ms, and 30 ms, surpassing the DDPG baseline and maintaining good stability even under larger time delays.In addition, the proposed algorithm effectively suppressed the first resonance peak of road excitation and body, and improved ride comfort.The proposed algorithm includes predictions of future rewards, thus possessing stronger robustness to a certain extent.This condition corresponds to a relatively ideal working environment for the actuator, where stable fixed delay is desirable.However, under actual conditions, system delay is often closely related to the actuator's manufacturing capability and response.Therefore, we designed semi-regular delay conditions to simply simulate this characteristic.The simulation results also reflected the good control performance of the proposed algorithm and its improvement in ride comfort.We believe that these results are due to the fact that the proposed algorithm, based on predicting the future, has imposed planning on the output of control force, keeping it within a small range of variation that better aligns with the actuator's response characteristics.Furthermore, uncertain conditions are relatively harsh working conditions, and it is necessary to conduct simulations under such conditions to better test the performance of the proposed algorithm.It can be seen that under such conditions, the proposed algorithm can still maintain a 37.56% optimization effect.We believe this is because, for the infinite-dimensional delay control system, the data-driven algorithm bypasses the analysis at the high-dimensional level and directly performs endto-end analysis.To a certain extent, it corresponds to re-architecting a solver that simulates a high-dimensional environment, and this approach is undoubtedly novel and effective.Of course, further research is needed to verify its effectiveness.It is encouraging that Baek et al. [54] and Li et al. [55] have attempted to apply DRL to robotics, Zhu et al. [56] have applied it in Mobile Sensor Networks, and Chen et al. [57] have generalized it even more to continuous control.At the same time, it is also important for us to focus on choosing more suitable algorithms for the control of delay systems.In recent years, PPO [58] has been widely applied in the industry due to its robustness and performance.The characteristics of PPO in parameter tuning and policy updating provide us with new ideas for our future research, which will be a valuable direction for future studies.
Furthermore, the generalization ability of the algorithm has always been a controversial issue for learning-based algorithms.For this reason, we conducted simulations at different speeds, and the results showed that the proposed algorithm maintained over 30% comfort optimization from 10 m/s to 50 m/s, which covers almost all driving speeds in reality.Moreover, to apply the proposed algorithm in the real world, the complexity of the algorithm must be considered, and its real-time calculation performance must be examined.The trained DRL controller has 33,537 parameters, and it only takes 5.6 ms to compute on a typical PC (Intel Core i9-12900KF, 16 GB RAM).Therefore, the proposed controller will not be a problem in terms of real-time implementation.

Conclusions
This paper proposed an active suspension DRL control algorithm considering time delay to study the uncertain delay problem in the actual control system of active suspension.Firstly, a dynamics model of the active suspension system considering time delay was established.Secondly, the TD3 algorithm was enhanced by incorporating delay, enabling the agent to explore more robust policies.Finally, simulation experiments were conducted under three different experimental conditions: deterministic delay, semi-regular delay, and uncertain delay.The proposed algorithm's control performance was evaluated, and experimental validation was performed at various speeds.The results illustrate the algorithm's effectiveness in mitigating the impact of uncertain delay on the active suspension system, resulting in significant improvements in ride comfort optimization.Specifically, the proposed algorithm achieved comfort optimization rates of 43.58%, 44.9%, and 32.28% for deterministic delays of 10 ms, 20 ms, and 30 ms, respectively.Additionally, it obtained optimization rates of 44.13% and 37.56% for semi-regular and uncertain delay conditions, respectively.Furthermore, when compared to the DDPG baseline algorithm, the proposed algorithm demonstrates excellent stability and convergence even under complex delay conditions.
Despite satisfactory results in the current research, the important characteristic of time delay still requires further investigation.In future work, we aim to enhance our understanding of the relationship between delay and control performance by incorporating an integrated system model that accounts for actuator dynamics into the DRL environment.

where bm
is the sprung mass, kg.u m is the unsprung mass, kg.b k is the spring stiff- ness, N/m.u k is the equivalent tire stiffness, N/m.b c is the equivalent damping factor of the suspension damping element, N•s/m.b x is the vertical displacement of the sprung mass, m. ux is the vertical displacement of the unsprung mass, m. w is the vertical tire displacement, which can also be equated to the road displacement, m.
study, the time domain road unevenness curve used by the training agent is shown in Figure2.
Sensors 2023,23, 7827 6 of 20 the environment by observing the state and using this information to select the action t a .The environment accepts the action and transitions to the next state 1 t s + .The environ- ment then feeds the next state and its reward t r to the agent.The cycle repeats and iter- ates continuously until the environment terminates.RL explores learning the optimal policy by maximizing the total reward in the process.
Sensors 2023, 23, x FOR PEER REVIEW 12 of 21 the baseline algorithm were used in this study, and additional hyperparameters were selected based on the original TD3 algorithm.In addition, we performed combinatorial experiments on some of the hyperparameters; see Appendix B.

Figure 5 .
Figure 5. Critic and actor network.(a) Network architecture created for the critic, and (b) network architecture created for the actor.It should be noted that the last fully connected layer in the critic network directly outputs the result without the need for activation.

Figure 5 .
Figure 5. Critic and actor network.(a) Network architecture created for the critic, and (b) network architecture created for the actor.It should be noted that the last fully connected layer in the critic network directly outputs the result without the need for activation.

Figure 6 .
Figure 6.Control performance with deterministic time delay.(a-c) Vehicle body acceleration curves under 10 ms, 20 ms, and 30 ms delay, respectively.(d) Frequency response of

Figure 6 .
Figure 6.Control performance with deterministic time delay.(a-c) Vehicle body acceleration curves under 10 ms, 20 ms, and 30 ms delay, respectively.(d) Frequency response of ..

Figure 7 .
Figure 7.Control performance with semi-regular time delay.(a) Vehicle body acceleration curve under semi-regular time delay.(b) Frequency response of

Figure 7 .
Figure 7.Control performance with semi-regular time delay.(a) Vehicle body acceleration curve under semi-regular time delay.(b) Frequency response of .. x b /w.

Figure 8 .
Figure 8.Control performance with uncertain time delay.(a) Vehicle body acceleration curve under uncertain time delay.(b) Frequency response of

Figure 8 .
Figure 8.Control performance with uncertain time delay.(a) Vehicle body acceleration curve under uncertain time delay.(b) Frequency response of .. x b /w.
2 Initialize target networks θ Q1 → θ Q1 , θ Q2 → θ Q2 , θ µ → θ µ = µ(s t |θ µ ) + ξ t according to the current policy and exploration noise.9 Execute action a t and observe reward r t and observe the new state s t+1 .10 Store transition tuple (s t , a t , r t , s t+1 ) in Ω 11 Sample mini-batch of N transitions (s i , a i , r i , s i+1 ) from Ω 12 Set y

Table 3 .
RMS values of body acceleration in deterministic delayed conditions.

Table 3 .
RMS values of body acceleration in deterministic delayed conditions.

Table 4 .
RMS values of body acceleration at different speeds.