Next Article in Journal
Immersion and Invariance Adaptive Control for Unmanned Helicopter Under Maneuvering Flight
Previous Article in Journal
Network Design and Content Deployment Optimization for Cache-Enabled Multi-UAV Socially Aware Networks
Previous Article in Special Issue
PowerLine-MTYOLO: A Multitask YOLO Model for Simultaneous Cable Segmentation and Broken Strand Detection
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
Drones 2025, 9(8), 569; https://doi.org/10.3390/drones9080569
Submission received: 17 June 2025 / Revised: 3 August 2025 / Accepted: 6 August 2025 / Published: 12 August 2025

Abstract

The locomotion control of unmanned quadruped robots has been one of the greatest challenges in robotics. Deep reinforcement learning has made great achievements in robot control. However, extracting effective features from historical information to improve locomotion agility is still an open and challenging problem. In this paper, a heterogeneous time-series soft actor–critic (HTS-SAC) method is proposed to enable better policy learning from historical data. Firstly, four mutual information decision conditions are developed for feature selection, which can analyze the correlation between input states and motion performance, obtaining the importance of temporal features of different lengths. Then, according to the results of feature optimization, a novel heterogeneous time-series neural network and the HTS-SAC locomotion control method are designed. Finally, the effectiveness of the proposed method is validated on different terrains using a Laikago quadruped robot simulation model.

1. Introduction

Legged robots have better terrain adaptability and movement ability in complex dynamic environments and rugged terrains, compared with wheeled robots and tracked robots [1]. In the future, unmanned quadruped robots may replace human beings in a variety of dangerous and complex environments. Therefore, it has become a hot research field for scholars to make unmanned quadruped robots move flexibly and quickly like animals.
There are several control methods for the stable multi-contact locomotion control of legged robots. The model predictive control (MPC) method shows strong advantages in control accuracy and robustness [2,3]. Robust model predictive control and other MPC methods are widely used in fields such as unmanned aerial vehicles and unmanned surface vehicles [4,5]. Unmanned quadruped robots can freely control the foot position, so that they can not only quickly cross a variety of cluttered obstacles and terrains, but also maintain high speed on flat ground [6]. For the quadruped robot, the MPC torque control was proposed by the MIT team to generate trotting, galloping, and other gaits in [7]. Then, they combined the whole body control (WBC) and the MPC to realize the high-speed dynamic locomotion in [8], where the MPC finds the optimal reaction force distribution over a longer time horizon with a simple model. The WBC calculates the torque, position, and velocity commands of the joints according to the reaction forces calculated by the MPC.
Furthermore, wheeled quadruped robots possess excellent terrain adaptability due to their unique structure. A model-based whole-body torque control system for tracking the center-of-mass (CoM) motion of wheeled quadruped robots was proposed in [9]. By integrating the wheel kinematic model with the robot’s CoM momentum/dynamics model, a wheel motion controller was developed. In [10], a horizontal stability control framework for wheel–leg hybrid robots was proposed to maintain stable motion on rough terrain. Therefore, wheel–leg quadruped robots have advantages such as efficient movement and wide terrain adaptability.
Additionally, the approximate dynamic programming (ADP) and approximate cell decomposition (ACD) methods are widely used in path planning and following control of robots. An event-triggered method for scheduling ADP controllers was proposed for path following control of robots in [11]. The value iteration was used for modular reconfigurable robots with the ADP algorithm in [12]. The graph search algorithm was developed for path planning of robots in unknown terrain in [13]. In addition, the kernel-based ADP method was used for autonomous vehicle stability control in [14]. An innovative and effective constrained finite-horizon ADP algorithm was proposed for autonomous driving in [15], and the improved A* algorithm and time window algorithm were designed to complete path planning in [16].
However, legged robots are high-dimensional non-smooth systems with several physical constraints, and the dynamic model and kinematic model of the robots cannot be accurately obtained in complex environments. As a result, the model-based control method including the MPC might have strong limitations in the unknown dynamic environments.
Reinforcement learning (RL), as a data-driven method, can overcome the limitations of the model-based methods through the interactive learning between the agents and the environments, which is a promising control method for robots [17,18,19]. It does not need exact dynamic modeling of unmanned quadruped robots but can greatly improve control performance.
RL was applied to the generation of stable gait of the AIBO robot as early as 2000 [20]. The model-free methods were developed to learn motor velocity trajectories and high-level control parameters to realize the jumping locomotion in [21].
In addition, two-stage reinforcement learning was proposed as a general policy to create robust policies, in which two stages learned different training contents [22]. Agile mobility was described as a multi-stage learning problem, in which the mentor guided the agent throughout the training process in [23]; once the student can solve the task, it teaches the student to perform the task without a mentor. In [24], the reference trajectory, inverse kinematics, and transformation loss are incorporated into the training process of reinforcement learning as a priori knowledge. However, the setting of the reference trajectory will indirectly restrict the final training results, so the method based on demonstration learning will not be effective in adapting to different complex terrains.
The deep reinforcement learning (DRL) method is generally selected as the high-level planner, and the traditional control method is applied for tracking control. An unmanned quadruped robot learning locomotion system based on a hierarchical learning framework was proposed in [25], where RL as a high-level policy is used to adjust the underlying trajectory generator to better adapt to the terrain. Combining terrain perception with locomotion planning, a hierarchical learning framework was developed for unmanned quadruped robots to move in challenging natural environments in [26], where the global height map of the terrain was used as the visual information of the DRL to determine the footholds needed for the leg swing and body posture. In the method proposed by Hwangbo in [27], the high-level controller is the DRL, and the low-level controller is the deep neural network. Compared with the traditional low-level controller, the deep neural network controller has a higher control frequency.
The motion control methods for quadruped robots based on reinforcement learning face numerous challenges. Traditional challenges include high-dimensional state and action spaces, low sample efficiency, and difficulty in simulation-to-reality transfer. The motion of quadruped robots is a highly dynamic temporal process (e.g., body swinging trends, changes in contact states, etc.), where current actions depend on historical states (e.g., past joint angles, IMU data, ground contact forces, etc.). Traditional reinforcement learning methods (including SAC) usually only take the "current state" as input, ignoring historical temporal correlations. This makes it difficult for the model to capture the continuity and dynamic trends of motion, thereby affecting control performance. In addition, the state of a quadruped robot includes dozens or even hundreds of dimensions of data such as multi-joint angles, angular velocities, IMU data, and ground contact states. High-dimensional raw features have redundancy and noise, and the input layers of traditional RL methods struggle to directly extract effective information from them, easily falling into the "curse of dimensionality." This results in the model having weak capability to capture key dynamic features (e.g., body tilting trends). Therefore, insufficient utilization of historical temporal information and the difficulty in effectively extracting features from high-dimensional state spaces are also significant challenges.
This paper proposes a feature optimization method based on correlation analysis, which is designed to optimize the dimensionality of data features while making full use of historical data. It also presents a heterogeneous time-series soft actor–critic (HTS-SAC) quadruped robot control method, which improves the performance of traditional RL methods by learning high-level motion features from heterogeneous historical time series. Finally, the effectiveness of the proposed methods is verified through extensive comparative experiments. The main contributions of this work are as follows:
A feature selection method based on k-nearest neighbor mutual information is proposed. The four designed mutual information decision conditions analyze the dimensions and time-series correlations of different features, and optimize the length of the time-series data. The feature selection method improves the utilization efficiency of historical time-series data and optimizes the feature dimension, avoiding the problem of dimension explosion.
A novel HTS-SAC control method is designed for unmanned quadruped robots. Based on the results of feature selection, the neural network as the input layer is designed and fused with the SAC to construct the HTS-SAC model, which can learn the optimal strategy from historical time-series data.
Experiments are carried out with the Laikago robot simulation model on four different terrains, and the performance of the proposed method is verified using a comparison with other DRL methods.
The rest of this paper is organized as follows: Section 2 is the design of the feature selection scheme, and Section 3 presents the detailed HTS-SAC algorithm. In Section 4, the simulation test results are given to demonstrate the effectiveness of the proposed method. Finally, Section 5 presents the conclusion.

2. Feature Selection Method

The current DRL-based methods overlook useful information of historical time-series data that can serve as a reliable source for optimizing strategies. Figure 1 shows the overall framework of the HTS-SAC algorithm. First, we adopt the traditionally trained SAC method to construct the initial experience replay buffer for the quadruped robot. Second, to address the issue of the low utilization of historical time-series data, we have designed four relevant decision-making criteria based on k-nearest neighbor mutual information to perform feature selection using data from the experience replay buffer. Finally, based on the new feature sequences after feature selection, we have designed an HTS-SAC model incorporating a heterogeneous time-series neural network to learn a better strategy.

2.1. Mutual Information Theory

The mutual information can fully show the degree of correlation between variables, and its concept comes from entropy in information theory.
Entropy is used to express the degree of uncertainty of random variables and the amount of information in random variables, which is often called information entropy. Set X as a discrete random variable with a range of D = x i | i = 1 , , N ; then, its entropy is defined as
H ( X ) = i N P ( x i ) log P ( x i ) ,
where P ( x i ) denotes the probability that X chooses x i . When the base of the logarithmic function is 2, the unit of entropy is a bit.
Set X and Y as two discrete random variables; then, the mutual information between X and Y is defined as
M I ( X , Y ) = H ( X ) + H ( Y ) H ( X , Y ) .
The k-nearest neighbor mutual information method applies the estimation method in the sample space and can directly calculate the high-dimensional mutual information. Assuming that there are N feature sample points q i = ( x i , y i ) in Q = ( X , Y ) , i = 1 , , N , the k-nearest neighbor mutual information of random variables X and Y is defined as
M I ( X , Y ) = ψ ( k ) ψ ( n x + 1 ) + ψ ( n y + 1 ) + ψ ( N ) ,
where n x denotes the number of sample points whose distance from point x i is less than ε i / 2 . Here, ε i / 2 denotes the nearest distance between q i and the k-th point, and  n y is denoted similarly. ψ ( k ) is a Digamma function calculated by the following iteration:
ψ ( 1 ) = 0.5772516 ,
ψ ( k + 1 ) = ψ ( k ) + 1 / k .

2.2. Four Mutual Information Decision Conditions

Based on the k-nearest neighbor mutual information theory, we design four key feature decision conditions: the state feature correlation mutual information M I ( X , Y ) , the mutual information change rate V Δ M I ( X , Y ) , the time delay correlation mutual information M I ( X t d , Y ) and the redundancy mutual information M I ( X i , ( X X i ) ) .
a. The state feature correlation mutual information. Set the input variables as X ; then, the mutual information between the output variable action reward Y and the input variable set X is defined as follows:
M I ( X , Y ) = ψ ( k ) ψ ( n x + 1 ) + ψ ( n y + 1 ) + ψ ( N ) .
By calculating the mutual information values of different state information variables, we can analyze the influence of different input state information variables on the locomotion state of the robot. Set the correlation threshold as δ 1 > 0 . When M I ( X , Y ) > δ 1 , it shows that the current variable is closely related, which is the key variable that affects the locomotion state of the robot.
b. The mutual information change rate. Select a variable X 1 as the reference variable, and combine the reference variable X 1 with other input state variables one by one to form a new set X 1 + i , i = 2 , , m . The mutual information of the reference variable X 1 and the new set X 1 + i is calculated, respectively. The change rate of mutual information is obtained by calculating the error of two mutual information. Define the rate of change of mutual information as follows:
V Δ M I ( X , Y ) = M I ( X 1 , Y ) 2 M I ( X 1 + i , Y ) 2 .
According to the information entropy, when X i is an independent variable, V Δ M I ( X , Y ) decreases; when X i is a key variable, V Δ M I ( X , Y ) increases. Set a correlation threshold δ 2 to determine whether V Δ M I ( X , Y ) is a related variable. When V Δ M I ( X , Y ) > δ 2 , it shows that the newly added variable X i is a related variable.
c. The time delay correlation mutual information. Set the action state reward Y of the current state information X t , and define X t d , d = 1 , 2 , , n as the state information at the historical moment with length d. The delay mutual information of state information is defined as follows:
M I ( X t d , Y ) = ψ ( k ) ψ ( n x t d + 1 ) + ψ ( n y + 1 ) + ψ ( N ) .
Here, the delay correlation threshold is set as δ 3 > 0 . When M I ( X t d , Y ) > δ 3 , it indicates that the state feature information of the current moment is a strong correlation feature.
d. The redundancy mutual information. After selecting the key feature variables through mutual information and mutual information change rate, it is necessary to analyze the redundancy of the feature information. The removal of redundant variables can effectively reduce data dimensions while improving the training speed. The redundant mutual information between X i and X is defined as follows:
M I ( X i , ( X X i ) ) = ψ ( k ) ψ ( n x i + 1 ) + ψ ( n x x i + 1 ) + ψ ( N ) .
Here, the threshold of redundancy correlation is set as δ 4 > 0 . When M I ( X i , ( X X i ) ) < δ 4 , it means that the variable is redundant and needs to be eliminated.
The final optimized feature results need to meet all four decision conditions simultaneously. In this way, the proposed method can enhance the effective utilization of the historical information while optimizing features to avoid dimension explosion.

3. Gait Learning Method by HTS-SAC

3.1. Heterogeneous Time-Series Method

In this subsection, we design a heterogeneous time-series neural network based on feature selection. We integrate heterogeneous time-series neural networks with the SAC to construct the HTS-SAC model, which can use heterogeneous time-series datasets and learn high-level features. The structure of the heterogeneous time-series neural network is shown in Figure 2.
The detailed design of the network is as follows. According to the correlation and redundancy mutual information, different time delays d are designed for a single input variable X i to form a heterogeneous time-series input layer S . When X i is strongly correlated, it will appropriately increase the length of the historical time series to use more feature information of the historical moment. When X i is weakly correlated, it will reduce the length of data appropriately to shorten the training time of the model. The heterogeneous time-series neural network is designed based on the importance of historical information from different dimensions.

3.2. HTS-SAC Algorithm

The soft actor–critic (SAC) is an off-policy DRL algorithm [28]. The entropy regularization is introduced into policy training to enlarge the policy searching space. However, the conventional SAC method only considers the current state S t and the previous state S t 1 , while ignoring the impact of longer historical time-series data S t m ( m = 1 , 2 , . . . , n ) . To resolve this issue, we design an HTS-SAC control framework for unmanned quadruped robots that integrates heterogeneous time-series neural networks. As a result, the actor network of HTS-SAC has the capability to learn the features of state variables from heterogeneous historical time series.
The observation space S t should contain all observable information related to the task. All the observation elements at t = t i are defined as
S t i = p t i , ϕ t i , ω t i , ψ t i , δ t i , Θ t i ,
where p t i is the robot base position, ϕ t i and ω t i are the base angles and angular velocities, ψ t i and δ t i are the joint angles and positions, and  Θ t i is the foot contact detection.
A non-deterministic control policy π gives the conditional probability density π ( a t s t ) of taking action a t in the state s t . The formula for calculating entropy H is
H ( π ( · s t ) ) = E π ln π ( · s t ) .
The policy entropy is a measure to estimate the randomness of the actions that an agent can take in a given state. Higher entropy indicates policies with higher uncertainty and stronger exploration ability. By introducing the entropy H ( π ( · s t ) ) into a part of the reward r , the agent will receive an additional entropy-related reward at each step of the exploration. The reward function J ( π ) at the maximum policy entropy is defined as
J ( π ) = E π t = 0 T r ( s t , a t ) + α H ( π ( · s t ) ) ,
where r is the reward of the agent taking action a t in the state s t . α is the regularization coefficient of entropy, which is used to control the randomness degree of the optimal policy and the importance of the upper policy entropy to the reward. The state–action value function Q ( · ) is defined as
Q t π ( s t , a t ) = r ( s t , a t ) + E π l = 1 T r ( s t + 1 + a t + 1 ) α log π ( a t + l s t + l ) .
According to Q t π ( s t , a t ) , the state value function V ( · ) is defined as
V t π ( s t ) = E Q t π ( s t , a t ) α log π ( a t s t ) .
According to (13) and (14), Q t π ( s t , a t ) can be expressed as V t π ( s t ) .
Q t π ( s t , a t ) = r ( s t , a t ) + E s t + 1 p ( s t + 1 s t , a t ) V t + 1 π ( s t + 1 ) ,
where (15) is also known as the soft Bellman equation.
For continuous state–action tasks, it is necessary to introduce function approximation. Define the parameter of the soft state value function V ( · ) as ϕ . The parameter of the state–action value function Q ( · ) is ψ , and the parameter of the policy function π ( · s ) is θ . According to the definition of state value function, the loss function L V ( ϕ ) of V t π ( s t ) is obtained.
L V ( ϕ ) = E s p ( s ) V ϕ ( s ) E a π ( a s ) Q ( s , a ) α log π ( a s ) s 2 ,
Action a represents the actor network under the condition of policy π ( · s ) . Here, a is calculated by a neural network containing noise:
a = f θ ( ϵ ; s ) ,
where ϵ is the input noise vector, sampled from a fixed distribution (such as a spherical Gaussian distribution).
According to the soft Bellman equation, the loss function L Q ( ψ ) of Q-value function Q t π ( s t , a t ) is defined as follows:
L Q ( ψ ) = E π Q ψ ( s , a ) ( r ( s , a ) + γ V ϕ ( s ) ) 2 .
Equation (12) is rewritten as a reward function in the form of KL-divergence:
J ( π ) = E s 0 p ( s 0 ) D K L ( π ( · s 0 ) exp ( 1 α Q 0 π ( s 0 , · ) ) ) + b ,
where b is a constant. The loss function L π ( θ ) of the updated policy is obtained by maximizing the reward J ( π ) :
L π ( θ ) = E s p ( s ) E a π ( a s ) α log π θ ( a s ) Q ψ ( s , a ) s .
Finally, the gradient descent method is used to minimize the above three loss functions L V ( ϕ ) , L Q ( ψ ) , and  L π ( θ ) , and the optimal policy π is obtained. The detailed steps of the HTS-SAC method are shown in Algorithm 1.
Algorithm 1 HTS-SAC algorithm
  • Require: Initialize the actor network parameter θ and critic network parameters ϕ , ψ , target network parameter ϕ t a r g , experience pool D .
  • for each iteration do
  •    for each environment step do
  •       a t π θ ( a t s t )
  •       s t + 1 p ( s t + 1 s t , a t )
  •       D D s t , a t , r ( s t , a t ) , s t + 1
  •    end for
  •    while experience pool D is full do
  •      for each variable do
  •          X M I ( X , Y ) > δ 1 V Δ M I ( X , Y ) > δ 2 M I ( X t d , Y ) > δ 3 M I ( X i , ( X X i ) ) < δ 4
  •      end for
  •      Use X to update the experience pool D D
  •      Construct HTS neural network based on D
  •    end while
  •    for each gradient step do
  •       ϕ ϕ λ V ^ ϕ J V ( ϕ )
  •       ψ i ψ i λ Q ^ ψ i J Q ( ψ i ) f o r i 1 , 2
  •       θ θ λ π ^ θ J π ( θ )
  •       ϕ t a r g ρ ϕ + ( 1 ρ ) ϕ t a r g
  •    end for
  • end for

4. Numerical Simulation

In this section, we compare the twin delayed deep deterministic policy gradient (TD3), the deep deterministic policy gradient (DDPG), the proximal policy optimization (PPO), the traditional SAC, and the time-delay soft actor–critic (TD-SAC) with the HTS-SAC proposed in this paper. The TD-SAC method is a comparative method we constructed, and this method only utilizes a large amount of historical state data without feature optimization. We use the PyBullet robot simulation system for testing. PyBullet has excellent rendering and collision detection details, which can simulate the real world with high fidelity. In addition, it is packaged into a Python (v3.8.2) module for robot simulation and experimentation and provides forward/reverse kinematics, forward/reverse dynamics, collision detection, ray intersection queries and other functions. The parameters of PyBullet are shown in Table 1. The Laikago-a1 quadruped robot is employed for simulation test. We set the actuator delay as 2 ms. The joint actuator uses position control, with its PID gains as follows: kp = 60, kd = 1. In addition, in order to simulate the real robot environment, we also add noise to the collected data from sensors.
The flowchart of the HTS-SAC program is shown in Figure 3. First, we use the traditional SAC method to collect 10,000 data entries in the initial experience replay buffer (denoted as D), with the training scenario set on flat ground. Second, the data D from the experience replay buffer is fed into our four mutual information decision criteria, and through feature optimization and selection, new time-series features and a new experience replay buffer D are generated. Then, based on the new time-series features after feature optimization, a heterogeneous time-series neural network is designed. Finally, the heterogeneous time-series neural network is incorporated into the HTS-SAC model to form a new HTS-SAC model, and the final model is obtained through iterative training for testing.
The main difference between the HTS-SAC and the SAC is the mutual information feature selection and the heterogeneous time-series neural network, as shown in the red dashed box in Figure 3.
We choose the input of the SAC policy network as the initial state set X , which is the observation information obtained by four sensors of the Laikago-a1 quadruped robot. The content of the set X is as follows: the data of the base-displacement sensor in three orientations ( p x , p y , p z ) , the angle and angular velocity data of the IMU sensor ( r x , r y , r z , r x r a t e , r y r a t e , r z r a t e ) , joint angles of 12 joints ( j 1 a , j 2 a , , j 12 a ) , joint positions of 12 joints ( j 1 p , j 2 p , , j 12 p ) , and dataset ( c F L , c F R , c B L , c B R ) from the foot-contact sensor for determining whether the foot touches the ground or not.
The action space is the output of the actor network, which corresponds to the position information of the 12 joints of the unmanned quadruped robot. The reward function is designed as follows:
r t = X 1 ( t ) X 1 ( t 1 ) ,
where the reward r t is expressed as the difference between the X-axis displacement of the current time and the previous time. Note that the reward in (19) is a universal design which can reduce the impact of the reward function on the comparative studies.

4.1. Feature Selection of Heterogeneous Time Series

In the feature selection simulation, 10,000 sets of data from the initial experience pool are used as the training data. The test results of the state feature correlation mutual information M I ( X , Y ) , the mutual information change rate V Δ M I ( X , Y ) , and the redundancy mutual information M I ( X i , ( X X i ) ) are shown in Figure 4. The red line represents the state feature correlation mutual information M I ( X , Y ) . It can be observed that when M I ( X , Y ) > 0.3 , the corresponding variable is strongly correlated, which is the key variable we need. That is, X 1 (position of the X-axis) and X 10 (angular velocity of the Z-axis) have strong correlations. The blue line indicates the mutual information change rate V Δ M I ( X , Y ) . We choose X 37 as the benchmark variable and combine it with X i , i = 1 , 2 , , 36 to form a new set X 37 + i . The change rate of mutual information is determined by calculating the difference between X 37 + i and X 37 . When V Δ M I ( X , Y ) > 0 , it shows that the newly added variable X i is the correlated variable. The yellow line is the redundancy mutual information M I ( X i , ( X X i ) ) . When M I ( X i , ( X X i ) ) > 0.27 , it indicates that the variable X i is redundant and should be eliminated. According to the mutual information results in Figure 4, we set δ 1 = 0.3 , δ 2 = 0 , and δ 4 = 0.27 . As a result, the key feature variables are determined to be ( X 1 , X 10 , X 16 , X 18 , X 19 , X 21 , X 22 ) .
The time delay correlation mutual information M I ( X t d , Y ) for delay lengths d [ 1 , 10 ] can be seen in Figure 5. When d = 1 and d = 2 , the corresponding variables have strong correlations. As d increases, the correlation of variables gradually reduces. According to the curve of M I ( X t d , Y ) , we set the delay correlation threshold δ 3 as 0.3. When M I ( X t d , Y ) > δ 3 , it shows that the variable has strong correlation and should be retained.
According to the mutual information decision conditions, the key features that affect policy learning can be obtained. Therefore, we choose X 1 and X 10 with a long historical time series, with d = 6 . X 16 , X 18 , X 19 , X 21 , X 22 are set to medium-length historical time series, with d = 3 . Other variables are set to a short historical time series, with d = 2 . In this way, a heterogeneous time-series input layer with a non-uniform length is designed. Then, we set the HTS-SAC input layer dimension as 87 (corresponding to the data with heterogeneous time series). The SAC input layer dimension is 37 (corresponding to the data with a time-series length of 1). The TD-SAC input layer dimension is 370 (corresponding to the data with a time-series length of 10).

4.2. Algorithm Simulation

We test the DDPG, PPO, TD3, SAC, TD-SAC, and HTS-SAC methods on different terrains using the Laikago-a1 quadruped robot simulation model. The advantages and disadvantages of the methods are analyzed by comparing the forward running.
The parameters of the agile locomotion control model of the unmanned quadruped robot based on HTS-SAC algorithm are shown in Table 2. The parameters are continuously optimized during training and testing. Here are some adjustment guidance and rules. Layer params: Increases sequentially from 128 and does not exceed 1024. Gamma: Gamma represents the discount factor. A larger Gamma indicates longer rewards, usually increasing from 0.99 to 0.999. Epochs: Usually increase from 100 but generally do not exceed 10 4 . Steps per epoch: Start from 1000 and generally do not exceed 5000. Their product represents the total number of steps, which is generally greater than 10 5 . When the training exceeds 10 6 steps and still does not converge, it indicates that the parameters are not suitable and the model need to be retrained. Replay size: Usually increases from 10 5 and ranges from 10 5 to 10 6 . Batch size: We usually choose a batch size that is equivalent to or smaller than the layer params. Learning rate: Generally, it decreases sequentially starting from 1 × 10 2 , ranging from 1 × 10 2 to 1 × 10 5 .
Figure 6 plots the cumulative training reward simulation results across 20 runs for each method on the selected map, together with 80% confidence intervals. All algorithms are trained and tested using the Laikago robot under fair conditions. The training results for TD-SAC and SAC are 828.61 and 885.32, respectively. However, the DDPG and PPO methods cannot converge, despite extensive network optimization and hyperparameter adjustment work, and the training performance of TD3 is poor. Although the early rewards of HTS-SAC are lower than those of SAC, it is significantly better than other methods after 350 epochs. The final cumulative reward is 923.99, which shows that the proposed HTS-SAC method can converge stably and learn the optimal locomotion policy by using heterogeneous time-series data. The numerical results of the cumulative training rewards for each method can be found in Table 3.
We test different TD-SAC network structures to provide fair comparison conditions, with the results shown in Table 4 and Figure 7. The results indicate that increasing the complexity of the network structure leads to an increase in cumulative rewards, with the maximum not exceeding 847. Additionally, when the network structure exceeds (2624, 2624), the model fails to converge, resulting in a reward of only 7.5.
To verify the control performance and agility of the proposed method, we test four different DRL algorithms in four scenarios, flat ground, uphill and downhill slopes, stair waves, and real land, as shown in Figure 8. Figure 9 shows the speed curves of four methods on flat ground; each method underwent a forward running test of 1000 steps. The link to the robot’s test demonstration video is https://youtu.be/yD8dkU6z52I (accessed on 3 August 2025). The speed tests on the four different terrains are shown in Figure 10. We find that the proposed method has better locomotion speed and agility on all terrains, and the smallest velocity attenuation on complex terrains. The average speeds of the HTS-SAC, SAC, TD-SAC and TD3 on all terrains are 0.39 m/s, 0.33 m/s, 0.34 m/s, and 0.24 m/s, respectively.
In addition, Figure 11 and Figure 12 show the real-time collection of robot position and angle information on flat ground using the proposed method. In particular, the results in Figure 11 show that the deviation of the robot’s locomotion trajectory on the Y-axis is small. The error on the Y-axis is 0.08, and the variance on the Y-axis is 6 × 10 3 . Figure 12 reflects the real-time angle changes in the tests, with all angle changes being less than 0.05 degrees.
Figure 13 shows the angle mean squared error (MSE) of four methods on different terrains, which reflects the stability of the robot. The MSE results of the yaw angles of each method on different terrains are shown in Table 5. The proposed method has the smallest errors on flat ground and stair wave terrain, while SAC has smaller errors on slope land and real land. Compared with the speed test results in Figure 10, although SAC has smaller errors, its speed attenuation is severe on uphill and downhill slopes and real land. In conclusion, the proposed method has the fastest locomotion speed on all terrains while maintaining smaller angle errors, which indicates that the proposed method has enhanced control performance and stability.
The comparative test results of different thresholds are shown in Table 6. The results indicate that when we decrease δ 1 and δ 2 and increase δ 4 , obvious changes occur in feature dimensions and training cumulative rewards. The selection of thresholds currently relies on manual experience and needs to be continuously adjusted according to specific application scenarios.
Figure 14 shows the results of whether each of the four legs is in contact with the ground at each moment. In the figure, black circles indicate that the corresponding leg is in contact with the ground at the current moment, and red circles indicate that it is in the air at the current moment. During the test, the ground contact information was recorded starting from the moment the robot was set up; therefore, all four legs of the robot are in the air at step 0, and the robot starts to move from step = 3. We did not add any periodic constraints or rewards when training the proposed model, which makes the trained robot dog exhibit aperiodic foot movements during motion.
The time-series movements of the left front leg joint angle and joint position are shown in Figure 15a and Figure 15b, respectively. Our designed reward function and constraints do not include any periodic settings; therefore, the leg movements of the trained robot exhibit aperiodic motions. The time series of joint angles and joint positions are consistent with those in Figure 14, both starting from the robot setup. In addition, we have added the comparative leg movements of SAC, as shown in Figure 16a,b. The motion patterns of other reinforcement learning methods are similar to that of the proposed method, all exhibiting irregular movements.
We tested the performance under different reward conditions. In the new comparative tests, we added the energy consumption of the motor to the reward function. The new reward function can be defined as
R t = α ( X 1 ( t ) X 1 ( t 1 ) ) β ( P ) ,
where α and β are, respectively, the coefficients of the forward velocity and energy consumption. Energy consumption P is defined as
P = i = i n ( t i v i ) ,
where t and v are torque and angular velocity, respectively.
The results of cumulative training rewards for different reward functions are shown in Figure 17, and the test results of speed and energy consumption on different terrains are shown in Figure 18 and Figure 19. In the experiments, α = 1 and β = 0.1 .
In the figure, Reward 1 refers to the original reward function, and Reward 2 refers to the reward function with added energy consumption. The final training rewards using Reward 1 and Reward 2 are 923 and 914, respectively. After adopting the new reward function, the energy consumption on the three terrains has decreased significantly; however, the test speed has decayed severely on sloped terrain and real-world terrain.

5. Conclusions

In this paper, we proposed a HTS-SAC method to optimize locomotion policy from historical time-series data. We designed four mutual information decision conditions based on the k-nearest neighbor mutual information theory. Through the decision conditions analysis of the correlation, the key feature variables affecting the locomotion performance were obtained. Then, we designed a heterogeneous time-series neural network to learn the high-level features from the key feature variables and designed the HTS-SAC algorithm to learn the enhanced policy. The simulation results show that the HTS-SAC can generate better policy and provide faster speed on various terrains.
The DRL-based methods do not require a complex and exact model of the unmanned quadruped robots and can be widely applied to various complex scenarios through learning and training with a lot of interaction data. However, the DRL-based methods require a lot of high-quality and diverse data to cover all the possible application scenarios. In addition, how to transfer the learned model and strategy into real robot hardware with a safety guarantee is also necessary. This makes it a great challenge to deploy the DRL-based methods in real-world systems. In future research, we will analyze the stability and implementation of the proposed method through more in-depth theoretical research and more comprehensive hardware experiments.
The proposed method is an end-to-end, data-driven reinforcement learning approach that does not rely on the robot’s model and has scalability. Quadruped robots and unmanned aerial vehicles (UAVs) both belong to unmanned robotic systems. When UAVs are adopted, information such as the UAV’s attitude, position, velocity, and surrounding environment can be collected. Through the learning and training of the model, the UAV can formulate control strategies for each motor by perceiving its current state, thereby achieving control of the UAV.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W.; software, Z.C.; validation, Z.W., Z.C. and H.L.; formal analysis, investigation, and resources, Z.W. and Z.C.; data curation, Z.C.; writing—original draft preparation, Z.W.; writing—review and editing, Z.C. and H.L.; visualization, Z.W.; supervision, project administration, and funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61922068).

Data Availability Statement

Data are unavailable due to privacy restrictions.

Acknowledgments

The authors would like to thank the Associate Editor and anonymous reviewers for their constructive suggestions that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Guo, J.; Chai, H.; Zhang, Q.; Zhao, H.; Chen, M.; Li, Y.; Li, Y. A Framework of Grasp Detection and Operation for Quadruped Robot with a Manipulator. Drones 2024, 8, 208. [Google Scholar] [CrossRef]
  2. Liang, H.; Li, H.; Shi, Y.; Constantinescu, D.; Xu, D. Energy-Efficient Integrated Motion Planning and Control for Unmanned Surface Vessels. IEEE Trans. Control Syst. Technol. 2024, 32, 250–257. [Google Scholar] [CrossRef]
  3. Zhao, M.; Li, H. Distributed Model Predictive Contouring Control of Unmanned Surface Vessels. IEEE Trans. Ind. Electron. 2024, 71, 13012–13019. [Google Scholar] [CrossRef]
  4. Yang, Q.; Li, H. RMPC-Based Visual Servoing for Trajectory Tracking of Quadrotor UAVs with Visibility Constraints. IEEE/CAA J. Autom. Sin. 2024, 11, 2027–2029. [Google Scholar] [CrossRef]
  5. Wang, R.; Li, H.; Liang, B.; Shi, Y.; Xu, D. Policy Learning for Nonlinear Model Predictive Control With Application to USVs. IEEE Trans. Ind. Electron. 2024, 71, 4089–4097. [Google Scholar] [CrossRef]
  6. Wang, Z.; Li, H.; Chen, Z.; Han, Q.L. A Fault Diagnosis Method for Quadruped Robot Based on Hybrid Deep Neural Networks. IEEE Trans. Ind. Inform. 2025, 21, 3027–3036. [Google Scholar] [CrossRef]
  7. Seok, S.; Wang, A.; Chuah, M.Y.; Hyun, D.; Lee, J.; Otten, D.M.; Lang, J.H.; Kim, S. Design principles for energy-efficient legged locomotion and implementation on the MIT cheetah robot. IEEE/ASME Trans. Mechatronics 2015, 20, 1117–1129. [Google Scholar] [CrossRef]
  8. Kim, D.; Carlo, J.D.; Katz, B.; Bledt, G.; Kim, S. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control. arXiv 2019, arXiv:1909.06586. [Google Scholar] [CrossRef]
  9. Du, W.; Fnadi, M.; Benamar, F. Rolling based locomotion on rough terrain for a wheeled quadruped using centroidal dynamics. Mech. Mach. Theory 2020, 153, 103984. [Google Scholar] [CrossRef]
  10. Xu, K.; Wang, S.; Shi, L.; Li, J.; Yue, B. Horizon-stability control for wheel-legged robot driving over unknow, rough terrain. Mech. Mach. Theory 2025, 205, 105887. [Google Scholar] [CrossRef]
  11. Qin, W.; Zhao, X.; Jiang, Y.; Wang, X.; Xu, D. Approximate path following control of robotic manipulators: An adaptive dynamic programming-based method. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 3909–3914. [Google Scholar]
  12. Jiang, H.; An, T.; Ma, B.; Li, Y.; Dong, B. Value iteration-based decentralized fuzzy optimal control of modular reconfigurable robots via adaptive dynamic programming. In Proceedings of the 2022 5th International Conference on Robotics, Control and Automation Engineering (RCAE), Changchun, China, 28–30 October 2022; pp. 186–190. [Google Scholar]
  13. Robotin, R.; Lazea, G.; Dobra, P. Mobile robot navigation using graph search techniques over an approximate cell decomposition of the free space. Adv. Intell. Control Syst. Comput. Sci. 2013, 187, 129–142. [Google Scholar]
  14. Guo, H.; Tan, Z.; Liu, J.; Guo, J. Kernel-based approximate dynamic programming for autonomous vehicle stability control. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 1299–1304. [Google Scholar]
  15. Lin, Z.; Ma, J.; Duan, J.; Li, S.E.; Ma, H.; Cheng, B.; Lee, T.H. Policy iteration based approximate dynamic programming toward autonomous driving in constrained dynamic environment. IEEE Trans. Intell. Transp. Syst. 2023, 20, 5003–5013. [Google Scholar] [CrossRef]
  16. Yang, Q.; Lian, Y.; Xie, W. Hierarchical planning for multiple AGVs in warehouse based on global vision. Simul. Model. Pract. Theory 2020, 104, 102124. [Google Scholar] [CrossRef]
  17. Yao, Z.; Yu, J.; Zhang, J.; He, W. Graph and dynamics interpretation in robotic reinforcement learning task. Inf. Sci. 2022, 611, 317–334. [Google Scholar] [CrossRef]
  18. Roveda, L.; Pallucca, G.; Pedrocchi, N.; Braghin, F.; Tosatti, L.M. Iterative learning procedure with reinforcement for high-accuracy force tracking in robotized tasks. IEEE Trans. Ind. Inform. 2017, 14, 1753–1763. [Google Scholar] [CrossRef]
  19. Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control. Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
  20. Hornby, G.S.; Takamura, S.; Yokono, J.; Hanagata, O.; Yamamoto, T.; Fujita, M. Evolving robust gaits with AIBO, Proceedings 2000 ICRA. Millennium Conference. In Proceedings of the IEEE International Conference on Robotics and Automation, Symposia Proceedings (Cat. No.00CH37065), San Francisco, CA, USA, 24–28 April 2000; pp. 3040–3045. [Google Scholar]
  21. Fankhauser, P.; Hutter, M.; Gehring, C.; Bloesch, M.; Hoepflinger, M.A.; Siegwart, R. Reinforcement learning of single legged locomotion. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 188–193. [Google Scholar]
  22. Bogdanovic, M.; Khadiv, M.; Righetti, L. Model-free reinforcement learning for robust locomotion using demonstrations from trajectory optimization. Front. Robot. AI 2022, 9, 1–12. [Google Scholar] [CrossRef] [PubMed]
  23. Iscen, A.; Yu, G.; Escontrela, A.; Jain, D.; Tan, J.; Caluwaerts, K. Learning agile locomotion skills with a mentor. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2019–2025. [Google Scholar]
  24. Wu, J.; Wang, C.; Zhang, D.; Zhong, S.; Wang, B.; Qiao, H. Learning smooth and omnidirectional locomotion for quadruped robots. In Proceedings of the 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), Chongqing, China, 3–5 July 2021; pp. 633–638. [Google Scholar]
  25. Tan, W.; Fang, X.; Zhang, W.; Song, R.; Chen, T.; Zheng, Y.; Li, Y. A hierarchical framework for quadruped locomotion based on reinforcement learning. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 8462–8468. [Google Scholar]
  26. Yao, Q.; Wang, J.; Wang, D.; Yang, S.; Zhang, H.; Wang, Y.; Wu, Z. Hierarchical terrain-aware control for quadrupedal locomotion by combining deep reinforcement learning and optimal control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4546–4551. [Google Scholar]
  27. Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2022, 4, 1–13. [Google Scholar] [CrossRef] [PubMed]
  28. Haarnoja, T.; Zhou, A.; Abbee, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]
Figure 1. Control framework of HTS-SAC algorithm.
Figure 1. Control framework of HTS-SAC algorithm.
Drones 09 00569 g001
Figure 2. Heterogeneous time-series neural network.
Figure 2. Heterogeneous time-series neural network.
Drones 09 00569 g002
Figure 3. HTS-SAC program flowchart.
Figure 3. HTS-SAC program flowchart.
Drones 09 00569 g003
Figure 4. Mutual information and its change rate.
Figure 4. Mutual information and its change rate.
Drones 09 00569 g004
Figure 5. Time delay correlation mutual information.
Figure 5. Time delay correlation mutual information.
Drones 09 00569 g005
Figure 6. Accumulated training rewards for six methods.
Figure 6. Accumulated training rewards for six methods.
Drones 09 00569 g006
Figure 7. Cumulative training rewards of TD-SAC under different structural conditions.
Figure 7. Cumulative training rewards of TD-SAC under different structural conditions.
Drones 09 00569 g007
Figure 8. Test on four terrains: flat ground, uphill and downhill slopes, stair waves and real land.
Figure 8. Test on four terrains: flat ground, uphill and downhill slopes, stair waves and real land.
Drones 09 00569 g008
Figure 9. Speed test of four algorithms on flat ground.
Figure 9. Speed test of four algorithms on flat ground.
Drones 09 00569 g009
Figure 10. Speed test of four methods on four different terrains.
Figure 10. Speed test of four methods on four different terrains.
Drones 09 00569 g010
Figure 11. The robot locomotion trajectory of HTS-SAC on flat ground.
Figure 11. The robot locomotion trajectory of HTS-SAC on flat ground.
Drones 09 00569 g011
Figure 12. The robot orientation of HTS-SAC on flat ground.
Figure 12. The robot orientation of HTS-SAC on flat ground.
Drones 09 00569 g012
Figure 13. The base angle MSE of four methods on different terrains.
Figure 13. The base angle MSE of four methods on different terrains.
Drones 09 00569 g013
Figure 14. The foot touchdown results of the proposed method.
Figure 14. The foot touchdown results of the proposed method.
Drones 09 00569 g014
Figure 15. Joint angles and joint positions of the proposed method.
Figure 15. Joint angles and joint positions of the proposed method.
Drones 09 00569 g015
Figure 16. Joint angles and joint positions of the SAC method.
Figure 16. Joint angles and joint positions of the SAC method.
Drones 09 00569 g016aDrones 09 00569 g016b
Figure 17. Cumulative training rewards for different reward functions.
Figure 17. Cumulative training rewards for different reward functions.
Drones 09 00569 g017
Figure 18. Speed test of different reward functions on four terrains.
Figure 18. Speed test of different reward functions on four terrains.
Drones 09 00569 g018
Figure 19. Energy consumption of different reward functions on four terrains.
Figure 19. Energy consumption of different reward functions on four terrains.
Drones 09 00569 g019
Table 1. Parameters of PyBullet.
Table 1. Parameters of PyBullet.
ParametersParameters Value
Fixed time step0.00416 s
Solver iterations50
solver typeProjected Gauss–Seidel
sensor noise modelGaussian noise
Robot mass4.7 kg
Friction0.5
Restitution(0.02, 0.03, 0.04)
Table 2. Parameters of locomotion control model of the unmanned quadruped robot.
Table 2. Parameters of locomotion control model of the unmanned quadruped robot.
ParametersParameters Value
Actor fc layer params(512, 512)
Critic fc layer params(512, 512)
Gamma0.99
Epochs1000
Steps per epoch4000
Replay size1,000,000
Batch size100
Actor learning rate0.001
Critic learning rate0.001
Table 3. Cumulative training rewards for different methods.
Table 3. Cumulative training rewards for different methods.
MethodsRewards
DDPG−122
PPO−0.17
TD3125
SAC891
TD-SAC826
HTS-SAC923
Table 4. TD-SAC training results under different structural conditions.
Table 4. TD-SAC training results under different structural conditions.
Network StructureReward
(500, 500)827
(800, 800)828
(1024, 1024)833
(1624, 1624)847
(2624, 2624)7.5
Table 5. The yaw angle MSE of four methods on different terrains.
Table 5. The yaw angle MSE of four methods on different terrains.
MethodsFlat GroundSlope LandStair WavesReal Land
TD3 1.15 × 10 2 1.04 × 10 2 1.52 × 10 2 8.99 × 10 3
SAC 2.23 × 10 4 5.39 × 10 4 4.94 × 10 4 1.42 × 10 3
TD-SAC 1.21 × 10 3 4.08 × 10 3 1.39 × 10 3 2.61 × 10 3
HTS-SAC 1.73 × 10 4 2.39 × 10 4 2.38 × 10 4 4.21 × 10 4
Table 6. The training results of HTS-SAC with different threshold conditions.
Table 6. The training results of HTS-SAC with different threshold conditions.
Methods δ 1 δ 2 δ 3 δ 4 Characteristic NumberReward
HTS-SAC0.300.30.2787924
Compare10.29−0.050.30.28125906
Compare20.28−0.10.30.29165896
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Z.; Chen, Z.; Li, H. A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones 2025, 9, 569. https://doi.org/10.3390/drones9080569

AMA Style

Wang Z, Chen Z, Li H. A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones. 2025; 9(8):569. https://doi.org/10.3390/drones9080569

Chicago/Turabian Style

Wang, Zhaoxu, Zhuoying Chen, and Huiping Li. 2025. "A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion" Drones 9, no. 8: 569. https://doi.org/10.3390/drones9080569

APA Style

Wang, Z., Chen, Z., & Li, H. (2025). A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones, 9(8), 569. https://doi.org/10.3390/drones9080569

Article Metrics

Back to TopTop