A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion

Wang, Zhaoxu; Chen, Zhuoying; Li, Huiping

doi:10.3390/drones9080569

Open AccessArticle

A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion

by

Zhaoxu Wang

^*

,

Zhuoying Chen

and

Huiping Li

School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an 710072, China

^*

Author to whom correspondence should be addressed.

Drones 2025, 9(8), 569; https://doi.org/10.3390/drones9080569

Submission received: 17 June 2025 / Revised: 3 August 2025 / Accepted: 6 August 2025 / Published: 12 August 2025

(This article belongs to the Special Issue Advances in Deep Learning for Drones and Its Applications: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

The locomotion control of unmanned quadruped robots has been one of the greatest challenges in robotics. Deep reinforcement learning has made great achievements in robot control. However, extracting effective features from historical information to improve locomotion agility is still an open and challenging problem. In this paper, a heterogeneous time-series soft actor–critic (HTS-SAC) method is proposed to enable better policy learning from historical data. Firstly, four mutual information decision conditions are developed for feature selection, which can analyze the correlation between input states and motion performance, obtaining the importance of temporal features of different lengths. Then, according to the results of feature optimization, a novel heterogeneous time-series neural network and the HTS-SAC locomotion control method are designed. Finally, the effectiveness of the proposed method is validated on different terrains using a Laikago quadruped robot simulation model.

Keywords:

reinforcement learning; soft actor–critic; K-nearest neighbor mutual information; quadruped robot

1. Introduction

Legged robots have better terrain adaptability and movement ability in complex dynamic environments and rugged terrains, compared with wheeled robots and tracked robots [1]. In the future, unmanned quadruped robots may replace human beings in a variety of dangerous and complex environments. Therefore, it has become a hot research field for scholars to make unmanned quadruped robots move flexibly and quickly like animals.

There are several control methods for the stable multi-contact locomotion control of legged robots. The model predictive control (MPC) method shows strong advantages in control accuracy and robustness [2,3]. Robust model predictive control and other MPC methods are widely used in fields such as unmanned aerial vehicles and unmanned surface vehicles [4,5]. Unmanned quadruped robots can freely control the foot position, so that they can not only quickly cross a variety of cluttered obstacles and terrains, but also maintain high speed on flat ground [6]. For the quadruped robot, the MPC torque control was proposed by the MIT team to generate trotting, galloping, and other gaits in [7]. Then, they combined the whole body control (WBC) and the MPC to realize the high-speed dynamic locomotion in [8], where the MPC finds the optimal reaction force distribution over a longer time horizon with a simple model. The WBC calculates the torque, position, and velocity commands of the joints according to the reaction forces calculated by the MPC.

Furthermore, wheeled quadruped robots possess excellent terrain adaptability due to their unique structure. A model-based whole-body torque control system for tracking the center-of-mass (CoM) motion of wheeled quadruped robots was proposed in [9]. By integrating the wheel kinematic model with the robot’s CoM momentum/dynamics model, a wheel motion controller was developed. In [10], a horizontal stability control framework for wheel–leg hybrid robots was proposed to maintain stable motion on rough terrain. Therefore, wheel–leg quadruped robots have advantages such as efficient movement and wide terrain adaptability.

Additionally, the approximate dynamic programming (ADP) and approximate cell decomposition (ACD) methods are widely used in path planning and following control of robots. An event-triggered method for scheduling ADP controllers was proposed for path following control of robots in [11]. The value iteration was used for modular reconfigurable robots with the ADP algorithm in [12]. The graph search algorithm was developed for path planning of robots in unknown terrain in [13]. In addition, the kernel-based ADP method was used for autonomous vehicle stability control in [14]. An innovative and effective constrained finite-horizon ADP algorithm was proposed for autonomous driving in [15], and the improved A* algorithm and time window algorithm were designed to complete path planning in [16].

However, legged robots are high-dimensional non-smooth systems with several physical constraints, and the dynamic model and kinematic model of the robots cannot be accurately obtained in complex environments. As a result, the model-based control method including the MPC might have strong limitations in the unknown dynamic environments.

Reinforcement learning (RL), as a data-driven method, can overcome the limitations of the model-based methods through the interactive learning between the agents and the environments, which is a promising control method for robots [17,18,19]. It does not need exact dynamic modeling of unmanned quadruped robots but can greatly improve control performance.

RL was applied to the generation of stable gait of the AIBO robot as early as 2000 [20]. The model-free methods were developed to learn motor velocity trajectories and high-level control parameters to realize the jumping locomotion in [21].

In addition, two-stage reinforcement learning was proposed as a general policy to create robust policies, in which two stages learned different training contents [22]. Agile mobility was described as a multi-stage learning problem, in which the mentor guided the agent throughout the training process in [23]; once the student can solve the task, it teaches the student to perform the task without a mentor. In [24], the reference trajectory, inverse kinematics, and transformation loss are incorporated into the training process of reinforcement learning as a priori knowledge. However, the setting of the reference trajectory will indirectly restrict the final training results, so the method based on demonstration learning will not be effective in adapting to different complex terrains.

The deep reinforcement learning (DRL) method is generally selected as the high-level planner, and the traditional control method is applied for tracking control. An unmanned quadruped robot learning locomotion system based on a hierarchical learning framework was proposed in [25], where RL as a high-level policy is used to adjust the underlying trajectory generator to better adapt to the terrain. Combining terrain perception with locomotion planning, a hierarchical learning framework was developed for unmanned quadruped robots to move in challenging natural environments in [26], where the global height map of the terrain was used as the visual information of the DRL to determine the footholds needed for the leg swing and body posture. In the method proposed by Hwangbo in [27], the high-level controller is the DRL, and the low-level controller is the deep neural network. Compared with the traditional low-level controller, the deep neural network controller has a higher control frequency.

The motion control methods for quadruped robots based on reinforcement learning face numerous challenges. Traditional challenges include high-dimensional state and action spaces, low sample efficiency, and difficulty in simulation-to-reality transfer. The motion of quadruped robots is a highly dynamic temporal process (e.g., body swinging trends, changes in contact states, etc.), where current actions depend on historical states (e.g., past joint angles, IMU data, ground contact forces, etc.). Traditional reinforcement learning methods (including SAC) usually only take the "current state" as input, ignoring historical temporal correlations. This makes it difficult for the model to capture the continuity and dynamic trends of motion, thereby affecting control performance. In addition, the state of a quadruped robot includes dozens or even hundreds of dimensions of data such as multi-joint angles, angular velocities, IMU data, and ground contact states. High-dimensional raw features have redundancy and noise, and the input layers of traditional RL methods struggle to directly extract effective information from them, easily falling into the "curse of dimensionality." This results in the model having weak capability to capture key dynamic features (e.g., body tilting trends). Therefore, insufficient utilization of historical temporal information and the difficulty in effectively extracting features from high-dimensional state spaces are also significant challenges.

This paper proposes a feature optimization method based on correlation analysis, which is designed to optimize the dimensionality of data features while making full use of historical data. It also presents a heterogeneous time-series soft actor–critic (HTS-SAC) quadruped robot control method, which improves the performance of traditional RL methods by learning high-level motion features from heterogeneous historical time series. Finally, the effectiveness of the proposed methods is verified through extensive comparative experiments. The main contributions of this work are as follows:

•: A feature selection method based on k-nearest neighbor mutual information is proposed. The four designed mutual information decision conditions analyze the dimensions and time-series correlations of different features, and optimize the length of the time-series data. The feature selection method improves the utilization efficiency of historical time-series data and optimizes the feature dimension, avoiding the problem of dimension explosion.
•: A novel HTS-SAC control method is designed for unmanned quadruped robots. Based on the results of feature selection, the neural network as the input layer is designed and fused with the SAC to construct the HTS-SAC model, which can learn the optimal strategy from historical time-series data.
•: Experiments are carried out with the Laikago robot simulation model on four different terrains, and the performance of the proposed method is verified using a comparison with other DRL methods.

The rest of this paper is organized as follows: Section 2 is the design of the feature selection scheme, and Section 3 presents the detailed HTS-SAC algorithm. In Section 4, the simulation test results are given to demonstrate the effectiveness of the proposed method. Finally, Section 5 presents the conclusion.

2. Feature Selection Method

The current DRL-based methods overlook useful information of historical time-series data that can serve as a reliable source for optimizing strategies. Figure 1 shows the overall framework of the HTS-SAC algorithm. First, we adopt the traditionally trained SAC method to construct the initial experience replay buffer for the quadruped robot. Second, to address the issue of the low utilization of historical time-series data, we have designed four relevant decision-making criteria based on k-nearest neighbor mutual information to perform feature selection using data from the experience replay buffer. Finally, based on the new feature sequences after feature selection, we have designed an HTS-SAC model incorporating a heterogeneous time-series neural network to learn a better strategy.

2.1. Mutual Information Theory

The mutual information can fully show the degree of correlation between variables, and its concept comes from entropy in information theory.

Entropy is used to express the degree of uncertainty of random variables and the amount of information in random variables, which is often called information entropy. Set

X

as a discrete random variable with a range of

D = \{x_{i} | i = 1, \dots, N\}

; then, its entropy is defined as

H (X) = - \sum_{i}^{N} P (x_{i}) \log P (x_{i}),

(1)

where

P (x_{i})

denotes the probability that

X

chooses

x_{i}

. When the base of the logarithmic function is 2, the unit of entropy is a bit.

Set

X

and

Y

as two discrete random variables; then, the mutual information between

X

and

Y

is defined as

M I (X, Y) = H (X) + H (Y) - H (X, Y) .

(2)

The k-nearest neighbor mutual information method applies the estimation method in the sample space and can directly calculate the high-dimensional mutual information. Assuming that there are N feature sample points

q_{i} = (x_{i}, y_{i})

in

Q = (X, Y)

,

i = 1, \dots, N

, the k-nearest neighbor mutual information of random variables

X

and

Y

is defined as

M I (X, Y) = ψ (k) - 〈ψ (n_{x} + 1) + ψ (n_{y} + 1)〉 + ψ (N),

(3)

where

n_{x}

denotes the number of sample points whose distance from point

x_{i}

is less than

ε_{i} / 2

. Here,

ε_{i} / 2

denotes the nearest distance between

q_{i}

and the k-th point, and

n_{y}

is denoted similarly.

ψ (k)

is a Digamma function calculated by the following iteration:

ψ (1) = - 0.5772516,

(4)

ψ (k + 1) = ψ (k) + 1 / k .

(5)

2.2. Four Mutual Information Decision Conditions

Based on the k-nearest neighbor mutual information theory, we design four key feature decision conditions: the state feature correlation mutual information

M I (X, Y)

, the mutual information change rate

V_{Δ M I (X, Y)}

, the time delay correlation mutual information

M I (X_{t - d}, Y)

and the redundancy mutual information

M I (X_{i}, (X - X_{i}))

.

a. The state feature correlation mutual information. Set the input variables as

X

; then, the mutual information between the output variable action reward

Y

and the input variable set

X

is defined as follows:

M I (X, Y) = ψ (k) - 〈ψ (n_{x} + 1) + ψ (n_{y} + 1)〉 + ψ (N) .

(6)

By calculating the mutual information values of different state information variables, we can analyze the influence of different input state information variables on the locomotion state of the robot. Set the correlation threshold as

δ_{1} > 0

. When

M I (X, Y) > δ_{1}

, it shows that the current variable is closely related, which is the key variable that affects the locomotion state of the robot.

b. The mutual information change rate. Select a variable

X_{1}

as the reference variable, and combine the reference variable

X_{1}

with other input state variables one by one to form a new set

\{X_{1 + i}\}

,

i = 2, \dots, m

. The mutual information of the reference variable

X_{1}

and the new set

\{X_{1 + i}\}

is calculated, respectively. The change rate of mutual information is obtained by calculating the error of two mutual information. Define the rate of change of mutual information as follows:

V_{Δ M I (X, Y)} = \sqrt{M I {(X_{1}, Y)}^{2} - M I {(X_{1 + i}, Y)}^{2}} .

(7)

According to the information entropy, when

X_{i}

is an independent variable,

V_{Δ M I (X, Y)}

decreases; when

X_{i}

is a key variable,

V_{Δ M I (X, Y)}

increases. Set a correlation threshold

δ_{2}

to determine whether

V_{Δ M I (X, Y)}

is a related variable. When

V_{Δ M I (X, Y)} > δ_{2}

, it shows that the newly added variable

X_{i}

is a related variable.

c. The time delay correlation mutual information. Set the action state reward

Y

of the current state information

X_{t}

, and define

X_{t - d}

,

d = 1, 2, \dots, n

as the state information at the historical moment with length d. The delay mutual information of state information is defined as follows:

\begin{matrix} M I (X_{t - d}, Y) \\ = & ψ (k) - 〈ψ (n_{x_{t - d}} + 1) + ψ (n_{y} + 1)〉 + ψ (N) . \end{matrix}

(8)

Here, the delay correlation threshold is set as

δ_{3} > 0

. When

M I (X_{t - d}, Y) > δ_{3}

, it indicates that the state feature information of the current moment is a strong correlation feature.

d. The redundancy mutual information. After selecting the key feature variables through mutual information and mutual information change rate, it is necessary to analyze the redundancy of the feature information. The removal of redundant variables can effectively reduce data dimensions while improving the training speed. The redundant mutual information between

X_{i}

and

X

is defined as follows:

\begin{matrix} M I (X_{i}, (X - X_{i})) \\ = & ψ (k) - 〈ψ (n_{x_{i}} + 1) + ψ (n_{x - x_{i}} + 1)〉 + ψ (N) . \end{matrix}

(9)

Here, the threshold of redundancy correlation is set as

δ_{4} > 0

. When

M I (X_{i}, (X - X_{i})) < δ_{4}

, it means that the variable is redundant and needs to be eliminated.

The final optimized feature results need to meet all four decision conditions simultaneously. In this way, the proposed method can enhance the effective utilization of the historical information while optimizing features to avoid dimension explosion.

3. Gait Learning Method by HTS-SAC

3.1. Heterogeneous Time-Series Method

In this subsection, we design a heterogeneous time-series neural network based on feature selection. We integrate heterogeneous time-series neural networks with the SAC to construct the HTS-SAC model, which can use heterogeneous time-series datasets and learn high-level features. The structure of the heterogeneous time-series neural network is shown in Figure 2.

The detailed design of the network is as follows. According to the correlation and redundancy mutual information, different time delays d are designed for a single input variable

X_{i}

to form a heterogeneous time-series input layer

S

. When

X_{i}

is strongly correlated, it will appropriately increase the length of the historical time series to use more feature information of the historical moment. When

X_{i}

is weakly correlated, it will reduce the length of data appropriately to shorten the training time of the model. The heterogeneous time-series neural network is designed based on the importance of historical information from different dimensions.

3.2. HTS-SAC Algorithm

The soft actor–critic (SAC) is an off-policy DRL algorithm [28]. The entropy regularization is introduced into policy training to enlarge the policy searching space. However, the conventional SAC method only considers the current state

S_{t}

and the previous state

S_{t - 1}

, while ignoring the impact of longer historical time-series data

S_{t - m} (m = 1, 2, . . ., n)

. To resolve this issue, we design an HTS-SAC control framework for unmanned quadruped robots that integrates heterogeneous time-series neural networks. As a result, the actor network of HTS-SAC has the capability to learn the features of state variables from heterogeneous historical time series.

The observation space

S_{t}

should contain all observable information related to the task. All the observation elements at

t = t_{i}

are defined as

S_{t_{i}} = 〈p_{t_{i}}, ϕ_{t_{i}}, ω_{t_{i}}, ψ_{t_{i}}, δ_{t_{i}}, Θ_{t_{i}}〉,

(10)

where

p_{t_{i}}

is the robot base position,

ϕ_{t_{i}}

and

ω_{t_{i}}

are the base angles and angular velocities,

ψ_{t_{i}}

and

δ_{t_{i}}

are the joint angles and positions, and

Θ_{t_{i}}

is the foot contact detection.

A non-deterministic control policy

π

gives the conditional probability density

π (a_{t} ∣ s_{t})

of taking action

a_{t}

in the state

s_{t}

. The formula for calculating entropy H is

H (π (\cdot ∣ s_{t})) = E_{π} [- \ln π (\cdot ∣ s_{t})] .

(11)

The policy entropy is a measure to estimate the randomness of the actions that an agent can take in a given state. Higher entropy indicates policies with higher uncertainty and stronger exploration ability. By introducing the entropy

H (π (\cdot ∣ s_{t}))

into a part of the reward

r

, the agent will receive an additional entropy-related reward at each step of the exploration. The reward function

J (π)

at the maximum policy entropy is defined as

J (π) = E_{π} [\sum_{t = 0}^{T} r (s_{t}, a_{t}) + α H (π (\cdot ∣ s_{t}))],

(12)

where

r

is the reward of the agent taking action

a_{t}

in the state

s_{t}

.

α

is the regularization coefficient of entropy, which is used to control the randomness degree of the optimal policy and the importance of the upper policy entropy to the reward. The state–action value function

Q (\cdot)

is defined as

\begin{matrix} Q_{t}^{π} (s_{t}, a_{t}) & = r (s_{t}, a_{t}) \\ + E_{π} [\sum_{l = 1}^{T} r (s_{t + 1} + a_{t + 1}) - α \log π (a_{t + l} ∣ s_{t + l})] . \end{matrix}

(13)

According to

Q_{t}^{π} (s_{t}, a_{t})

, the state value function

V (\cdot)

is defined as

V_{t}^{π} (s_{t}) = E [Q_{t}^{π} (s_{t}, a_{t}) - α \log π (a_{t} ∣ s_{t})] .

(14)

According to (13) and (14),

Q_{t}^{π} (s_{t}, a_{t})

can be expressed as

V_{t}^{π} (s_{t})

.

Q_{t}^{π} (s_{t}, a_{t}) = r (s_{t}, a_{t}) + E_{s_{t + 1} \sim p (s_{t + 1} ∣ s_{t}, a_{t})} [V_{t + 1}^{π} (s_{t + 1})],

(15)

where (15) is also known as the soft Bellman equation.

For continuous state–action tasks, it is necessary to introduce function approximation. Define the parameter of the soft state value function

V (\cdot)

as

ϕ

. The parameter of the state–action value function

Q (\cdot)

is

ψ

, and the parameter of the policy function

π (\cdot ∣ s)

is

θ

. According to the definition of state value function, the loss function

L_{V} (ϕ)

of

V_{t}^{π} (s_{t})

is obtained.

\begin{matrix} L_{V} (ϕ) \\ = & E_{s \sim p (s)} [V_{ϕ} (s) - E_{a \sim π (a ∣ s)} [Q (s, a) - α \log π (a ∣ s) ∣ s^{2}]], \end{matrix}

(16)

Action

a

represents the actor network under the condition of policy

π (\cdot ∣ s)

. Here,

a

is calculated by a neural network containing noise:

a = f_{θ} (ϵ; s),

(17)

where

ϵ

is the input noise vector, sampled from a fixed distribution (such as a spherical Gaussian distribution).

According to the soft Bellman equation, the loss function

L_{Q} (ψ)

of Q-value function

Q_{t}^{π} (s_{t}, a_{t})

is defined as follows:

L_{Q} (ψ) = E_{π} [Q_{ψ} (s, a) - {(r (s, a) + γ V_{ϕ} (s^{'}))}^{2}] .

(18)

Equation (12) is rewritten as a reward function in the form of KL-divergence:

J (π) = - E_{s_{0} \sim p (s_{0})} [D_{K L} (π (\cdot ∣ s_{0}) \dots \exp (\frac{1}{α} Q_{0}^{π} (s_{0}, \cdot)))] + b,

(19)

where b is a constant. The loss function

L_{π} (θ)

of the updated policy is obtained by maximizing the reward

J (π)

:

L_{π} (θ) = E_{s \sim p (s)} [E_{a \sim π (a ∣ s)} [α \log π_{θ} (a ∣ s) - Q_{ψ} (s, a) ∣ s]] .

(20)

Finally, the gradient descent method is used to minimize the above three loss functions

L_{V} (ϕ)

,

L_{Q} (ψ)

, and

L_{π} (θ)

, and the optimal policy

π

is obtained. The detailed steps of the HTS-SAC method are shown in Algorithm 1.

Algorithm 1 HTS-SAC algorithm

Require: Initialize the actor network parameter $θ$ and critic network parameters $ϕ, ψ$ , target network parameter $ϕ_{t a r g}$ , experience pool $D$ .
for each iteration do
for each environment step do
$a_{t} \leftarrow π_{θ} (a_{t} ∣ s_{t})$
$s_{t + 1} \leftarrow p (s_{t + 1} ∣ s_{t}, a_{t})$
$D \leftarrow D \cup \{s_{t}, a_{t}, r (s_{t}, a_{t}), s_{t + 1}\}$
end for
while experience pool $D$ is full do
for each variable do
$X^{'} \leftarrow \{\begin{matrix} M I (X, Y) > δ_{1} \\ V_{Δ M I (X, Y)} > δ_{2} \\ M I (X_{t - d}, Y) > δ_{3} \\ M I (X_{i}, (X - X_{i})) < δ_{4} \end{matrix}$
end for
Use $X^{'}$ to update the experience pool $D^{'} \leftarrow D$
Construct HTS neural network based on $D^{'}$
end while
for each gradient step do
$ϕ \leftarrow ϕ - λ_{V} {\hat{\nabla}}_{ϕ} J_{V} (ϕ)$
$ψ_{i} \leftarrow ψ_{i} - λ_{Q} {\hat{\nabla}}_{ψ_{i}} J_{Q} (ψ_{i})$ $f o r$ $i \in \{1, 2\}$
$θ \leftarrow θ - λ_{π} {\hat{\nabla}}_{θ} J_{π} (θ)$
$ϕ_{t a r g} \leftarrow ρ ϕ + (1 - ρ) ϕ_{t a r g}$
end for
end for

4. Numerical Simulation

In this section, we compare the twin delayed deep deterministic policy gradient (TD3), the deep deterministic policy gradient (DDPG), the proximal policy optimization (PPO), the traditional SAC, and the time-delay soft actor–critic (TD-SAC) with the HTS-SAC proposed in this paper. The TD-SAC method is a comparative method we constructed, and this method only utilizes a large amount of historical state data without feature optimization. We use the PyBullet robot simulation system for testing. PyBullet has excellent rendering and collision detection details, which can simulate the real world with high fidelity. In addition, it is packaged into a Python (v3.8.2) module for robot simulation and experimentation and provides forward/reverse kinematics, forward/reverse dynamics, collision detection, ray intersection queries and other functions. The parameters of PyBullet are shown in Table 1. The Laikago-a1 quadruped robot is employed for simulation test. We set the actuator delay as 2 ms. The joint actuator uses position control, with its PID gains as follows: kp = 60, kd = 1. In addition, in order to simulate the real robot environment, we also add noise to the collected data from sensors.

The flowchart of the HTS-SAC program is shown in Figure 3. First, we use the traditional SAC method to collect 10,000 data entries in the initial experience replay buffer (denoted as D), with the training scenario set on flat ground. Second, the data D from the experience replay buffer is fed into our four mutual information decision criteria, and through feature optimization and selection, new time-series features and a new experience replay buffer

D^{'}

are generated. Then, based on the new time-series features after feature optimization, a heterogeneous time-series neural network is designed. Finally, the heterogeneous time-series neural network is incorporated into the HTS-SAC model to form a new HTS-SAC model, and the final model is obtained through iterative training for testing.

The main difference between the HTS-SAC and the SAC is the mutual information feature selection and the heterogeneous time-series neural network, as shown in the red dashed box in Figure 3.

We choose the input of the SAC policy network as the initial state set

\{X\}

, which is the observation information obtained by four sensors of the Laikago-a1 quadruped robot. The content of the set

\{X\}

is as follows: the data of the base-displacement sensor in three orientations

(p_{x}, p_{y}, p_{z})

, the angle and angular velocity data of the IMU sensor

(r_{x}, r_{y}, r_{z}, r_{x}^{r a t e}, r_{y}^{r a t e}, r_{z}^{r a t e})

, joint angles of 12 joints

(j_{1}^{a}, j_{2}^{a}, \dots, j_{12}^{a})

, joint positions of 12 joints

(j_{1}^{p}, j_{2}^{p}, \dots, j_{12}^{p})

, and dataset

(c_{F L}, c_{F R}, c_{B L}, c_{B R})

from the foot-contact sensor for determining whether the foot touches the ground or not.

The action space is the output of the actor network, which corresponds to the position information of the 12 joints of the unmanned quadruped robot. The reward function is designed as follows:

r_{t} = X_{1} (t) - X_{1} (t - 1),

(21)

where the reward

r_{t}

is expressed as the difference between the X-axis displacement of the current time and the previous time. Note that the reward in (19) is a universal design which can reduce the impact of the reward function on the comparative studies.

4.1. Feature Selection of Heterogeneous Time Series

In the feature selection simulation, 10,000 sets of data from the initial experience pool are used as the training data. The test results of the state feature correlation mutual information

M I (X, Y)

, the mutual information change rate

V_{Δ M I (X, Y)}

, and the redundancy mutual information

M I (X_{i}, (X - X_{i}))

are shown in Figure 4. The red line represents the state feature correlation mutual information

M I (X, Y)

. It can be observed that when

M I (X, Y) > 0.3

, the corresponding variable is strongly correlated, which is the key variable we need. That is,

X_{1}

(position of the X-axis) and

X_{10}

(angular velocity of the Z-axis) have strong correlations. The blue line indicates the mutual information change rate

V_{Δ M I (X, Y)}

. We choose

X_{37}

as the benchmark variable and combine it with

X_{i}

,

i = 1, 2, \dots, 36

to form a new set

\{X_{37 + i}\}

. The change rate of mutual information is determined by calculating the difference between

\{X_{37 + i}\}

and

X_{37}

. When

V_{Δ M I (X, Y)} > 0

, it shows that the newly added variable

X_{i}

is the correlated variable. The yellow line is the redundancy mutual information

M I (X_{i}, (X - X_{i}))

. When

M I (X_{i}, (X - X_{i})) > 0.27

, it indicates that the variable

X_{i}

is redundant and should be eliminated. According to the mutual information results in Figure 4, we set

δ_{1} = 0.3

,

δ_{2} = 0

, and

δ_{4} = 0.27

. As a result, the key feature variables are determined to be

(X_{1}, X_{10}, X_{16}, X_{18}, X_{19}, X_{21}, X_{22})

.

The time delay correlation mutual information

M I (X_{t - d}, Y)

for delay lengths

d \in [1, 10]

can be seen in Figure 5. When

d = 1

and

d = 2

, the corresponding variables have strong correlations. As d increases, the correlation of variables gradually reduces. According to the curve of

M I (X_{t - d}, Y)

, we set the delay correlation threshold

δ_{3}

as 0.3. When

M I (X_{t - d}, Y) > δ_{3}

, it shows that the variable has strong correlation and should be retained.

According to the mutual information decision conditions, the key features that affect policy learning can be obtained. Therefore, we choose

X_{1}

and

X_{10}

with a long historical time series, with

d = 6

.

X_{16}, X_{18}, X_{19}, X_{21}, X_{22}

are set to medium-length historical time series, with

d = 3

. Other variables are set to a short historical time series, with

d = 2

. In this way, a heterogeneous time-series input layer with a non-uniform length is designed. Then, we set the HTS-SAC input layer dimension as 87 (corresponding to the data with heterogeneous time series). The SAC input layer dimension is 37 (corresponding to the data with a time-series length of 1). The TD-SAC input layer dimension is 370 (corresponding to the data with a time-series length of 10).

4.2. Algorithm Simulation

We test the DDPG, PPO, TD3, SAC, TD-SAC, and HTS-SAC methods on different terrains using the Laikago-a1 quadruped robot simulation model. The advantages and disadvantages of the methods are analyzed by comparing the forward running.

The parameters of the agile locomotion control model of the unmanned quadruped robot based on HTS-SAC algorithm are shown in Table 2. The parameters are continuously optimized during training and testing. Here are some adjustment guidance and rules. Layer params: Increases sequentially from 128 and does not exceed 1024. Gamma: Gamma represents the discount factor. A larger Gamma indicates longer rewards, usually increasing from 0.99 to 0.999. Epochs: Usually increase from 100 but generally do not exceed

10^{4}

. Steps per epoch: Start from 1000 and generally do not exceed 5000. Their product represents the total number of steps, which is generally greater than

10^{5}

. When the training exceeds

10^{6}

steps and still does not converge, it indicates that the parameters are not suitable and the model need to be retrained. Replay size: Usually increases from

10^{5}

and ranges from

10^{5}

to

10^{6}

. Batch size: We usually choose a batch size that is equivalent to or smaller than the layer params. Learning rate: Generally, it decreases sequentially starting from

1 \times 10^{- 2}

, ranging from

1 \times 10^{- 2}

to

1 \times 10^{- 5}

.

Figure 6 plots the cumulative training reward simulation results across 20 runs for each method on the selected map, together with 80% confidence intervals. All algorithms are trained and tested using the Laikago robot under fair conditions. The training results for TD-SAC and SAC are 828.61 and 885.32, respectively. However, the DDPG and PPO methods cannot converge, despite extensive network optimization and hyperparameter adjustment work, and the training performance of TD3 is poor. Although the early rewards of HTS-SAC are lower than those of SAC, it is significantly better than other methods after 350 epochs. The final cumulative reward is 923.99, which shows that the proposed HTS-SAC method can converge stably and learn the optimal locomotion policy by using heterogeneous time-series data. The numerical results of the cumulative training rewards for each method can be found in Table 3.

We test different TD-SAC network structures to provide fair comparison conditions, with the results shown in Table 4 and Figure 7. The results indicate that increasing the complexity of the network structure leads to an increase in cumulative rewards, with the maximum not exceeding 847. Additionally, when the network structure exceeds (2624, 2624), the model fails to converge, resulting in a reward of only 7.5.

To verify the control performance and agility of the proposed method, we test four different DRL algorithms in four scenarios, flat ground, uphill and downhill slopes, stair waves, and real land, as shown in Figure 8. Figure 9 shows the speed curves of four methods on flat ground; each method underwent a forward running test of 1000 steps. The link to the robot’s test demonstration video is https://youtu.be/yD8dkU6z52I (accessed on 3 August 2025). The speed tests on the four different terrains are shown in Figure 10. We find that the proposed method has better locomotion speed and agility on all terrains, and the smallest velocity attenuation on complex terrains. The average speeds of the HTS-SAC, SAC, TD-SAC and TD3 on all terrains are 0.39 m/s, 0.33 m/s, 0.34 m/s, and 0.24 m/s, respectively.

In addition, Figure 11 and Figure 12 show the real-time collection of robot position and angle information on flat ground using the proposed method. In particular, the results in Figure 11 show that the deviation of the robot’s locomotion trajectory on the Y-axis is small. The error on the Y-axis is 0.08, and the variance on the Y-axis is

6 \times 10^{- 3}

. Figure 12 reflects the real-time angle changes in the tests, with all angle changes being less than 0.05 degrees.

Figure 13 shows the angle mean squared error (MSE) of four methods on different terrains, which reflects the stability of the robot. The MSE results of the yaw angles of each method on different terrains are shown in Table 5. The proposed method has the smallest errors on flat ground and stair wave terrain, while SAC has smaller errors on slope land and real land. Compared with the speed test results in Figure 10, although SAC has smaller errors, its speed attenuation is severe on uphill and downhill slopes and real land. In conclusion, the proposed method has the fastest locomotion speed on all terrains while maintaining smaller angle errors, which indicates that the proposed method has enhanced control performance and stability.

The comparative test results of different thresholds are shown in Table 6. The results indicate that when we decrease

δ_{1}

and

δ_{2}

and increase

δ_{4}

, obvious changes occur in feature dimensions and training cumulative rewards. The selection of thresholds currently relies on manual experience and needs to be continuously adjusted according to specific application scenarios.

Figure 14 shows the results of whether each of the four legs is in contact with the ground at each moment. In the figure, black circles indicate that the corresponding leg is in contact with the ground at the current moment, and red circles indicate that it is in the air at the current moment. During the test, the ground contact information was recorded starting from the moment the robot was set up; therefore, all four legs of the robot are in the air at step 0, and the robot starts to move from step = 3. We did not add any periodic constraints or rewards when training the proposed model, which makes the trained robot dog exhibit aperiodic foot movements during motion.

The time-series movements of the left front leg joint angle and joint position are shown in Figure 15a and Figure 15b, respectively. Our designed reward function and constraints do not include any periodic settings; therefore, the leg movements of the trained robot exhibit aperiodic motions. The time series of joint angles and joint positions are consistent with those in Figure 14, both starting from the robot setup. In addition, we have added the comparative leg movements of SAC, as shown in Figure 16a,b. The motion patterns of other reinforcement learning methods are similar to that of the proposed method, all exhibiting irregular movements.

We tested the performance under different reward conditions. In the new comparative tests, we added the energy consumption of the motor to the reward function. The new reward function can be defined as

R_{t} = α (X_{1} (t) - X_{1} (t - 1)) - β (P),

(22)

where

α

and

β

are, respectively, the coefficients of the forward velocity and energy consumption. Energy consumption

P

is defined as

P = \sum_{i = i}^{n} (t_{i} * v_{i}),

(23)

where t and v are torque and angular velocity, respectively.

The results of cumulative training rewards for different reward functions are shown in Figure 17, and the test results of speed and energy consumption on different terrains are shown in Figure 18 and Figure 19. In the experiments,

α = 1

and

β = - 0.1

.

In the figure, Reward 1 refers to the original reward function, and Reward 2 refers to the reward function with added energy consumption. The final training rewards using Reward 1 and Reward 2 are 923 and 914, respectively. After adopting the new reward function, the energy consumption on the three terrains has decreased significantly; however, the test speed has decayed severely on sloped terrain and real-world terrain.

5. Conclusions

In this paper, we proposed a HTS-SAC method to optimize locomotion policy from historical time-series data. We designed four mutual information decision conditions based on the k-nearest neighbor mutual information theory. Through the decision conditions analysis of the correlation, the key feature variables affecting the locomotion performance were obtained. Then, we designed a heterogeneous time-series neural network to learn the high-level features from the key feature variables and designed the HTS-SAC algorithm to learn the enhanced policy. The simulation results show that the HTS-SAC can generate better policy and provide faster speed on various terrains.

The DRL-based methods do not require a complex and exact model of the unmanned quadruped robots and can be widely applied to various complex scenarios through learning and training with a lot of interaction data. However, the DRL-based methods require a lot of high-quality and diverse data to cover all the possible application scenarios. In addition, how to transfer the learned model and strategy into real robot hardware with a safety guarantee is also necessary. This makes it a great challenge to deploy the DRL-based methods in real-world systems. In future research, we will analyze the stability and implementation of the proposed method through more in-depth theoretical research and more comprehensive hardware experiments.

The proposed method is an end-to-end, data-driven reinforcement learning approach that does not rely on the robot’s model and has scalability. Quadruped robots and unmanned aerial vehicles (UAVs) both belong to unmanned robotic systems. When UAVs are adopted, information such as the UAV’s attitude, position, velocity, and surrounding environment can be collected. Through the learning and training of the model, the UAV can formulate control strategies for each motor by perceiving its current state, thereby achieving control of the UAV.

Author Contributions

Conceptualization, Z.W.; methodology, Z.W.; software, Z.C.; validation, Z.W., Z.C. and H.L.; formal analysis, investigation, and resources, Z.W. and Z.C.; data curation, Z.C.; writing—original draft preparation, Z.W.; writing—review and editing, Z.C. and H.L.; visualization, Z.W.; supervision, project administration, and funding acquisition, H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (Grant Nos. 61922068).

Data Availability Statement

Data are unavailable due to privacy restrictions.

Acknowledgments

The authors would like to thank the Associate Editor and anonymous reviewers for their constructive suggestions that improved this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Guo, J.; Chai, H.; Zhang, Q.; Zhao, H.; Chen, M.; Li, Y.; Li, Y. A Framework of Grasp Detection and Operation for Quadruped Robot with a Manipulator. Drones 2024, 8, 208. [Google Scholar] [CrossRef]
Liang, H.; Li, H.; Shi, Y.; Constantinescu, D.; Xu, D. Energy-Efficient Integrated Motion Planning and Control for Unmanned Surface Vessels. IEEE Trans. Control Syst. Technol. 2024, 32, 250–257. [Google Scholar] [CrossRef]
Zhao, M.; Li, H. Distributed Model Predictive Contouring Control of Unmanned Surface Vessels. IEEE Trans. Ind. Electron. 2024, 71, 13012–13019. [Google Scholar] [CrossRef]
Yang, Q.; Li, H. RMPC-Based Visual Servoing for Trajectory Tracking of Quadrotor UAVs with Visibility Constraints. IEEE/CAA J. Autom. Sin. 2024, 11, 2027–2029. [Google Scholar] [CrossRef]
Wang, R.; Li, H.; Liang, B.; Shi, Y.; Xu, D. Policy Learning for Nonlinear Model Predictive Control With Application to USVs. IEEE Trans. Ind. Electron. 2024, 71, 4089–4097. [Google Scholar] [CrossRef]
Wang, Z.; Li, H.; Chen, Z.; Han, Q.L. A Fault Diagnosis Method for Quadruped Robot Based on Hybrid Deep Neural Networks. IEEE Trans. Ind. Inform. 2025, 21, 3027–3036. [Google Scholar] [CrossRef]
Seok, S.; Wang, A.; Chuah, M.Y.; Hyun, D.; Lee, J.; Otten, D.M.; Lang, J.H.; Kim, S. Design principles for energy-efficient legged locomotion and implementation on the MIT cheetah robot. IEEE/ASME Trans. Mechatronics 2015, 20, 1117–1129. [Google Scholar] [CrossRef]
Kim, D.; Carlo, J.D.; Katz, B.; Bledt, G.; Kim, S. Highly dynamic quadruped locomotion via whole-body impulse control and model predictive control. arXiv 2019, arXiv:1909.06586. [Google Scholar] [CrossRef]
Du, W.; Fnadi, M.; Benamar, F. Rolling based locomotion on rough terrain for a wheeled quadruped using centroidal dynamics. Mech. Mach. Theory 2020, 153, 103984. [Google Scholar] [CrossRef]
Xu, K.; Wang, S.; Shi, L.; Li, J.; Yue, B. Horizon-stability control for wheel-legged robot driving over unknow, rough terrain. Mech. Mach. Theory 2025, 205, 105887. [Google Scholar] [CrossRef]
Qin, W.; Zhao, X.; Jiang, Y.; Wang, X.; Xu, D. Approximate path following control of robotic manipulators: An adaptive dynamic programming-based method. In Proceedings of the 2022 China Automation Congress (CAC), Xiamen, China, 25–27 November 2022; pp. 3909–3914. [Google Scholar]
Jiang, H.; An, T.; Ma, B.; Li, Y.; Dong, B. Value iteration-based decentralized fuzzy optimal control of modular reconfigurable robots via adaptive dynamic programming. In Proceedings of the 2022 5th International Conference on Robotics, Control and Automation Engineering (RCAE), Changchun, China, 28–30 October 2022; pp. 186–190. [Google Scholar]
Robotin, R.; Lazea, G.; Dobra, P. Mobile robot navigation using graph search techniques over an approximate cell decomposition of the free space. Adv. Intell. Control Syst. Comput. Sci. 2013, 187, 129–142. [Google Scholar]
Guo, H.; Tan, Z.; Liu, J.; Guo, J. Kernel-based approximate dynamic programming for autonomous vehicle stability control. In Proceedings of the 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Macau, China, 8–12 October 2022; pp. 1299–1304. [Google Scholar]
Lin, Z.; Ma, J.; Duan, J.; Li, S.E.; Ma, H.; Cheng, B.; Lee, T.H. Policy iteration based approximate dynamic programming toward autonomous driving in constrained dynamic environment. IEEE Trans. Intell. Transp. Syst. 2023, 20, 5003–5013. [Google Scholar] [CrossRef]
Yang, Q.; Lian, Y.; Xie, W. Hierarchical planning for multiple AGVs in warehouse based on global vision. Simul. Model. Pract. Theory 2020, 104, 102124. [Google Scholar] [CrossRef]
Yao, Z.; Yu, J.; Zhang, J.; He, W. Graph and dynamics interpretation in robotic reinforcement learning task. Inf. Sci. 2022, 611, 317–334. [Google Scholar] [CrossRef]
Roveda, L.; Pallucca, G.; Pedrocchi, N.; Braghin, F.; Tosatti, L.M. Iterative learning procedure with reinforcement for high-accuracy force tracking in robotized tasks. IEEE Trans. Ind. Inform. 2017, 14, 1753–1763. [Google Scholar] [CrossRef]
Brunke, L.; Greeff, M.; Hall, A.W.; Yuan, Z.; Zhou, S.; Panerati, J.; Schoellig, A.P. Safe learning in robotics: From learning-based control to safe reinforcement learning. Annu. Rev. Control. Robot. Auton. Syst. 2022, 5, 411–444. [Google Scholar] [CrossRef]
Hornby, G.S.; Takamura, S.; Yokono, J.; Hanagata, O.; Yamamoto, T.; Fujita, M. Evolving robust gaits with AIBO, Proceedings 2000 ICRA. Millennium Conference. In Proceedings of the IEEE International Conference on Robotics and Automation, Symposia Proceedings (Cat. No.00CH37065), San Francisco, CA, USA, 24–28 April 2000; pp. 3040–3045. [Google Scholar]
Fankhauser, P.; Hutter, M.; Gehring, C.; Bloesch, M.; Hoepflinger, M.A.; Siegwart, R. Reinforcement learning of single legged locomotion. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 188–193. [Google Scholar]
Bogdanovic, M.; Khadiv, M.; Righetti, L. Model-free reinforcement learning for robust locomotion using demonstrations from trajectory optimization. Front. Robot. AI 2022, 9, 1–12. [Google Scholar] [CrossRef] [PubMed]
Iscen, A.; Yu, G.; Escontrela, A.; Jain, D.; Tan, J.; Caluwaerts, K. Learning agile locomotion skills with a mentor. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 2019–2025. [Google Scholar]
Wu, J.; Wang, C.; Zhang, D.; Zhong, S.; Wang, B.; Qiao, H. Learning smooth and omnidirectional locomotion for quadruped robots. In Proceedings of the 2021 6th IEEE International Conference on Advanced Robotics and Mechatronics (ICARM), Chongqing, China, 3–5 July 2021; pp. 633–638. [Google Scholar]
Tan, W.; Fang, X.; Zhang, W.; Song, R.; Chen, T.; Zheng, Y.; Li, Y. A hierarchical framework for quadruped locomotion based on reinforcement learning. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 8462–8468. [Google Scholar]
Yao, Q.; Wang, J.; Wang, D.; Yang, S.; Zhang, H.; Wang, Y.; Wu, Z. Hierarchical terrain-aware control for quadrupedal locomotion by combining deep reinforcement learning and optimal control. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021; pp. 4546–4551. [Google Scholar]
Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2022, 4, 1–13. [Google Scholar] [CrossRef] [PubMed]
Haarnoja, T.; Zhou, A.; Abbee, P.; Levine, S. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 1861–1870. [Google Scholar]

Figure 1. Control framework of HTS-SAC algorithm.

Figure 2. Heterogeneous time-series neural network.

Figure 3. HTS-SAC program flowchart.

Figure 4. Mutual information and its change rate.

Figure 5. Time delay correlation mutual information.

Figure 6. Accumulated training rewards for six methods.

Figure 7. Cumulative training rewards of TD-SAC under different structural conditions.

Figure 8. Test on four terrains: flat ground, uphill and downhill slopes, stair waves and real land.

Figure 9. Speed test of four algorithms on flat ground.

Figure 10. Speed test of four methods on four different terrains.

Figure 11. The robot locomotion trajectory of HTS-SAC on flat ground.

Figure 12. The robot orientation of HTS-SAC on flat ground.

Figure 13. The base angle MSE of four methods on different terrains.

Figure 14. The foot touchdown results of the proposed method.

Figure 15. Joint angles and joint positions of the proposed method.

Figure 16. Joint angles and joint positions of the SAC method.

Figure 17. Cumulative training rewards for different reward functions.

Figure 18. Speed test of different reward functions on four terrains.

Figure 19. Energy consumption of different reward functions on four terrains.

Table 1. Parameters of PyBullet.

Parameters	Parameters Value
Fixed time step	0.00416 s
Solver iterations	50
solver type	Projected Gauss–Seidel
sensor noise model	Gaussian noise
Robot mass	4.7 kg
Friction	0.5
Restitution	(0.02, 0.03, 0.04)

Table 2. Parameters of locomotion control model of the unmanned quadruped robot.

Parameters	Parameters Value
Actor fc layer params	(512, 512)
Critic fc layer params	(512, 512)
Gamma	0.99
Epochs	1000
Steps per epoch	4000
Replay size	1,000,000
Batch size	100
Actor learning rate	0.001
Critic learning rate	0.001

Table 3. Cumulative training rewards for different methods.

Methods	Rewards
DDPG	−122
PPO	−0.17
TD3	125
SAC	891
TD-SAC	826
HTS-SAC	923

Table 4. TD-SAC training results under different structural conditions.

Network Structure	Reward
(500, 500)	827
(800, 800)	828
(1024, 1024)	833
(1624, 1624)	847
(2624, 2624)	7.5

Table 5. The yaw angle MSE of four methods on different terrains.

Methods	Flat Ground	Slope Land	Stair Waves	Real Land
TD3	$1.15 \times 10^{- 2}$	$1.04 \times 10^{- 2}$	$1.52 \times 10^{- 2}$	$8.99 \times 10^{- 3}$
SAC	$2.23 \times 10^{- 4}$	$5.39 \times 10^{- 4}$	$4.94 \times 10^{- 4}$	$1.42 \times 10^{- 3}$
TD-SAC	$1.21 \times 10^{- 3}$	$4.08 \times 10^{- 3}$	$1.39 \times 10^{- 3}$	$2.61 \times 10^{- 3}$
HTS-SAC	$1.73 \times 10^{- 4}$	$2.39 \times 10^{- 4}$	$2.38 \times 10^{- 4}$	$4.21 \times 10^{- 4}$

Table 6. The training results of HTS-SAC with different threshold conditions.

Methods	$δ_{1}$	$δ_{2}$	$δ_{3}$	$δ_{4}$	Characteristic Number	Reward
HTS-SAC	0.3	0	0.3	0.27	87	924
Compare1	0.29	−0.05	0.3	0.28	125	906
Compare2	0.28	−0.1	0.3	0.29	165	896

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Chen, Z.; Li, H. A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones 2025, 9, 569. https://doi.org/10.3390/drones9080569

AMA Style

Wang Z, Chen Z, Li H. A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones. 2025; 9(8):569. https://doi.org/10.3390/drones9080569

Chicago/Turabian Style

Wang, Zhaoxu, Zhuoying Chen, and Huiping Li. 2025. "A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion" Drones 9, no. 8: 569. https://doi.org/10.3390/drones9080569

APA Style

Wang, Z., Chen, Z., & Li, H. (2025). A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion. Drones, 9(8), 569. https://doi.org/10.3390/drones9080569

Article Menu

A Heterogeneous Time-Series Soft Actor–Critic Method for Quadruped Locomotion

Abstract

1. Introduction

2. Feature Selection Method

2.1. Mutual Information Theory

2.2. Four Mutual Information Decision Conditions

3. Gait Learning Method by HTS-SAC

3.1. Heterogeneous Time-Series Method

3.2. HTS-SAC Algorithm

4. Numerical Simulation

4.1. Feature Selection of Heterogeneous Time Series

4.2. Algorithm Simulation

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI