Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots

Cao, Huiyang; Lei, Hongfa; Liu, Yangjun; Chen, Zheng; Shi, Shuai; Li, Bingquan; Xu, Weichao; Yang, Zhi-Xin

doi:10.3390/sym17111787

Open AccessArticle

Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots

by

Huiyang Cao

¹

,

Hongfa Lei

²

,

Yangjun Liu

^3,*

,

Zheng Chen

²

,

Shuai Shi

¹

,

Bingquan Li

²

,

Weichao Xu

²

and

Zhi-Xin Yang

^3,*

¹

School of Computer, Guangdong University of Technology, Guangzhou 510006, China

²

School of Automation, Guangdong University of Technology, Guangzhou 510006, China

³

State Key Laboratory of Internet of Things for Smart City, Centre for Artificial Intelligence and Robotics, Department of Electromechanical Engineering, University of Macau, Macau 999078, China

^*

Authors to whom correspondence should be addressed.

Symmetry 2025, 17(11), 1787; https://doi.org/10.3390/sym17111787

Submission received: 10 September 2025 / Revised: 7 October 2025 / Accepted: 14 October 2025 / Published: 22 October 2025

(This article belongs to the Special Issue Latest Advancements in Symmetry and Asymmetry in Robotics and Mechatronics: Highlighting Emerging Trends and Their Intersections with Artificial Intelligence (AI), Machine Learning, Nonlinear Systems, and Control)

Download

Browse Figures

Versions Notes

Abstract

Legged robots have great potential in complex environments, but achieving robust and natural locomotion remains difficult due to challenges in generating smooth gaits and resisting disturbances. This article presents a novel reinforcement learning framework that integrates a skeleton-aware graph neural network (GNN), a single-stage teacher–student architecture, a system-response model, and a Wasserstein Adversarial Motion Priors (wAMP) module. The skeleton-aware GNN enriches observations by encoding key node information and link properties, providing structured body information and better spatial awareness on irregular terrains. Unlike conventional two-stage approaches, this method jointly trains teacher and student policies to accelerate learning and improve sim-to-real transfer using hybrid advantage estimation (HAE). The system-response model further enhances robustness by predicting future observations from historical states via contrastive learning, enabling the policy to anticipate terrain variations and external disturbances. Finally, wAMP provides a more stable adversarial imitation method for fitting expert datasets of both flat ground and stair locomotion. Experiments on quadruped robots demonstrate that the proposed approach achieves more natural gaits and stronger robustness than existing baselines.

Keywords:

legged locomotion; skeleton information; graph convolutional network; reinforcement learning; imitation learning

1. Introduction

In the field of legged robot locomotion, traditional control algorithms face significant challenges in achieving robust performance [1,2]. Particularly in complex terrains, these methods typically rely on intricate finite-state machines to coordinate basic motion primitives and reflex controllers [1,2]. Critical tasks such as ground contact detection and slip estimation often depend on empirically tuned thresholds, which are highly sensitive to unmodeled environmental factors like mud, snow, or vegetation [3,4,5]. Additionally, while foot-mounted contact sensors are widely used, their reliability in real-world scenarios is often limited. As operational scenarios expand, the complexity of traditional control systems grows exponentially, making them not only difficult to develop and maintain, but also prone to failure in edge cases [6].

In contrast, model-free reinforcement learning (RL) offers a more streamlined alternative [7]. By optimizing reward functions through autonomous data collection, RL significantly simplifies the design process of locomotion controllers [8], reducing reliance on manual parameter tuning and expert knowledge while automating many aspects of the development pipeline. This approach enables robots to acquire movement capabilities that are difficult to achieve with traditional methods. However, despite the significant advantages of RL in legged robot applications, the generated motion patterns often appear unnatural, characterized by stiff, jerky movements rather than the fluid, biologically-inspired gaits seen in animals. This lack of natural motion characteristics remains a major limitation of current RL approaches in legged locomotion.

Researchers have explored integrating predefined gait priors into RL training to guide convergence towards natural gaits [9,10,11]. Imitation learning [12,13,14], where robots mimic expert reference motions using phase variables, accelerates training but struggles to scale across diverse motion types [15,16]. Alternatively, action spaces such as central pattern generators (CPGs) or policies-modulating trajectory generators (PMTG) impose constraints that limit complex movement generation [17]. Although promising, these approaches often restrict behavior flexibility.

A notable advancement is Adversarial Motion Priors (AMP) [18]. Unlike direct imitation learning, AMP leverages style rewards to encourage the agent to produce desired locomotion behaviors while maintaining natural motion characteristics. This approach provides greater flexibility, enabling robots to learn diverse and complex gait patterns without imposing restrictive action constraints [19]. However, AMP is inherently based on a Generative Adversarial Network (GAN) framework, and thus suffers from a similar drawback: mode collapse. When the training dataset contains multiple types of locomotion behaviors, such as flat-ground walking and stair climbing, AMP tends to capture only one dominant mode of motion, leading to monotonous outputs and reduced generalization to diverse scenarios.

To address this issue, this article proposes Wasserstein Adversarial Motion Priors (wAMP) [20], which replaces the standard loss of GAN in AMP with a loss of Wasserstein divergence (WGAN-div) [21,22]. WGAN-div provides a smoother and more informative gradient signal, even when generated actions deviate significantly from expert demonstrations. This mitigates mode collapse by encouraging the generator to explore a broader distribution of locomotion behaviors. As a result, wAMP produces more diverse and natural gait patterns, enabling quadrupedal robots to learn both regular flat ground walking and complex skills such as stair climbing within a unified reinforcement learning framework.

Two-stage teacher–student frameworks have been developed to enable blind locomotion and reduce the sim-to-real gap [9,11]. In this paradigm, a teacher policy is first trained in simulation with access to privileged information, such as terrain friction or elevation maps, that is unavailable in the real world. A proprioceptive student policy is then trained to mimic the teacher, relying solely on onboard sensors such as IMUs and joint encoders. This approach allows the student to operate without privileged information, thereby bridging the sim-to-real gap and enabling blind locomotion.

However, the two-stage process is often cumbersome and time-consuming, as it requires training and distillation in two separate phases. To overcome this limitation, a single-stage teacher–student training framework is proposed [23]. Instead of training policies sequentially, the teacher and student are optimized jointly, with their relative contributions controlled by hybrid advantage estimation (HAE). In the early stages of training, the teacher dominates both action generation and representation encoding, providing strong guidance. Over time, the proportion of student contributions gradually increases until the student ultimately takes full control of the training process. This progressive strategy accelerates training while retaining the benefits of teacher–student transfer, leading to more efficient sim-to-real adaptation.

A Graph Neural Network (GNN) module enhances imitation learning and blind locomotion tasks, such as climbing stairs, by encoding robot skeletal information, specifically joint positions relative to body in Cartesian coordinates and the length and mass of the parent link of each joint. Representing the kinematic structure as a graph with joints as nodes and links as edges, the GNN captures spatial relationships and constraints in multiple consecutive states [24,25,26,27,28]. These embeddings, integrated into a reinforcement learning (RL) policy, support faster convergence during imitation learning by providing a concise kinematic representation, aiding adaptation to expert demonstrations. For blind staircase climbing, where tactile feedback (for example, front foot contact) guides navigation, the module improves spatial awareness, contributing to robust terrain perception.

Furthermore, a system-response model [29] is incorporated to enhance robustness against external disturbances. By modeling robot responses to environmental variations through proprioceptive feedback, the system-response model enables the policy to better anticipate and compensate for unexpected perturbations, thus improving stability and adaptability in real-world deployments. See Figure 1 for a side-by-side comparison with prior approaches.

The main contributions of this work are summarized as follows:

A novel reinforcement learning (RL) framework is proposed, driven by skeletal information. Joint positions relative to the body and parent link lengths and masses are encoded using a graph neural network (GNN), enriching observations and accelerating the convergence of both reinforcement and imitation learning. This enables fast, robust, and regular blind locomotion, such as climbing stair, without relying on navigation or external perception.
The framework integrates a single-stage teacher–student approach and a system-response model. The single-stage strategy jointly optimizes both teacher and student policies, improving the efficiency of sim-to-real transfer, while the system-response model enhances adaptability to environmental disturbances by predicting future states from historical observations.
The proposed approach leverages Wasserstein-based Adversarial Motion Priors (wAMP) to address the mode collapse issue that arises when training on multi-terrain datasets. By replacing the standard GAN loss with Wasserstein divergence, wAMP stabilizes adversarial imitation learning, promotes diverse gait generation, and prevents collapse on unseen terrains. This enables quadrupedal robots to efficiently learn from mixed datasets containing both flat-ground walking and complex stair locomotion. As a result, the training process achieves faster convergence across varied terrains, while the learned policies exhibit smoother and more adaptive transitions, such as ascending and descending stairs.

2. Materials and Methods

Figure 2 illustrates the overall architecture of our framework, which integrates four key components to achieve robust and natural quadrupedal locomotion. First, the skeleton-aware Graph Convolutional Network (GCN) encoder (Section 2.2) processes skeletal information from 13 nodes representing the robot’s kinematic structure. Each node is characterized by 25 features across 5 consecutive time steps, capturing joint positions, link lengths, and masses. The GCN produces a 16-dimensional embedding

e_{t}^{g r a p h}

that encodes spatial relationships and body awareness. Second, the teacher–student framework (Section 2.3 and Section 2.4) enables efficient sim-to-real transfer. The teacher encoder receives privileged information (

S_{t}^{p}

and

S_{t}^{e}

), while the student encoder processes 24 steps of proprioceptive observation history (

O_{24 s t e p s}

). Their outputs are adaptively aggregated using coefficient

β

, controlled by hybrid advantage estimation (HAE). Third, the system-response model (Section 2.5) enhances robustness by predicting future robot states from historical observations. Using contrastive learning, it aligns past observations

{\tilde{o}}_{t - 5 : t}

with future states

{\tilde{o}}_{t + 1}

, generating implicit response features

i_{t}^{s r c}

that capture system dynamics. Finally, the actor network combines all embeddings to generate actions

a_{t}

, optimized via Proximal Policy Optimization (PPO) with style rewards from the Wasserstein Adversarial Motion Priors (wAMP) discriminator (Section 3.5).

2.1. Reinforcement Learning Problem Formulation

The locomotion control of the legged robot is modeled as a partially observable Markov decision process

P = (S, O, A, p, r, γ)

[34], where state, observation, and action are denoted as

s \in S

,

o \in O

, and

a \in A

. The state transition probability is defined as

p (s_{t + 1} ∣ s_{t}, a_{t})

. The policy

π

selects actions based on H steps of historical observations

a_{t} \sim π (\cdot ∣ o_{t}^{H})

. The reward function is

r_{t} = r (s_{t}, a_{t})

with a discount factor

γ \in [0, 1)

, while the objective is to maximize cumulative discounted rewards:

J (π) = E [\sum_{t = 0}^{\infty} γ^{t} r (s_{t}, a_{t})] .

(1)

Table 1 summarizes the construction and physical meaning of observations and states. The noise levels are adopted from [35].

2.2. Skeleton Information Encoder

In this paper, the Unitree GO1 quadruped robot is modeled with 13 nodes: one base node, eight joint nodes, and four foot nodes. The adjacency matrix is constructed based on the physical connectivity of these nodes. The ordering of the nodes, along with the corresponding adjacency matrix

\hat{A}

and the degree matrix

\hat{D}

are illustrated in Figure 3. It is worth noting that for the two closely adjacent joints on the shoulder of the Unitree robot, the proposed method has selected only the HIP joint for modeling.

The output feature matrix

Y \in R^{N \times F^{'}}

for a graph convolutional layer is computed as:

Y = {\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{- 1 / 2} X Θ

(2)

where the input is the feature matrix

X \in R^{N \times F}

(representing N nodes, each with F input features) and the trainable weight matrix

Θ \in R^{F \times F^{'}}

. This operation aggregates and transforms node features based on the graph structure encoded in the renormalized adjacency matrix

{\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{- 1 / 2}

, producing an output where each node is now represented by

F^{'}

new features. Subsequent layers propagate features iteratively as:

H^{l + 1} = σ ({\hat{D}}^{- 1 / 2} \hat{A} {\hat{D}}^{- 1 / 2} H^{l} W^{l})

(3)

where the input is the feature matrix

H^{l}

from the l-th layer and the layer-specific trainable weight matrix

W^{l}

. The output

H^{l + 1}

is the updated feature matrix for the next layer, with

σ (\cdot)

being the ReLU activation function [25].

The proposed graph neural network (GNN) encoder architecture is illustrated in Figure 2. Each node in a state is characterized by a 5-D input feature vector, including its position in the body coordinate system, as well as the length and mass of the previous link. By stacking the five most recent consecutive states, each node obtains 25 features in total, resulting in an input dimension of

E_{G}

of (13, 25). After three layers of the GCN with the number of hidden features 32, 32 and 16, respectively, each node will be represented by a 16-features vector, these features represent the relationships between the nodes that affect each other. Then, the features of thirteen nodes are flattened into a 208-dimensional vector before passing through fully connected layers to encode them into a 16-dimensional embedding

e_{t}^{g r a p h}

, which is input to actor networks.

Through domain randomization of physical parameters during training, the GCN learns to extract structural features that remain stable across different dynamic conditions. This allows the skeleton encoding to provide a reliable morphological representation that works with varying contact dynamics, friction coefficients, and external disturbances, making it robust to diverse physical environments encountered during deployment.

2.3. Teacher and Student Encoder Networks

To simplify the learning process and mitigate training challenges, a latent representation is created by implicitly reducing the dimensionality of the privileged state. For successful sim-to-real transfer, the proposed approach employs a teacher encoder (TE) and a student encoder (SE), which are parameterized by

θ_{t}

and

θ_{s}

respectively. The TE is designed to function as an environmental factor network, taking as input privileged information

S_{t}^{p}

along with terrain details

S_{t}^{e}

. In contrast, the SE, which acts as an adaptation network, receives only historical information

O_{24 s t e p s}

. Both encoders generate a set of latent feature vectors. This approach allows the student to acquire a policy that is suitable for real-world deployment, as it learns from non-privileged sensor data.

The specifics of the TE and SE are described below:

p_{t}^{t e a}, e_{t}^{t e a} = E_{θ_{t}} (S_{t}^{p}, S_{t}^{e})

(4)

p_{t}^{s t u}, e_{t}^{s t u} = E_{θ_{s}} (O_{24 s t e p s})

(5)

The proposed framework enables single-stage training of the entire teacher–student network. Drawing inspiration from adaptive asymmetric DAgger (A2D) [36], an adaptive coefficient

β

is introduced for each agent. This coefficient dynamically adjusts the mixture between the teacher’s encoding of environmental information and the student’s encoding of historical proprioceptive observations, as defined by the following equation:

({\bar{p}}_{t}, {\bar{e}}_{t}) = (1 - β) (p_{t}^{t e a}, e_{t}^{t e a}) + β (p_{t}^{s t u}, e_{t}^{s t u})

(6)

L_{distill} = MSE (p_{t}^{stu}, p_{t}^{tea}) + MSE (e_{t}^{stu}, e_{t}^{tea})

(7)

The teacher–student framework serves two main purposes. First, it helps reduce the sim-to-real gap by allowing the student to learn a deployable policy using only onboard sensors. Second, it enables blind locomotion by teaching the student to infer surrounding terrain from proprioceptive signals alone. The mechanism works by having the student encoder approximate both the privileged information encoding

p_{t}^{t e a}

and the terrain encoding

e_{t}^{t e a}

that the teacher obtains only in simulation. Through processing 24-step historical proprioceptive observations with LSTM, the student learns to reconstruct these features from motion patterns, enabling terrain-aware control without external sensors.

2.4. Hybrid Advantage Estimation (HAE) and Adaptive Aggregation

Traditional A2D methods often link

β

updates to a fixed number of training iterations, a strategy that can introduce instability. To overcome this, a novel hybrid advantage estimation method (HAE) [23] is introduced, which provides a reliable metric (Equation (8)) to evaluating policy improvement. HAE determines whether a policy update from

π_{θ}

to

π_{θ^{'}}

results in an increased reward advantage.

A^{π_{θ}} = E_{s \sim D^{π_{θ}}} E_{a \sim π_{θ}} [{\tilde{A}}_{R}^{π_{θ}} (s, a)]

(8)

This metric is then used to intelligently govern the adjustment of

β

according to this rule:

β \leftarrow β + δ \cdot B_{p^{m t l - T E} \geq 1} \cdot B_{A^{π_{θ}^{'}} - A^{π_{θ}} > 0}

(9)

Here, B denotes the indicator function and

p^{m t l - T E}

indicates whether the agent, following the teacher’s policy, has reached an average terrain difficulty level of 5.

The curriculum learning strategy is fully detailed in Section 3.1. Initially,

β

is set to 0. The teacher’s latent representation is considered sufficiently accurate once the agent reaches an average terrain difficulty level of 5. At this point, the teacher encoder’s parameters are frozen, and the student encoder begins to actively participate in RL training. As

β

increases by

δ

based on the change in HAE, the influence of student gradually increases until

β

reaches 1, signifying a full transition where the student alone interacts with the environment. Guided by HAE, this adaptive aggregation strategy enables the student to reliably imitate the teacher while mitigating over-reliance.

2.5. System-Response Model

To enhance robustness against external disturbances, a system-response model [29] is incorporated. The principle follows Internal Model Control (IMC) [37], which suggests that robust control can be achieved by simulating system responses instead of directly modeling disturbances. In legged locomotion, disturbances such as terrain elevation, friction, and restitution are difficult to model explicitly. Instead, the system-response module predicts the robot’s reaction to implicit stability requirements.

In concrete terms, a history of proprioceptive observations

{\tilde{o}}_{t - 5 : t}

is encoded by an encoder

E_{S R M_{s}}

into a latent embedding, represented as the implicit response feature

i_{t}^{s r c}

.The training objective is to ensure that the embedding accurately represents the successor state of the robot

{\tilde{o}}_{t + 1}

. The implicit response

i_{t}^{s r c}

is optimized with contrastive learning. Specifically, pairs of

({\tilde{o}}_{t - 5 : t}, {\tilde{o}}_{t + 1})

sampled from the same trajectory are treated as positive pairs, while pairs from different trajectories are negative. Setting positive and negative pairs allows the system-response model to learn discriminative representations: Positive pairs encourage the embedding to capture consistent dynamics between past observations and their true successor state, while negative pairs push apart embeddings from unrelated trajectories. A source encoder

E_{S R M_{s}}

processes

{\tilde{o}}_{t - 5 : t}

and a target encoder

E_{S R M_{t}}

processes

{\tilde{o}}_{t + 1}

, producing normalized latent features

i_{t}^{src}

and

i_{t}^{target}

. Cluster assignment probabilities are computed via:

p_{t}^{src} = \frac{exp (\frac{1}{τ} i_{t}^{src ⊤} e_{k})}{\sum_{k^{'}} exp (\frac{1}{τ} i_{t}^{src ⊤} e_{k^{'}})} and p_{t}^{target} = \frac{exp (\frac{1}{τ} i_{t}^{target ⊤} e_{k})}{\sum_{k^{'}} exp (\frac{1}{τ} i_{t}^{target ⊤} e_{k^{'}})}

(10)

where

{e_{k}}

are normalized prototypes and

τ

is a temperature parameter. The representation learning objective is to minimize the following loss function:

L_{SwAV} = - \frac{1}{2 H} \sum_{t = 1}^{H} (q_{t}^{src} log p_{t}^{target} + q_{t}^{target} log p_{t}^{src}) .

(11)

where q represents the target assignments obtained by the Sinkhorn-Knopp algorithm [38], which helps to avoid trivial solutions. This optimization framework enables the implicit response model to learn a meaningful representation of the robot’s dynamics, which is important for handling external disturbances.

The System-Response Model helps the robot handle external disturbances and dynamic uncertainties. By learning to predict the robot’s future state from historical observations through contrastive learning, the SRM captures how the robot responds to disturbances such as terrain variations, friction changes, and external forces without explicitly modeling these factors. The learned response embedding

i_{t}^{s r c}

is integrated into the actor network, providing the policy with system dynamics information.

2.6. Network Architecture Details

Our framework integrates multiple network modules with different architectures. Table 2 provides an overview of all components, where standard MLP modules list their hidden layer dimensions in the “Architecture” column, and the specialized architectures are marked with asterisks.

The Skeleton Info Encoder employs a 3-layer Graph Convolutional Network with channel dimensions [N × 32, N × 32, N × 16], where N is the number of skeleton nodes. The GCN output is then processed through an MLP with layers [N × 16, 128, 16] to produce the final graph embedding

e_{t}^{g r a p h}

. The Student Encoder consists of a 3-layer LSTM with hidden size 256 that processes the 24-step observation history, followed by an MLP with layers [256, 64, 16] to extract the information embedding

p_{t}^{s t u}

and the environment embedding

e_{t}^{s t u}

.

All networks are optimized using the Adam optimizer with default learning rates, except for the wAMP Discriminator which uses the RMSProp optimizer.

3. Training Process

3.1. Training Curriculum

A total of 4096 parallel agents were trained on different types of terrain using the IsaacGym simulator [35], which supports NVIDIA RTX 30-series GPUs (NVIDIA Corporation, Santa Clara, CA, USA) [39]. The teacher and student policies were trained simultaneously in a single stage with a total of 800 simulated time steps. The total training time for this process was 7 h of wall-clock time. Each RL episode lasts for a maximum of 1000 steps, equivalent to 20 s, and terminates early if it reaches the termination criteria. The control frequency of the policy is 50 Hz in the simulation. All trainings were performed on a single NVIDIA RTX 4090 GPU. The primary memory requirement stems from our 24-step LSTM encoder; for resource-constrained settings, lightweight alternatives such as GRU units can be adopted to reduce memory footprint [11].

To address the overfitting issue of the LSTM encoder caused by repetitive terrain patterns, especially the false identification of flat ground as stairs, a diverse set of stairs terrains was designed with varying step widths and heights (Figure 4). Specifically, the step width is randomly sampled from a range between 0.31 and 1.2 m, allowing the robot to experience transitions where the entire base can stand on a single step. This increases the diversity of terrain transitions and improves the policy’s ability to distinguish between flat surfaces and obstacles, thereby enhancing deployment stability in real-world scenarios.

The terrain setup included four procedurally generated types as [35]: smooth slopes, rough slopes, stairs, and discrete obstacles. A height field map comprising 100 terrain segments was arranged in a grid

20 \times 10

, each row representing a specific terrain type and the difficulty progressively increasing from left to right. All terrains maintained a uniform size of

8 m \times 8 m

.

Smooth sloped terrains are generated with inclinations gradually increasing from 0° to 45°. Rough slopes use the same slope range but include surface irregularities by adding elevation noise with a maximum amplitude of

5 cm

. The stairs consist of four types with step widths of

0.31 m

,

0.40 m

,

0.49 m

, and

1.2 m

; wider stairs are associated with higher maximum step heights to enrich terrain diversity. Discrete terrains feature obstacles with two height levels, progressively increasing from

\pm 5 cm

to

\pm 15 cm

.

At the beginning of training, all robots are randomly initialized on terrains with difficulty levels ranging from 0 to 4 (with a maximum difficulty of 9). During each reset, if the robot moves beyond half of the terrain width, it will progress to a higher difficulty level. Conversely, if the robot fails to cover at least half of the expected distance based on its commanded velocity, the difficulty level will be reduced. To prevent skill forgetting, robots that reach the highest difficulty are reset to a randomly selected difficulty within the current terrain type.

A command-conditioned policy was trained using velocity tracking. Each episode began with a desired command comprising longitudinal, lateral, and yaw velocities in the base frame. Two command curricula were employed: heading commands (60% probability) facilitated rapid terrain navigation by guiding yaw alignment before forward movement, while randomly sampled commands (40% probability) fixed the yaw axis within a command cycle and varied forward velocities. This design allowed the agent to practice in-place rotation more frequently, thus preventing unnatural spinning behaviors.

Episodes ended after meeting failure criteria including trunk-ground collisions, excessive body inclination, or prolonged immobilization.

3.2. Dynamics Randomization

To enhance policy robustness and support sim-to-real transfer, domain randomization is applied during training by varying several dynamic properties at the beginning of each episode (Table 3). These include the mass of the robot’s trunk and limbs, the payload’s mass and location on the body, ground friction and restitution coefficients, actuator strength, joint-level proportional-derivative (PD) gains, and initial joint configurations. A subset of these randomized parameters is treated as a privileged state

s_{p}

to assist in training the teacher policy. Additionally, observation noise is injected following the same scheme as described in [35].

3.3. Motion Capture Data Preprocessing

Since only state transitions are required to construct the motion dataset D, this article adopts a trajectory optimization (TO) approach [40] based on centroidal dynamics to generate quadrupedal locomotion trajectories with a trotting gait on flat terrain. Specifically, the OCS2 framework is used [41,42] to solve the TO problem, which explicitly enforces friction cone constraints and kinematic constraints.

To collect motion data, a traditional control simulation of the Unitree Go2 quadrupedal robot is deployed in the Gazebo 11 environment (Figure 5). During the simulation, state transitions are recorded at a frequency of 50 Hz. Each recorded frame consists of joint positions, joint velocities, base linear velocity, base angular velocity, and base height. The dataset D includes trajectories of forward, backward, lateral left, lateral right, left steering, right steering, and combined locomotion, with a total duration of approximately 30 s.

3.4. Reward Terms Design

In this work, the overall reward function is composed of four components: a task term

r_{t}^{g}

, a regularization term

r_{t}^{l}

, a contact term

r_{t}^{c}

, and a style term

r_{t}^{s}

. The total reward is given by:

r_{t} = r_{t}^{g} + r_{t}^{l} + r_{t}^{c} + r_{t}^{s} .

(12)

To encourage accurate tracking behavior, the task reward focuses on minimizing the error between the commanded and actual linear and angular velocities.

Regularization terms are introduced to promote smooth joint motion, consistent gait patterns, and base stability.

The contact reward helps the robot move more smoothly when stepping over obstacles by reducing undesired foot collisions and promoting cleaner foot placement during contact transitions.

Finally, the style reward measures how well the agent’s behavior matches expert motion patterns from a reference dataset

D

. It is computed using an wAMP discriminator that assigns high scores to agent behaviors resembling those in the demonstration set. This reward term encourages natural and smooth trotting gaits.

To ensure training stability and handle the unbounded output of the Wasserstein discriminator, the style reward

r_{t}^{s}

is normalized using running statistics. The exponential moving average of the mean

μ

and variance

σ^{2}

of the discriminator’s outputs over recent batches is maintained. The reward is then calculated as:

r_{t}^{s} = \frac{D_{ϕ} (s_{t}, s_{t + 1}) - μ}{\sqrt{σ^{2} + ϵ}}

(13)

This normalization scheme centers the reward around zero and maintains a consistent scale, which significantly improves the stability of the reinforcement learning process.

The details of the reward functions are shown in Table 4.

3.5. Adversarial Motion Priors with Wasserstein Divergence

Following [18], a discriminator

D_{ϕ}

is defined as a neural network with parameters

ϕ

, which predicts whether a state transition

(s_{t}, s_{t + 1})

is a real sample from the dataset

D

or a fake sample produced by the agent

A

.

Each state

s_{t}^{AMP} \in R^{35}

consists of joint positions, joint velocities, base linear velocity, base angular velocity, foot positions, and base height relative to terrain. The inclusion of foot positions is crucial to ensuring that the learned motion does not cause excessive foot dragging. Moreover, in the reference dataset, when executing a stop command, the robot comes to a halt with all four feet grounded, ensuring that the learned policy stabilizes the feet in place when stopping. This design helps prevent unintended foot movements and improves the naturalness of stopping behaviors.

The conventional GAN-based discriminator, trained solely on flat-terrain motion data, utilizes a Least-Squares GAN (LSGAN) objective with a gradient penalty term. However, this approach is prone to mode collapse when the agent’s policy explores motions that are valid but fall outside the narrow distribution of the flat-terrain dataset, leading to vanishing gradients and hindering the learning of diverse, adaptive locomotion skills.

To overcome these limitations, a Wasserstein-based objective is adopted to train the discriminator

D_{ϕ}

. This formulation provides smoother gradients and prevents mode collapse by maximizing the Wasserstein distance between the distributions of agent motions and reference demonstrations. The discriminator is optimized using RMSProp with the following update rule:

\hat{S} \leftarrow ϵ S_{A} + (1 - ϵ) S_{D}, a random number ϵ \sim U [0, 1]

(14)

L_{wAMP} = \frac{1}{m} \sum_{i = 1}^{m} D_{φ} (s_{A}^{(i)}) - \frac{1}{m} \sum_{i = 1}^{m} D_{φ} (s_{D}^{(i)}) + λ (∥ \nabla_{\hat{S}} D_{φ} (\hat{S}) {∥_{2} - 1)}^{2}

(15)

3.6. Training Pipeline

Figure 6 illustrates the complete training workflow of our framework. The training process follows a standard reinforcement learning loop with multiple specialized components working together.

Environment Initialization and State Acquisition. At the beginning of each episode, the environment is reset and the initial state

S_{t}

is obtained from the simulation. This state contains observations

O_{t}

(proprioceptive information), privileged states

S_{t}^{p}

(simulation-only information such as base linear velocity and contact forces), terrain information

S_{t}^{e}

(height map), and skeleton information (robot morphology).

Encoder Processing. Different components of the state are processed by their respective encoders. The Student Encoder processes 24-step observation history through LSTM and MLP to produce embeddings

p_{t}^{s t u}

and

e_{t}^{s t u}

. The Teacher Encoder processes privileged and terrain information to generate

p_{t}^{t e a}

and

e_{t}^{t e a}

. The SRM Source Encoder processes 5-step observation history to extract dynamics embedding

i_{t}^{s r c}

. The Skeleton Encoder uses GCN and MLP to encode the robot’s morphology into

e_{t}^{g r a p h}

.

Adaptive Fusion and Action Generation. The teacher and student embeddings are adaptively fused using coefficient

β

as described in Equation (6). The Actor Network integrates multiple inputs including 5-step observations, SRM embedding

i_{t}^{s r c}

, fused embeddings

{\bar{p}}_{t}

and

{\bar{e}}_{t}

, and skeleton embedding

e_{t}^{g r a p h}

to generate action

a_{t}

.

Environment Interaction and Reward Calculation. The action is executed in the simulation environment, yielding the next state

S_{t + 1}

and a done flag. The total reward

R_{t}

is computed as the sum of task reward and style reward. The style reward is provided by the wAMP discriminator evaluating motion features

s_{t}^{A M P}

extracted from the current state.

Experience Collection. The tuple

(S_{t}, A_{t}, R_{t}, S_{t + 1}, done)

is stored in the replay buffer. This process repeats for 24 time steps to accumulate sufficient experience for network updates.

Network Update. After collecting 24 steps of experience, all networks are updated. The PPO loss

L_{PPO}

consists of three components:

Surrogate Loss:

$L_{surrogate} = E_{t} [max (- A_{t} \cdot r_{t}, - A_{t} \cdot clip (r_{t}, 1 - ϵ, 1 + ϵ))]$

(16)

where $r_{t} = exp (log π_{θ} (a_{t} | s_{t}) - log π_{θ_{old}} (a_{t} | s_{t}))$ is the probability ratio, $A_{t}$ is the advantage, and $ϵ = 0.2$ is the clipping parameter.
Value Loss (with clipping):

$L_{value} = E_{t} [max ({(V_{θ} (s_{t}) - R_{t})}^{2}, {(clip (V_{θ} (s_{t}), V_{old} - ϵ, V_{old} + ϵ) - R_{t})}^{2})]$

(17)

where $R_{t}$ is the return and $V_{old} (s_{t})$ is the previous value estimate.
Entropy Loss:

$L_{entropy} = - E_{t} [H (π_{θ} (\cdot | s_{t}))]$

(18)

which encourages exploration by maximizing policy entropy.

The complete PPO loss is:

$L_{PPO} = L_{surrogate} + 1.0 \cdot L_{value} - 0.01 \cdot L_{entropy}$

(19)
The total training loss combines PPO with auxiliary losses:

$L_{total} = L_{PPO} + L_{SwAV} + L_{distill} + L_{AMP}$

(20)

where $L_{SwAV}$ , $L_{distill}$ , and $L_{AMP}$ are the contrastive loss, distillation loss, and discriminator loss as described in their respective subsections.

4. Results

4.1. Hardware

The controller is deployed on the Unitree Go1 Air robot (Unitree Robotics Co., Ltd., Hangzhou, China), which stands 33 cm tall and weighs 13 kg. The sensors used on the robot consist of joint position encoders and an IMU. The trained student policy is optimized by using the ONNX framework to improve inference speed and runs on a laptop with a Ryzen 7 5800H CPU (Advanced Micro Devices, Inc., Santa Clara, CA, USA). For deployment, the control frequency is 50 Hz. As shown in Figure 7, the trained policy enables the robot to perform various locomotion tasks, including obstacle crossing, stair climbing, recovery from missteps, and payload balancing. A demonstration of the quadruped robot locomotion and additional experiments are provided in the Supplementary Materials (Videos S1–S6).

4.2. Long-Distance Locomotion and Torque Analysis Under Extreme Conditions

Figure 8 provides visual and data analysis illustrating the quadruped robot’s performance across diverse challenging scenarios. The curves represent the motor torque trajectories during different test conditions, offering insight into how the policy adapts to varying terrains and disturbances.

The experiment begins with a long-distance stability test in a park environment covering approximately 300 m, which includes steep slopes, consecutive steps, and dense vegetation. This extended traverse validates the policy’s ability to maintain consistent locomotion over prolonged periods while handling terrain variations. The torque measurements during this phase reveal how the robot continuously adjusts joint actuation to accommodate surface changes.

Subsequently, the robot is tested under two extreme conditions: slippery surfaces and heavy payload (10 kg). Figure 8 highlights torque data from four representative scenarios: continuous stairs, steep slope, 10 kg payload, and oil-contaminated slippery surfaces. When ascending continuous stairs, torque spikes correspond to the moment when the swing leg lifts to clear each step height, with the stance legs providing increased support. On steep slope, sustained elevated torque is observed in the hip and knee joints to counteract gravitational pull. Under 10 kg payload (approaching the robot’s own body weight), the torque profiles exhibit uniformly higher magnitudes across all joints throughout the gait cycle as the actuators compensate for the additional mass. On oil-contaminated slippery surfaces, torque fluctuations become more frequent as the policy continuously adjusts to maintain traction and balance, with the reduced ground friction forcing more conservative torque modulation to prevent excessive foot slipping.

These torque measurements serve as direct indicators of the policy’s internal decision-making process. Unlike kinematic variables such as joint angles or angular velocities, torque reflects the actual control effort required under varying physical constraints. The consistent and adaptive torque profiles across all test scenarios demonstrate the robustness and effectiveness of our skeleton information-driven reinforcement learning framework.

4.3. Evaluation of Training Efficiency and Terrain Adaptation

The training efficiency and terrain level of five representative strategies are analyzed. Domain Randomization (DR), Teacher–Student (2-stage), Teacher–Student (1-stage), 1-stage + System Response, and 1-stage + System Response + Skeleton. The evaluation metrics include: (i) the progression of cumulative rewards during training, which reflects the learning efficiency of each method, and (ii) the terrain level achieved during curriculum training, which indicates the adaptability to increasingly complex terrains.

As shown in Figure 9, the two-stage Teacher–Student framework benefits from the teacher policy’s access to terrain information, resulting in faster improvement compared to domain randomization alone. When the average difficulty in the terrain reaches level 5, the student training phase begins, causing the yellow curve to restart from zero. The student policy converges to the optimal performance in approximately 12,000 steps.

The single-stage Teacher–Student framework gradually mixes the student encoder once the teacher reaches terrain level 5. This design avoids the discontinuity of two-stage training and allows the policy to converge to the optimal value in about 10,000 steps. The final reward and terrain level achieved are higher than those of the two-stage framework.

Adding a system-response model and skeletal information encoding on top of the single-stage framework leads to faster convergence and higher performance. The system-response model enables anticipation and compensation for external disturbances, while the skeletal representation accelerates convergence of imitation learning convergence and enhances adaptability to challenging terrains such as stair environments, resulting in consistently superior outcomes.

4.4. Robustness on Large Obstacles and Unseen Terrains

The robustness of the robot is measured by its ability to handle challenging obstacles. As shown in Figure 10, the robot successfully steps over a 25 cm high obstacle, which is close to its own standing height of 33 cm. This obstacle height is significantly larger than the 8 cm foot clearance present in the AMP demonstration dataset. The policy, trained purely on proprioceptive observations without any exteroceptive sensors, exhibits emergent behaviors such as trunk lifting and foot raising after detecting a head collision, allowing for successful traversal. Additionally, the method is further evaluated against several baselines using only proprioception inputs.

RMA [9]: A teacher–student framework with a 1D CNN adaptation module.
Concurrent [31]: A jointly trained policy with a learned state estimator.
Domain Randomization: A policy trained without privileged information.
Built-in MPC: A model predictive controller (MPC) implemented on Unitree Go1.

All RL methods above were trained using the same curriculum strategy detailed in Section 3.1, the same reward functions detailed in Section 3.4 and the same random seed. For a fair comparison, the same low-level network architecture detailed in Table 2 is used for all RL methods. The CNN input sequence length for the RMA baseline was 50, with access to the same memory length as the LSTM encoder.

Each controller was evaluated in a single-step task using a remote-control command of 0.4 m/s. For each height of the step, the robot attempted to ascend or descend one step from the starting point, and a trial was considered successful if it completed the motion without falling. Each height was tested 10 times and the success rate was recorded. These experiments were carried out on the physical robot, and the results are reported in Figure 11.

To further evaluate the limits of the proposed method (1-stage + System Response + Skeleton), additional tests are conducted in simulation using Isaac Gym. In this setting, the environment consisted of ten parallel stair setups, each composed of five consecutive steps with varying heights. For each step height, 10 trials were performed in parallel, and the success rate was measured. As shown in Figure 12, the proposed method demonstrates enhanced robustness in simulation, successfully navigating step heights up to 35 cm.

4.5. Evaluation of Anti-Disturbance Capability

To examine the controller’s robustness against external perturbations, several physical disturbance tests are conducted on the Unitree Go1 robot, including dragging interference, payload loading, missing step, and lateral hit. All evaluations are performed in real-world scenarios using only proprioceptive observations:

Dragging: To assess robustness against pulling disturbances, the robot is commanded to move forward at 0.4 m/s while dragging a pair of dumbbells. The maximum straight-line distance it can travel without falling is measured over 3 trials, with a maximum distance of 10 m.
Payload: This test evaluates the robot’s ability to handle additional static loads. A dumbbell payload is fixed on the robot’s torso and is commanded to walk forward at 0.4 m/s. Performance is assessed by the maximum distance the robot can travel stably without falling, measured in 3 trials.
Missing Step: The robot’s recovery from an unexpected drop is tested as it is commanded to walk forward at 0.4 m/s and step off a platform. The success rate of this maneuver is recorded over 10 trials.
Lateral Hit: To evaluate the robot’s capability to recover push, a dumbbell attached to a pendulum is raised to 15° from the vertical and released from a height of 100 cm, with the robot positioned directly beneath the rope so that the mass delivers a lateral impact. The success rate for maintaining balance is recorded on 10 trials.

The method is first compared with several representative baselines to demonstrate its overall performance in handling these disturbances. The methods considered for this study include:

RMA [9]: A teacher–student framework with a 1D CNN adaptation module.
MoB [43]: A multi-behavioral controller trained for terrain generalization.
Built-in MPC: The MPC controller provided by the factory on Unitree Go1.

As summarized in Figure 13, the approach demonstrates significantly improved robustness, outperforming all baseline methods against external perturbations. Although domain randomization allows other methods to handle payload and lateral impacts effectively, our approach demonstrates superior performance in dragging and missing-step scenarios. This is primarily attributed to the method’s ability to utilize skeletal information, allowing the robot to better detect when its rear feet are being dragged or its front feet miss a step.

To further validate the contribution of each component within the framework, an ablation study is conducted. This comparison includes the following training strategies:

Domain Randomization (DR): The baseline approach where the agent is trained with domain randomization alone and deployed directly.
Teacher-Student (2-stage): A two-stage framework where a teacher policy is first trained and then distilled to a student.
Teacher-Student (1-stage): A single-stage framework where both teacher and student are trained jointly.
1-stage + System Response: The single-stage framework with the addition of the system-response model.
1-stage + System Response + Skeleton: The full proposed method, which includes both the system-response model and skeletal information encoding.

As shown in Figure 14, the method’s superior performance is a result of the effective integration of its core components. The system-response model allows the robot to proactively anticipate and compensate for external forces, making it exceptionally robust against payload and lateral hit disturbances. For the dragging test, the Graph Neural Network (GNN) encodes skeletal information, providing the policy with a deeper awareness of its own physical state under tension. Finally, success in the missing step test is a direct result of the specialized terrain curriculum, which trains the robot to handle complex vertical transitions by distinguishing between different ground surfaces and obstacles. This combination of proactive control, enhanced body awareness, and specialized training enables the method to consistently outperform other approaches.

5. Discussion

This work introduces a skeleton information-driven reinforcement learning framework that integrates a Graph Convolutional Network (GCN), a single-stage teacher–student architecture, a system-response model, and Wasserstein Adversarial Motion Priors (wAMP). The GCN encodes relative joint and foot positions, enriching the observation space and enabling more reliable gait generation on irregular terrains. Compared with previous two-stage methods such as RMA, single-stage training accelerates convergence and simplifies deployment. The system-response model further improves stability by implicitly encoding dynamic reactions, while wAMP mitigates mode collapse on multi-terrain datasets, allowing smoother transitions such as stair climbing. Together, these components produce robust and natural quadrupedal locomotion in diverse environments.

5.1. Constraints and Assumptions

While our experimental validation focuses on the Unitree GO1 quadruped, the framework is designed to be generalizable across different platforms. The skeleton-based encoding via Graph Convolutional Networks naturally adapts to different morphologies [25], and the teacher–student learning paradigm has shown consistent effectiveness across diverse quadruped platforms [9,17,33]. Our System-Response Model employs contrastive learning to capture system dynamics, a technique successfully applied across quadrupeds and humanoids [29,32,44]. Similarly, our AMP-based imitation learning has demonstrated effectiveness in both quadruped [39] and humanoid control [20,44]. Our current implementation deliberately focuses on proprioceptive feedback to establish robust blind locomotion capabilities [17,23,39]. However, the framework can be extended with exteroceptive sensing, as demonstrated by recent successful integrations of LiDAR [45] and vision [46] in legged robots. The framework supports diverse motion prior sources including motion capture data [20,30,44] and traditional control datasets [39]. These design choices position our framework for broader application beyond the GO1 platform, though comprehensive validation across multiple robots and sensing modalities remains important future work.

Table 5 summarizes the requirements and constraints of our framework by module.

Table 5 presents the GPU memory requirements for each module in our framework. The baseline IsaacGym environment with 4096 parallel agents occupies approximately 6 GB of GPU memory. Each module adds additional memory overhead, with the Student Encoder’s LSTM contributing the largest increase (+12 GB). The complete framework requires approximately 22 GB total GPU memory for training. For resource-constrained settings, the LSTM in the Student Encoder can be replaced with Gated Recurrent Units (GRU) to reduce memory consumption.

Our framework uses proprioceptive sensing to validate the feasibility of blind locomotion, which does not imply that it is restricted to proprioceptive-only configurations. For vision-based training, the framework can be extended by removing the environment encoding components from the teacher and student encoders and replacing them with depth camera encoders to process visual information.

5.2. Future Work

Despite these advances, limitations remain. The framework has been validated primarily on a quadruped platform without exteroceptive sensing. Future work may extend skeleton-aware representations with visual or tactile input and validate the approach across different robot platforms. Exploring integration with high-level planning or multi-robot coordination could further broaden its applicability in unstructured environments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/sym17111787/s1, Video S1: Demonstration of the quadruped robot locomotion. Video S2: Steep slope. Video S3: Oil. Video S4: 10 kg payload. Video S5: Continuous steps. Video S6: Deep grass.

Author Contributions

Conceptualization, H.C., H.L. and B.L.; methodology, H.C. and Z.C.; writing—original draft preparation, H.C., H.L., S.S. and B.L.; writing—review and editing, Y.L., Z.-X.Y. and W.X.; software, Z.-X.Y., W.X., H.C. and Z.C.; validation, Z.-X.Y. and W.X.; supervision, Y.L., Z.-X.Y. and W.X.; project administration, Y.L., Z.-X.Y. and W.X.; funding acquisition, Y.L., Z.-X.Y. and W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded in part by the National Natural Science Foundation of China under Grant 62171141 and 62461160260, in part by the Science and Technology Development Fund of Macau SAR under Grant 0092/2024/AFJ, 0075/2023/AMJ, 0003/2023/RIB1 and 001/2024/SKL, in part by the Guangdong Science and Technology Department under Grant 2024A1515011803, 2023A0505030003 and 2020B1515130001, and in part by the University of Macau under Grant MYRG-GRG2023-00237-FST-UMDF and MYRG-GRG2024-00299-FST.

Data Availability Statement

The original contributions presented in this study are included in the article and Supplementary Material. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

DR	Domain Randomization
HAE	hybrid advantage estimation
GCN	Graph Convolutional Network
GNN	Graph Neural Network
LSTM	Long Short-Term Memory
MLP	Multi-Layer Perceptron
SRM	System-Response Model
IMC	Internal Model Control
AMP	Adversarial Motion Priors
wAMP	Wasserstein Adversarial Motion Priors
A2D	Adaptive Asymmetric DAgger
PPO	Proximal Policy Optimization
DoF	Degree of Freedom
TO	Trajectory Optimization
RMA	Rapid Motor Adaptation
MoB	Multi-Behavioral Controller
MPC	Model Predictive Controller
Go1	Unitree Go1 Quadruped Robot
Go2	Unitree Go2 Quadruped Robot
IsaacGym	NVIDIA Isaac Gym Simulator

References

Jenelten, F.; Hwangbo, J.; Tresoldi, F.; Bellicoso, C.D.; Hutter, M. Dynamic Locomotion on Slippery Ground. IEEE Robot. Autom. Lett. 2019, 4, 4170–4176. [Google Scholar] [CrossRef]
Bledt, G.; Wensing, P.M.; Ingersoll, S.; Kim, S. Contact Model Fusion for Event-Based Locomotion in Unstructured Terrains. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 4399–4406. [Google Scholar] [CrossRef]
Bloesch, M.; Gehring, C.; Fankhauser, P.; Hutter, M.; Hoepflinger, M.A.; Siegwart, R. State estimation for legged robots on unstable and slippery terrain. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 6058–6064. [Google Scholar] [CrossRef]
Gehring, C.; Bellicoso, C.D.; Coros, S.; Bloesch, M.; Fankhauser, P.; Hutter, M.; Siegwart, R. Dynamic trotting on slopes for quadrupedal robots. In Proceedings of the 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Hamburg, Germany, 28 September–2 October 2015; pp. 5129–5135. [Google Scholar] [CrossRef]
Hartley, R.; Mangelson, J.; Gan, L.; Jadidi, M.G.; Walls, J.M.; Eustice, R.M.; Grizzle, J.W. Legged Robot State-Estimation Through Combined Forward Kinematic and Preintegrated Contact Factors. arXiv 2018, arXiv:1712.05873. [Google Scholar] [CrossRef]
Yan, C.; Qin, J.; Liu, Q.; Ma, Q. Slip Detection and Recovery for Quadruped Robots via Orthogonal Decomposition. IEEE Trans. Ind. Electron. 2024, 72, 6166–6174. [Google Scholar] [CrossRef]
Hwangbo, J.; Lee, J.; Dosovitskiy, A.; Bellicoso, D.; Tsounis, V.; Koltun, V.; Hutter, M. Learning agile and dynamic motor skills for legged robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [PubMed]
Haarnoja, T.; Ha, S.; Zhou, A.; Tan, J.; Tucker, G.; Levine, S. Learning to Walk via Deep Reinforcement Learning. arXiv 2019, arXiv:1812.11103. [Google Scholar] [CrossRef]
Kumar, A.; Fu, Z.; Pathak, D.; Malik, J. RMA: Rapid Motor Adaptation for Legged Robots. arXiv 2021, arXiv:2107.04034. [Google Scholar] [CrossRef]
Margolis, G.B.; Yang, G.; Paigwar, K.; Chen, T.; Agrawal, P. Rapid Locomotion via Reinforcement Learning. arXiv 2022, arXiv:2205.02824. [Google Scholar] [CrossRef]
Agarwal, A.; Kumar, A.; Malik, J.; Pathak, D. Legged Locomotion in Challenging Terrains using Egocentric Vision. arXiv 2022, arXiv:2211.07638. [Google Scholar] [CrossRef]
Liu, Y.; Liu, S.; Chen, B.; Yang, Z.X.; Xu, S. Fusion-Perception-to-Action Transformer: Enhancing Robotic Manipulation With 3-D Visual Fusion Attention and Proprioception. IEEE Trans. Robot. 2025, 41, 1553–1567. [Google Scholar] [CrossRef]
Dong, Y.; Liu, Y.; Duan, J.; Li, Y.; Dai, Z. Robotic manipulation framework based on semantic keypoints for packing shoes of different sizes, shapes, and softness. Robot. Auton. Syst. 2025, 194, 105174. [Google Scholar] [CrossRef]
Sheng, J.; Liu, Y.; Xu, S.; Yang, Z.; Liu, M. GPA-RAM: Grasp-Pretraining Augmented Robotic Attention Mamba for Spatial Task Learning. arXiv 2025, arXiv:2504.19683. [Google Scholar]
Peng, X.B.; Abbeel, P.; Levine, S.; van de Panne, M. DeepMimic: Example-guided deep reinforcement learning of physics-based character skills. Acm Trans. Graph. 2018, 37, 1–14. [Google Scholar] [CrossRef]
Peng, X.B.; Coumans, E.; Zhang, T.; Lee, T.W.; Tan, J.; Levine, S. Learning Agile Robotic Locomotion Skills by Imitating Animals. arXiv 2020, arXiv:2004.00784. [Google Scholar] [CrossRef]
Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning quadrupedal locomotion over challenging terrain. Sci. Robot. 2020, 5, eabc5986. [Google Scholar] [CrossRef]
Peng, X.B.; Ma, Z.; Abbeel, P.; Levine, S.; Kanazawa, A. AMP: Adversarial motion priors for stylized physics-based character control. Acm Trans. Graph. 2021, 40, 1–20. [Google Scholar] [CrossRef]
Vollenweider, E.; Bjelonic, M.; Klemm, V.; Rudin, N.; Lee, J.; Hutter, M. Advanced Skills through Multiple Adversarial Motion Priors in Reinforcement Learning. arXiv 2022, arXiv:2203.14912. [Google Scholar] [CrossRef]
Tang, A.; Hiraoka, T.; Hiraoka, N.; Shi, F.; Kawaharazuka, K.; Kojima, K.; Okada, K.; Inaba, M. HumanMimic: Learning Natural Locomotion and Transitions for Humanoid Robot via Wasserstein Adversarial Imitation. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 13107–13114. [Google Scholar] [CrossRef]
Lin, Y.S.; Lin, L.S.; Chen, C.C. An Integrated Framework Based on GAN and RBI for Learning with Insufficient Datasets. Symmetry 2022, 14, 339. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Zhou, Q.; Ding, H.; Chen, T.; Man, L.; Jiang, H.; Zhang, G.; Li, B.; Rong, X.; Li, Y. ALARM: Safe Reinforcement Learning With Reliable Mimicry for Robust Legged Locomotion. IEEE Robot. Autom. Lett. 2025, 10, 6768–6775. [Google Scholar] [CrossRef]
Kim, J.T.; Park, J.; Choi, S.; Ha, S. Learning Robot Structure and Motion Embeddings using Graph Neural Networks. arXiv 2021, arXiv:2109.07543. [Google Scholar] [CrossRef]
Phan Bui, K.; Nguyen Truong, G.; Nguyen Ngoc, D. GCTD3: Modeling of Bipedal Locomotion by Combination of TD3 Algorithms and Graph Convolutional Network. Appl. Sci. 2022, 12, 2948. [Google Scholar] [CrossRef]
Gallien, T. Beyond Fixed Morphologies: Learning Graph Policies with Trust Region Compensation in Variable Action Spaces. arXiv 2025, arXiv:2508.14102. [Google Scholar] [CrossRef]
Wang, R.; Li, F.; Liu, S.; Li, W.; Chen, S.; Feng, B.; Jin, D. Adaptive Multi-Channel Deep Graph Neural Networks. Symmetry 2024, 16, 406. [Google Scholar] [CrossRef]
Wan, F.; Li, P. A Novel Money Laundering Prediction Model Based on a Dynamic Graph Convolutional Neural Network and Long Short-Term Memory. Symmetry 2024, 16, 378. [Google Scholar] [CrossRef]
Long, J.; Wang, Z.; Li, Q.; Gao, J.; Cao, L.; Pang, J. Hybrid Internal Model: Learning Agile Legged Locomotion with Simulated Robot Response. arXiv 2024, arXiv:2312.11460. [Google Scholar] [CrossRef]
Escontrela, A.; Peng, X.B.; Yu, W.; Zhang, T.; Iscen, A.; Goldberg, K.; Abbeel, P. Adversarial Motion Priors Make Good Substitutes for Complex Reward Functions. arXiv 2022, arXiv:2203.15103. [Google Scholar] [CrossRef]
Ji, G.; Mun, J.; Kim, H.; Hwangbo, J. Concurrent Training of a Control Policy and a State Estimator for Dynamic and Robust Legged Locomotion. IEEE Robot. Autom. Lett. 2022, 7, 4630–4637. [Google Scholar] [CrossRef]
Nahrendra, I.M.A.; Yu, B.; Myung, H. DreamWaQ: Learning Robust Quadrupedal Locomotion With Implicit Terrain Imagination via Deep Reinforcement Learning. arXiv 2023, arXiv:2301.10602. [Google Scholar]
Wang, H.; Luo, H.; Zhang, W.; Chen, H. CTS: Concurrent Teacher–Student Reinforcement Learning for Legged Locomotion. arXiv 2024, arXiv:2405.10830. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Wang, X.; Gao, W.; Chen, Y. A Motion Control Strategy for a Blind Hexapod Robot Based on Reinforcement Learning and Central Pattern Generator. Symmetry 2025, 17, 1058. [Google Scholar] [CrossRef]
Rudin, N.; Hoeller, D.; Reist, P.; Hutter, M. Learning to Walk in Minutes Using Massively Parallel Deep Reinforcement Learning. arXiv 2022, arXiv:2109.11978. [Google Scholar] [CrossRef]
Warrington, A.; Lavington, J.W.; Ścibior, A.; Schmidt, M.; Wood, F. Robust Asymmetric Learning in POMDPs. arXiv 2021, arXiv:2012.15566. [Google Scholar] [CrossRef]
Rivera, D.E.; Morari, M.; Skogestad, S. Internal model control: PID controller design. Ind. Eng. Chem. Process Des. Dev. 1986, 25, 252–265. [Google Scholar] [CrossRef]
Cuturi, M. Sinkhorn Distances: Lightspeed Computation of Optimal Transportation Distances. arXiv 2013, arXiv:1306.0895. [Google Scholar] [CrossRef]
Wu, J.; Xin, G.; Qi, C.; Xue, Y. Learning Robust and Agile Legged Locomotion Using Adversarial Motion Priors. IEEE Robot. Autom. Lett. 2023, 8, 4975–4982. [Google Scholar] [CrossRef]
Winkler, A.W.; Bellicoso, C.D.; Hutter, M.; Buchli, J. Gait and Trajectory Optimization for Legged Systems Through Phase-Based End-Effector Parameterization. IEEE Robot. Autom. Lett. 2018, 3, 1560–1567. [Google Scholar] [CrossRef]
OCS2: An Open Source Library for Optimal Control of Switched Systems. Available online: https://github.com/leggedrobotics/ocs2 (accessed on 9 September 2025).
Wu, G.; Wang, P.; Qiu, B.; Han, Y. SDA-RRT*Connect: A Path Planning and Trajectory Optimization Method for Robotic Manipulators in Industrial Scenes with Frame Obstacles. Symmetry 2025, 17, 1. [Google Scholar] [CrossRef]
Margolis, G.B.; Agrawal, P. Walk These Ways: Tuning Robot Control for Generalization with Multiplicity of Behavior. arXiv 2022, arXiv:2212.03238. [Google Scholar] [CrossRef]
Huang, H.; Cui, W.; Zhang, T.; Li, S.; Han, J.; Qin, B.; Zhang, T.; Zheng, L.; Tang, Z.; Hu, C.; et al. Think on your feet: Seamless Transition between Human-like Locomotion in Response to Changing Commands. arXiv 2025, arXiv:2502.18901. [Google Scholar] [CrossRef]
Miki, T.; Lee, J.; Hwangbo, J.; Wellhausen, L.; Koltun, V.; Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022, 7, eabk2822. [Google Scholar] [CrossRef]
Luo, S.; Li, S.; Yu, R.; Wang, Z.; Wu, J.; Zhu, Q. PIE: Parkour with Implicit-Explicit Learning Framework for Legged Robots. arXiv 2024, arXiv:2408.13740. [Google Scholar] [CrossRef]

Figure 1. Comparison of this study and other state-of-the-art legged robot locomotion approaches (Unitree MPC, Ashish Kumar et al. [9], Gabriel B. Margolis et al. [10], Alejandro Escontrela et al. [30], Gwanghyeon Ji et al. [31], Junfeng Long et al. [29], Made Aswin Nahrendra et al. [32], Hongxi Wang et al. [33]). Tick means the method is used or the performance is achieved, and cross means not.

Figure 2. Overview of the proposed framework. The architecture consists of four main components: (1) a skeleton-aware GCN encoder, (2) teacher–student aggregation module, (3) system-response encoder with contrastive learning, and (4) PPO-based policy optimization guided by wAMP. Detailed descriptions of each component are provided in Section 2.1, Section 2.2, Section 2.3, Section 2.4, Section 2.5 and Section 2.6.

Figure 3. The node sequence and connectivity edges of the Unitree GO1 quadruped robot; the adjacency matrix

\hat{A}

and the diagonal matrix

\hat{D}

constructed according to the specified nodes and connections.

Figure 3. The node sequence and connectivity edges of the Unitree GO1 quadruped robot; the adjacency matrix

\hat{A}

and the diagonal matrix

\hat{D}

constructed according to the specified nodes and connections.

Figure 4. Stair terrains with fixed step widths of 0.31 m, 0.49 m, and 1.2 m were generated, with wavy surface added to enhance terrain diversity and mitigate LSTM overfitting. During training, the robot receives command inputs where heading commands are sampled with a probability of 0.6 and random commands are sampled with a probability of 0.4. In the formulas, d denotes the terrain difficulty level (0–9), h the step height, and w the step width.

Figure 5. Simulation of the Go2 quadruped robot in Gazebo environment for motion dataset collection. State transitions including joint positions/velocities and base dynamics are recorded at 50 Hz for various locomotion patterns.

Figure 6. Training pipeline flow diagram showing the data flow from state acquisition through multiple encoders to action generation and network updates.

Figure 7. Demonstration of the robot’s capabilities: (a) stepping over a 25 cm obstacle, (b) climbing a 16 cm stair, (c) climbing a 20 cm lateral stair, (d) recovering from a 40 cm foot misstep, (e) walking with a 5 kg payload, and (f) staying balanced under a 5 kg pulling disturbance on one leg.

Figure 8. Motor torque analysis across challenging scenarios. The figure shows torque trajectories of hip and knee joints during four representative conditions: consecutive stair climbing, steep slope traversal, 10 kg payload locomotion, and slippery surface navigation. Torque patterns reveal the policy’s adaptive control strategies under varying physical constraints.

Figure 9. Training efficiency and terrain adaptation. Left: reward improvement during training. Right: terrain level progression achieved during curriculum learning. Where the black line marks the condition under which the teacher reaches terrain level 5, one of the prerequisites for mixing student encodings.

Figure 10. Locomotion over challenging steps.

Figure 11. Success rates of different methods for different step heights. The success rate was evaluated over 10 trials for each step height.

Figure 12. Simulated step-up and step-down success rates on five consecutive steps at different heights using parallel Isaac Gym environments.

Figure 13. Anti-disturbance performance comparison. This figure illustrates the comparative anti-disturbance capabilities of the proposed method (1-stage + System Response + Skeleton) against representative baselines, including RMA, MoB, and the Built-in MPC.

Figure 14. Ablation study of the proposed method. The contribution of each component is evaluated by comparing five training strategies: Domain Randomization (DR), a two-stage teacher–student approach, a single-stage teacher–student approach, the addition of a system-response model, and the inclusion of both a system-response model and skeletal information encoding.

Table 1. Components of partial observations

O_{t}

and states

S_{t}

,

S_{t}^{p}, S_{t}^{e} \in S_{t}

.

Table 1. Components of partial observations

O_{t}

and states

S_{t}

,

S_{t}^{p}, S_{t}^{e} \in S_{t}

.

Name	Dimension	Noise Level	Category
Command	3	0	$O_{t}$
Base Angular Velocity	3	±0.2 rad/s	$O_{t}$
Projected Gravity	3	±0.05 rad/s²	$O_{t}$
DoF Position	12	±0.01 rad	$O_{t}$
DoF Velocity	12	±1.5 rad/s	$O_{t}$
DoF Action	12	0	$O_{t}$
$O_{t}$ (Without noise)	45	–	$S_{t}$
Terrain Information	187	–	$S_{t}^{e}$
base linear velocity	3	–	$S_{t}^{p}$
Base Height	1	–	$S_{t}^{p}$
ground friction coefficients	1	–	$S_{t}^{p}$
ground restitution coefficients	1	–	$S_{t}^{p}$
foot contact forces	12	–	$S_{t}^{p}$
external forces and their positions	6	–	$S_{t}^{p}$
collision states (trunk, thigh, shank)	12	–	$S_{t}^{p}$

Table 2. Details of network architecture.

Module	Inputs	Architecture	Outputs
Skeleton Info Encoder *	skeleton info	GCN + MLP *	$e_{t}^{g r a p h}$
Teacher Encoder	${\tilde{S}}_{t}^{p}, {\tilde{S}}_{t}^{e}$	[256, 128]	$p_{t}^{t e a}, e_{t}^{t e a}$
Student Encoder *	${\tilde{O}}_{24 s t e p s}$	LSTM + MLP *	$p_{t}^{s t u}, e_{t}^{s t u}$
SRM Source Encoder	${\tilde{O}}_{t - 4 : t}$	[128, 64]	$i_{t}^{s r c}$
SRM Target Encoder	${\tilde{O}}_{t + 1}$	[128, 64]	$i_{t}^{t a r g e t}$
Actor Network	${\tilde{O}}_{5 s t e p s}, i_{t}^{s r c}, {\bar{p}}_{t}, {\bar{e}}_{t}, e_{t}^{g r a p h}$	[512, 256, 128]	$a_{t}$
Critic Network	${\tilde{S}}_{t}$	[512, 256, 128]	$v_{t}$
wAMP Discriminator	$M_{t}, M_{t + 1}$	[1024, 512]	$R_{t}^{s}$

An asterisk (*) denotes a non-MLP specialized architecture. Details are provided in the main text.

Table 3. Dynamic parameters and the range of their randomization values used during training.

Parameters	Range [Min, Max]	Unit
Link Mass	[0.8, 1.2] × nominal value	kg
Payload Mass	[0, 3]	kg
Payload Position	[−0.1, 0.1] relative to base	m
Ground Friction	[0.05, 2.75]	-
Ground Restitution	[0.0, 1.0]	-
Motor Strength	[0.8, 1.2] × motor torque	Nm
Joint $K_{p}$	[0.8, 1.2] × 20	-
Joint $K_{d}$	[0.8, 1.2] × 0.5	-
Initial Joint Positions	[0.5, 1.5] × nominal value	rad

Table 4. Design of reward functions.

Term	Equation	Weight
Task Reward $r^{g}$	$exp (- \frac{∥ v_{t, x y}^{des} - v_{t, x y} ∥_{2}}{0.25})$	1.3
Task Reward $r^{g}$	$exp (- \frac{∥ ω_{t, z}^{des} - ω_{t, z} ∥_{2}}{0.25})$	0.65
Regularization Reward $r^{l}$	$- {∥ τ ∥}_{2}$	$1 \times 10^{- 4}$
	$- ∥ \ddot{q} ∥_{2}$	$2.5 \times 10^{- 7}$
	$- ∥ q_{t - 1} - q_{t} ∥_{2}$	$0.1$
	$\sum_{i = 0}^{4} min (t_{air, i} - 0.35, 0)$	$1.0$
	$- ∥ v_{t, z} ∥_{2}^{2}$	2
	$- \sum ∥ ω_{t, x y} ∥_{2}^{2}$	0.05
	$- {(h_{base} - h_{target})}^{2}$	1
Contact Reward $r^{c}$	$- n_{collision}$	0.1
Contact Reward $r^{c}$	$- ∥ f_{x y} ∥_{2} > 5 \| f_{z} \|$	0.6
Style Reward $r^{s}$	$r_{t}^{s} = \frac{D_{ϕ} (s_{t}, s_{t + 1}) - μ}{\sqrt{σ^{2} + ϵ}}$	0.5

Table 5. Framework requirements and constraints by module.

Module	Robot Platform	Sensing	Motion Dataset	GPU Memory Overhead
Skeleton Info Encoder	–	Proprioceptive	–	+1 GB (GCN)
Teacher Encoder	–	Privileged state	–	+1 GB (MLP)
Student Encoder	–	Proprioceptive	–	+12 GB (LSTM)
System-Response Model	–	Proprioceptive	–	+1 GB (MLP)
AMP Discriminator	–	Proprioceptive	Motion priors (sim/mocap)	+1 GB (MLP)
Baseline (4096 envs IsaacGym): 6 GB GPU Memory
Our Framework: 22 GB GPU Memory

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, H.; Lei, H.; Liu, Y.; Chen, Z.; Shi, S.; Li, B.; Xu, W.; Yang, Z.-X. Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots. Symmetry 2025, 17, 1787. https://doi.org/10.3390/sym17111787

AMA Style

Cao H, Lei H, Liu Y, Chen Z, Shi S, Li B, Xu W, Yang Z-X. Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots. Symmetry. 2025; 17(11):1787. https://doi.org/10.3390/sym17111787

Chicago/Turabian Style

Cao, Huiyang, Hongfa Lei, Yangjun Liu, Zheng Chen, Shuai Shi, Bingquan Li, Weichao Xu, and Zhi-Xin Yang. 2025. "Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots" Symmetry 17, no. 11: 1787. https://doi.org/10.3390/sym17111787

APA Style

Cao, H., Lei, H., Liu, Y., Chen, Z., Shi, S., Li, B., Xu, W., & Yang, Z.-X. (2025). Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots. Symmetry, 17(11), 1787. https://doi.org/10.3390/sym17111787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Skeleton Information-Driven Reinforcement Learning Framework for Robust and Natural Motion of Quadruped Robots

Abstract

1. Introduction

2. Materials and Methods

2.1. Reinforcement Learning Problem Formulation

2.2. Skeleton Information Encoder

2.3. Teacher and Student Encoder Networks

2.4. Hybrid Advantage Estimation (HAE) and Adaptive Aggregation

2.5. System-Response Model

2.6. Network Architecture Details

3. Training Process

3.1. Training Curriculum

3.2. Dynamics Randomization

3.3. Motion Capture Data Preprocessing

3.4. Reward Terms Design

3.5. Adversarial Motion Priors with Wasserstein Divergence

3.6. Training Pipeline

4. Results

4.1. Hardware

4.2. Long-Distance Locomotion and Torque Analysis Under Extreme Conditions

4.3. Evaluation of Training Efficiency and Terrain Adaptation

4.4. Robustness on Large Obstacles and Unseen Terrains

4.5. Evaluation of Anti-Disturbance Capability

5. Discussion

5.1. Constraints and Assumptions

5.2. Future Work

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI