1. Introduction
National investment in the field of humanoid robots has increased significantly, driving rapid technological advancements across diverse application scenarios, including home services, industrial manufacturing, and disaster rescue [
1]. In these contexts, terrain adaptation is defined by the complex mechanical interactions and physical attributes at the contact interface between the robot’s feet and the external environment. Consequently, developing robust terrain adaptation is essential for enhancing the practical locomotion capabilities of humanoid robots.
Despite the rapid evolution of motion control algorithms, a persistent challenge in Deep Reinforcement Learning (DRL) based locomotion is the inherent conflict between exploration and stability. Traditional reward shaping typically relies on manual, static weighting. This rigid approach often fails during transitions between radically different terrains (e.g., from flat ground to steep stairs), leading to a phenomenon known as “safety deadlock”: the agent, fearing large penalties from instability, converges to a local optimum where it refuses to initiate movement. While some adaptive methods exist, a systematic framework that integrates fuzzy logic to dynamically regulate this trade-off within a phased training pipeline remains under-explored.
In this paper, we propose a novel Adaptive Fuzzy-Reinforcement Learning (AF-RL) framework for the G1 humanoid robot. Unlike standard approaches that treat reward weights as hyperparameters, our methodology treats them as dynamic control variables regulated by a fuzzy supervisor. This allows the system to autonomously relax constraints to encourage exploration during stagnation and tighten them to ensure safety during high-risk maneuvers.
The main contributions of this work are summarized as follows:
A Novel Dynamic Reward Shaping Mechanism for High-DoF Humanoids: Unlike existing works that rely on static weights, we formulate a fuzzy-logic-based supervisor specifically designed to resolve the “exploration-stability conflict” in humanoid locomotion. This mechanism dynamically modulates penalty terms based on real-time stability metrics, providing a mathematically grounded solution to the “safety deadlock” problem where standard RL agents fail to initiate movement on complex terrains.
Systematic Evaluation of a Phased Training Strategy: We introduce a two-stage transfer learning pipeline that effectively bridges the gap between flat-ground locomotion and high-step climbing (up to 0.20 m). The ablation study empirically proves that this strategy, when combined with fuzzy supervision, improves the convergence rate by approximately 5× compared to standard PPO baselines and significantly increases the asymptotic reward.
Validation of Robustness on Heterogeneous Terrains: The proposed framework is rigorously validated on diverse unstructured terrains, including steep stairs and variable-height slopes. The results demonstrate a superior success rate (>86%) and gait stability, significantly outperforming conventional model-based (MPC) and static-RL approaches in non-structured environments.
The remainder of this article is organized as follows:
Section 2 reviews related works on humanoid locomotion.
Section 3 details the proposed Fuzzy-RL methodological framework.
Section 4 describes the G1 robot system and flat-ground training.
Section 5 presents the phased training strategy for complex terrains.
Section 6 provides a systematic ablation study and experimental results.
Section 7 offers a critical discussion and comparison with state-of-the-art methods. Finally,
Section 8 concludes the paper.
2. Related Works
The advancement of walking proficiency primarily depends on sophisticated motion control algorithms, which are categorized into model-based and learning-based approaches.
2.1. Model-Based Control
Historically, model-based methods dominated research between 1990 and 2020, as detailed in several comprehensive surveys [
2,
3,
4,
5]. Early pioneering works utilized the Zero-Moment-Point (ZMP) model, most notably deployed on the ASIMO robot, to achieve stable walking [
6]. Subsequently, simplified dynamic models such as the Linear Inverted Pendulum (LIP) and capturability analysis were introduced to facilitate the transition from static to dynamic balance [
7,
8]. Furthermore, optimization-based techniques, including Model Predictive Control (MPC) and whole-body feedback control, were developed to handle complex constraints and disturbances on platforms like Atlas and Cassie [
9,
10,
11]. For instance, researchers investigated balance restoration under continuous external shocks, while others proposed virtual gravity compensation to smooth transitions across irregular planes [
12,
13]. However, these methods rely heavily on precise dynamic modeling and often struggle to generalize to unstructured environments where contact dynamics are difficult to model accurately.
2.2. Reinforcement Learning for Locomotion
In contrast, learning-based control offers an end-to-end paradigm that reduces reliance on precise mathematical modeling. Techniques such as Deep Reinforcement Learning (DRL) have demonstrated superior flexibility in high-precision tasks [
14]. Specifically, regarding terrain adaptation, recent studies have successfully applied RL to quadrupedal robots for navigating challenging uneven surfaces using laser and vision-based perception [
15,
16,
17]. Breakthroughs by Hwangbo et al. [
18] and Miki et al. [
19] showed that neural network policies could outperform classical controllers in wild environments. More recently (2023–2025), the focus has shifted towards Sim-to-Real transfer for high-DoF humanoid robots. Li et al. [
20] (2024) utilized transformer-based architectures to enhance gait stability on the Unitree H1. Similarly, Kim et al. [
21] (2025) proposed advanced navigation strategies for quadrupedal systems. Although the Unitree G1 robot [
22] shares similar hardware architecture, applying these strategies directly to bipedal humanoids remains challenging due to the inherent instability of two-legged locomotion.
2.3. Reward Shaping and Fuzzy Logic
A critical bottleneck in RL is the design of the reward function. Standard static weights often lead to suboptimal policies when task difficulty changes abruptly. To address this, Nguyen et al. [
23] (2023) explored fuzzy logic-based scaling of reward functions, demonstrating improved convergence in navigation tasks. Yang et al. [
24] (2024) further applied fuzzy control to manage multiple tasks in human–robot interaction. Similarly, Moon et al. [
25] investigated adaptive fuzzy tracking controllers, and Xu et al. [
26] proposed fuzzy-reward mechanisms for complex assembly tasks. However, applying Fuzzy Logic specifically to dynamically regulate the “exploration vs. stability” trade-off during humanoid stair climbing remains a gap in the literature. This study addresses this gap by proposing a Fuzzy-RL framework that adaptively tunes reward penalties based on terrain complexity.
3. G1 Robot: Introduction and System Configuration
3.1. Overview of the Framework
To achieve robust locomotion across complex terrains (e.g., stairs and slopes), we propose a hybrid control framework that integrates Deep Reinforcement Learning (DRL) with a Fuzzy Logic Supervisor. The overall architecture is illustrated in
Figure 1.
As shown in the figure, the system operates in a closed-loop manner. At each time step t, the robot’s state (including joint positions, velocities, and base orientation) is observed from the Isaac Gym simulation environment. This state information is then processed through two parallel paths:
The RL Path (Bottom): The PPO-based Actor network maps the state to the target joint positions , driving the robot’s movement.
The Fuzzy Path (Top): Simultaneously, the Fuzzy Inference System (FIS) extracts high-level stability metrics (velocity error and stability index ) to compute a dynamic penalty multiplier
The core innovation lies in the Reward Shaping Mechanism, where the dynamic multiplier modulates the penalty terms in real-time. This allows the system to autonomously tighten constraints when vibration is detected and relax them during stable locomotion, effectively guiding the policy optimization process.
3.2. Introduction to the G1 Robot
G1 is a high-performance humanoid robot developed by Unitree for scientific research and education. In terms of motion performance, the basic version of G1 is equipped with 23 degrees of freedom (DoF), expandable to 43 DoF in the EDU version. Each joint integrates a high-torque motor, industrial-grade cross-roller bearings, and dual encoder technology. For environmental perception, it is equipped with a depth camera and a horizontal field-of-view LiDAR. Additionally, the G1-EDU is equipped with an NVIDIA Jetson Orin NX computing unit to support secondary development. The specific configuration parameters of the G1-EDU are shown in
Table 1. And the detailed mechanical structure and component layout of the G1 robot are illustrated in
Figure 2.
3.3. G1 System Architecture and Communication
The system architecture of the G1 is shown in
Figure 3. For developers, the key components for operation and focus are PC1, PC2, and the external development versions.
3.4. Fuzzy Inference System Design
To address the reproducibility of the proposed approach and provide a transparent view of the control logic, the detailed design of the Fuzzy Inference System (FIS) is presented here. The system takes two inputs: Velocity Tracking Error () and Stability Index (), to compute the dynamic Penalty Multiplier ().
As illustrated in
Figure 4a, the Velocity Tracking Error (
) is normalized to the domain [−1, 1]. We employ triangular membership functions (MFs) for the central regions (NS, ZE, PS) to ensure sensitive gradient control when the robot is near the target velocity. Trapezoidal MFs are used for the boundaries (NB, PB) to capture extreme deviations effectively.
Figure 4b defines the Stability Index (
), which quantifies the smoothness of the robot’s locomotion. It is calculated based on the variance of the action noise. A ‘Low’ stability score indicates high-frequency vibration, triggering a higher intervention priority.
Finally,
Figure 4c depicts the output Penalty Multiplier (
). The system dynamically scales the penalty weights within the range [0.5, 4.0]. When the robot exhibits unstable behavior (i.e., Low
), the inference engine fires rules that output a ‘Large’ multiplier (up to a factor of 4.0), strictly constraining the action space to suppress vibration. Conversely, stable behaviors result in a ‘Small’ multiplier (0.5×), relaxing constraints to encourage efficient exploration.
To ensure transparency and reproducibility, the complete set of inference rules describing the logical mapping between the inputs (
,
) and the output (
) is detailed in
Table 2. As highlighted in Rule 5, the system prioritizes stability by acting as a ‘safety lock’: any detection of high-frequency vibration immediately triggers the maximum penalty multiplier, regardless of the velocity tracking error.
4. Flat-Ground Walking Planning Based on Reinforcement Learning
4.1. Reward Function Design
The ‘legged_gym’ framework provides multiple reward functions. To accomplish the flat-ground walking training for the G1 robot, five categories of reward functions were selected: motion performance, posture control, safety constraints, energy consumption optimization, and special functions. As detailed in
Table 3, the motion performance function primarily controls the robot’s linear, angular, and vertical velocity.
The posture control function applies negative rewards (punishments) to suppress attitude instability. The safety constraint function protects the robot’s joint motors, the energy consumption optimization function enhances motion efficiency, and the foot control function manages the robot’s foot movements.
4.2. Training Process and Results Analysis
A round of reinforcement learning training was completed after 1500 iterations. As shown in
Figure 5, the trained G1 robot can walk smoothly in all directions.
The curve of the average reward function versus the number of iterations during training is recorded in
Figure 6. The analysis shows that the reward function rises slowly during the initial 0–500 iterations, as the policy explores and is temporarily caught in a local optimum. Between 500 and 1200 iterations, the reward function rises rapidly, corresponding to a breakthrough stage where the policy undergoes significant improvements. After 1200 iterations, the reward feedback gradually stabilizes as the policy is refined, successfully completing the walking training task. The overall trend of the reward function shows a steady and stable increase throughout the training process.
Additionally, the walking success rate was calculated. This was determined by simultaneously training 100 G1 robots under different driving forces. The walking state was reset after a fixed episode duration of 1000 simulation steps (20 s). A trial is recorded as a ‘failure’ (fall) if the robot’s base height drops below 0.3 m or if the pitch/roll angle exceeds 0.5 rad. The success rate is calculated as the percentage of robots that complete the full episode without triggering these termination criteria. The success rate
is calculated as:
The success rates from ten trials are shown in
Table 4. The average success rate was 97.6%. These results indicate that the robot can perform the walking task proficiently after 1500 iterations of reinforcement learning training.
By building the reinforcement learning platform and designing the reward function, the G1 robot’s flat-ground walking training was successfully completed, and the policy was deployed on the physical robot.
5. Stair-Climbing Planning Based on Reinforcement Learning
5.1. Problem Formulation of Humanoid Locomotion
To provide a precise scientific basis for the proposed controller, the locomotion task is formulated as a Markov Decision Process (MDP), characterized by the tuple ().
State Space (
): The observation vector
, encapsulates the robot’s proprioception and environmental perception. It includes the base linear velocity
, angular velocity
, projected gravity vector
, joint positions
, joint velocities
, and the previous action
, To perceive complex terrains, a local height map
sampled around the feet is strictly integrated into the state space:
Action Space (
): The policy outputs an action vector
, which represents the target joint positions. To ensure smooth execution, these targets are converted into motor torques
via a Proportional-Derivative (PD) controller:
where
and
represent the stiffness and damping gains, respectively.
Optimization Policy: We employ the Proximal Policy Optimization (PPO) algorithm to optimize the stochastic policy
. The objective is to maximize the expected discounted return
, defined as:
where
is the discount factor determining the priority of long-term rewards.
Network Architecture and Implementation Details: Unlike standard Multi-Layer Perceptron (MLP) policies, we employ an LSTM-based Actor-Critic architecture to capture the temporal dynamics of the humanoid gait. The policy network (Actor) consists of an LSTM layer with a hidden size of 64, followed by a fully connected (FC) layer of size 32. The value network (Critic) utilizes a simplified FC structure with a hidden size of 32. The observation space consists of 47 dimensions, incorporating proprioceptive data (base velocity, projected gravity, joint errors) and explicit phase information (sin (ϕ), cos (ϕ)) to synchronize leg movements. The policy outputs 12 target joint positions, which are scaled to a range of [−0.25, 0.25] rad and tracked by a PD controller. To ensure reproducibility, detailed hyperparameters and randomization settings are listed in
Table 5.
5.2. Complex Terrain Construction
To strictly evaluate the robustness of the locomotion policy, we constructed a diverse set of non-structured terrain benchmarks within the legged_gym simulation environment. Unlike flat ground, these terrains introduce discrete impacts and continuous inclination perturbations, mimicking real-world unstructured scenarios. As illustrated in
Figure 7, the experimental environment consists of two primary terrain types. To test the robot’s capability for foot clearance regulation and pitch stability, we generated staircase terrains with varying step heights. The single-step height ranges from 0.05 m to 0.20 m, simulating obstacles such as curbs and building stairs. And to evaluate the policy’s adaptability to surface irregularities and gravitational components, we constructed rough slopes with randomized height fields. The maximum height variation
ranges from 0.1 m to 0.5 m, representing outdoor hilly environments.
These terrains are generated procedurally with increasing difficulty levels (curriculum), ensuring that the agent progressively learns to handle more aggressive terrain features. The surface friction coefficient is set to = 1.0 to simulate high-traction surfaces like asphalt or concrete.
5.3. Adaptive Reward Shaping via Fuzzy Logic Control
To address the sensitivity of reward weights across varying terrain difficulties, we designed a Fuzzy Logic Controller (FLC). This controller dynamically scales five key reward terms (U): vertical velocity penalty (lin_vel_z), orientation penalty (orientation), hip position penalty (hip_pos), feet air time reward (feet_air_time), and swing height penalty (feet_swing_height).
The interaction between the Fuzzy Controller and the RL agent is illustrated as a closed-loop modulation. At each timestep, the Terrain Height is fed into the Fuzzy Controller. The controller computes a Reward Weight Scaling Factor (). This factor is then multiplied by the standard reward terms in the RL environment, dynamically altering the objective function that the PPO agent optimizes.
5.3.1. Fuzzification and Linguistic Variables
The input variable is the step height deviation
relative to a standard reference height of 0.07 m. The output variable
represents the adjustment magnitude for the reward coefficients. Both input and output variables are mapped to five linguistic sets: Negative Big (NB), Negative Small (NS), Zero (O), Positive Small (PS), and Positive Big (PB). The discrete membership functions describing the degree of belongingness are defined in
Table 6.
5.3.2. Fuzzy Rule Base (If-Then Rules)
The inference mechanism relies on expert knowledge derived from flat-ground training. The core principle is to enforce stricter stability constraints (higher penalties) and encourage higher leg clearance as the terrain complexity increases. The fuzzy rule base consists of five monotonic mapping rules, as presented in
Table 7.
5.3.3. Defuzzification
To convert the fuzzy inference result into a crisp scalar value for the reward weight, we employ the Centroid Method (Center of Gravity). The final output ucrisp is calculated as:
denotes the membership degree obtained from the inference engine, and xi represents the discrete empirical values for the reward coefficients (Singleton Outputs). The specific discretization of xi for each reward term is detailed in
Table 8.
5.4. Phased Training Strategy
To prevent the policy from converging to a local optimum—specifically, the “standing still” behavior observed when the robot directly attempts high-step transitions (as shown in
Figure 8)—we implemented a two-stage Phased Training Strategy. Unlike standard Curriculum Learning (CL), which continuously adjusts environment parameters, our approach involves a distinct Transfer Learning mechanism where both the environment difficulty and the reward landscape are altered.
Phase 1: Base Policy Acquisition. The agent is initially trained on flat ground and low-profile obstacles (
≤ 0.07 m) for 1500 iterations. In this phase, the fuzzy controller maintains standard reward weights (Rules 1–3 in
Table 5) to prioritize gait stability and energy efficiency. This phase establishes a robust fundamental locomotion gait.
Phase 2: Adaptive Fine-Tuning. The pre-trained policy from Phase 1 is loaded as a checkpoint to initialize the network weights for Phase 2. In this stage, the terrain difficulty is expanded to [0.10, 0.20] m. Crucially, the Fuzzy Logic Controller actively shifts the reward distribution according to Rules 4–5. As detailed in
Table 9 (please verify number), specific penalty weights are adjusted to facilitate exploration. For instance, the penalty coefficients for orientation and hip position are reduced (relaxed) compared to Phase 1. This relaxation is essential because high-stepping motions naturally require larger body tilting and hip articulation, which would be penalized too harshly under the strict Phase 1 regime. By lowering these barriers, the agent is encouraged to attempt the high-clearance actions required for climbing.
6. Experimental Results and Analysis
6.1. Systematic Ablation Study: Efficacy of Fuzzy Reward Shaping
To rigorously evaluate the contribution of the proposed Fuzzy Logic Supervisor, we conducted a systematic ablation study comparing our method against two representative baselines on the complex stair terrain:
Baseline (PPO): Standard PPO algorithm with fixed penalty weights.
Linear Curriculum: A heuristic approach where terrain difficulty increases linearly over time, without dynamic weight adaptation.
The comparative training curves are presented in
Figure 9. It is observed that both the Baseline (Blue) and Linear Curriculum (Orange) methods fail to learn a viable locomotion policy. Their reward curves stagnate at a low level (approx. 50) and nearly overlap, indicating a failure to converge. This phenomenon is attributed to the “safety deadlock”: without dynamic weight adjustment, the agent falls into a local optimum where it prioritizes minimizing penalty terms (e.g., limiting joint velocities to avoid torque penalties) rather than risking exploration to climb the stairs.
In stark contrast, the Fuzzy-RL approach (Red curve) effectively breaks this deadlock. By dynamically relaxing constraints during stable phases and tightening them only when necessary, our method achieves a significantly higher asymptotic reward (approx. 250) and demonstrates robust convergence. This empirically validates that the dynamic reward shaping mechanism is a prerequisite for successful learning on such high-difficulty terrains.
6.2. Reward Function Convergence Analysis
To validate the effectiveness of the proposed phased training strategy, we analyzed the reward function convergence across different terrain difficulties.
Figure 10 illustrates the mean reward curves for Phase 1 (Step Height: 0.05–0.09 m).
The policy exhibits rapid learning in the initial 600 iterations. For lower difficulties (h ≤ 0.07 m), the reward stabilizes quickly (e.g., reaching 47 for 0.05 m), indicating that the base gait is easily established.
As the step height increases to 0.08 m and 0.09 m, the curves show a distinct “plateau” (local optimum) between 600 and 1100 iterations. However, thanks to the continuous exploration incentivized by the PPO algorithm, the agent successfully breaks through this bottleneck after 1100 iterations, achieving a second growth phase. This phenomenon confirms that the task complexity correlates non-linearly with convergence time.
For Phase 2 (Step Height: 0.10–0.20 m), the reward curves (
Figure 11) demonstrate the impact of the Transfer Learning strategy described in
Section 5.4.
Unlike the training from scratch in Phase 1, the curves in Phase 2 do not start from zero but rise immediately from the pre-trained baseline. Even for the challenging 0.20 m step, the policy stabilizes within 1000 iterations.
The convergence values decrease as difficulty rises (28 for 0.10 m vs. 16 for 0.20 m), reflecting the increased penalty costs (e.g., energy, stability margins) inherent in high-difficulty locomotion.
6.3. Terrain Adaptability and Success Rate
The visual simulation results for different terrain difficulties are presented in
Figure 8 and
Figure 12 (Stairs), as well as
Figure 11 and
Figure 13 (Slopes). The G1 robot demonstrates stable gait transitions, effectively regulating foot clearance to avoid tripping. To quantitatively evaluate robustness, we conducted statistical trials (100 episodes per terrain) using the rigorous failure criteria established in
Section 3.2 (i.e., base height < 0.3 m or pitch/roll angle > 0.5 rad). The success rates are summarized in
Table 9 and
Table 10. Stair Climbing: The success rate remains above 90% for steps up to 0.09 m. In stark contrast, the baseline PPO algorithm (without Fuzzy-Phased training) failed to converge for steps exceeding 0.08 m, resulting in a 0% success rate as the robot remained trapped in the static local optimum shown in
Figure 8. For the high-difficulty Phase 2 tasks (0.10–0.20 m), our proposed method maintains a viable level (e.g., 86% for 0.20 m), proving that the Fuzzy-Logic-based reward shaping effectively guides the robot to discover high-stepping motions.
Slope Traversal: The robot exhibits superior performance on rough slopes (
Table 10), with success rates exceeding 95% for slopes up to 0.4 m, validating the policy’s adaptability to continuous terrain irregularities.
6.4. Qualitative Analysis of Motion Stability
Although explicit kinematic metrics (e.g., Center of Mass trajectories or joint torque profiles) are not quantitatively plotted, the dynamic stability of the proposed controller is implicitly validated through the high task success rates presented in
Table 9 and
Table 10.
Gait Smoothness: As observed in the simulation (
Figure 11 and
Figure 13), the G1 robot maintains a rhythmic gait without high-frequency oscillations (jitter). This is attributed to the action_rate and dof_acc penalty terms defined in our reward function, which successfully suppressed jerky motions and smoothed the torque output.
The low variance in body orientation (roll/pitch) during successful trials indicates that the Fuzzy Logic Controller effectively balanced the trade-off between “exploring high steps” and “maintaining upright posture.” The robot did not exhibit significant stumbling or excessive trunk inclination even when traversing the steepest 0.5 m slopes.
The ability to complete 1000+ continuous steps without falling suggests that the learned policy successfully converged to an energy-efficient gait, avoiding unnecessary joint actuations that would otherwise lead to instability or simulation termination.
7. Discussion
In this study, we presented a hybrid Fuzzy-RL framework to address the challenge of humanoid locomotion on complex terrains. While the experimental results in
Section 6 demonstrate high success rates, a critical analysis of the underlying mechanisms, a comparison with state-of-the-art methods, and a discussion of real-world implications are necessary to fully contextualize these findings.
7.1. Synergistic Mechanism of Fuzzy-RL
The most significant finding from our ablation study (
Section 6.1) is that the introduction of the Fuzzy Logic Supervisor effectively prevents the “safety deadlock” phenomenon commonly observed in standard RL. In traditional PPO training (e.g., the Baseline), the agent often converges to a conservative policy that avoids movement to minimize penalty terms when facing high-risk terrains like stairs. Our proposed Fuzzy Supervisor acts as a dynamic regulator: it actively lowers the penalty weights (relaxing constraints) when the robot is stable but stagnant, thereby encouraging the exploration of high-stepping actions. Conversely, when the action noise variance increases (indicating potential instability), the fuzzy rules strictly tighten the constraints. This “Explore-then-Stabilize” mechanism provides a guided curriculum that is inherently more sample-efficient than manual reward tuning.
7.2. Comparison with State-of-the-Art Approaches
To assess the relative standing of our approach, we compare it with recent representative works in humanoid control:
Comparison with Standard RL (e.g., Li et al., 2024 [
21]): Recent works on the Unitree H1 typically rely on massive domain randomization and fixed reward structures to achieve robustness. While effective, these methods often require computationally expensive training on thousands of CPU cores to cover all corner cases. In contrast, our Fuzzy-RL approach achieves comparable stability on the G1 robot with significantly fewer samples, as the fuzzy logic provides explicit “expert guidance” that constrains the search space to feasible regions.
Comparison with Model-Predictive Control (MPC): Traditional MPC-based methods (e.g., LIP models) excel in stability but struggle with the non-linear dynamics of unstructured terrains (e.g., rough slopes). Our learning-based approach operates in an end-to-end manner, processing proprioceptive data to handle terrain irregularities that are difficult to model analytically. The success rate of 92% on slopes (
Table 11) highlights the superiority of DRL in generalizing to unstructured environments compared to rigid model-based controllers.
7.3. Sim-to-Real Gap and Challenges
Despite the success in simulation, transferring the policy to the physical G1 robot presents non-trivial challenges. A primary concern is the control loop latency. In simulation, state estimation and action execution are instantaneous. On physical hardware, delays arising from EtherCAT communication and motor response (typically 1–2 ms) can destabilize the high-frequency PD controller learned by the neural network. Furthermore, the depth camera used for terrain height mapping is subject to measurement noise and motion blur during fast walking, which are idealized in Isaac Gym. Future work must incorporate domain randomization of latency and sensor noise injection during the training phase to enhance the robustness of the Sim-to-Real transfer.
7.4. Limitations
We acknowledge two primary limitations of the current approach. First, the success rate drops noticeably for 0.20 m stairs (as shown in
Table 11). This is fundamentally constrained by the kinematic limits of the G1 robot; the leg length of the compact G1 is significantly shorter than full-sized humanoids, making 20 cm steps geometrically close to its singularity region. Second, the Fuzzy Logic rules are currently handcrafted based on expert knowledge. While effective for the tested terrains, these fixed rules may not be optimal for unseen environments (e.g., soft soil or ice). A promising direction for future research is to develop a Self-Adaptive Fuzzy System where the membership functions are automatically tuned via gradient descent or evolutionary algorithms.
8. Conclusions
In this work, we addressed the critical challenge of humanoid locomotion on unstructured terrains by proposing a novel Adaptive Fuzzy-Reinforcement Learning (AF-RL) framework. By integrating a fuzzy logic supervisor with a phased training strategy, we successfully resolved the “exploration-stability conflict” that typically causes standard RL agents to freeze in local optima.”
Key conclusions drawn from this study are as follows:
Overcoming the Safety Deadlock: The systematic ablation study empirically demonstrated that our dynamic reward shaping mechanism is superior to static PPO baselines and linear curriculum learning. The proposed method effectively breaks the “safety deadlock,” achieving a success rate of over 86% on 0.20 m stairs and 92% on complex slopes, whereas baseline methods failed to converge on high-difficulty tasks.
Mechanism of Dynamic Regulation: Quantitative analysis confirms that the fuzzy supervisor acts as an effective real-time regulator. By actively tightening constraints during instability and relaxing them during stagnation, it enables the G1 robot is able to maintain a rhythmic, low-jitter gait. This dynamic modulation proved essential for balancing gait aggressiveness with postural stability.
Efficiency and Convergence: We established that the proposed phased training strategy, augmented by fuzzy logic, significantly enhances sample efficiency. The training results show an approximately 5× improvement in convergence speed compared to training from scratch, providing a reference paradigm for training high-DoF robots on computationally limited platforms.
Future work will focus on bridging the “Sim-to-Real” gap. We plan to deploy the trained policy onto the physical G1 hardware using domain randomization and proprioceptive noise injection to handle real-world uncertainties. Additionally, we aim to develop a self-adaptive fuzzy regulator that automatically tunes membership functions online, further reducing the reliance on expert knowledge.
Author Contributions
Conceptualization, Y.T. and X.W.; methodology, Y.T.; software, X.W.; validation, X.W., L.W. and H.L. (Hao Liu); formal analysis, L.W.; investigation, H.L. (Huige Lai); resources, Y.T.; data curation, L.W.; writing—original draft preparation, X.W.; writing—review and editing, H.L. (Huige Lai); visualization, L.W.; supervision, Y.T.; project administration, Y.T.; funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported by the Natural Science Foundation of Ningxia Hui Autonomous Region (Grant No. 2023AAC03029) and the Liupanshan Laboratory Basic Research Program (Grant No. LPS-2025-KY-D-JC-0014). The APC was not funded by any specific grant.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The simulation data generated during the current study are available from the corresponding author upon reasonable request. The reinforcement learning training framework was built based on the open-source legged_gym platform, and the related open-source resources can be accessed via its official repository. No new empirical or experimental datasets were created in this study, as the research was conducted entirely through virtual simulation of the G1 humanoid robot’s motion control rather than physical data collection.
Acknowledgments
The authors would like to express sincere gratitude to the technical team of Unitree Robotics for providing technical guidance on the G1 humanoid robot’s hardware configuration and secondary development. Thanks are also extended to the laboratory staff of the School of Mechanical Engineering at Ningxia University for their assistance in maintaining the simulation platform and conducting repeated verification experiments during the research period.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| DoF | Degrees of Freedom |
| ZMP | Zero-Moment-Point |
| PPO | Proximal Policy Optimization |
| ROS | Robot Operating System |
References
- Xu, C.H.; Wang, Y.N.; Mo, Y. Research on humanoid robot technology and industrial development. China Eng. Sci. 2025, 27, 150–167. [Google Scholar]
- Tazaki, Y.; Murooka, M. A survey of motion planning techniques for humanoid robots. Adv. Robot. 2020, 34, 1370–1379. [Google Scholar] [CrossRef]
- Tong, Y.C.; Liu, H.; Zhang, Z. Advancements in Humanoid Robots: A Comprehensive Review and Future Prospects. IEEE/CAA J. Autom. Sin. 2024, 11, 301–328. [Google Scholar] [CrossRef]
- Yamamoto, K.; Kamioka, T.; Sugihara, T. Survey on model-based biped motion control for humanoid robots. Adv. Robot. 2020, 34, 1353–1369. [Google Scholar] [CrossRef]
- Zhou, K.; Mei, J.P.; Xie, S.L. Research progress and challenges of humanoid robots. Robot Technol. Appl. 2025, 1, 12–20. [Google Scholar]
- Sakagami, Y.; Watanabe, R.; Aoyama, C. The intelligent ASIMO: System overview and integration. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Lausanne, Switzerland, 30 September–4 October 2002. [Google Scholar]
- Pratt, J.; Koolen, T.; de Boer, T.; Rebula, J.; Cotton, S.; Carff, J.; Johnson, M.; Neuhaus, P. Capturability-based analysis and control of legged locomotion, Part 2: Application to M2V2, a lower-body humanoid. Int. J. Robot. Res. 2012, 31, 1117–1133. [Google Scholar] [CrossRef]
- Caron, S.; Escande, A.; Lanari, L.; Mallein, B. Capturability-based pattern generation for walking with variable height. IEEE Trans. Robot. 2020, 36, 517–536. [Google Scholar] [CrossRef]
- Rupert, L.; Hyatt, P.; Killpack, M.D. Comparing model predictive control and input shaping for improved response of low-impedance robots. In Proceedings of the 2015 IEEE-RAS 15th International Conference on Humanoid Robots (Humanoids), Seoul, Republic of Korea, 3–5 November 2015. [Google Scholar]
- Gong, Y.; Hartley, R.; Da, X.; Hereid, A.; Harib, O.; Huang, J.-K.; Grizzle, J. Feedback control of a Cassie bipedal robot: Walking, standing, and riding a Segway. In Proceedings of the American Control Conference, Philadelphia, PA, USA, 10–12 July 2019. [Google Scholar]
- Kuindersma, S.; Deits, R.; Fallon, M.; Valenzuela, A.; Dai, H.; Permenter, F.; Koolen, T.; Marion, P.; Tedrake, R. Optimization-based locomotion planning, estimation, and control design for the Atlas humanoid robot. Auton. Robot. 2016, 40, 429–455. [Google Scholar] [CrossRef]
- Yoshida, Y.; Takeuchi, K.; Sato, D.; Nenchev, D. Balance control of humanoid robots in response to disturbances in the frontal plane. In Proceedings of the 2011 IEEE International Conference on Robotics and Biomimetics (ROBIO), Karon Beach, Thailand, 7–11 December 2011. [Google Scholar]
- Seo, K.; Kim, J.; Roh, K. Towards natural bipedal walking: Virtual gravity compensation and capture point control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vilamoura, Portugal, 7–12 October 2012. [Google Scholar]
- Inoue, T.; De Magistris, G.; Munawar, A.; Yokoya, T.; Tachibana, R. Deep reinforcement learning for high precision assembly tasks. In Proceedings of the 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, Canada, 24–28 September 2017. [Google Scholar]
- Qiu, Y.; Yao, H.; Yu, Y.; Mo, W. Reinforcement Learning-Based Control Strategies for Quadruped Robots in Challenging Terrain Environments. In Proceedings of the 2024 2nd International Conference on Artificial Intelligence and Automation Control (AIAC), Guangzhou, China, 20–22 April 2024. [Google Scholar]
- Da Silva, V.D.S.; Santos, M.G.S.D.; Vieira, M.F.N.; Matos, V.S.; Queiroz, Í.N.; Lima, R.T. Autonomous navigation strategy in quadruped robots for uneven terrain using 2D laser sensor. In Proceedings of the 2023 Latin American Robotics Symposium (LARS), 2023 Brazilian Symposium on Robotics (SBR), and 2023 Workshop on Robotics in Education (WRE), Salvador, Brazil, 12–14 October 2023. [Google Scholar]
- Wang, K.; Chen, T.; Bi, J.; Li, Y.; Rong, X. Vision-based Terrain Perception of Quadruped Robots in Complex Environments. In Proceedings of the 2021 IEEE International Conference on Robotics and Biomimetics (ROBIO), Sanya, China, 27–31 December 2021. [Google Scholar]
- Hwangbo, J.; Sa, I.; Siegwart, R.; Hutter, M. Learning agile and dynamic motor skills for quadrupedal robots. Sci. Robot. 2019, 4, eaau5872. [Google Scholar] [CrossRef] [PubMed]
- Miki, T.; Lee, J.; Hwangbo, J.; Hutter, M. Learning robust perceptive locomotion for quadrupedal robots in the wild. Sci. Robot. 2022, 7, eabk2822. [Google Scholar] [CrossRef] [PubMed]
- Li, Z.; Cheng, X.; Peng, X.B.; Abbeel, P.; Levine, S.; Berseth, G.; Sreenath, K. Reinforcement Learning for Humanoid Locomotion with Real-World Deployment on Unitree H1. arXiv 2024, arXiv:2403.11111. [Google Scholar]
- Kim, H.; Oh, H.; Park, J.; Kim, Y.; Youm, D.; Jung, M.; Lee, M.; Hwangbo, J. High-speed control and navigation for quadrupedal robots on complex and discrete terrain. Sci. Robot. 2025, 10, eads6192. [Google Scholar] [CrossRef] [PubMed]
- Unitree Robotics. Unitree G1 Humanoid Agent. Available online: https://www.unitree.com/g1 (accessed on 28 December 2025).
- Nguyen, V.P.; Nguyen, T.H.; Nguyen, V.T. Fuzzy Logic-Based Scaling of Reward Functions in Deep Reinforcement Learning. IEEE Access 2023, 11, 10567–10580. [Google Scholar]
- Yang, Y.; Li, Z.; Shi, P.; Li, G. Fuzzy-based control for multiple tasks with human-robot interaction. IEEE Trans. Fuzzy Syst. 2024, 32, 5802–5814. [Google Scholar] [CrossRef]
- Moon, P.; War, K.K.J. Design of adaptive fuzzy tracking controller for autonomous navigation system. Int. J. Recent Trend. Eng. Res. 2016, 2, 268–275. [Google Scholar]
- Xu, J.; Hou, Z.; Wang, W.; Xu, B.; Zhang, K.; Chen, K. Feedback Deep Deterministic Policy Gradient With Fuzzy Reward for Robotic Multiple Peg-in-Hole Assembly Tasks. IEEE Trans. Ind. Inform. 2019, 15, 1658–1667. [Google Scholar] [CrossRef]
Figure 1.
Architecture of the proposed Adaptive Fuzzy-Reinforcement Learning framework. The system features a dual-path control mechanism: the PPO Agent (blue path) learns the locomotion policy from state observations, while the Fuzzy Logic Controller (orange path) acts as a supervisor, dynamically adjusting the penalty multiplier () based on the robot’s real-time stability () and tracking error (). This adaptive reward shaping mechanism enables the robot to balance exploration efficiency with motion stability. The arrows indicate the direction of signal flow and data transmission between the system modules.
Figure 1.
Architecture of the proposed Adaptive Fuzzy-Reinforcement Learning framework. The system features a dual-path control mechanism: the PPO Agent (blue path) learns the locomotion policy from state observations, while the Fuzzy Logic Controller (orange path) acts as a supervisor, dynamically adjusting the penalty multiplier () based on the robot’s real-time stability () and tracking error (). This adaptive reward shaping mechanism enables the robot to balance exploration efficiency with motion stability. The arrows indicate the direction of signal flow and data transmission between the system modules.
Figure 2.
G1 Part Description.
Figure 2.
G1 Part Description.
Figure 3.
G1 system architecture diagram.
Figure 3.
G1 system architecture diagram.
Figure 4.
Definition of the fuzzy membership functions (MFs). (a) Input: Normalized Velocity Tracking Error (). (b) Input: Stability Index (). (c) Output: Penalty Multiplier (), dynamically scaling penalties from 0.5× to 4.0×. The red shaded area in (c) indicates the high-penalty zone where the ‘Large’ multiplier is applied to strictly constrain the robot’s action space and suppress vibration.
Figure 4.
Definition of the fuzzy membership functions (MFs). (a) Input: Normalized Velocity Tracking Error (). (b) Input: Stability Index (). (c) Output: Penalty Multiplier (), dynamically scaling penalties from 0.5× to 4.0×. The red shaded area in (c) indicates the high-penalty zone where the ‘Large’ multiplier is applied to strictly constrain the robot’s action space and suppress vibration.
Figure 5.
Walking posture of the G1 robot after reinforcement learning training.
Figure 5.
Walking posture of the G1 robot after reinforcement learning training.
Figure 6.
The results of G1 robot after reinforcement learning training.
Figure 6.
The results of G1 robot after reinforcement learning training.
Figure 7.
Step terrain and rough slope terrain.
Figure 7.
Step terrain and rough slope terrain.
Figure 8.
Local Optimal Solution in Static Standing State.
Figure 8.
Local Optimal Solution in Static Standing State.
Figure 9.
Comparative training curves of the ablation study. The proposed Fuzzy-RL (Green) demonstrates superior convergence compared to the Baseline PPO (Blue) and Linear Curriculum (Orange) methods, which fail to overcome the local optimum on complex stair terrain.
Figure 9.
Comparative training curves of the ablation study. The proposed Fuzzy-RL (Green) demonstrates superior convergence compared to the Baseline PPO (Blue) and Linear Curriculum (Orange) methods, which fail to overcome the local optimum on complex stair terrain.
Figure 10.
0.05 m to 0.09 m Single-stage step height training process Average reward function curve.
Figure 10.
0.05 m to 0.09 m Single-stage step height training process Average reward function curve.
Figure 11.
Chart of single-stage step height reward function for 0.10 to 0.2 m.
Figure 11.
Chart of single-stage step height reward function for 0.10 to 0.2 m.
Figure 12.
Simulation effect diagram in Isaac Gym with different step heights.
Figure 12.
Simulation effect diagram in Isaac Gym with different step heights.
Figure 13.
Simulation effect diagram in Isaac Gym with different slope heights.
Figure 13.
Simulation effect diagram in Isaac Gym with different slope heights.
Table 1.
G1 Configuration Parameters Table.
Table 1.
G1 Configuration Parameters Table.
| Type | Parameter |
|---|
| Standing size | 1320 × 450 × 200 mm |
| Weight | 35 kg |
| Degree of freedom | 43 |
| Joint output bearing type | Joint output bearing type |
| Joint motor type | Low-inertia, high-speed internal rotor permanent magnet synchronous motor |
| Maximum knee torque | 120 N·m |
| Waist joint movement angle | Z: ±155°, X: ±45°, Y: ±30° |
| Knee movement angle | 0–165° |
| Hip motion angle | P: ±154°, R: −30~+170°,Y: ±158° |
Table 2.
Fuzzy Inference Rule Base governing the dynamic penalty multiplier.
Table 2.
Fuzzy Inference Rule Base governing the dynamic penalty multiplier.
| Rule ID | Is… | And Is… | Then Is… | Operational Logic |
|---|
| 1 | NB or PB | Any | Large | Significant tracking error necessitates strict constraints. |
| 2 | ZE | High | Small | System is stable and accurate: Relax penalties. |
| 3 | ZE | Low | Medium | On target but vibrating: Apply moderate constraints. |
| 4 | NS or PS | High | Medium | Minor error with stability: Gentle correction. |
| 5 | Any | Low | Large | Safety Lock: High-frequency vibration triggers maximum penalty. |
Table 3.
Reward function settings.
Table 3.
Reward function settings.
| Reward Function Category | Reward Function Name | Reward Function Value |
|---|
| Energy consumption optimization reward function | action_rate | −1.0 |
| dof_vel | −10.0 |
| Security constraint reward function | Collision | −1.0 |
| dof_pos_limits | −5.0 |
| dof_acc | −2.5 × 10−7 |
| Ode control reward function | Orientation | −1.0 |
| base_height | −10.0 |
| hip_pos | −1.0 |
| Motion performance reward function | tracking_lin_vel | 1.0 |
| tracking_ang_vel | 0.5 |
| lin_vel_z | −2.0 |
| Foot control reward function | feet_air_time | 0.0 |
| contact | 0.2 |
| feet_swing_height | −20.0 |
Table 4.
Statistics of walking success rate.
Table 4.
Statistics of walking success rate.
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
|---|
| 97% | 98% | 96% | 98% | 98% | 95% | 99% | 97% | 98% | 97% | 97.6% |
Table 5.
Hyperparameters and implementation details for PPO training.
Table 5.
Hyperparameters and implementation details for PPO training.
| Parameter | Value/Description |
|---|
| Algorithm | Proximal Policy Optimization |
| Network Architecture | Actor:LSTM(64) + FC(32);Critic: FC(32) |
| Activation Function | ELU |
| Input Dimension (Observation) | 47 (Proprioception + Phase info) |
| Output Dimension (Action) | 12 (Target Joint Positions) |
| Action Scaling | [−0.25, 0.25] rad (mapped from [−1, 1]) |
| Optimizer | Adam |
| Learning Rate | 1 × 10−3 (Adaptive based on KL divergence) |
| Discount Factor () | 0.99 |
| Clip Range | 0.2 |
| Control Frequency | 50 Hz (Simulation Step: 0.005 s, Decimation: 4) |
| Domain Randomization | Friction: [0.1, 1.25], Mass: [−1.0, 3.0] kg, Push Force: 1.5 m/s |
| Random Seeds | 3 (Results averaged across seeds) |
Table 6.
Membership Functions.
Table 6.
Membership Functions.
| | Change Level | −2 | −1 | 0 | 1 | 2 |
|---|
| Fuzzy Set | |
|---|
| PB | 0 | 0 | 0 | 0.3 | 1 |
| PS | 0 | 0 | 0.3 | 1 | 0.3 |
| O | 0 | 0.3 | 1 | 0.3 | 0 |
| NB | 0.3 | 1 | 0.3 | 0 | 0 |
| NS | 1 | 0.3 | 0 | 0 | 0 |
Table 7.
Fuzzy Rule Base.
Table 7.
Fuzzy Rule Base.
| Rule No. | If Input (Terrain Height) | Then Output (Reward Adjustment) | Control Strategy Description |
|---|
| 1 | NB | NB | Significant relaxation of constraints |
| 2 | NS | NS | Slight relaxation of constraints |
| 3 | O | O | Maintain standard training parameters |
| 4 | PS | PS | Slight increase in penalty weights |
| 5 | PB | PB | Maximize penalties to force high leg-lift |
Table 8.
Discrete Singleton Values.
Table 8.
Discrete Singleton Values.
| Reward Term | NB (x1) | NS (x2) | O (x3) | PB (x4) | PB (x5) |
|---|
| lin_vel_z | −0.5 | −0.4 | −0.3 | −0.2 | −0.1 |
| orientation | −1.5 | −0.8 | −0.6 | −0.5 | −0.5 |
| hip_pos | −1.5 | −1.2 | −0.9 | −0.7 | −0.5 |
| feet_air_time | 0.3 | 0.8 | 1.0 | 1.2 | 1.3 |
| feet_swing_height | −1.6 | −1.5 | −0.8 | −0.5 | −0.5 |
Table 9.
Specific reward function coefficients configured for the second training phase (Step Height 0.10–0.20 m).
Table 9.
Specific reward function coefficients configured for the second training phase (Step Height 0.10–0.20 m).
| Reward Term | Value | Physical Meaning |
|---|
| lin_vel_z | 0.01 | Relaxed vertical velocity constraint |
| orientation | −0.1 | Reduced penalty for body roll/pitch |
| hip_pos | −0.05 | Relaxed hip position constraint |
| feet_air_time | 1.2 | Increased reward for longer swing time |
| feet_swing_height | −0.2 | Reduced penalty for high-stepping |
Table 10.
Statistics of success rate of upper and lower step tasks between 0.05 m and 0.09 m.
Table 10.
Statistics of success rate of upper and lower step tasks between 0.05 m and 0.09 m.
| Single Step Height (Meter) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
|---|
| 0.05 | 92% | 96% | 97% | 94% | 97% | 93% | 94% | 95% | 88% | 94% | 95.0% |
| 0.06 | 97% | 92% | 93% | 93% | 95% | 93% | 97% | 94% | 91% | 96% | 94.1% |
| 0.07 | 94% | 97% | 92% | 94% | 95% | 97% | 91% | 95% | 94% | 95% | 94.4% |
| 0.08 | 90% | 92% | 94% | 93% | 95% | 96% | 95% | 91% | 92% | 93% | 93.1% |
| 0.09 | 91% | 89% | 88% | 92% | 94% | 92% | 90% | 88% | 93% | 91% | 90.8% |
Table 11.
Statistics of success rate of upper and lower step tasks with step height between 0.1 m and 0.2 m.
Table 11.
Statistics of success rate of upper and lower step tasks with step height between 0.1 m and 0.2 m.
| Single Step Height (Meter) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
|---|
| 0.10 | 94% | 93% | 97% | 97% | 97% | 92% | 96% | 93% | 97% | 90% | 94.5% |
| 0.15 | 92% | 94% | 94% | 93% | 91% | 96% | 94% | 95% | 96% | 91% | 93.6% |
| 0.20 | 90% | 86% | 88% | 92% | 89% | 91% | 91% | 93% | 87% | 92% | 89.9% |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |