1. Introduction
The ultimate objective of legged robots is to achieve autonomous locomotion and operation in complex, unstructured, and unknown environments, which imposes stringent requirements on the robot’s perception, planning, and control algorithms [
1,
2,
3,
4]. Humanoid robots, with their human-like structure and motion patterns, offer distinct advantages in mitigating risks such as jolting, slipping, and immobilization in unstructured environments by adaptive foothold selection, gait adjustment, and speed modulation. Their strong locomotion capabilities in complex environments enable a wide range of applications such as disaster response, industrial automation, and space exploration. However, the motion planning and control of humanoid robots are inherently complex due to their high degrees of freedom, and the reliance on precise terrain perception in such environments further adds to the challenges in their research and practical deployment.
Current model-based methodologies for humanoid locomotion have predominantly focused on two aspects: (a) perceptive locomotion utilizing static gaits [
5,
6,
7,
8,
9] and (b) blind dynamic locomotion under the assumption of flat terrain [
10,
11,
12]. While recent advances in learning-based control frameworks have demonstrated remarkable generalization capabilities, enabling robust locomotion across diverse challenging terrains without explicit terrain models [
13,
14,
15,
16,
17,
18], research on vision-guided reinforcement learning for humanoid robots remains limited [
19,
20]. Despite these advancements, the integration of terrain perception modules for precise foothold control on rugged terrains remains a significant challenge. This integration challenge manifests through three bottlenecks: computationally efficient terrain representation, optimal foothold selection in unstructured environments, and coordinated planning of underactuated system dynamics.
In the context of foothold selection, several studies have incorporated perceptual information into foothold selection algorithms [
21,
22,
23], adopting a hierarchical framework where foothold positions and base motion are optimized independently. In this framework, the high-level module selects footholds by incorporating terrain information, while the low-level module optimizes base motion based on the robot’s dynamic model. Although this hierarchical approach reduces computational complexity, it requires manual alignment between these two modules. Furthermore, the independent optimization of footholds and base motion often fails to adequately address kinematic constraints between the base and legs, as well as leg collision avoidance with the terrain.
In contrast, some studies have proposed a joint optimization approach for foothold positions and base motion, simultaneously considering both base and leg movements during trajectory optimization, which has yielded promising results [
24,
25,
26]. This method eliminates the need for manual handling of the coupled motion between the legs and base and allows for the integration of comprehensive terrain information into the optimization process. However, the joint optimization of base and leg motion introduces significant computational complexity due to the high-dimensional nature of the model, particularly when perceptual information is incorporated. This may result in excessively long computation times, making it challenging to achieve real-time deployment.
Elevation maps play a crucial role in terrain perception and are extensively utilized in the navigation and control of legged robots [
27,
28,
29]. Algorithms based on search or sampling techniques can identify stable foothold positions by leveraging ground height information derived from elevation maps. In optimization-based control algorithms, elevation maps are commonly employed as inputs and integrated with other constraints to optimize the robot’s motion or path planning. For instance, Winkler et al. [
25] employed elevation maps for foothold selection and collision avoidance, applying equality constraints to restrict the start and endpoints of foot trajectories to the terrain and inequality constraints to prevent collisions between the swing foot and the terrain. However, this approach does not account for the discontinuities and non-convexities of the terrain, which may cause the optimization problem to converge to local minima, thereby limiting its effectiveness in complex environments.
In the domain of trajectory optimization, large-scale nonlinear solvers such as SNOPT [
30] and IPOPT [
31] are widely adopted [
32,
33]. These solvers are capable of handling the complex constraints and objective functions inherent to nonlinear optimization problems, although they often require extensive iterative computations to find optimal solutions. Alternatively, specialized solvers for trajectory optimization, such as iterative Linear Quadratic Regulator (iLQR) [
34,
35] and Sequential Linear Quadratic (SLQ) methods [
36], have been proposed. These solvers, which are variants of Differential Dynamic Programming (DDP) [
37], can efficiently compute near-optimal solutions within a relatively short time frame, making them particularly suitable for high-dimensional problems. Nevertheless, these methods still require linearization of the system, which may introduce approximation errors during the computation process and potentially compromise the accuracy of the solutions.
To address the aforementioned challenges, this paper proposes a hierarchical planning and control framework that integrates terrain perception. As illustrated in
Figure 1, the framework comprises a three-level architecture: perception, planning, and control. The perception layer constructs a high-precision elevation map from 3D point cloud data, extracts feasible regions through terrain segmentation, and generates convex polygon terrain constraints. The motion planning layer formulates a multi-objective optimization problem that incorporates underactuated dynamics constraints and terrain constraints using nonlinear model predictive control (MPC). A sequential quadratic programming (SQP) algorithm is employed to generate gaits and trajectories in real time within 10 ms to 20 ms. The whole-body control (WBC) layer adopts a hierarchical optimization strategy, utilizing null-space projection and quadratic programming to coordinate multiple tasks and constraints, thereby achieving synergistic optimization of dynamic balance and trajectory tracking. The three-layer architecture operates as an integrated system through closed-loop feedback mechanisms. The key contributions of this paper are as follows:
Real-time trajectory planning via nonlinear MPC: we implemented trajectory planning for humanoid robots using nonlinear MPC, with real-time gait and trajectory generation achieved via an SQP algorithm.
Closed-loop architecture with triple coupling: we constructed a closed-loop architecture featuring “perception–planning–control” triple coupling, enabling the deep integration of terrain perception with underactuated system planning and control.
High-performance locomotion capabilities: Based on the proposed method, the simulated experiments demonstrate high-performance capabilities on humanoid robots, including high-speed walking at 1 m/s, disturbance rejection with up to 60 Ns in the forward direction and 30 Ns in the lateral direction, and obstacle traversal over heights of 0.16 m, representing a significant improvement in locomotion performance on complex terrains.
Following the proposed framework, the structure of this paper is organized as follows:
Section 2 presents the details on the terrain perception pipeline. Subsequently, the formulation of the motion planning problem and the corresponding numerical optimization strategy are presented in
Section 3 and
Section 4, respectively.
Section 5 introduces the whole-body control layer. Finally, the proposed method is evaluated on the MuJoCo simulation platform in
Section 6, and the paper concludes with a summary in
Section 7.
2. Terrain Perception
To establish a geometric model of terrain in complex environments, this section utilizes point cloud data acquired from RGB-D sensors to construct a 2.5D elevation map that incorporates sensor uncertainty and robot pose uncertainty. The elevation map is further segmented to extract critical terrain features.
2.1. Elevation Map Construction
Constructing a robot-centric local elevation map requires projecting depth images from the sensor onto a cell grid map. Unlike global maps defined in an inertial coordinate system, the local map is continuously updated with robot motion without relying on absolute localization. To account for uncertainties in depth measurements and locomotion-induced pose variations on unstructured terrains, we employ the elevation map update algorithm proposed in [
38].
The update process of the elevation map is illustrated in
Figure 2. Initially, based on the depth distance vector,
, of point P on the terrain measured by the sensor, the covariance matrix,
, of the sensor model, and the covariance matrix,
, of the sensor’s rotation relative to the map coordinate system, a one-dimensional Kalman filter is applied to fuse the old and new elevation measurements to update the mean,
, and variance values,
, of the elevation map.
Given that the elevation map moves with the robot, the spatial covariance matrix, , of each cell in the elevation map is updated whenever the robot’s pose changes, based on the robot’s translational covariance, , and rotational covariance matrix, . Finally, the height estimate for each cell is computed as a weighted mean of the heights from surrounding cells in a process known as map fusion.
2.2. Plane Segmentation
The objective of terrain segmentation is to identify traversable regions and boundaries based on terrain parameters. Slope and roughness, which serve as critical indicators for characterizing terrain, reflect the inclination and irregularity of the terrain, respectively. Feasibility classification is performed using the local surface inclination and the local roughness, the latter of which is estimated through the standard deviation [
39]. For a neighborhood of size
N, these quantities are computed as follows:
where
μ and
S denote the mean and variance of the cell height
ri, respectively, while
Σ represents the positive semi-definite covariance matrix of
ri.
The variance of the local surface normal direction corresponds to the smallest eigenvalue of Σ, and the surface normal vector n is the eigenvector associated with this smallest eigenvalue. In this study, a neighborhood of N = 16 is used, with the standard deviation threshold for the normal direction set to 2 cm and the maximum slope angle set to 20°. If the angle is below the threshold and the standard deviation is less than the threshold, the region is classified as planar. A cell is marked with 1 if its neighborhood satisfies the planar condition; otherwise, it is marked with 0. Finally, an opening operation is applied to the image to remove isolated points.
Upon completion of the feasibility classification, the connected component-labeling algorithm is employed to identify continuous regions. For each connected region, planarity verification is performed using the same method as in the previous classification. Subsequently, the determination of whether the region constitutes a plane is made based on the following criteria:
To define a plane more accurately, the threshold for standard deviation is relaxed to 2.5 cm, allowing for a certain degree of topographic undulation. Meanwhile, the constraint on slope is tightened to 15°, ensuring that only planes meeting this stricter slope requirement are accepted. The final connected regions are subjected to erosion operations. By extracting the edges, the non-convex two-dimensional contours of each plane are obtained and projected onto the corresponding planes along the z-axis to define the boundaries within the planes.
3. Motion Planning
The motion planning of legged robots in complex terrains requires simultaneous consideration of dynamic constraints and physical constraints. By dynamically adjusting footholds and body posture, stability can be maintained across varying terrain conditions. This section describes how the motion planning problem for humanoid robots is transformed into an optimal control problem, incorporating terrain information, dynamic models, and other physical constraints into the modeling and solution processes.
3.1. Robot Definition
The schematic diagram of the humanoid robot’s structure is illustrated in
Figure 3. Each leg is configured with six degrees of freedom, and each arm is configured with four degrees of freedom. An inertial coordinate system
I is defined, with its x-axis aligned with the robot’s forward direction, the z-axis oriented vertically upward, and the y-axis determined according to the right-hand rule. The floating base coordinate system
B is attached to the robot’s body, with its origin located at a specific point on the body. The initial orientation of
B is aligned with that of the inertial coordinate system
I. The center-of-mass coordinate system
G is defined such that its origin coincides with the robot’s center of mass, and its orientation remains consistent with that of the inertial coordinate system
I.
The generalized coordinates
q and generalized velocities
of the robot are defined as follows:
where
denotes the position of the floating base in the inertial coordinate system
I,
represents the orientation of the floating base in
I, described by ZYX Euler angles,
and
denote the joint angles and velocities of the actuated joints.
The contact between the humanoid robot’s foot and the environment is modeled as a surface contact, which is simplified to four contact points on the sole. Each contact point is subjected to a three-dimensional linear force from the environment. The set of forces acting on the robot’s foot contact points is denoted as .
3.2. Dynamics Model
The state
x and input
u of the robot are defined as follows:
where
represents the robot’s centroidal momentum expressed in the center-of-mass coordinate system
G, which comprises the linear momentum
and the angular momentum
.
denotes the pose of the floating base in the inertial coordinate system
I.
na and
nc represent the number of actuated joints and contact points, respectively.
According to the centroidal dynamics, the external forces acting on the robot are equivalent to the rate of change in the centroidal momentum, which is expressed as follows:
where
denotes the vector from the centroid to the
i-th contact point, and
fi denotes the external force acting on the
i-th contact point.
To solve for the centroidal momentum of the robot, a centroidal momentum matrix
is introduced. This matrix describes the linear mapping between the generalized joint velocities
and the centroidal momentum
hcom, expressed as follows:
Based on Equation (10), the velocity of the floating base
can be derived as follows:
By differentiating the robot state
x, the state equation of the robot system is obtained as follows:
3.3. Trajectory Planning
3.3.1. Floating Base Trajectory Planning
The floating base position and orientation of a robot walking on complex terrain are intrinsically linked to terrain information and the user’s input commands. Assuming that the user’s input command velocities, comprising the linear velocities
vx and
vy in the horizontal plane and the angular velocity
ωyaw around the z-axis, remain constant. The horizontal position of the robot’s body in the inertial coordinate frame
I can be derived as follows:
where
xB and
yB represent the current positions of the robot’s body in the inertial coordinate frame
I, as computed by the state estimation module. The term
denotes the rotation matrix between the body coordinate frame
B and the inertial coordinate frame
I.
vxΔ
t and
vyΔ
t represent the incremental position of the robot over the time window of Δ
t, expressed in the body coordinate frame
B.
Similarly, the command for the rotation of the robot’s body around the z-axis can be expressed in the inertial coordinate frame
I as follows:
During typical robot locomotion, the height command for the robot’s body is maintained at a constant value hnom, which refers to the height of the floating base relative to the foot contact points. To determine the height of the floating base in the inertial coordinate frame I, a function zmap(x, y) is designed, and interpolation methods are employed to obtain the terrain height corresponding to a specified position (x, y).
3.3.2. Foothold Planning
Without considering the terrain, the nominal foothold is determined by integrating gait information with the floating base trajectory. Based on the heuristic foothold theory proposed by Raibert [
40], the nominal foothold can be expressed as follows:
where
represents the nominal foothold in the inertial coordinate frame
I;
denotes the position of the floating base at the midpoint of the support phase;
g is the gravitational acceleration; and
represents the measured velocity and commanded velocity of the robot’s floating base, respectively. Equation (15) enables the closed-loop control of the robot’s velocity through foothold selection, which is critical for dynamic stability. For instance, if the measured velocity of the robot under external disturbances exceeds the desired value, the nominal foothold will increase, prompting the robot to take larger steps to decelerate. Conversely, smaller steps would be taken to accelerate.
By adapting the nominal foothold to the terrain information, the projected foothold and feasible landing region can be obtained. The specific steps of this process are as follows:
- (a)
Calculate the distance from the nominal foothold to all segmented planes, and sort the segmented planes in ascending order based on their distance.
- (b)
For each plane, determine whether the nominal foothold lies within the boundary of the plane. If the nominal foothold is outside the boundary, project it onto the boundary by identifying the closest point on the boundary. If the nominal foothold lies within the boundary, the projected foothold remains unchanged.
- (c)
Compute the loss value of each projected foothold and select the one with the minimum loss value. After evaluating all planes, a set of projected footholds
is obtained. The loss value for each foothold is calculated as follows:
where
fkin is the kinematic penalty term. This term assumes a large value if the selected foothold causes the robot’s leg length to exceed the allowable threshold during landing or lifting, or if it results in leg crossing. By adjusting the weight
ωkin, the influence of the kinematic penalty term on the overall loss value can be modulated.
3.3.3. Swing Foot Trajectory Planning
The height trajectory of a complete swing phase is divided into two segments: the lifting phase and the descending phase. Both segments are represented by cubic polynomial curves. For the lifting process, the endpoint corresponds to the midpoint of the swing phase, where the foot height reaches its maximum value
zmiddle, expressed as follows:
where
zliftOff and
ztouchDown represent the initial and final heights of the swing foot, respectively, and
hswing denotes the user-specified lifting height.
The initial height and velocity of the descending process are consistent with the final height and velocity of the lifting process. The final height is determined by the height of the projected foothold obtained from the foothold planning module, and the final velocity is set to the user-specified landing velocity, which is typically negative.
When the plane of the foothold region is parallel to the horizontal plane, the height planning for the four contact points of the foot is consistent with the foothold height planning. However, when the plane of the foothold region is not parallel to the horizontal plane, the height planning for the four contact points of the foot must account for the orientation of the foothold plane. Taking the left front contact point of the left foot (LLF) as an example, its position in the inertial coordinate frame
I can be derived through coordinate transformation as follows:
where
Pf represents the foothold position in the inertial coordinate frame
I,
IRF denotes the rotation matrix between the foot coordinate frame and the inertial coordinate frame
I, which is computed using the rotation matrix of the convex polygon and the Yaw angle command, and
FPLLF is the vector from the ankle joint projection point to the LLF contact point expressed in the foot coordinate frame.
3.4. Nonlinear MPC
In this study, the control problem of humanoid robots is formulated as an optimal control problem, expressed as follows:
where
x(
t) and
u(
t) represent the state and input variables of the robot system, respectively;
Φ(
x(
T)) denotes the terminal loss associated with the terminal state
x(
T); and
L(
x(
t),
u(
t),
t) represents the running loss over the entire time horizon. Different from the classical MPC [
41], humanoid robots’ dynamics cannot be simply approximated as a linear system. Therefore, we incorporate the nonlinear dynamics formulation in Equation (12) within the MPC architecture to achieve a higher performance.
The objective of this optimization problem is to find a set of state and input variables that minimize the loss function while satisfying the initial value constraints, dynamic equations, equality constraints, and inequality constraints. In the following subsections, the loss function and constraints are elaborated on in detail for the complex system of humanoid robots.
3.4.1. Loss Function
The running loss function
L(
x(
t),
u(
t),
t) is divided into two components: the tracking loss
Lε for states and inputs and the penalty loss
LB introduced by inequality constraints. Terms related to trajectory tracking, including the tracking error of the states and the tracking error of the inputs, belong to the tracking loss
Lε, expressed as follows:
where
εx denotes the tracking error of the states, which is used to minimize the deviation between the robot’s state variables and the reference states. Similarly,
εu represents the tracking error of the inputs, which minimizes the deviation between the robot’s input variables and the reference inputs.
Wx and
Wu are the weight matrices for state and input errors, respectively.
Following the approach proposed in [
42], transforming all inequality constraints into penalty loss significantly simplifies the solution of the optimization problem. Therefore, we incorporate those inequality constraints related to terrain adaptability and dynamic stability (e.g., foothold constraints and obstacle avoidance constraints) into the penalty loss
LB, expressed as follows:
where
Bj (
) represents the joint limit constraints for the
i-th leg,
Bt (
) denotes the foothold constraints,
Bf (
) represents the friction cone constraints, and
Bd (
) denotes the obstacle avoidance constraints.
The tracking loss
Lε and the penalty loss
LB are integrated through weighted summation in our multi-objective optimization problem. This weighting strategy reflects our prioritized control objectives: higher weight coefficients are assigned to terrain adaptability- and dynamic stability-related penalties while maintaining moderate weight coefficients on trajectory tracking terms like body height regulation and velocity tracking. Conversely, lower weight coefficients would be placed on joint trajectory tracking to avoid reducing the dynamic stability and terrain adaptability. The terminal loss is computed based on the state and input errors at the terminal time
T, and it is expressed as follows:
3.4.2. Constraints
For the four contact points of the supporting leg, each contact point must lie within a three-dimensional space bounded by a convex polygon. This space is enclosed by
N boundary surfaces, each of which is perpendicular to the plane of the convex polygon and passes through one edge of the convex polygon, as illustrated in
Figure 4a. The condition that the contact points simultaneously lie on the positive side of the
N boundary surfaces can be expressed in vector form as follows:
where
and
are the parameters of the boundary surfaces, and
pi is the position vector of the
i-th contact point in the inertial coordinate system.
During the swing phase, to prevent collisions between the swinging foot and the environment, a one-dimensional nonlinear inequality constraint is applied to each contact point of the foot, expressed as:
where a signed distance field is used to describe the shortest distance from the position of contact point
pi to the obstacle
dSDF(
pi), as shown in
Figure 4b, and
dmin(
t) is a time-dependent shaping function that relaxes this constraint under specific conditions.
To prevent relative sliding between the supporting foot and the environment, a second-order nonlinear inequality constraint is applied to each contact point, expressed as follows:
where
μc is the friction coefficient, [
Fx Fy Fz]
T =
FRIfi represents the tangential contact force in the foot coordinate system, and
ε > 0 is introduced to ensure the continuity of the constraint derivative when
fi = 0. Additionally, the linear velocity of the contact points should be constrained to zero, i.e.,
vi = 0.
During the swing phase, a height trajectory constraint is imposed on the four contact points, expressed as follows:
where
piref and
viref are the reference position and velocity of the
i-th contact point, respectively, and
kp is the position error gain. The normal vector
of the plane is obtained through time-based interpolation, with the starting point being the terrain normal vector at the lift-off moment and the endpoint being the terrain normal vector at the landing moment.
To fully constrain the motion of the swinging foot, a constraint is imposed on the foot’s orientation angles, expressed as a three-dimensional nonlinear equality:
where
is the orientation reference value of the
i-th foot, obtained through time-based interpolation and represented using quaternions, and dis(•) is a function that calculates the distance between two quaternions. Additionally, to ensure that the swinging foot does not make contact with the environment, the contact force should be constrained to zero, i.e.,
fi = 0.
Throughout the whole motion, all actuated joints are required to satisfy physical constraints on joint positions and velocities, expressed as multidimensional linear inequalities:
where
and
are the upper limits for joint positions and velocities, respectively, and
and
are the corresponding lower limits.
6. Simulation and Results
To validate the effectiveness of the proposed framework, this study leverages the MuJoCo simulation platform to evaluate the locomotion capabilities of a robot (height: 1.5 m, total mass: 48 kg). Our experimental protocol is designed to quantify three critical performance metrics: dynamic stability, robustness under perturbation scenarios, and adaptive capacity in unstructured terrains. The specific experiments conducted include the following:
- (a)
Forward walking: This experiment aims to evaluate the robot’s dynamic stability and locomotion efficiency. The robot is tasked with walking forward on a flat surface, and its gait patterns, balance maintenance, and speed consistency are analyzed to evaluate its dynamic performance.
- (b)
External force disturbance: This experiment is designed to test the robot’s robustness against external perturbations. External forces are applied to the robot at various points during its motion, and its ability to recover balance and maintain stable locomotion is observed. This test highlights the control framework’s capability to handle unexpected disturbances.
- (c)
Complex terrain walking: This experiment aims to evaluate the robot’s adaptability to uneven and challenging terrains. The robot is required to traverse surfaces with varying slopes, steps, and obstacles. Its ability to adjust its gait, maintain stability, and avoid collisions is analyzed to demonstrate its adaptability to real-world environments.
These simulated experiments collectively provide a comprehensive assessment of the robot’s motion control framework, ensuring its effectiveness in dynamic, robust, and adaptive scenarios. The simulation videos are included in the
Supplementary Materials.
6.1. Forward Walking
The whole-body motion control algorithm and state estimation operate at a frequency of 500 Hz. The MPC algorithm runs at a frequency of 50 Hz, with a prediction horizon of 1 s and a discrete time step of 0.015 s. The SQP solver [
44] based on HPIPM [
45] is employed with a convergence tolerance of 1 × 10
−4 and a maximum of one iteration per optimization cycle. To ensure the computational efficiency, three threads are allocated for SQP calculations. The simulation time step is set to 0.002 s. In the absence of external disturbances and on a flat ground surface, the desired velocity of the robot in the X-direction is set to gradually accelerate from 0 m/s to 1 m/s and then decelerate back to 0 m/s.
Figure 5 shows the velocity tracking curve of the robot’s floating base. The robot successfully follows the commanded acceleration, constant velocity, and deceleration processes. The maximum velocity tracking error at 1 m/s is 0.07 m/s, while the velocity fluctuation in the Y-direction remains within 0.05 m/s.
Figure 6 displays the joint torque vs. speed curves during the robot’s walking process, providing a basis for motor selection. Throughout the process, the Hip Pitch joint exhibits the maximum torque, approximately 200 Nm. The joint speeds are generally low, all below 4 rad/s.
6.2. External Disturbance
After demonstrating basic walking capabilities, the robot’s stability under external force disturbances is further evaluated. In the simulation environment, the robot is commanded to perform in-place stepping while external disturbances are simulated by applying impulsive forces. The magnitude of the impulsive force is set to 120 N with durations ranging from 0.2 s to 0.5 s.
Figure 7a illustrates the velocity curve of the robot’s floating base. The first three impulsive forces have a duration of 0.2 s, while the latter three have a duration of 0.5 s. It is observed that after experiencing an impulse of 24 Ns, the robot’s velocity exhibits brief fluctuations, with a peak reaching 0.6 m/s. When subjected to an impulse of 60 Ns, the robot’s velocity shows more significant fluctuations, with a peak reaching 1.1 m/s. In this case, the robot adjusts its velocity by taking two consecutive steps forward, as illustrated in the corresponding timing diagram in
Figure 7b. The disturbance resistance capability (forward push impulse: 60 Ns, robot mass: 48 kg) demonstrates comparable performance with the learning-based approach implemented on the iCub robot (forward push impulse: 40 Ns, robot mass: 33 kg) [
46].
Figure 7c depicts the velocity curve under lateral impulsive forces. Given that the lateral disturbance resistance is weaker than forward resistance, the magnitude of the three lateral impulsive forces is set to 60 N, each with a duration of 0.5 s. It can be seen that after experiencing lateral impulses, the robot’s lateral velocity exhibits substantial fluctuations, with a peak reaching 1 m/s. The robot adjusts its lateral velocity by modifying its lateral foot placement, as shown in the corresponding timing diagram in
Figure 7d. Notably, the robot does not exhibit leg crossing after lateral disturbances, validating the effectiveness of the self-collision avoidance constraint in the whole-body control framework.
6.3. Walking Experiments on Complex Terrain
To evaluate the robot’s adaptability to complex terrains, an RGB-D sensor and a staircase terrain were integrated into the simulation environment. The RGB-D sensor was utilized to acquire depth information about the terrain. The staircase consists of four steps, each with a height of 0.16 m and a width of 0.5 m. The elevation map was configured with a size of 3 m × 3 m and a resolution of 0.04 m. The robot was commanded to move forward at a speed of 0.36 m/s with a gait cycle of 0.8 s.
Figure 8a illustrates the timing diagram of the robot climbing the stairs. As the robot approaches the stairs, it autonomously adjusts its base height and modifies its foot placement to align with the step surfaces. To provide a more intuitive representation of the motion details,
Figure 8b presents the timing diagram of foot placement planning during the stair-climbing process. In the figure, two robot models are displayed: one representing the current state (with high transparency) and the other representing the 1-second-ahead state (with low transparency). The blue points denote the nominal foot placement points, while the green points represent the projected foot placement points. The swing trajectories of the four contact points are depicted by curves of different colors, and the convex polygonal regions corresponding to the projected foot placement points are represented by polygons of different colors. As shown in
Figure 8b, when the nominal foot placement point approaches the edge of a step, the projected foot placement point automatically adjusts to either the current step surface or the next step surface. If the projected foot placement point lies on the next step surface, the swing leg trajectory smoothly transitions from the current step surface to the next, ensuring seamless stair climbing. Simultaneously, the robot’s base height is dynamically adjusted based on the elevation map to align with the step height.
Figure 9 compares the simulation results with and without the swing leg collision avoidance constraint. When the swing leg collision avoidance constraint is not considered, the swing foot trajectory optimized by MPC collides with the stairs (as shown in
Figure 9a), resulting in the robot falling. In contrast, when the swing leg collision avoidance constraint is incorporated, the swing foot trajectory optimized by MPC automatically avoids the stair edges (as shown in
Figure 9b), ensuring that the swing foot does not collide with the staircase.
To further evaluate the robot’s terrain adaptability in complex environments, a 20 × 20 grid terrain was constructed in the simulation environment, as illustrated in
Figure 10. Each grid cell measures 0.6 m × 0.6 m, with randomly assigned height differences between adjacent grids (up to 0.16 m). For minor height differences, adjacent grids are classified as coplanar regions, enabling direct traversal by the robot. In such scenarios, the robot’s stability is predominantly ensured by the controller’s inherent robustness. Conversely, for significant height variations, foothold planning becomes the critical determinant of successful traversal.
Figure 10 illustrates the temporal sequence of the robot navigating this randomized terrain. The results demonstrate that despite stochastic elevation variations between adjacent grids, the robot achieves stable locomotion through adaptive foothold selection and dynamic control.
7. Conclusions
This study has developed a comprehensive terrain-aware hierarchical framework that effectively bridges the perception–planning–control loop for humanoid locomotion in complex environments. By segmenting the elevation map to obtain convex polygonal terrain features, a terrain-constrained nonlinear MPC formulation enabling real-time foothold adaptation through convex polygon constraints is constructed. The sequential quadratic programming algorithm ensures real-time generation of gaits and trajectories with a 50 Hz update rate, while a hierarchical optimization-based whole-body control strategy effectively coordinates dynamic balance and trajectory tracking tasks. The proposed framework has been validated through simulation to significantly enhance the robot’s dynamism, robustness, and terrain adaptability. It provides a novel solution for addressing the challenges of the dynamic motion control of humanoid robots in unstructured environments.
The terrain perception in this work is grounded in elevation mapping techniques, which may introduce inherent limitations when confronting extreme terrains such as sandy substrates, overgrown vegetation, or jagged rocks. It would be hard to extract the convex polygon from these highly irregular geometries or deformable surfaces. Additionally, real-world deployment on physical robots faces challenges, including sensor noise and latency, actuator nonlinearities, and discrepancies between simulated and physical system dynamics. Future investigations are necessary to integrate multimodal perceptions for accurate and robust terrain reconstruction, such as the learning-based method [
47], and implement hardware validation to bridge the gap between simulation and real-world deployment.