1. Introduction
Legged robots have emerged as a fascinating field of study and innovation in the realm of robotics. The unique capabilities of legged robots stem from the natural motion patterns of animals, which have evolved over millions of years to be efficient and effective in various landscapes. Drawing inspiration from nature’s most agile creatures, quadrupedal robots aim to replicate the locomotion abilities of four-legged animals [
1]. Their versatile and adaptive movement in unstructured environments has garnered significant interest across various industries, ranging from search-and-rescue operations [
2] to the exploration of harsh terrains [
3,
4]. Legged locomotion holds a distinct advantage over wheeled or tracked locomotion as it allows robots to traverse challenging terrains, bypass obstacles, and maintain stability even in unpredictable environments [
5]. Quadrupedal robots, in particular, have shown great promise due to their ability to distribute weight across four legs, ensuring better stability and load-bearing capacity.
Considerable effort and focus have gone into the design of robotic feet for stable interaction with the environment [
6,
7,
8,
9]. For instance, the authors in [
10] developed passive mechanically adaptive feet for quadrupedal robots capable of adapting to different terrain surfaces, significantly reducing robot slippage with respect to flat and ball feet.
From the motion point of view, dynamic locomotion plays a pivotal role in achieving agile and efficient movements for legged robots [
11]. Unlike static locomotion, dynamic locomotion emphasizes the agility of these machines. The exploration of dynamic locomotion control for quadrupedal robots has become a critical area of research, seeking to enhance their agility, stability, and overall performance. Through dynamic locomotion control, robots can adjust their gait, step frequency, and leg movements in response to varying terrains, slopes, or disturbances. This adaptability allows quadrupedal robots to quickly handle unexpected obstacles and maintain balance and stability while moving at different speeds. The development of effective dynamic locomotion control algorithms presents numerous challenges [
12]. Engineers and researchers must face complex issues such as environment reconstruction, state estimation, motion planning, and feedback control. Furthermore, achieving seamless coordination between multiple legs and ensuring smooth transitions between different gaits demands sophisticated control strategies.
Recent works have demonstrated that by exploiting highly parallel GPU-accelerated simulators and deep reinforcement learning, end-to-end policies can be trained in minutes even for complex control tasks such as quadrupedal robot dynamic locomotion [
13,
14].
To the best of the authors’ knowledge, there is no previous research addressing the problem of designing dynamic locomotion control algorithms that work on quadrupedal robots equipped with the passive adaptive feet presented in [
10] (
Figure 1), which consistently differ from the well-known and extensively investigated ball and flat feet. Therefore, the motivation of this work is the lack of control approaches designed and thought to fully exploit and leverage the information provided by this type of adaptive feet.
Given the kinematic and dynamic high complexity of these adaptive feet and considering the lack of good informative mathematical models regarding dynamic locomotion in harsh environments, as in the context of robotic environmental monitoring of natural habitats, we decided to address this problem by leveraging data-driven and learning-based methods over model-based approaches.
The contribution of this work is the design, training, and investigation of different learning-based end-to-end control policies enabling quadrupedal robots equipped with the passive adaptive feet presented in [
10] to perform dynamic locomotion over a diverse set of terrains. For end-to-end policies, we mean that between the input and the output of the controller, there is only a single neural network without any additional model-based sub-module. The adaptive feet used in this work come from previous research [
10]; thus, they are not a contribution of this work.
This paper is organized as follows: in
Section 2, we present the state-of-the-art learning-based quadrupedal robot dynamic locomotion control approaches and other related works of the problem faced; in
Section 3, we present the method used to solve the quadrupedal robot dynamic locomotion control problem with adaptive feet; in
Section 4, the results obtained in simulation are shown; in
Section 5, we compare the results of the different trained control policies, and we discuss them; and in
Section 6, we conclude and present potential future work that could improve the performance of the system.
2. Related Work
State-of-the-art learning-based quadrupedal robot dynamic locomotion control approaches use hybrid methods like [
15], where a policy is trained to provide a few parameters that modulate model-based foot trajectory generators. In [
16], the authors train in simulation, using a hybrid simulator similar to the one in [
17], a non-exteroceptive student policy to imitate the behavior of a teacher policy that has access to privileged information not available to the real robot, and using a temporal convolutional neural network-based policy [
18]. In [
19], the authors similarly train a LiDAR- and RGBD-based perceptive policy capable of distinguishing when to rely on proprioception rather than exteroception using a recurrent neural network-based belief-state encoder.
These methods are hybrid in the sense of [
15], and they are not end-to-end in the sense that they do not rely on a purely learning-based approach. End-to-end pipelines have the freedom to learn behaviors that may not be possible to perform with an a priori fixed structure. On the other hand, hybrid approaches provide a better warm start in the learning phase, and they are more reliable because of their higher level of interpretability and robust in terms of stability because of their use of well-known first principles. However, certificated learning strategies [
20] can be used to verify and satisfy safety and stability requirements directly during the training process.
Figure 2 highlights the main differences in terms of different controller components of hybrid architectures versus end-to-end ones.
More recently, end-to-end methods [
14] instead have been demonstrated to be able to successfully learn perceptive policies, whose performance is comparable to that of hybrid methods [
19], without any model-based trajectory generator or other prior knowledge and without any privileged learning or teacher–student-based imitation learning. Other works [
21,
22,
23,
24,
25] have been carried out to showcase how using end-to-end learning-based pipelines enables advanced motor skills for quadrupedal robot locomotion and navigation in unstructured and challenging environments. These methods, however, have all been designed to run on robots equipped with standard ball feet, with point-based contact dynamics assumptions.
Passive adaptive feet [
10] have been designed to be able to adapt to the shape of any kind of obstacle, maximizing grip and avoiding slippage. Moreover, they are equipped with a sensing system capable of providing information about the environment and the foot interaction with the surroundings. The authors in [
26] developed a foothold selection algorithm for quadrupedal robots equipped with SoftFoot-Q [
10] mechanically adaptive soft feet. As in [
27], on the baseline [
28], a dataset of elevation maps of synthetically generated terrains is acquired and collected in simulation. Then a polynomial fitting algorithm is used, along with other computations, to compute the associated cost maps, which are then used for the foothold selection. Since the terrain assessment and constraint checking are computationally expensive, a convolutional neural network (CNN), more precisely an efficient residual factorized convolutional neural network (ERFNet) [
29], is trained and used to predict a good approximation of the cost map, which is then used to compute the potential footholds on the elevation map.
Another interesting work addressing real-time dynamic locomotion on rough terrains that deals with dynamic footholds is presented in [
30], but it is still designed to run on robots equipped with standard ball feet. In [
31], the authors aim to improve dynamic quadrupedal locomotion robustness by employing a fast model-predictive foothold-planning algorithm and an LQR controller applied to projected inverse dynamic control for robust motion tracking. The method shows robustness by being able to achieve good dynamic locomotion performance also with unmodeled adaptive feet, but it does not make any use of the supplementary haptic information of the terrain provided by these feet and thus does not exploit the benefits of their passive adaptability.
Despite the good work performed on footstep planning with the adaptive feet presented in [
10], there is no work on dynamic locomotion control explicitly designed for them. All the mentioned methods employ static walking gait patterns, such as crawling, where robot speed is limited. Moreover, all the works presented so far address the problem of dynamic locomotion but not with the use of the adaptive feet presented in [
10].
3. Method
We model the quadrupedal robot dynamic locomotion control problem with adaptive feet as a Markov decision process (MDP). The latter is a mathematical framework to formulate discrete-time decision-making processes. The MDP is the standard mathematical formalism when dealing with reinforcement learning problems. An MDP is described by a tuple of five elements . defines the state space, defines the action space, is the reward function, is the transition probability density function, and is the starting state distribution.
At each time step t, the learning agent takes an action based on the environment state . The action is drawn from a parameterized policy , with being the set of policy parameters. This results in a reward value . In the case of policy optimization algorithms, among reinforcement learning methods, the goal is to find the optimal set of policy parameters , which maximizes the expected return , i.e., . This goal is achieved through interactions with the environment in a trial-and-error fashion, where are trajectories (sequences of states and actions) drawn from the policy, and is the infinite-horizon discounted return (the cumulative reward), with discount factor and being the reward at time step t.
A Gaussian distribution is used to model the policy
, i.e.,
. We use a multi-layer perceptron (MLP) for the mean value
and a state-independent standard deviation
. The policy optimization is obtained through the proximal policy optimization (PPO) algorithm [
32] from RL-Games [
33].
We define our approach as end-to-end because it uses a single neural network as the only component of the controller.
As the authors in [
26] highlight, the simulation of the adaptive foot [
10] is slow because of its mechanical complexity, since it counts a total of 34 joints, and a quadrupedal robot employs four of these feet. Even using a GPU-accelerated physics-based simulator such as Isaac Sim [
34], the simulation remains slow, especially considering that to exploit its high level of parallelism, thousands of robots are simultaneously simulated, which is useful to find optimal policies for reinforcement learning problems in a short time [
13]. Moreover, even if Isaac Sim supports closed kinematic chain simulation, given the high complexity of the adaptive feet, the simulation of the closed kinematic model of the foot is highly unstable and often leads to non-physical behaviors when high contact forces are involved. For these reasons, we decided to employ two simplified approximations of the adaptive foot model shown in
Figure 3a, both of which have an open kinematic chain structure, preserving the major degrees of freedom of their adaptability.
These model approximations are shown in
Figure 3. In particular, the two simplified models that were designed are the Adaptive Flat Foot (AFF) model and the Adaptive Open Foot (AOF) model, represented in
Figure 3b,c, respectively. The Adaptive Flat Foot keeps the pitch and roll joints of the ankle of the SoftFoot-Q shown in
Figure 3a, with the three chains being replaced by a single rigid flat body for the sole (2 passive degrees of freedom in total). The Adaptive Open Foot instead splits the single rigid flat body of the sole into two bodies, each one having an additional rotational degree of freedom with regard to the transversal axis of the foot roll link, allowing the foot to adapt to the shapes of different terrain obstacles and artifacts (4 passive degrees of freedom in total).
The reinforcement learning environments, shown in
Figure 4, used for training the dynamic locomotion control policies are based on the baselines available in [
35], a collection of reinforcement learning environments for common and complex continuous-time control problems, implemented to run on Isaac Sim. They are the flat terrain locomotion environment, shown in
Figure 4a, and the rough terrain locomotion environment, shown in
Figure 4b. Both of them use the ANYmal quadrupedal robot platform [
36], equipped with a total of 12 actuated degrees of freedom, 3 per leg, namely, Hip Abduction–Adduction (HAA), Hip Flexion–Extension (HFE), and Knee Flexion–Extension (KFE).
As in [
14], the rough terrain policies were trained in a game-inspired automatic curriculum fashion [
37], starting from simpler terrain samples and increasing in complexity once the control policies allow the robots to traverse them [
38]. Therefore, the environment terrain is built as a single set of different terrains (smooth slope, rough slope, stairs, stepping stones) at different levels of difficulty, following the setup shown in
Figure 5.
For the flat terrain locomotion environment, the policy MLP has three hidden layers, with 256, 128, and 64 units, respectively, and a 12-dimensional output layer representing the 12 desired joint velocities , which are then integrated with the current joint positions , to obtain the desired joint positions .
The rough terrain locomotion environment, instead, employs a policy MLP with three hidden layers with 512, 256, and 128 units, respectively, and a 12-dimensional output layer representing the 12 desired joint positions directly.
For both environments, a low-level PD controller is then used to compute the desired joint torques to apply to the robot actuated joints.
We considered different types of policy. In particular, the policy can either be perceptive, using a robot-centric elevation map of the surrounding terrain as shown in
Figure 6 as part of the input, or blind, without considering it. Then, the policy can either exploit terrain haptic perception, taking as input also the joint positions and velocities of the feet, or not. Considering the combinations of these strategies, four types of locomotion policy have been defined: blind–non-haptic (B), perceptive–non-haptic (P), blind–haptic (BH), and perceptive–haptic (PH). The configuration of the policy observation space for each type of policy is given in
Table 1. For the rough terrain locomotion task, all these types of policy were investigated, while for the flat terrain locomotion task, only the blind–non-haptic policy type (B) was considered.
The contact conditions of the adaptive feet with the terrain are not contained in any of the policy inputs. The feet information that the different policies take as input comprises the feet joint positions and velocities, as described in
Table 1. Even if the feet joint positions and velocities cannot be directly measured via joint encoders, they can be easily estimated from the four IMU measurements available on each foot [
10].
The input components of these different types of policy can be all or some of the following: base twist
, consisting of the 3 linear and angular velocities of the robot base; projected gravity
, that is, the projection of the gravity vector on the base frame axes, which gives information about the robot base orientation with regard to the gravity direction; command
, that is, the user-provided desired robot base twist reference, whose details are given in the next paragraphs; joint positions
, namely, the positions of the actuated joint; joint velocities
, namely, the velocities of the actuated joint; previous action
, that is, the output of the policy at the previous time step; height map
, i.e., the elevation map shown in
Figure 6; feet joint positions
, i.e., the positions of the feet passive joints; and feet joint velocities
, i.e., the velocities of the feet passive joints.
The primary goal of the locomotion tasks is to track a user-provided commanded base twist reference without falling. The command is composed of three components: a linear velocity along the robot sagittal axis (), a linear velocity along the robot transversal axis (), and an angular velocity around the robot longitudinal axis ().
We also applied domain randomization [
39], considering additive component-wise scaled random noise, generating randomized reference commands (the angular velocity command is generated starting from a randomized heading direction) and applying random disturbances to the robots through interspersed base random pushes.
The reward function is the sum of different conventional terms, and it aims to make the robot base twist:
to follow the commanded reference one
with
,
, and
always equal to zero.
The linear velocity term aims at minimizing the velocity tracking errors in the forward and lateral linear directions:
The angular velocity term aims at minimizing the yaw angular velocity tracking error around the vertical axis of the robot base frame:
The longitudinal linear velocity penalty term keeps the robot base’s vertical velocity as small as possible to avoid jumping-like motions:
The angular velocity penalty term keeps the robot base’s roll and pitch angular velocities as small as possible to prevent the robot from rolling to the front or side:
The orientation penalty term penalizes all those robot base orientations
that do not make the gravity vector
perpendicular to the
x–
y plane of the robot base frame by keeping the
x and
y components of the projected gravity vector
as close to zero as possible
The robot base height penalty term keeps the robot base height, with regard to the terrain beneath it,
close to a predefined walking height
h (0.52 m) to avoid the robot walking with a low base height (this penalty term is particularly helpful when training blind locomotion policies on rough terrains, as is shown in the results):
The joint torques penalty term aims at minimizing the robot actuated joint torques for energy-efficiency reasons:
The joint accelerations penalty term aims at minimizing the robot actuated joint accelerations to avoid abrupt motions of the robot joints:
The action rate penalty term smooths the policy output with regard to time steps, providing a certain continuity in the desired joint references regardless of whether they are joint positions or joint velocities:
The hip motion penalty term is used for aesthetic reasons, and it keeps the Hip Abduction–Adduction joint positions close to the default values, which makes the robot walk in a natural-looking manner:
where
L,
R,
F, and
H are the left, right, front, and hind sides of the robot, respectively, and
are the default joint positions of the Hip Abduction–Adduction joints, which are 0.03 radians for the left-hand side joints and −0.03 radians for the right-hand side joints.
In Equations (
5)–(
12),
are opportunely tuned constants, and
n is the number of robot actuated joints. These reward and penalty terms do not impose any kind of specific walking pattern, gait, or behavior, neither static nor dynamic, since none of them uses any kind of information about feet contact timing. These terms only aim to impose a specific robot base motion, since they only involve robot base velocity, orientation, and height values, as well as joint velocities, accelerations, and torques. The dynamic locomotion gait is a self-emerging behavior during training. According to these terms, it is more convenient to employ a faster dynamic gait rather than a slower static one to track the required commanded base twists to obtain a higher reward.
The robots are reset and their rewards are penalized if their base or more than one knee gets in touch with the ground. Robots can also be reset if a timeout occurs when reaching the maximum episode length. Timeouts are correctly handled, as in [
14], using the partial-episode bootstrapping method [
40]. The values of the hyper-parameters of the policy optimization algorithm (PPO), used for the training of the locomotion policies, are the environment default ones as in [
35].
The policies were all trained in simulation using Isaac Gym [
13] in the Omniverse Isaac Sim [
34] simulator, running thousands of robots in parallel, exploiting the power of GPU acceleration. The hardware setup for training the control policies employed an NVIDIA Tesla T4 GPU with 16 GB of VRAM.
4. Results
In
Figure 7, the reward trends of the different training sessions are shown. The spikes in the trends are due to training restarts from checkpoints. The training was stopped when the reward did not significantly increase for more than a thousand iterations. The number of simultaneously simulated robots during the training process is 2048 for all the trained policies. A policy update iteration consists of 48 policy steps for each of these environments.
The performance indices evaluated are the following: the linear velocity tracking error in the forward and lateral directions; the yaw angular velocity tracking error around the vertical axis of the robot base frame; the desired joint positions rate, which is the difference between the previous policy output and the current one, to evaluate continuity; the actuated joint velocities rate, which is the difference between the previous actuated joint velocities and the current ones, to evaluate abrupt motions; actuated joint torques, to evaluate energy efficiency; and robot base height, to evaluate natural-looking walking style.
After the policies were trained, the performance indices were evaluated by simulating 1024 robots in parallel for a simulation time of 60 s for the flat terrain locomotion task and 17.5 s for the rough terrain policies. This is just for memory requirements and computational time reasons for computing these performance indices. Both the mean and standard deviation of the performance indices were computed over all these policy steps, and they are shown in the charts in terms of error bars.
Figure 8 shows a performance evaluation and a comparison of the flat terrain locomotion policies. The performance of the baseline policy from [
35], employed on the robot equipped with its own default ball feet, is compared with the performance of the same policy employed on the robot equipped with the Adaptive Flat Feet (AFF,
Figure 3b). Then the performance of a third policy, opportunely trained using the robot equipped with the Adaptive Flat Feet, is also illustrated. A detailed description of the different policies tested on this task is given in
Table 2.
In
Figure 9, a performance evaluation of the rough terrain locomotion policies and comparison with ball-feet baseline policies from [
35] is presented. Furthermore, in this case, the baseline policy (perceptive–non-haptic with ball feet P-ANY) from [
35], trained on the robot equipped with its own default ball feet, was tested on the robot equipped with both the approximations of the SoftFoot-Q. A detailed description of the different policies tested on this task is given in
Table 3.
For the blind locomotion tasks, we noticed a common pattern: the robot learned to walk at a lower height of the base. As mentioned in
Section 3, adding the base height penalty term partially helps to solve this problem at the expense of a worse base twist command tracking performance.
Since the policies were trained taking into account dynamics randomization, as discussed in
Section 3,
Figure 10 shows an example of the disturbance response to random pushes applied to the robot base for the perceptive–non-haptic rough terrain policy running on the robot equipped with the Adaptive Flat Feet (AFF).
Figure 11,
Figure 12,
Figure 13 and
Figure 14 show two motion captures of two different trained policies traversing two types of rough terrains among the considered ones and their corresponding feet contact schedules. Please refer also to the video attachment to visualize the full experiment. The feet contact schedules in
Figure 12 and
Figure 14 show that the learned locomotion gait follows a dynamic gait pattern, namely, the trotting gait.