Learning Quadrupedal High-Speed Running on Uneven Terrain

Reinforcement learning (RL)-based controllers have been applied to the high-speed movement of quadruped robots on uneven terrains. The external disturbances increase as the robot moves faster on such terrains, affecting the stability of the robot. Many existing RL-based methods adopt higher control frequencies to respond quickly to the disturbance, which requires a significant computational cost. We propose a control framework that consists of an RL-based control policy updating at a low frequency and a model-based joint controller updating at a high frequency. Unlike previous methods, our policy outputs the control law for each joint, executed by the corresponding high-frequency joint controller to reduce the impact of external disturbances on the robot. We evaluated our method on various simulated terrains with height differences of up to 6 cm. We achieved a running motion of 1.8 m/s in the simulation using the Unitree A1 quadruped. The RL-based control policy updates at 50 Hz with a latency of 20 ms, while the model-based joint controller runs at 1000 Hz. The experimental results show that the proposed framework can overcome the latency caused by low-frequency updates, making it applicable for real-robot deployment.


Introduction 1.Background
The locomotion control of quadrupedal robots has become a research focus in recent years.Quadrupedal robots are believed to be capable of traversing more challenging terrains and performing more agile motions than wheeled robots.However, controlling a quadruped is also more challenging.Traditional model-based control contains algorithms such as model-predictive control [1,2] and whole-body control [3], which require manually decided structures and empirically tuned parameters for the best performance.
Reinforcement learning has been widely applied in locomotion control.Benefiting from the universal approximation ability of the neuron network policy model and the offline optimization of the reinforcement learning algorithm, the control policy optimized by reinforcement learning can capture complex robot dynamics.Reinforcement learning has been employed to control low-speed quadrupedal walking on uneven terrains [4][5][6][7], or high-speed running on flat ground [8,9].RL is also implemented in bipedal locomotion controllers, resulting in a better performance regarding dynamic model uncertainty [10], fast yet robust maneuvers [11,12], and the precise control of stepping point [13] or gait [14].
Most of the existing RL-based control policy gives target joint or foot positions as output, which are tracked by a PD controller with fixed gains [6,[15][16][17][18].These methods introduce a constant impedance to the joints.Such an impedance will surpass the disturbance forces exerted by the ground to the robot.When the robot runs over uneven ground at high speed, the disturbance will severely decrease the stability.In Jin et al. [9], Choi et al. [19], the policy is updated with a high frequency and low latency to respond to such disturbances in a timely manner.In these methods, the control policies applied in agile locomotion control have an update rate of from 100 to 500 Hz and a latency of less than 10 ms.

Motivation
In contrast to legged robots, their animal counterparts cannot respond to external stimulation in such a short time.Nevertheless, they demonstrate a fast locomotion ability on complex natural terrains.When animals move a joint, a group of antagonistic muscles driving that joint will contract simultaneously [20].As the contraction of these muscles will change their force-length relation and force-velocity relation [21], the stiffness and damping factor of the joint will change during motion.Having joints with adjustable impedance has advantages in resisting different types of disturbances.As qualitative examples, a leg following a swinging trajectory with low joint impedance will pass little disturbance force onto the torso if it collides with an obstacle.Supporting legs with high-impedance joints can stabilize the torso when they encounter external pushing forces.
Inspired by the variable impedance joints of animals, we designed a control framework consisting of a state estimator, a reference trajectory generator, an RL-based control policy with a low update rate, and joint controllers running at a high frequency.In this framework, the reference trajectory generator provides target joint trajectories for joint controllers to track.The control policy outputs each joint's control law, including stiffness, damping, and feed-forward torque.The joint controllers then execute that control law so that the joint can follow the reference trajectory and stabilize the torso under external disturbances.

Contributions
Our main contributions are as follows: 1.
We propose a new control framework for quadruped locomotion, which uses a lowfrequency RL-based policy to provide a control law allowing for high-frequency model-based joint controllers to resist disturbances.

2.
We realized high-speed locomotion on challenging terrains with a relatively low computational cost.Our method succeeds in controlling the quadruped to traverse uneven terrains at the speed of 1.8 m/s and frequency of 50 Hz, with a latency of 20 ms.
The rest of this paper is arranged as follows.Section 2 will describe the control framework and the control policy's training progress in detail.Section 3 will provide the results of simulation experiments, including the behavior of the robot on multiple terrains and the evaluation of our method's ability to allow for running on uneven terrains.Section 4 contains conclusions and discussions for future works.

Control Framework
Our control framework consists of a state estimator, a reference trajectory generator, an RL-based control policy running at a low frequency, and joint controllers running at a high frequency.The control policy will be described in detail in Section 2.2.
Figure 1 shows our control framework.The red part is the RL-based control policy with an update rate of 50 Hz.The rest of the control framework and the physics simulation update at 1 kHz.A 20 ms latency is applied to the control policy to simulate the computation time of the policy model.As we employ the adjustable joint control law, the impedance characteristics of joints can be modified to suit different situations during locomotion, such as foot contact or leg swinging.With the control law optimized by the RL algorithm, our framework can achieve a better performance compared with controllers with fixed PD gains.
The reference trajectory of the robot can be used to regulate robot behavior and narrow the exploration space of the RL-based policy.While many prior approaches use dynamics models to generate physically feasible reference trajectories for bipedal locomotion [11,12], using fixed reference trajectories proved to be feasible for quadrupedal locomotion [4] as the control policy is capable of modifying the trajectory to stabilize the robot.In our framework, we use a reference trajectory generator to build trajectory, regardless of the robot's current state.This is because the robot's state changes violently during high-speed locomotion, making it unsuitable for generating a stable reference trajectory.
The reference trajectory generator provides a periodic reference joint trajectory based on velocity command v xy,cmd , target torso height h torso,ref , period length T, stance phase ratio µ, foot raise height h raise , and foot nominal position x foot,0 , y foot,0 .For the three joints[q 3k−2,ref , q 3k−1,ref , q 3k,ref ] of leg k, their reference position can be expressed as a func- tion of phase ϕ k : [q 3k−2,ref , where JTG k is the joint trajectory generator for leg k; phase ϕ k is a periodic function of time t: ϕ k = (t/T + ϕ k,0 ) mod 1. ϕ k,0 represents the phase offset for leg k's motion.In this paper, we used trotting gait as a reference trajectory.The phase offset for FR, FL, RR, and RL leg (k = 1, 2, 3, 4, respectively) was 0, 0.5, 0.5, 0. As the reference trajectory was independent of feedback information, it can be computed offline, reducing the burden of real-time computing.The reference joint velocity was obtained by differentiating the reference joint position.
In this paper, we used manually designed function JTG k .The function generates stance phase joint trajectory by computing inverse kinematics (IK) from the foot reference trajectory, which moves backwards in the torso frame.During the beginning and end of the swinging phase, IK is also used to plan lifting and landing trajectory.The function uses joint space interpolation as a reference trajectory in the middle part of the swinging phase.All parameters of the function JTG k were manually designed, with the purpose of generating a joint trajectory for a command velocity up to 3 m/s while keeping joint velocity under 40 rad/s and foot-raise height under 12 cm.
Figure 2 is a sampled foot trajectory in X-Z plane at the command forward velocity of 2 m/s.This foot trajectory is converted from the joint trajectory generated by the function JTG k .The maximum joint velocity of the joint trajectory is 21.2 rad/s.
A joint controller tracks reference joint trajectory with the stiffness k p,i , damping k d,i , and feed-forward torque τ i,ff parameters given by the control policy described in the next section.For joint controller i, the control law can be expressed as: With simple control laws, the joint controller can run at a high frequency.
We used an extended Kalman filter (EKF)-based state estimator for the quadruped, similar to Agarwal et al. [22].As the orientation estimation noise grows too large during highly dynamic locomotion, we used the orientation integration readout from the Inertial Measurement Unit (IMU) as the estimated orientation of the torso state.Its origin is at the hip joint, while its axis is parallel to the axis of torso frame.Top Right: Reference foot trajectory relative to the ground.Bottom: The scale of reference foot trajectory in the global frame, compared with the robot.

Reinforcement Learning Control Policy 2.2.1. Observation Space and Action Space
The control policy model's input is a 67-dimension vector, including torso state, command, reference joint trajectory, joint tracking error, and two stance phase indicators stance 1 , stance 2 .stance k is the stance phase indicator, which is set to 1 when the reference motion of leg k is in stance phase.Since we used a trotting gait where diagonal feet step synchronously, only two indicators are required as input.A detailed description of these inputs is listed in Table 1.The policy model's output is a 72D vector, which needs further processing to obtain joints' control law parameters.The output vector can be divided into 6 parts; each part contains 12 scalars corresponding to 12 joints.The 6 parts are the scaled stiffness, damping, and feed-forward torque parameters for stance phases ( kp,stance , kd,stance , τ f f ,stance ∈ R 12 ), and that set of parameters for swing phase ( kp,swing , kd,swing , τ f f ,swing ∈ R 12 ), which are limited to the range of [−1, 1].For joint i, belonging to leg k, its control parameters can be computed as: In the equation above, k p,0 , α p , k d,0 , α d , τ 0 are empirically set parameters.In our experiments, the values of these parameters are set as k p,0 = 12 Nm/rad, α p = 1.5, k d,0 = 1 Nms/rad, This set of parameters allows for a large range of stiffness and damping coefficients, covering the values presented in some prior works [5,8,9,19].The large range of PD control laws allow for the robot to exhibit different characteristics under disturbances.
In our control framework, the control policy updates much more slowly than the trajectory generator.As a result, when a leg experiences a phase shift between swing and stance phases, the control policy may not be ready to update.As swinging legs and supporting legs have different dynamics characteristics, the optimal control law should be different.To change the control law at the phase shift without updating the control policy, our policy provides two sets of parameters for each joint, for stance and swing phases, respectively.In addition, we used exponential scaling for stiffness and damping parameters, which allows for the policy model to control these two parameters more precisely when the parameters are small.This design is based on an observation that a low-impedance joint's dynamics characteristics change more than a high-impedance joint when the same amount of modification is applied to their stiffness or damping coefficient.

Policy Training
We used proximal policy optimization [23] for training.Our policy model was implemented by a fully connected neuron network of five layers.The sizes of the four hidden layers are 256, 256, 128, and 128.The output of the last hidden layer was decoded to obtain the mean and variance of the action, and the estimated state value.We used exponential linear units (ELU) [24] as activation functions in the neuron network model; we applied a running mean to the input and inverse running mean to the estimated state-value output to boost training speed.The sampled action was clipped into the range of [−1, 1].The training algorithm was implemented in the rl_games [25] library.

Training Environment and Domain Randomization
We implemented our training environment using a Mujoco dynamics engine [26].We set up an environment with multiple obstacles of 6cm in height and a terrain with a changeable surface.To speed up the process of data collection, we maintained a group of independently running environments and updated them simultaneously with parallel computing.Figure 3 shows one of the training environments.The grey object is the terrain and the red ones are obstacles with a fixed position and shape.We implemented the reference trajectory generator and joint controller described above in the training environment (including the latency of policy updates).They updated together with the physics simulator.A state estimator was not necessary as the state of the robot can be acquired via the simulator's API.As our control policy updates at a low frequency, the physics simulator updates multiple times in one environment update step.Here, we used N sim to denote the number of physics simulator updates in one environment step.For a variable X, we defined X sub(j) to be a value of X at the jth physics simulator update.
We applied domain randomization to increase the robustness of our control policy.We randomize robot dynamics and environment parameters.All randomization was carried out at the beginning of an episode.The randomized parameters are listed in Table 2.
The gravity direction is more heavily randomized around the Y axis (changes more along the X axis), as we hope that the robot will have the ability to run up and down slopes instead of resisting lateral tilts.
The detailed formula of each term is listed in Table 3.The terms are categorised into five groups: task, stability, smoothness, tracking and limit.Task-related reward terms include r lin , r rot , r yaw , which encourage the robot to follow the velocity and yaw command.Stability reward terms include r ori , r h , r torq , which keep the robot from falling or making violent moves.Smoothness reward term r torq penalizes the robot for violent moves.The task, stability and smoothness terms are inspired by the reward settings of Margolis et al. [8].
The tracking term r jpos encourages the joints to follow the reference trajectory, similar to the reward term in the work of Xie et al. [17].jerk and vjerk are limit terms, which penalize the joints when moving to infeasible positions or moving faster than their mechanical limit during high-speed motion.Parameter q i,max , q i,min , qi,max poses a soft limit on joint position and velocity.r jerk is set to 1 if any of the joints reach the soft position limit; otherwise, it is set to 0. Similarly, r vjerk is set to 1 if any of the joints reaches the soft velocity limit.The episode terminates if the robot touches the ground with parts other than foot or shank.0.5 e −(h torso,ref −h torso,real ) 2 /0.05 r jpos 0.5 e − ∑ i (q i,ref −q i,real ) 2 /0.05 r torq −0.001 j=1 1 q i,sub(j) > q i,max or q i,sub(j) < q i,min > 0 r vjerk −0.5

Continuous Adaptive Curriculum Learning
Curriculum learning has been shown to be able to improve the performance of the policy, including the success rate when traversing uneven terrains [4] and the maximum speed of the robot [8].
We present a new adaptive curriculum, which gradually tunes two parameters: the terrain height and forward speed command.During training, we used three types of terrain whose difficulty is relative to the terrain's height.The three types of terrain are shown in Figure 4. Different from some other methods [8], our curriculum changes each environment's difficulty independently, without the need to compute the overall difficulty distribution.Similar to Lee et al. [4], we designed a score p j to evaluate the policy's performance in the jth environment.When an episode ends, p j is determined by the moving average of the robot's forward velocity v and the length of the episode: For the jth environment, we used v max,j and h max,j to denote its maximum difficulty.Its terrain height h j and forward velocity v cmd,j are updated according to the following curriculum at the end of an episode: The maximum difficulty parameters were assigned to each environment at the beginning of the training.During training, these parameters limit the distribution of difficulty.This limitation keeps a portion of easy environments, which prevents the policy from overfitting with the hard evironments while behaving poorly in easy ones.The way we set the maximum difficulty parameters is shown in Algorithm 1.

Algorithm 1: Assigning maximum environment difficulty
Data: global maximum forward velocity command v max , global maximum terrain height h max , number of simulation environments N env , proportion of environments to be limited σ Result: maximum forward velocity command of each environment v max,1 , . . ., v max,N env , maximum terrain height for each environment h max,1 , . . ., h max,N env for j ← 1 to N env do // Initialize all environments' maximum difficulty with global maximum difficulty Compute number of environments with limited velocity (or with limited height)

Training
We trained our policy on a desktop PC with 32-core Intel Xeon Gold 6234 CPU and Nvidia GeForce RTX 2070 GPU.A total of 1024 simulations were computed in parallel by the CPU, with 63 threads for data generation.The training took 9.5 h, corresponding to 58 days of simulation.The distribution of curriculum difficulty was examined at the end of the training progress.During the training, 64% of the environments had limited difficulty.As shown in Figure 5, most of the environments reached a relatively high level of difficulty, yet a small portion of them remained easy due to the limitations.

Results of High-Speed Locomotion
We evaluated our control framework in Webots [27] 2021a simulation.In the experiment, we built a series of obstacles and terrains that can be found in the real world, including stairs (6 cm per step, 30 cm width), obstacles (6 cm height, 10 cm width), lateral stairs (6 cm height difference), and an uneven terrain with random height (maximum height 6 cm, horizontal sampling interval 4 cm).We also designed some dynamic terrains, including two seesaws that can rotate when stepped on, and a floating blocks that will sink when stepped on (each block weighs 2 kg, connected to the fixed environment by a vertical slider joint with a stiffness of 2000 N/m and damping of 50 Ns/m).These terrains are lined up in a row so that the robot can run past them in a single experiment.
We use a simulated Unitree A1 robot [28] to conduct all the experiments, as shown in Figure 6.Video S1 contains a full record of running on different terrains.The Unitree A1 robot is a 12 kg small quadrupedal robot with a standing height of 30 cm.During simulation, we limited the torque output of each joint to match the capacity of the real robot (20 Nm for hip joints; 55 Nm for others).However, the noise and latency of the sensors and actuators were not modeled.During all experiments in this paper, the update rate of the control policy was set to 50 Hz, while the other parts of the control framework updated at 1000 Hz.The simulation timestep was 1 ms.The parameters for the reference trajectory were set as follows: h torso,ref = 0.3 m, T = 0.25 s, µ = 0.3, h foot = 12 cm, x foot,0 = 0 m, y foot,0 = 0.08 m.The lateral velocity and rotational velocity command were set to zero.The command forward velocity was 2.2 m/s when the robot traversed the environment.With a policy delay of 20 ms, the robot can achieve a speed of over 1.8 m/s when running across different terrains while only using proprioceptive sensors.The forward velocity during the experiment is shown in Figure 7.Although the robot encountered ground disturbance that drastically slowed it down, it maintained stability and recovered to normal speed in a short time.During the experiment, the robot experienced slipping foot, blocked swinging foot, and delayed contact.The policy succeeded in stabilizing the robot in these situations, as shown in Figures 8-10. .The front-left foot makes contact with the ground after more than half of the stance phase (at T = 0.02 s), then leaves the ground (at T = 0.04 s).This stance phase is less than 0.02 s (27% of planned stance phase).Video S3 records the complete motion of the delayed contact.

Locomotion Performance on Uneven Terrain
We evaluated our method's performance when running on an uneven terrain with random height.The result is compared with a baseline, which is a policy with PD parameters fixed to 28 Nm/rad and 0.7 Nms/rad, respectively.During this experiment, we trained two versions of the policy for each method.The first version lacks a policy update latency and the second one is trained and evaluated with 20 ms of latency.
Both methods have a high success rate when the commanded velocity is lower than 2.4 m/s and the terrain height is less than 0.08 m.However, as the velocity or terrain height increases, our method outperforms the baseline, as shown in Figure 11.Also, our method is less affected by control latency.When running on a 6 cm high uneven terrain, both methods have a similar performance in terms of velocity-tracking when the command velocity is lower than 2.5 m/s.However, as the command velocity increases, our method can track the command with less variance and fewer tracking errors, as shown in Figure 12.The baseline experienced a degradation in performance when there was a latency of 20 ms, while our method was less affected.
The better performance obtained by our framework under the presence of latency may be due to the utilization of the joint controllers running at a high frequency.Since the joint controllers are not affected by the latency, they can react to disturbances in time.With a properly set controller law, these controllers can keep the robot from falling until the policy can respond to the disturbance.

Conclusions and Discussion
We presented a control framework containing an RL-based control policy with a low update rate and model-based joint controllers with a high update rate.This control framework can achieve high-speed locomotion on uneven terrains.With a latency in policy updates, our control policy stabilized the robot following various disturbances by modifying joint control laws.By providing parameters for a control law, a high-level controller can define the behavior of joints and how they respond to disturbances, as long as the control law is simple enough to update at a high frequency.Since our framework can withstand latency and does not require in-time updates, it can be deployed on robot controllers with an inferior computational capacity.
Our method has the following points to be improved.

1.
We only used proprioceptive sensors, which could be combined with exteroception in the future for robots running over high obstacles, as was conducted in Miki et al. [18] and Zhuang et al. [29].

2.
The manually designed reference trajectory could be further optimized for higher speed.

3.
The linear feedback control laws could be replaced by non-linear control laws in future works.

4.
The stability of the controller under sensor noise is not analysed.The stability of a legged controller can be investigated via Poincare Map function [30], or by entropybased analysis [9].

5.
Our controller's robustness under sensor or actuator faults could be investigated in future works.The sensor fault could be detected [31] and the RL-based policy is proven to be able to adapt to actuator failures [8].6.
The range of joint controller stiffnesses and damping parameters was empirically set.How the range of these parameters affects the performance of our controller may be investigated in future works.

Figure 1 .
Figure 1.The proposed control framework.The input of the control policy is sampled at 50 Hz.

Figure 2 .
Figure 2. Top Left: Reference foot trajectory in hip frame.Hip frame is a frame fixed to torso frame.Its origin is at the hip joint, while its axis is parallel to the axis of torso frame.Top Right: Reference foot trajectory relative to the ground.Bottom: The scale of reference foot trajectory in the global frame, compared with the robot.

Figure 4 .
Figure 4. Three types of terrain in training environment: left: stairs; middle: lateral stairs; right: uneven terrain with random height.The obstacles are not displayed.

Figure 5 .
Figure 5. Environments' difficulty distribution.Left: Maximum difficulty distribution.The large dot in the top-right corner indicates all unlimited environments.Right: The distribution of environment difficulty at the end of the training.Those who reached at least one of the limits are colored red; the others are colored blue.

Figure 6 .
Figure 6.Robot running on different types of terrain with only proprioceptive sensor inputs.

Figure 7 .
Figure 7. Robot forward velocity during the experiment.

Figure 8 .Figure 9 .Figure 10
Figure 8.The rear-right leg slips on slippery ground (with the friction coefficient of 0.15).Its motion is indicated by the red arrow.Multiple snapshots of the motion in 0.05 s are stacked together to visualize the slipping.The duration of the slipping is about 66% of the stance phase.

Figure 11 .
Figure 11.Success rate when running on an uneven terrain with random height.

Figure 12 .
Figure 12.Achieved velocity against command velocity when running on uneven terrain with a height of 0.06 m.

Table 1 .
List of control policy inputs.

Table 2 .
Domain randomization applied to the environments.= max w lin r lin + w rot r rot + w yaw r yaw + w ori r ori + w h r h +w jpos r jpos + w torq r torq + w jerk r jerk + w vjerk r vjerk , 0 r

Table 3 .
Elements of reward function.