Adaptive Motion Skill Learning of Quadruped Robot on Slopes Based on Augmented Random Search Algorithm

: To deal with the problem of stable walking of quadruped robots on slopes, a gait planning algorithm framework for quadruped robots facing unknown slopes is proposed. We estimated the terrain slope by the attitude information measured by the inertial measurement unit (IMU) without relying on the robot vision. The crawl gait was adopted, and the center of gravity trajectory planning was carried out based on the stability criterion zero-moment point (ZMP). Then, the augmented random search (ARS) algorithm was used to modulate the parameters of the Bezier curve to realize the planning of the robot foot trajectory. Additionally, the robot can adjust the posture in real time to follow the desired joint angle, which realizes the adaptive adjustment of the robot’s posture during the slope movement. Simulation experiment results show that the proposed algorithm for slope gait planning can adaptively adjust the robot’s attitude and stably pass through the slope environment when the slope is unknown. for robots. ZMP refers to a point on the ground, which the component of the resultant moment of inertial force and gravity acting on the robot in the horizontal direction of this point is 0. The research shows that, when the robot moves, if the ZMP is located in the support polygon formed by the robot’s sole


Introduction
The quadruped robot greatly expands the motion range of the robot thanks to the design of the robot's bionic legs. For challenging terrains that wheeled and tracked robots cannot pass, quadruped robots have the potential to pass through. At present, it has broad development prospects and important research value in various fields, such as education, medical care, service, industry, and military [1]. However, compared with quadrupeds mammals in nature, there is still a big gap in the ability of quadruped robots to move and adjust in complex environments.
To make the quadruped robot move stably on challenging terrain, the stability of the robot is generally improved by selecting a static gait. Professor McGhee's team [2] firstly described the static gait of the quadruped robot and realized the stable walking of the quadruped robot's static gait. However, McGhee's study only considers the quadruped robot on a flat road, and cannot be applied to challenging terrain. Ref. [3] planned the static gait of the quadruped robot, which can be used for the robot to walk on irregular terrain. Ref. [4] designed a quadruped robot motion controller with layered control and realized the quadruped robot's movement on challenging terrain based on the selection of the best foothold. However, the controller needs to model the environment and has poor real-time performance. In addition, it is difficult to obtain the optimal strategy because there are a large number of parameters to be designed in the control system. Inspired by biology, the control method based on a central pattern generator (CPG) [5] helps to improve the stability and environmental adaptability of robot motion. J.C. Montesdeoca et al. [6] used the CPG model to develop a controller capable of generating omnidirectional motion and successfully applied it to a physical robot; Ref. [7] designed a quadruped robot foot trajectory based on CPG and then used the kinematics and dynamics model to realize the robot's movement on challenging terrain. However, the feedback is simply superimposed to the trajectories generated by the CPG, and thus the stability properties of CPGs are not exploited.
Gehring et al. [8] modeled the terrain as a plane and estimated the terrain information, and then adjusted the robot's strategy. Additionally, they successfully achieved a dynamic trot on the slope. We adopt an overall architecture similar to that of [8]; however, in this paper, we used a reinforcement learning algorithm to train a parameterized linear policy to reduce computation.
In recent years, with the development of artificial intelligence technology and the increase in computing resources, some intelligent algorithms, such as reinforcement learning and evolutionary strategy, have become more and more popular. Nate Kohl et al. [9] first applied reinforcement learning to the field of quadruped robot motion, using a policy gradient algorithm to optimize robot foot trajectory parameters. The most classic one is policies modulating trajectory generators [10], which combines open-loop leg trajectories with neural network feedback, and the trained robot can walk stably; however, these deepneural-network-based policies are computationally intensive and were unable to generalize to rough terrain walking. Ref. [11] incorporated proprioceptive feedback in locomotion control and trained the controller using reinforcement learning. The trained ANYmal robot can traverse various difficult terrains without explicit perception. However, the 12-layer deep neural network and the huge amount of data require expensive equipment to realize, and this work is almost non-transferable.
To deal with the problems existing in the above research, a typical environment of the slope with an unknown slope is selected, and a motion planning algorithm for quadruped robot is proposed. First, we initialized the simulation environment and used the joint angle data obtained from the robot leg motor encoder to estimate the slope of the terrain. Next, we planned the trajectory of the center of gravity to obtain the position of the next foothold, and then used the Bezier curve to generate the trajectory of the robot foot. The ARS algorithm receives data from sensors to guide the robot's motion, allowing the robot to adaptively adjust its attitude to follow changes in the desired attitude angle. The algorithm frame is shown in Figure 1. The specific implementation of the algorithm will be described in detail in the following sections. The main contributions of this paper can be summarized as follows: (1) An algorithm for estimating unknown terrain slope based on the coordinates of the foot endpoint is proposed; (2) We parameterized the foot end trajectory of the quadruped robot based on the Bezier curve to improve the training speed of the robot; (3) The ARS algorithm was used to optimize the trajectory of the foot end, and the adaptive adjustment of the robot posture is realized.
The algorithm is verified on the simulation platform, and the robot can follow the desired roll and pitch angles to achieve adaptive attitude adjustment and pass through the slope environment when the slope is unknown.
The remainder of this paper contains the following contents. Section 2 introduces robot model and terrain slope estimation algorithm. Section 3 introduces the whole control framework, including the trajectory planning of the center of gravity, the trajectory planning of the foot end, and the ARS algorithm. In Section 4, we verified the algorithm on the simulation platform and analyzed the simulation experiment results. Finally, Section 5 summarizes all the work and introduces future work.

Model of a Robot on a Slope
The terrain slope is one of the important parameters to characterize the terrain information. When facing the sloped terrain, the body of the quadruped robot needs to maintain a certain posture to adapt to the sloped terrain. Generally speaking, it can be divided into a posture parallel to the ground plane or a posture parallel to the inclined plane. When the body posture is parallel to the inclined plane, the motion trend of the front and rear legs of the robot is similar, and the leg movement space of the robot leaves a margin. Therefore, this paper adopts a posture parallel to the inclined plane to move on the inclined plane.
A diagram of the quadruped robot moving on the slope is shown in Figure 2. The four legs of the robot are defined as Left Fore (LF), Right Fore (RF), Left Hide (LH), and Right Hide (RH). A single-leg coordinate system is established at the hip joint, respectively, {LFH}, {RFH}, {LHH}, and {RHH}. The world reference coordinate system, {W}, is established with the projection of the robot's initial centroid coordinate position on the horizontal ground as the origin. The Z-axis is in the opposite direction of gravity, the X-axis points to the horizontal advance direction of the robot at the initial moment, and the Y-axis is determined by the right-hand rule. The position of the center of mass of the robot torso is defined as the body center of mass coordinate system, {B}, and the coordinate axis is parallel to the world reference coordinate system, {W}.

Terrain Slope Estimation
In this paper, the coordinate position of the foot end of the robot is used to estimate the terrain slope, as shown in Figure 3. Each leg of a quadruped robot has three degrees of freedom, and each degree of freedom corresponds to a motor. By reading the motor data at the robot joints, and using the robot's forward kinematics equation, we can obtain the coordinates of the robot's foot in the body center of mass coordinate system {B}: where p i is the position of the foot i, where i ∈ {LF, LH, RF, RH}, in the body center of mass coordinate system {B}, and θ r i , θ k i , θ p i are the angles of the abduction, hip, and knee actuators, respectively, of the leg, i. Additionally, f (· · · ) is the forward kinematics function, that is, given the angle of each joint of the robot, it calculates the coordinates of the foot end of the robot. After obtaining the coordinates of the foot end, the terrain slope can be estimated. According to the coordinates of the robot foot point, the slope normal vector, → n , can be obtained: Furthermore, we can obtain the desired pitch angle θ P and roll angle θ R of the robot.
is the unit normal vector of the slope, and R Y and R X are the rotation matrix.

Zero-Moment Point
In 1972, Vukobratovic first proposed the zero-moment point (ZMP) theory [12] as a dynamic stability criterion for robots. ZMP refers to a point on the ground, which the component of the resultant moment of inertial force and gravity acting on the robot in the horizontal direction of this point is 0. The research shows that, when the robot moves, if the ZMP is located in the support polygon formed by the robot's sole and the ground, then the stability of the robot's movement process will be guaranteed.
When the robot moves on the slope, the projection of the footfall point of the inclined plane on the horizontal plane will be reduced, so the area of the robot's stable support triangle will be reduced. As shown in Figure 4. Additionally, the projected position of the robot's center of gravity may fall outside the stable support triangle. Therefore, the quadruped robot should adjust the foot position to change the robot's posture to achieve stable walking on the slope. In this paper, the swing of the left front leg is analyzed, and the landing point of the robot on the inclined plane is projected to the world reference coordinate system, {W}. The triangle formed by the solid line is the stable support domain on the right, and the vertex of the stable support triangle is s, RH and RF. The shortest distance from ZMP to the three sides of the supporting triangle is defined as the stability margin. When the ZMP is located inside the supporting triangle, the stability margin is positive, the robot is in a stable walking state. The stability has the most maximum margin when the ZMP coincides with the center of the incircle of the supporting triangle. Assume that the projected coordinates of the three landing points of the robot in the world reference coordinate system, {W}, are W LH (x LH , y LH ), W RH (x RH , y RH ), and W RF (x RF , y RF ). The projected coordinates of the expected landing point of the LF leg in the world reference coordinate system, {W}, are W LF (x LF , y LF ). The diagram of the landing point of the quadruped robot in the world reference coordinate system, {W}, is shown in Figure 5.
First, we used the four endpoints to solve the equations of the straight line L LH−RF and the straight line L LF−RH c, and then we obtained the coordinates W s (x s , y s ) of the intersection of the two straight lines.
Therefore, according to the coordinates of the three vertices of the triangle ∆W s W RH W RF , the lengths of the three sides can be obtained as: Furthermore, the center of the incircle of the supporting triangle can be obtained as:

Center of Gravity Trajectory Generator
Assume that the coordinates of the center of gravity of the robot are (x C , y C , z C ). According to Equation (11), the desired coordinate position of ZMP can be obtained as (x I , y I ). According to the "cart-table" model proposed in [13], the equation between the barycentric coordinates and the ZMP coordinates can be obtained as follows: .. ..
where g is gravity acceleration and ..
x C and .. y C are the acceleration of the robot's center of mass along the X-axis and Y-axis, respectively.
The differential equations are solved as follows: where C 1 , C 2 , C 3 , and C 4 are constant coefficients. It is known that the position of the center of gravity in the quadruped support stage is the initial position, and to ensure that the trajectory of the center of gravity is continuous and the speed of the center of gravity has no sudden change, in the foot swing stage, the robot's torso remains stationary, and its speed and acceleration are both zero. Therefore, when we know the coordinates of the ZMP and the coordinates of the center of gravity at the initial moment, we can obtain the position of the center of gravity at the next moment. Then, according to the matrix transformation, the position of the coordinates of the desired landing point can be obtained.

Gait Planner
For quadruped robots, there are six non-singular static gaits [2]. This paper designs a crawl gait. Crawl gait has good stability when walking at low speed and has high adaptability to challenging terrain. The crawling gait is supported by at least three legs at any time. The leg swing sequence of the robot designed in this paper is: LF → LH → RF → RH. Considering that when the swinging leg of the robot switches from the left (right) side to the right (left) side, the stable support triangle of the robot switches from right to left. In this paper, the quadruped support phase is added to the swing cycle of the robot to ensure a smooth transition. We divided one motion cycle of a quadruped robot into six stages: Support Stage → LF swing → LH swing → Support Stage → RF swing → RH swing. The sequence diagram of the quadruped robot crawling gait is shown in Figure 6. Take the left swing leg exercise as an example, in which the right leg is always in a supported state. When the LF leg swings, the LH leg and the right leg form a supporting triangle. Additionally, when the LH leg swings, the LF leg and the right leg form a supporting triangle. The two supporting triangles have overlapping parts, which constitute the right stable support domain (the red triangle in Figure 7). The robot can maintain a stable motion state when the ZMP of the robot is in this region.

Foot Trajectory Planner
Foot trajectory planning is an important part of quadruped robot motion control. Refs. [14][15][16] used the CPG, interpolation curve, and sinusoidal parameterized gait control strategy, respectively, and successfully applied them to the quadruped robot. In this paper, the Bezier curve is used to plan a semi-elliptical trajectory to realize the parametric control of the foot trajectory of the quadruped robot. The semi-elliptical trajectory is shown in Figure 8. The Bezier curves, which are related to Bernstein polynomials, are named after the French engineer Pierre Bézier, who used it in the 1960s for designing curves for the bodywork of Renault cars. We chose to parameterize the end-foot trajectory of each foot by a Bezier curve. Bezier Curves have attractive properties over other alternatives, such as quintic splines, because they do not have self-intersecting curves, and are guaranteed to lie within the convex hull of the control points. This ensures that the trajectories are always in the workspace of the end-effector [17,18].
The shape of the Bezier curve is determined by the control points. A Bezier equation of degree n in the plane requires n + 1 control points that denoted by P i (i = 0, 1, · · · n). The curve starts at point P 0 and goes to point P n . Generally, it does not pass through the middle point, and the middle point only plays a guiding role. The higher-order Bezier curve equation is: (15) where t ∈ [0, 1] and B(t) are the Bezier curve polynomials. The crawling gait in this paper has a total of six stages and can be divided into two states: the stance state and the swing state. When the robot leg is swinging, the trajectory generator takes the robot's current footfall point and expected footfall point as input, and uses the Bezier curve to plan the swinging leg to generate a semi-elliptical trajectory, as shown in the following formula: where S tg is the trajectory generator equation. Our Bezier curves use 9 control points (see Table 1). p k (x, y) is the control point, which is controlled by the robot's current landing point, x p , y p , and the desired landing point, (x d , y d ). Given different landing points, different foot trajectories can be generated. After the calculation of inverse kinematics, the motor commands required by the 12 motors of the robot are obtained, to control the walking of the quadruped robot on the slope.

Training Algorithm
The stable walking problem of the quadruped robot on slopes in this paper can be regarded as a partially observable Markov decision problem (MDP). MDP is described for each time step t with a tuple, {s, a, p, r, y}, where s is the current robot state, a is the action applied, p is the transition probability function from the current state, s t , to the next state s t+1 , r is a reward value obtained due to the transition, and y ⊂ (0, 1) is a discount factor for the long-term reward [19].
The goal of reinforcement learning is to find an optimal strategy to maximize the cumulative reward value, as shown in Equation (17). In this paper, we train a linear policy a t = π(θ, o t ) to enable the robot to receive information in the environment, so that it can output the optimal action to maximize the cumulative reward.
where E is the expect reward, and T is the total timesteps. In this paper, the augmented random search (ARS) [20] algorithm is chosen to find a linear policy. ARS is a model-free reinforcement learning algorithm, which adopts a gradient-free policy gradient method, which greatly reduces the complexity of the algorithm and can achieve the training effect of other model-free reinforcement learning algorithms.
The ARS algorithm abandons the differential-based policy gradient method and uses the finite difference method to approximate the gradient from the sampling of multiple directional derivatives, so there is no differentiation process. We used Version V-1t of ARS from [20]. For a given strategy, the policy is perturbed both along the positive and negative directions, where α is the step size, and σ R is the standard deviation, and υ < 1 is a constant. δ (k) is a sample taken from a standard normal distribution. b is a constant which is the best b directions corresponding to the maximum returns. The reward function designed in this paper is used to guide the robot to adapt to the terrain slope and move in the forward direction. We obtain the following: where r is the cumulative reward, and x is the horizontal distance that the robot's center of gravity moves in the direction of motion. ∆ r P , ∆ r R , and ∆ r Y are the actual pitch angle, roll angle, and yaw angle of the quadruped robot during the movement process, respectively, which we can obtain this data from the IMU; ∆ P , ∆ R are the desired pitch and roll, respectively. The units of ∆ r P , ∆ r R , ∆ r Y , ∆ P , and ∆ R are the angles.

Simulation Environment
As shown in Figure 9, the simulation model of the robot is shown. Each leg of the robot has three degrees of freedom, namely the pitching hip joint, the pitching knee joint, and the rolling hip joint. The robot joints are controlled by high-performance motors, and the total instantaneous maximum power that can be reached is 18 kw. Laikago's torso is 0.55 m long and 0.35 m wide. When standing naturally, the height is 0.6 m, and the weight of the whole machine is about 24 kg. An inertial measurement unit (IMU) is installed on the trunk of the machine to detect the attitude of the robot and the speed and acceleration information of the trunk. A force sensor is installed at the foot-end of the robot to sense information [21]. Figure 10 shows the slope built on Pybullet [22] platform. In this simulation experiment, the algorithm is tested on 0 • slope (plane), 5 • slope, and 11 • slope, respectively.

Results and Discussion
To verify that parameterized foot trajectories can speed up robot learning, we compared the algorithm proposed in this paper with the reinforcement learning algorithm deep deterministic policy gradient (DDPG). DDPG is a class of model-free reinforcement learning algorithms that perform well in control tasks with continuous action spaces. Our algorithm uses the ARS algorithm to optimize foot trajectories parameterized by Bezier curves, while DDPG is trained from scratch.
We trained the robot for 400 epochs, and the maximum timestep of each epoch was 2000 steps. Figure 11 shows the average reward value of each round of the robot in the simulation environment. According to Figure 11a, our algorithm can converge in about 150 epochs, while the DDPG algorithm is still in the rising stage, and the policy is very unstable. Next, we trained the DDPG algorithm for 1000 epochs, as shown in Figure 11b, and found that the algorithm stabilized around 800 epochs. From the training results, it can be seen that the reinforcement learning algorithm trained from scratch needs to explore the environment at the beginning of training, and can only train the robot after accumulating data. However, our use of Bezier curves to parameterize the robot foot trajectory is to integrate the prior knowledge into the foot trajectory generator, which reduces the complexity of the policy learning problem and greatly reduces the training time. Figure 12 shows the changes of body pitch angle and roll angle in the walking simulation experiment of the robot facing a 0 • slope. The green line is the real slope of the terrain, and the orange line is the estimated value of the slope by the terrain estimation algorithm. Additionally, the blue line is the change of the pitch angle and roll angle of the robot obtained from the robot IMU. It can be seen from Figure 12 that the terrain estimation algorithm proposed in this paper can effectively estimate the slope, and the pitch angle and the roll angle of the robot can fluctuate within a certain range through algorithm learning, to ensure the stability of the robot in the process of moving. Table 2 shows the data analysis of the terrain estimation. The standard error between the estimated pitch angle and the true value is 0.000013, and the maximum deviation is 0.007399; additionally, the standard error between the estimated roll angle and the true value is 0.000108, and the maximum deviation is 0.007371. The effectiveness of the terrain estimation algorithm is fully illustrated.  Table 3 shows the data analysis of real pitch angle and roll angle during robot movement. The average swing angle of the robot is less than −0.003629 rad (<1 • ), and the max swing angle is less than 0.007026 rad (<2 • ). The robot can walk stably, which shows the effectiveness of the trajectory generator.  Figure 13 shows the horizontal distance that the robot's center of mass travels along the robot's motion direction in the 10 s time in the walking simulation experiment of the robot facing a 0 • slope. From this, it can be seen that the motion trajectory of the robot is continuous and moves forward.  Figure 14 shows a screenshot of the robot moving on the slope. According to the gait planning algorithm framework of the quadruped robot proposed in this paper, the robot can effectively predict the slope and adjust the robot's attitude to realize stable walking on the slope. Figures 15 and 16 show the changes of body pitch angle and roll angle during the walking simulation experiment of the robot facing a 5 • and 11 • slope, respectively. The green line is the true slope of the terrain, the orange line is the estimated value of the slope by the terrain slope estimation algorithm, and the blue line is the change of the pitch angle and roll angle of the robot obtained from the IMU.
According to Figures 15a and 16a, the curve is divided into three sections. The first section on the left shows that the robot is standing on the slope; the second segment indicates that the robot adaptively adjusts the attitude to follow the change of the estimated value according to the terrain slope estimation algorithm and ARS algorithm; in the third section, after a period of motion, the robot successfully walked to the platform at the top of the slope, which fully shows the feasibility and effectiveness of this research algorithm.    Figure 17 shows the horizontal distance traveled by the robot's center of mass along the robot's motion direction in the 10 s time in the simulation experiment of the robot facing the 5 • slope and the 11 • slope. Although, the increase in slope has a certain impact on the movement speed of the robot. However, the trajectory of the robot is continuous and moving forward.

Conclusions
In this paper, an adaptive motion skill learning of the quadruped robot on slopes was proposed to realize the stable walking of a quadruped robot on 5 • and 11 • slopes. First of all, a terrain slope estimation algorithm was proposed, in which the slope was estimated according to the robot foothold, to obtain the desired body attitude angle of the robot. Then, the augmented random search (ARS) algorithm was used for training a motion strategy of the quadruped robot so that the quadruped robot can adaptively adjust its attitude to follow the desired body attitude angle. In addition, the Bezier curve was used to parameterize the robot trajectory, which greatly sped up the learning speed. The feasibility and effectiveness of the algorithm framework were verified in the simulation environment. The robot can adaptively adjust its attitude to realize its motion on the slope and maintain good stability. The sim-to-real transition is very interesting and challenging work-to further validate our algorithm, we plan to deploy physical experiments in future work.