A Stability Training Method of Legged Robots Based on Training Platforms and Reinforcement Learning with Its Simulation and Experiment

This paper continues the proposed idea of stability training for legged robots with any number of legs and any size on a motion platform and introduces the concept of a learning-based controller, the global self-stabilizer, to obtain a self-stabilization capability in robots. The overall structure of the global self-stabilizer is divided into three modules: action selection, adjustment calculation and joint motion mapping, with corresponding learning algorithms proposed for each module. Taking the human-sized biped robot, GoRoBoT-II, as an example, simulations and experiments in three kinds of motions were performed to validate the feasibility of the proposed idea. A well-designed training platform was used to perform composite random amplitude-limited disturbances, such as the sagittal and lateral tilt perturbations (±25°) and impact perturbations (0.47 times the robot gravity). The results show that the proposed global self-stabilizer converges after training and can dynamically combine actions according to the system state. Compared with the controllers used to generate the training data, the trained global self-stabilizer increases the success rate of stability verification simulations and experiments by more than 20% and 15%, respectively.


Introduction
Compared with fix-based industrial robots, mobile robots have a wider application prospect because of their mobility and operational capacity. In particular, legged robots have a similar mechanism to animals and thus have a stronger adaptability to complex terrains than robots with other movements, such as wheeled and tracked robots. However, the practicality of legged robots is still lower than that of wheeled and tracked robots due to the difficulties in balance control and perturbation recovery.
The early studies mainly focused on the balance control of walking motions. Based on the Zero Moment Point (ZMP) force reflection control proposed by Vukobratovic [1], a variety of balance control methods such as body posture control [2], ZMP damping control [3] and landing point adjustment control [4] were proposed and deployed successively on ASIMO, Petman and other robots. Since then, researchers have started to consider the influence of perturbations and proposed corresponding control methods according to different types of perturbations. Successful results have been obtained for tilt ground [5][6][7], uneven ground [8], external force impact [9][10][11] and other perturbations.
The above balance controllers generate a planned response for a specific perturbation and then calculate the control outputs that enable the robot to track a determined trajectory by solving the dynamical model (or simplified model) of that robot. Thus, these controllers can be collectively defined as model-based balance controllers. Such controllers have achieved many successful results in structured environments such as laboratories, but their application is limited in unstructured, complex environments where the robot may be subject to multiple, mutually compounding, and unpredictable perturbations.
Consequently, more and more studies have paid attention to obtaining the selfstabilization capability of legged robots by using learning-based methods which are able to obtain the optimal mapping from the system state to the joint adjustment. The relevant literature on learning-based balance control methods is summarized in Table 1.  [14] Fuzzy reinforcement learning Two 2-d continuous space 1-d continuous space None Joao et al. [15] SVM + FNN 2-d continuous space 1-d continuous space None Li et al. [16] Fuzzy control + optimal control Two 3-d continuous space 3-d continuous space None Hwang et al. [17] Q-learning 82 discrete states 24 discrete actions Seesaw Hengst et al. [18] Q-learning 4-d continuous space 9 discrete actions None Hwang et al. [19,20] [22] Approximate Q-learning 66 discrete states 8 discrete actions None Li et al. [23] PPO 20-d continuous space 10-d continuous space Load of 15% total mass 1 short for 6-dimensional.
As shown in Table 1, some studies did not consider any perturbation, and the rest of them applied one kind of perturbation-generally a specific perturbation in a single direction. Moreover, the dimensions of the state/action space defined in existing studies are relatively low, which means that the learning process is carried out locally in the whole state space. In addition, the scope of state migration is relatively small when the applied perturbation is simple. Therefore, even though successful results can be obtained in the laboratory, state confusion is prone to occur in local state spaces determined by only partial state variables when the existing controllers face complex perturbations in reality, which leads to the failure in maintaining balance. In addition, the learning algorithms are applied without considering the curse of dimensionality when the number of state/action variables increases.
To address the above problems, the authors in [24] have proposed the idea of robot stability training-that is, to simulate composite perturbations by the random amplitudelimited motion of a six-degrees-of-freedom (DOF) training platform on which the robot is trained, and to obtain the self-stabilization capability by reinforcement learning with feature selection. A stability training simulation [25] of a bipedal robot was performed under randomly varying ground tilt perturbation, which preliminarily verified the feasibility of this idea. Relevant studies in medicine and biology also corroborate the practicability of stability training-for example, studies on movement disorder syndrome [26], stroke rehabilitation [27] and mice anatomy [28] have shown that training an organism with a moving platform can enhance or rebuild its balance.
In order to distinguish from the balance controllers learned under a single disturbance, the robot self-stabilizer trained on a 6-DOF motion platform called the global self-stabilizer, where "global" means that the training process has traversed all different kinds of environmental disturbances through the random amplitude-limited motion of the training platform. A robot self-stabilizer trained under such conditions can obtain robustness to any environmental perturbation, and after sufficient training, it can make the robot stable under any perturbation within its driving capability. Figure 1 compares the differences between the general balance controller of a legged robot and the global self-stabilizer in this study. robustness to any environmental perturbation, and after sufficient training, it can make the robot stable under any perturbation within its driving capability. Figure 1 compares the differences between the general balance controller of a legged robot and the global self-stabilizer in this study.  Figure 1a shows that a general robot balance controller uses a specific balance control law for different perturbations; the balance control of the robot is coupled with a specific motion, which means it is not universal.
As in Figure 1b, the global self-stabilizer is separated from the motion controller. The former finds the optimal joints' increments according to the internal state/action map; the latter only needs to generate the reference motion according to the given motion parameters and is not affected by the global self-stabilizer. Thus, the two tasks (motion and balance) are independent. Only the target motion and the driving capability of the robot are considered in the motion controller. In other words, the self-stabilization capability obtained is not limited to specific motions, and the global self-stabilizer, such as a cerebellum, can be applied to any motion under any perturbation after being sufficiently trained.
In this paper, the stability training system of a legged robot with multiple legs is established and a general hierarchical structure of the global self-stabilizer is designed. The task of the proposed global self-stabilizer will be divided into three subtasks: action selection, adjustment calculation and joint motion mapping. Each subtask will be learned in different state spaces. This paper is organized as follows: Section 2 describes the model of the training system and defines the state space of system variables and actions. Section 3 presents the three modules of the global self-stabilizer and their corresponding learning algorithms. Section 4 describes the simulated and experimental environments for stability training of a biped robot (GoRoBoT-II) and the balance controllers for generating training data. Sections 5 and 6 presents the simulation and experiment training processes and results. This paper is concluded in Section 7.

A General Model for Stability Training of Legged Robots
The basic idea of the legged robot stability training proposed by the authors is shown in Figure 2. During the training period, the robot stands on a training platform that performs a 6-DOF random amplitude-limited motion to simulate perturbations in the real world. The joint motion is generated by model-based balance controllers. The global selfstabilizer learns from the state transition data to obtain the optimal state/action mapping through reinforcement learning. After training, the converged global self-stabilizer can be used in uncertain environments to keep the robot stable.  Figure 1a shows that a general robot balance controller uses a specific balance control law for different perturbations; the balance control of the robot is coupled with a specific motion, which means it is not universal.
As in Figure 1b, the global self-stabilizer is separated from the motion controller. The former finds the optimal joints' increments according to the internal state/action map; the latter only needs to generate the reference motion according to the given motion parameters and is not affected by the global self-stabilizer. Thus, the two tasks (motion and balance) are independent. Only the target motion and the driving capability of the robot are considered in the motion controller. In other words, the self-stabilization capability obtained is not limited to specific motions, and the global self-stabilizer, such as a cerebellum, can be applied to any motion under any perturbation after being sufficiently trained.
In this paper, the stability training system of a legged robot with multiple legs is established and a general hierarchical structure of the global self-stabilizer is designed. The task of the proposed global self-stabilizer will be divided into three subtasks: action selection, adjustment calculation and joint motion mapping. Each subtask will be learned in different state spaces. This paper is organized as follows: Section 2 describes the model of the training system and defines the state space of system variables and actions. Section 3 presents the three modules of the global self-stabilizer and their corresponding learning algorithms. Section 4 describes the simulated and experimental environments for stability training of a biped robot (GoRoBoT-II) and the balance controllers for generating training data. Sections 5 and 6 presents the simulation and experiment training processes and results. This paper is concluded in Section 7.

A General Model for Stability Training of Legged Robots
The basic idea of the legged robot stability training proposed by the authors is shown in Figure 2. During the training period, the robot stands on a training platform that performs a 6-DOF random amplitude-limited motion to simulate perturbations in the real world. The joint motion is generated by model-based balance controllers. The global self-stabilizer learns from the state transition data to obtain the optimal state/action mapping through reinforcement learning. After training, the converged global self-stabilizer can be used in uncertain environments to keep the robot stable. Micromachines 2022, 13, x FOR PEER REVIEW 4 of 27 Figure 2. The basic idea of legged robot stability training and its application [25].

Environmental Disturbance Simulation Method based on Motion Platform
A dedicated 6-DOF serial-parallel mechanism motion platform [24,29] was designed in the authors' laboratory for generating composite perturbations during stability training. Its mechanism sketch is shown in Figure 3a. The reference frames ΣOB-xByBzB and ΣOP-xPyPzP are fixed to the ground and the platform, respectively. The motion of the moving platform can be represented by the displacements xP, yP, zP and 3-2-1 Euler angles θP1, θP2 and θP3 of the frame ΣOP with respect to frame ΣOB in Figure 3b. The pose vector can be expressed as XP = [xP, yP, zP, θP1, θP2, θP3] T . Point C represents the center of mass (CoM) of the trained robot.  Two forms of perturbations, ground tilt perturbations and inertial force/moment perturbations, can be generated by the above platform. If the training platform performs a random amplitude-limited motion, the generated tilt perturbation angle β, inertial force perturbation FP and inertial moment perturbation MP will also be randomly distributed within a certain range, thus enabling a comprehensive simulation of perturbations in the real world.

Environmental Disturbance Simulation Method based on Motion Platform
A dedicated 6-DOF serial-parallel mechanism motion platform [24,29] was designed in the authors' laboratory for generating composite perturbations during stability training. Its mechanism sketch is shown in Figure 3a. The reference frames ΣO B -x B y B z B and ΣO P -x P y P z P are fixed to the ground and the platform, respectively. The motion of the moving platform can be represented by the displacements x P , y P , z P and 3-2-1 Euler angles θ P1 , θ P2 and θ P3 of the frame ΣO P with respect to frame ΣO B in Figure 3b. The pose vector can be expressed as X P = [x P , y P , z P , θ P1 , θ P2 , θ P3 ] T . Point C represents the center of mass (CoM) of the trained robot.

Environmental Disturbance Simulation Method based on Motion Platform
A dedicated 6-DOF serial-parallel mechanism motion platform [24,29] was designed in the authors' laboratory for generating composite perturbations during stability training. Its mechanism sketch is shown in Figure 3a. The reference frames ΣOB-xByBzB and ΣOP-xPyPzP are fixed to the ground and the platform, respectively. The motion of the moving platform can be represented by the displacements xP, yP, zP and 3-2-1 Euler angles θP1, θP2 and θP3 of the frame ΣOP with respect to frame ΣOB in Figure 3b. The pose vector can be expressed as XP = [xP, yP, zP, θP1, θP2, θP3] T . Point C represents the center of mass (CoM) of the trained robot.  Two forms of perturbations, ground tilt perturbations and inertial force/moment perturbations, can be generated by the above platform. If the training platform performs a random amplitude-limited motion, the generated tilt perturbation angle β, inertial force perturbation FP and inertial moment perturbation MP will also be randomly distributed within a certain range, thus enabling a comprehensive simulation of perturbations in the real world.  Two forms of perturbations, ground tilt perturbations and inertial force/moment perturbations, can be generated by the above platform. If the training platform performs a random amplitude-limited motion, the generated tilt perturbation angle β, inertial force perturbation F P and inertial moment perturbation M P will also be randomly distributed within a certain range, thus enabling a comprehensive simulation of perturbations in the real world.

Model of the Training System
As shown in Figure 4, legged robots of any mechanical configurations and any size standing on the training platform can all be equated to a multi-branch chain rigid-body system with n 1 (n 1 ≥ 1) stance legs and n 2 swing legs (n 2 ≥ 0) if the motion in the air is not considered.

Model of the Training System
As shown in Figure 4, legged robots of any mechanical configurations and any size standing on the training platform can all be equated to a multi-branch chain rigid-body system with n1 (n1 ≥ 1) stance legs and n2 swing legs (n2 ≥ 0) if the motion in the air is not considered. The reference frame ΣOS-xSySzS is established at the center of the theoretical support zone, and the motion of frame ΣOS with respect to frame ΣOP can represent the change in the contact state of the robot's feet. In this study, situations in which the robot is completely in the air or the support foot slides on the training platform are not considered. Thus, only the 2-DOF flip motion of the theoretical support zone is analyzed, with the flip angles θS1 and θS2, respectively. Each swing leg can be viewed as an open chain mechanism with its root located at the torso. The swing leg reference frame ΣOFj-xFjyFjzFj is located at the center of the bottom surface of the j th swing foot (j = 1, 2...n2). The motion of the swing leg can be represented by the pose vector XFj-the pose of frame ΣOFj with respect to the torso frame ΣOT-xTyTzT.
To establish the system variable set for the above model, the variables that can be measured or estimated in this system are summarized in Table 2.  The reference frame ΣO S -x S y S z S is established at the center of the theoretical support zone, and the motion of frame ΣO S with respect to frame ΣO P can represent the change in the contact state of the robot's feet. In this study, situations in which the robot is completely in the air or the support foot slides on the training platform are not considered. Thus, only the 2-DOF flip motion of the theoretical support zone is analyzed, with the flip angles θ S1 and θ S2 , respectively. Each swing leg can be viewed as an open chain mechanism with its root located at the torso. The swing leg reference frame ΣO Fj -x Fj y Fj z Fj is located at the center of the bottom surface of the j th swing foot (j = 1, 2 . . . n 2 ). The motion of the swing leg can be represented by the pose vector X Fj -the pose of frame ΣO Fj with respect to the torso frame ΣO T -x T y T z T . To establish the system variable set for the above model, the variables that can be measured or estimated in this system are summarized in Table 2. Table 2. System variables of a general model for stability training of legged robots.

Category of System Variables Definition of Variable Symbolic Representation
Joint motion Angle, angular velocity and acceleration of joints Pose, velocity and acceleration of torso ..

X T j th swing foot motion
Pose, velocity and acceleration of j th foot ..

X Fj
CoM motion Pose, velocity and acceleration of CoM ..  For robots with any number of legs and any configuration, the system variable set can be constructed according to Table 2. In addition, the state variables corresponding to each action will be selected from the system variable set in the subsequent stability training.

Action Set of Legged Robot
The action set which stores the action variables and their adjustment equations is the discourse domain for the action selection. The action is considered as the active adjustment performed by the robot. So, after excluding the system variables that cannot be actively adjusted in the last two rows of Table 2, six types of actions are obtained: single-joint action, torso action, swing foot action, CoM action, inertial force/moment action and ZMP action (corresponding to the first six rows of Table 2, respectively).
In the stability training, the robot needs to accomplish three tasks simultaneously, i.e., tracking motion samples, resisting environmental (training platform) perturbations and avoiding joint limits. In the following, the six types of actions listed will be assigned to the three tasks mentioned above, and then the equation for action adjustment will be designed for each action. The parameters for each action are explained in Table 3.

Action Parameter Meaning
Single-joint action K 11 , K 12 , K 13 the compensation coefficients when the joint position, velocity and acceleration are close to the limit ε 11 , ε 12 , ε 13 the width of the neighborhood where the joint position, velocity and acceleration start to avoid the limit L (·) the compensation function for avoiding the joint limit (1) Single-joint action. When the robot's joint reaches its position limit, velocity limit or acceleration limit, the motion of the robot will be affected, so joint limit avoidance is required.
The angular acceleration .. θ k (k = 1, 2 . . . N J ) of the N J joints of the robot are taken as the action variables in the single-joint action so that the motion curves obtained by integrating the acceleration are smoother than those obtained by directly adjusting the position and velocity. The adjustment is calculated according to Equation (1). The compensation equation for the joint angular limit is calculated according to Equation (2). The compensation equations for joint velocity and acceleration are similar and will not be listed specifically.
(2) Torso action. This kind of action is used to bring the robot stance leg back to the preset motion sample after other adjustments. The action variable is chosen as ..
X T , and its adjustment is calculated using the PD control law shown in the following equation.
(3) Swing foot action. Similar to the torso action, for the n 2 swimming feet in the general model. The action variables are chosen as ..
X Fj (j = 1, 2 . . . n 2 ) and the adjustment is calculated using the PD control law shown in the following equation.
(4) CoM action. This type of action will directly adjust the robot CoM to keep balance on the moving platform. The action variable is chosen as the linear acceleration of the CoM. To keep the CoM above the stance legs, the adjustment is calculated according to the estimated position of the moving platform.
(5) Inertial force/moment action. The inertial forces and moments influenced by the motion of the limbs are taken as a class of actions to cope with the perturbations. The action variables are chosen as the inertial force F and the inertial moment M at the CoM. The kinetic energy attenuation method proposed by the authors of [11] is used here to keep the robot balanced. The adjustment is calculated as follows: (6) ZMP action. As a common control strategy in robot balance control, changing the ZMP position within the support zone through limb motion can be used as a class of action in response to perturbations. Therefore, the action variables are chosen as x ZMP and y ZMP . Using the pose balance control method based on the CP point proposed by the authors of [8], the ZMP adjustment is calculated with the following equation: The action set of legged robots can be written as: Although only one equation is given for the adjustment of each action in Q, different adjustments can be obtained by adjusting the 12 free parameters (K 11 , K 12 , K 13 , K 21 , etc.). The determination methods and specific values of these parameters will be illustrated in Section 4 with simulation examples.

Preprocessing and Structure of the Global Self-Stabilizer
Dimensionality reduction and discretization are required to enable the learning process to exponentially converge because the system space designed in Section 2.2 is a highdimensional continuous space. The system variable set listed in Table 2 is denoted as X = {x i |I = 1, 2 . . . N}, and the action set in 2.3 is denoted as Q = {q j |j = 1, 2 . . . m}. The global self-stabilizer in this study will establish the mapping from X to Q.
The RAFS feature selection method proposed by the authors in [30] will be used to reduce the dimensionality of the system space to obtain the state set S j = s jk k = 1, 2, · · · , N Sj ; s jk ∈ X -corresponding to each action q j and followed by the autonomic abstraction calculation of the state space based on the Gaussian basis functions proposed by the authors in [25]. The continuous state space corresponding to S j is then discretized into different Gaussian basis functions according to the maximum affiliation principle. The full set of Gaussian basis functions corresponding to S j can be expressed as Ψ j = {ψ jk = <µ jk , Σ jk >|k = 1, 2, . . . , N Bj }, where µ jk and Σ jk are the center vector and covariance matrix of the basis function ψ jk , respectively. With . . x N ] T denoting the vector in the system space and s j = [s j1 , s j2 . . . s jNSj ] T denoting the vector in the state space of action q j , the mapping of X to S j after feature selection can be expressed as: where W j is the N Sj × N selection matrix obtained from the feature selection calculation. The affiliation of the reduced-dimensional state vector s j to the basis function ψ jk can be expressed as: The N Sj -dimensional continuous state space corresponding to S j can thus be transformed into a discrete space with N Bj values. To facilitate the learning calculation of the global self-stabilizer in Section 2, a normalized affiliation function is also defined.
The legged robot's actions need to be executed by the joint motion, so the global self-stabilizer also needs to establish the mapping from Q to the joint angular accelera- Because the action variables in Q are all acceleration or force/moment, we can linearize the kinematic or dynamical equations of the system: where b j is the N J -dimensional joint motion mapping vector, which represents the projection of the action adjustment q j in the robot joint space. Combining the joint motion mapping vectors into the mapping matrix B = [b 1 , b 2 , . . . , b n ] T , Equation (12) can then be written as: In general, the number of actions m is greater than the robot DOF N J , so the matrix B is singular. Therefore, the action selection matrix A NJ×m is constructed, which has only one element of 1 in each row and the remaining elements of 0. Adding the action selection matrix A to Equation (13) gives: ∆ ..
According to Equation (14), the global self-stabilizer is divided into three modules in this study: action selection module, adjustment calculation module and joint motion mapping module, which are used to generate A, q and B, respectively. The specific structure of the global self-stabilizer is shown in Figure 5.
In general, the number of actions m is greater than the robot DOF NJ, so the matrix B is singular. Therefore, the action selection matrix ANJ×m is constructed, which has only one element of 1 in each row and the remaining elements of 0. Adding the action selection matrix A to Equation (13) gives: According to Equation (14), the global self-stabilizer is divided into three modules in this study: action selection module, adjustment calculation module and joint motion mapping module, which are used to generate A, q and B, respectively. The specific structure of the global self-stabilizer is shown in Figure 5.

Action Selection Module
The action selection module selects NJ actions in the action set and generates the action selection matrix A. Two main considerations are made when selecting the combination of actions: the value of the actions for the robot stability at the current state, and the influence of the actions on each other when combined.
The action value function is defined as VA(x). The mutual influence between actions is shown by the singularity of AB in Equation (14). For the action variables qi and qj, the mutual influence is quantified by the relative projection of the joint mapping vectors bi and bj (defined in Section 3.4):

Action Selection Module
The action selection module selects N J actions in the action set and generates the action selection matrix A. Two main considerations are made when selecting the combination of actions: the value of the actions for the robot stability at the current state, and the influence of the actions on each other when combined.
The action value function is defined as V A (x). The mutual influence between actions is shown by the singularity of AB in Equation (14). For the action variables q i and q j , the mutual influence c ij is quantified by the relative projection of the joint mapping vectors b i and b j (defined in Section 3.4): For any action selection of matrix A, we can define the action selection evaluation function as follows: where 1 is an all-one vector, and ω C is the weight of the action value and mutual influence. Action selection can be achieved by solving the optimization model shown in Equation (17).

Adjustment Calculation Module
There may be different formulas (or different parameters) for calculating the adjustment of the same action because the training data of the global self-stabilizer may have multiple sources (model-based controllers, motion capture data, etc.). Therefore, the task of the adjustment calculation module is to select the most valuable adjustment calculation formula for each action, then calculate and output action adjustment q.
Assuming that the j th action has n j (j = 1, 2, . . . , m) different adjustment formulas, the value functions of these formulas V jk (x) (j = 1, 2, . . . , m; k = 1, 2, . . . , n j ) are obtained by learning, and the formula with the largest V jk (x) is selected to calculate q j .
The action value function V Aj (x) can be determined by V jk (x): Both the action selection module and the adjustment calculation module need to determine the value function V jk (x) through learning. The learning of V jk (x) will be introduced below. The state transition of the training data at each moment can be extracted as a quintuple <x, I, ∆ .. θ, r, x >, where I is the activation flag matrix of the adjustment calculation formula, and the element I jk takes 1 when the adjustment q j is calculated by the k th formula. x is the system variable vector at the next moment. r is the immediate reward, considering the stability of the robot and the difference between the actual motion and the reference motion of the robot. The reward function r is defined according to Equation (19).
where θ i d is the joint angle of the i th joint in the motion sample; θ imax and θ imin are the positive and negative limit positions of the i th joint, respectively.
In this paper, ZMP is not the only criterion for determining stability. When ZMP is within the support zone, the robot is considered to be stable; when ZMP exceeds the support zone, the robot will start to flip along the boundary of the support zone. The robot is still considered to have the possibility of recovery when the flip angle is less than 45 • ; only after the flip angle exceeds 45 • is the robot considered to be in an irrecoverable unstable state.
For each training data, the value function Q(ψ ijk ) is updated by Q-learning.
where r jk is the reward function after assigning the immediate reward r to the k th adjustment calculation formula for the j th action, calculated according to Equation (21).

Joint Motion Mapping Module
The task of this module is to give the joint mapping matrix B based on the feedback of the system variable x. The radial basis function network (RBF network) will be used to train the mapping relation (x, ∆ .. θ) → q as an approximation to the system motion equations because it is difficult to obtain training data in the form of <x, B> directly. However, it is always possible to extract training data in the form of <x, q, ∆ .. θ> from the robot state transition. Then, the local linearized mapping matrix B is obtained by differentiating this RBF network.
This network is split into sub-networks with one single behavioral variable q i to reduce the complexity. In Equation (12), ignoring the effect of ∆ .. θ on b i , the mapping vector b i is considered as a function of x only. After performing the local linearization, q i can be calculated by the following equation. The RBF network structure is shown in Figure 6, where B i and v i are the weight matrix and bias vector connecting the input layer to the hidden layer, respectively; u ij (I = 1, 2, . . . , m; j = 1, 2, . . . , N i ) is the linear activation function; the connection weights of the hidden layer to the output layer are the affiliation function f ij (defined in Equation (11)). The output equation of the network is shown in Equation (23)

Joint Motion Mapping Module
The task of this module is to give the joint mapping matrix B based on the feed of the system variable x. The radial basis function network (RBF network) will be us train the mapping relation ( , Δ ̈) → as an approximation to the system motion e tions because it is difficult to obtain training data in the form of <x, B> directly. How it is always possible to extract training data in the form of <x, q, θ    > from the robot transition. Then, the local linearized mapping matrix B is obtained by differentiating RBF network.
This network is split into sub-networks with one single behavioral variable qi t duce the complexity. In Equation (12), ignoring the effect of θ    on bi, the mapping v bi is considered as a function of x only. After performing the local linearization, qi ca calculated by the following equation. The RBF network structure is shown in Figure 6, where Bi and vi are the weight trix and bias vector connecting the input layer to the hidden layer, respectively; u 1,2,…,m; j = 1,2,…, Ni) is the linear activation function; the connection weights of the den layer to the output layer are the affiliation function fij (defined in Equation (11)) output equation of the network is shown in Equation (23) Figure 6. Structure of the RBF network.
The basis functions of the above RBF network are evaluated only in the space te rized by to reduce the number of basis functions. This modified RBF network is eq alent to linear (first order) interpolation in the multi-dimensional space, which can prove the fitting accuracy. The basis functions of the above RBF network are evaluated only in the space tensorized by x to reduce the number of basis functions. This modified RBF network is equivalent to linear (first order) interpolation in the multi-dimensional space, which can improve the fitting accuracy.
Differentiating Equation (23), the equation for the mapping vector b i extracted from the RBF network is: The training of the designed RBF network is divided into two steps: (1) determination of the center and boundary of the basis function; (2) local training inside the basis function.
The center and boundary of the basis function are determined by the state space autonomic abstraction calculation based on the Gaussian base function [25] (feature selection is also required). For each RBF sub-network, the basis function set can be expressed as Ψ Bi = {ψ Bij |j = 1, 2, . . . , N i } after the autonomic abstraction calculation.
For the j th basis function of action q i , the following error function can be defined: The superscript (k) represents the k th training data, b Rij is the j th row of the weight matrix B i , and v ij is the j th element of the bias vector v i . For simplicity, f (s i (k) , W Bi , ψ Bij ) is abbreviated as f ijk . To minimize e ij , the following equations need to be solved: Solving Equation (26), the solution shown in Equation (27) can be obtained.
Where the definition of U,Û and q Li are shown in:

Stability Training System of Biped Robots
Taking the biped robot GoRoBoT-II as an example, the simulated and experimental stability training environment are established to validate the effectiveness of the proposed idea and the balance controllers for generating the training data are designed.

Simulation Environment
The biped robot used in this study is the bipedal part of the GoRoBoT-II robot designed by the author's laboratory. Its main mechanism parameters are shown in Table 4. The seven-bar multi-rigid-body model of the biped robot is shown in Figure 7. The reference frames and variables are defined according to the model in Section 2.2. In addition, the joint angles of the left and right legs are denoted as θ Li and θ Ri (i = 1, 2, . . . , 6), respectively.

Experiment Environment
The experimental system for stability training is shown in Figure 8, which includes the upper computer, motion platform, biped robot and protection device. The upper computer is a PC with a Windows operating system. The protection device is composed of a wire rope, a fixed pulley and a pull ring. When the robot is stable, the wire rope stays slack and does not affect the robot; when the robot is unstable, the experimenter pulls the protection rope tightly to prevent the robot from falling down.

Experiment Environment
The experimental system for stability training is shown in Figure 8, which includes the upper computer, motion platform, biped robot and protection device. The upper computer is a PC with a Windows operating system. The protection device is composed of a wire rope, a fixed pulley and a pull ring. When the robot is stable, the wire rope stays slack and does not affect the robot; when the robot is unstable, the experimenter pulls the protection rope tightly to prevent the robot from falling down. The training platform in the above system is a 2-DOF motion platform. The mechanism diagram and its main parameters are given in Figure 9. The motion platform can oscillate around the x-axis and y-axis, denoted by θP1 and θP2, respectively. The limits of oscillation amplitude, speed and acceleration are ±20°, ±40°/s and ±60°/s 2 , respectively, which meet the requirements for stability training. The training platform in the above system is a 2-DOF motion platform. The mechanism diagram and its main parameters are given in Figure 9. The motion platform can oscillate around the x-axis and y-axis, denoted by θ P1 and θ P2 , respectively. The limits of oscillation amplitude, speed and acceleration are ±20 • , ±40 • /s and ±60 • /s 2 , respectively, which meet the requirements for stability training. The training platform in the above system is a 2-DOF motion platform. The mechanism diagram and its main parameters are given in Figure 9. The motion platform can oscillate around the x-axis and y-axis, denoted by θP1 and θP2, respectively. The limits of oscillation amplitude, speed and acceleration are ±20°, ±40°/s and ±60°/s 2 , respectively, which meet the requirements for stability training. Each joint of the robot is driven by a Maxon RE35 DC servo motor. The transmission system consists of a synchronous belt drive (first stage) and a harmonic gear drive (second stage).
The motion control commands of the robot are generated by the upper computer, and the DC servo motors of each joint are position servos controlled by IPM100 controllers. In addition to the photoelectric encoders on the DC servo motors, the robot is equipped with a gyroscope (mounted on the torso) and force sensors (mounted under the Each joint of the robot is driven by a Maxon RE35 DC servo motor. The transmission system consists of a synchronous belt drive (first stage) and a harmonic gear drive (second stage).
The motion control commands of the robot are generated by the upper computer, and the DC servo motors of each joint are position servos controlled by IPM100 controllers. In addition to the photoelectric encoders on the DC servo motors, the robot is equipped with a gyroscope (mounted on the torso) and force sensors (mounted under the soles of the feet) to measure the acceleration and velocity of the torso as well as the contact forces, respectively.

Balance Controllers for Stability Training Data Generation
The model-based balance controllers used for training data generation can be obtained by combining actions in action set Q. The stance leg can follow the motion sample input when the behavior variable By replacing some elements of the above action vector with the variables of three types of actions-CoM action, inertial force/moment action and ZMP action-the balance adjustments can be achieved based on the input sample motions. A variety of legged robot balance controllers with different action combinations can be obtained. The three types of controllers are described in detail below.
(1) CoM adjustment balance controller. This controller maintains the robot's balance by keeping the robot's CoM above its support zone. The action variable that must be selected is X F . The former corresponds to adjusting the robot's CoM by translational motion of the torso, and the latter by the swing foot.
(2) Energy attenuation balance controller. This controller dissipates the system energy by making the inertial force and moment do negative work, thus achieving stabi-lization. The action variables to be selected are F and M. The action variables that can be removed are the torso acceleration ..

X T or the swing leg acceleration
.. X F , which correspond to the two ways of changing the inertial force and moment by the stance leg adjustment or the swing leg adjustment, respectively.
(3) ZMP adjustment balance controller. This controller keeps the robot's CP point in the center of the support zone by adjusting the ZMP. Therefore, the action variables that must be selected are x ZMP and y ZMP , and the substituted action variables can be ..
x T or X F , which is equivalent to the adjustment of the ZMP position by torso swing or swing leg kick. Table 5 summarizes the six balance controllers. During the stability training, the action selection matrix A is determined by the corresponding controllers used in Table 5; the joint mapping matrix B is calculated according to the kinematics and dynamics of the robot; the adjustment vector ∆q within each control cycle is calculated from the corresponding adjustment calculation formula (Equations (1)- (7)) according to the current state x; and the control output ∆ .. θ is solved by Equation (14). Furthermore, the state transition information <x, I, ∆ .. θ, r, x > generated by the above balance controller will be recorded to form the training data, and this data will be used for learning the three modules of the global self-stabilizer. When the position, velocity or acceleration of a joint enters its limit neighborhood (determined by ε J1 , ε J2 and ε J3 ), the joint limit will be avoided by the single-joint action, which is achieved by selecting one of the single-joint actions that has the largest influence coefficient (defined by Equation (15)).

Simulation Results
Here, the stability training data of the single-leg stance, double-leg stance and stepping will be generated within the simulation environment established in 4.1 using the modelbased balance controllers in 4.3 to train the global self-stabilizer, after which the stability verification simulation of the trained global self-stabilizer will be performed under the same conditions.

Stability Training in Simulation
In the stability training simulation, the motion platform applies two kinds of perturbations. The first one is time-varying ground tilt perturbation by the amplitude-limited random motion of the swing angle θ P1 and θ P2 (see Figure 7); the other is the impact perturbation by the sudden change of angular velocity based on the first one.
Three different sets of the control parameters in the adjustment amount (Equations (1)- (7)) are designed, corresponding to different response speeds. The specific values are given in Table 6, which were obtained from the simulation conducted before training. The superscript is used to indicate the level action that the variable takes, such as x ZMP (1). Three reference motions were used for the stability training simulation, i.e., single-leg stance, double-leg stance and stepping. The stepping motion has random landing points, and the motion samples were obtained by the planning method proposed in [31]. One hundred simulations were performed for each level of each balance controller under each perturbation condition in Adams, and 4000 system variable transition data were extracted from each simulation. The duration of each simulation was 20 s, and the control period was 5 ms. For the controllers of TC i , TE i and TB i (I = 1, 2, 3), a total of 1.2 × 10 6 transition data of system variables without impact and 8 × 10 5 with impact were obtained, respectively; for the controllers of FC i , FE i and FB i (I = 1, 2, 3), a total of 8 × 10 5 transition data of system variables without impact and 4 × 10 5 with impact were obtained, respectively. The maximum simulation success rates among all model-based controllers are shown in Table 8.
As a preparation for Q-learning and RBF network learning, feature selection and autonomic abstraction calculations were performed first, and the results are shown in Table 7. The value functions of the actions with different parameters share the same feature selection results, but the state space autonomic abstraction calculation is performed with different basis function distributions so that different parameters obtain different numbers of basis functions.
A total of 198 system variables were selected for the 40 functions in the above table. There were an average of five state variables per function from 113 system variables, which shows that the RAFS feature selection method effectively reduces the state space dimensionality of learning.
The 30 most-selected system variables are shown in Figure 10. The most-selected variables are joint angles of stance leg, followed by the position of CoM and the flip angle. Overall, the system variables related to robot CoM, platform swing angle, resultant force/moment and ZMP are all present in the top 30 most-selected variables. All of these are important variables or equilibrium criteria in biped robot balance control, which indicates that the RAFS feature selection method successfully selected system variables of significance.
The 40 functions in Table 7 were learned separately after the feature selection and state space autonomic abstraction calculation described above. The Q-learning of the action values was trained in a batch, with the amount of training data for each batch being 10,000, and the incremental threshold of the value function for iterative convergence set to 10 −5 ; the RBF network for joint motion mapping was trained according to Equation (27). The optimal solution was converged after performing one iteration on all training data.
x C , y C , . y C 8824 8937 9331 Joint mapping The 30 most-selected system variables are shown in Figure 10. The most-selected variables are joint angles of stance leg, followed by the position of CoM and the flip angle. Overall, the system variables related to robot CoM, platform swing angle, resultant force/moment and ZMP are all present in the top 30 most-selected variables. All of these are important variables or equilibrium criteria in biped robot balance control, which indicates that the RAFS feature selection method successfully selected system variables of significance. Figure 10. The 30 most-selected system variables.
The 40 functions in Table 7 were learned separately after the feature selection and state space autonomic abstraction calculation described above. The Q-learning of the action values was trained in a batch, with the amount of training data for each batch being 10,000, and the incremental threshold of the value function for iterative convergence set to 10 −5 ; the RBF network for joint motion mapping was trained according to Equation (27). The optimal solution was converged after performing one iteration on all training data.

Stability Verification Simulation of the Trained Global Self-Stabilizer
To verify the effectiveness of the trained global self-stabilizer, five hundred stability verification simulations were performed on the motion platform for each of the three robot motions, under the same simulation conditions and parameters as the training data generation. The success rates of the above verification simulations are presented in Table  8 and are compared with the highest success rate of the model-based balance controllers.

Stability Verification Simulation of the Trained Global Self-Stabilizer
To verify the effectiveness of the trained global self-stabilizer, five hundred stability verification simulations were performed on the motion platform for each of the three robot motions, under the same simulation conditions and parameters as the training data generation. The success rates of the above verification simulations are presented in Table 8 and are compared with the highest success rate of the model-based balance controllers. Table 8. Comparison of the simulation success rates of the trained global self-stabilizer and modelbased balance controllers.

Success Rate without Impact
Success Rate with Impact From the above table, it can be seen that the trained global self-stabilizer obtains stronger stability than the model-based balance controllers, with increases ranging from around 10% to 33%. The global self-stabilizer nearly doubles the success rate when the impact perturbations are applied.

Model-Based Controllers
The verification simulation results of the single-leg stance and the stepping will be analyzed next, because the double-leg stance is less challenging than others.
The ZMP curves in two single-leg stance simulations are given in Figure 11, respectively. Figure 11a depicts that the trained global self-stabilizer regulated the ZMP to the center of the support zone when no impact perturbation is applied. Figure 11b shows that the ZMP exceeded the support zone boundary with the farthest distance of 87.7 mm after the impact, and the global self-stabilizer reduced the ZMP's oscillation amplitude and finally recovered the flat-foot contact of the robot.
The joint angles in the same simulations are shown in Figure 12, wherein the joint limits are marked with horizontal lines. The moments of impact and restoration of equilibrium are also marked with vertical lines in Figure 12b. Figure 12a shows that the knee joints approach the joint limit between 15 s and 17 s, and the global self-stabilizer distributes the motion of the knee joints to the ankle joints.
Screenshots of the single-stance stability verification simulation with impact using the virtual prototype in Adams are shown in Figure 13.
The ZMP curves in two single-leg stance simulations are given in Figure 11, respectively. Figure 11a depicts that the trained global self-stabilizer regulated the ZMP to the center of the support zone when no impact perturbation is applied. Figure 11b shows that the ZMP exceeded the support zone boundary with the farthest distance of 87.7 mm after the impact, and the global self-stabilizer reduced the ZMP's oscillation amplitude and finally recovered the flat-foot contact of the robot. The joint angles in the same simulations are shown in Figure 12, wherein the joint limits are marked with horizontal lines. The moments of impact and restoration of equilibrium are also marked with vertical lines in Figure 12b. Figure 12a shows that the knee joints approach the joint limit between 15 s and 17 s, and the global self-stabilizer distributes the motion of the knee joints to the ankle joints. Screenshots of the single-stance stability verification simulation with impact using the virtual prototype in Adams are shown in Figure 13. The action-switching process of the global self-stabilizer is given in Figure 14. The switching of C x  , C y  , xZMP and yZMP without impact are given in Figure 14a,b. When the sagittal impact is applied, the global self-stabilizer will use FX and MY actions to replace Screenshots of the single-stance stability verification simulation with impact using the virtual prototype in Adams are shown in Figure 13. The action-switching process of the global self-stabilizer is given in Figure 14. The switching of C x  , C y  , xZMP and yZMP without impact are given in Figure 14a,    , respectively). The switching process is shown in Figure 14c,d. The action-switching process of the global self-stabilizer is given in Figure 14. The switching of ..
x C , .. y C , x ZMP and y ZMP without impact are given in Figure 14a,  In the case of single-leg stance, the global self-stabilizer dynamically mixed TC a TB controllers in the case of no impact; in the presence of impact, the global self-stabiliz combined the four types of controllers, TC, TB, FE and TE. The controller parameters we also adjusted according to the system state. The switching rules were implicitly contain in value functions obtained from training process, and the results are equivalent to e ploring different combinations of actions or parameters for calculating adjustments in d ferent locations of the system space. Therefore, the global self-stabilizer obtained stronger stability than the original controller used to generate the training data.
The simulation data of single-leg stance with impact were sampled using a Gaussi function (standard deviation 5 mm). The probabilities of the distribution of the simulati success rate with respect to the ZMP position and the support surface flip angle are show in Figure 15. From Figure 15a, it can be seen that the robot is basically guaranteed to stable when the ZMP is within the support zone. The area circled by the contour with 80% success rate is about 1.8 times the size of the support zone, indicating that the glob self-stabilizer makes it possible for the robot to recover its balance even when the ZMP out of the support zone. Figure 15b shows that the robot has 100% stability when the f angle θS1 is less than 6° and θS2 is less than 5°; the success rate of recovering balance gra ually decreases as the flip angle rises. Figure 15 also shows that the robot has strong robustness to resist sagittal disturbances than lateral disturbances in single-leg stance. In the case of single-leg stance, the global self-stabilizer dynamically mixed TC and TB controllers in the case of no impact; in the presence of impact, the global self-stabilizer combined the four types of controllers, TC, TB, FE and TE. The controller parameters were also adjusted according to the system state. The switching rules were implicitly contained in value functions obtained from training process, and the results are equivalent to exploring different combinations of actions or parameters for calculating adjustments in different locations of the system space. Therefore, the global self-stabilizer obtained a stronger stability than the original controller used to generate the training data.
The simulation data of single-leg stance with impact were sampled using a Gaussian function (standard deviation 5 mm). The probabilities of the distribution of the simulation success rate with respect to the ZMP position and the support surface flip angle are shown in Figure 15. From Figure 15a, it can be seen that the robot is basically guaranteed to be stable when the ZMP is within the support zone. The area circled by the contour with an 80% success rate is about 1.8 times the size of the support zone, indicating that the global self-stabilizer makes it possible for the robot to recover its balance even when the ZMP is out of the support zone. Figure 15b shows that the robot has 100% stability when the flip angle θ S1 is less than 6 • and θ S2 is less than 5 • ; the success rate of recovering balance gradually decreases as the flip angle rises. Figure 15 also shows that the robot has stronger robustness to resist sagittal disturbances than lateral disturbances in single-leg stance. The joint angles in one stepping simulation are shown in Figure 16 which depicts that the robot has periodic trajectories of joint angles. In addition, there are also irregular fluctuations due to the changing ground tilt perturbation imposed by the moving platform. The action switching in the sagittal and lateral planes are given in Figure 17. The action switching processes in two planes are similar. Where the CoM action is dominant during the double-legged stance period, the ZMP action is dominant during the singleleg stance period, and the inertial force action is used before and after the swing foot hits the ground. The joint angles in one stepping simulation are shown in Figure 16 which depicts that the robot has periodic trajectories of joint angles. In addition, there are also irregular fluctuations due to the changing ground tilt perturbation imposed by the moving platform. The joint angles in one stepping simulation are shown in Figure 16 which depicts that the robot has periodic trajectories of joint angles. In addition, there are also irregular fluctuations due to the changing ground tilt perturbation imposed by the moving platform. The action switching in the sagittal and lateral planes are given in Figure 17. The action switching processes in two planes are similar. Where the CoM action is dominant during the double-legged stance period, the ZMP action is dominant during the singleleg stance period, and the inertial force action is used before and after the swing foot hits the ground. The action switching in the sagittal and lateral planes are given in Figure 17. The action switching processes in two planes are similar. Where the CoM action is dominant during the double-legged stance period, the ZMP action is dominant during the single-leg stance period, and the inertial force action is used before and after the swing foot hits the ground.

Experiment Results
In this section, experiments are conducted firstly for stability training, followed by stability verification experiments using the trained global self-stabilizer to show the effects.

Stability Training Experiment
The global self-stabilizer obtained from the simulation was transplanted to the robot to reduce the wear of mechanical parts by frequent training experiments. The parameters that need to be transplanted include the parameter set Ψ of the basis function, the value function V and the connection weight matrix Hi of the RBF network.
The stability training experiments of three motions were performed using the bipedal part of the GoRoBoT-II robot. The total number of experiments and the number of successes for each motion under different perturbation conditions are given in Table 9. The transplanted global self-stabilizer was trained using the obtained experimental data, and the procedure and the parameters to be learned are similar to the simulation training in 5.1.

Stability Verification Experiment
For the three motions of double-leg stance, single-leg stance and stepping, twenty stability validation experiments were conducted, and the disturbances were generated according to the fourth, second, and first row parameters in Table 9, respectively. The corresponding success rates are 75%, 60% and 55%, respectively. Accordingly, the success rates of the model-based balance controller in the experiments were improved by 16.7%, 26.7% and 25.4%, respectively.
The distributions of the experimental data in the platform phase space are shown in Figure 18, where the unstable points indicate that the robot met the unstable condition in Equation (19) within 3 s, while the stable points indicate that the robot did not fall over within 3 s. The phase space of the motion platform was divided into the stable region, the unstable region and the transition region. It can be seen that the stable region of all three

Experiment Results
In this section, experiments are conducted firstly for stability training, followed by stability verification experiments using the trained global self-stabilizer to show the effects.

Stability Training Experiment
The global self-stabilizer obtained from the simulation was transplanted to the robot to reduce the wear of mechanical parts by frequent training experiments. The parameters that need to be transplanted include the parameter set Ψ of the basis function, the value function V and the connection weight matrix H i of the RBF network.
The stability training experiments of three motions were performed using the bipedal part of the GoRoBoT-II robot. The total number of experiments and the number of successes for each motion under different perturbation conditions are given in Table 9. The transplanted global self-stabilizer was trained using the obtained experimental data, and the procedure and the parameters to be learned are similar to the simulation training in 5.1.

Stability Verification Experiment
For the three motions of double-leg stance, single-leg stance and stepping, twenty stability validation experiments were conducted, and the disturbances were generated according to the fourth, second, and first row parameters in Table 9, respectively. The corresponding success rates are 75%, 60% and 55%, respectively. Accordingly, the success rates of the model-based balance controller in the experiments were improved by 16.7%, 26.7% and 25.4%, respectively.
The distributions of the experimental data in the platform phase space are shown in Figure 18, where the unstable points indicate that the robot met the unstable condition in Equation (19) within 3 s, while the stable points indicate that the robot did not fall over within 3 s. The phase space of the motion platform was divided into the stable region, the unstable region and the transition region. It can be seen that the stable region of all three motions is larger than the size of the unstable region and the transition region, indicating that the trained global self-stabilizer gained the ability to resist external perturbations. The ZMP curves that were obtained in three random experiments for each motion are given in Figure 19, which shows that the trained global stabilizer can restore balance even if the ZMP is out of the support zone. In addition, the corresponding experiment screenshots are shown in Figure 20. The ZMP curves that were obtained in three random experiments for each motion are given in Figure 19, which shows that the trained global stabilizer can restore balance even if the ZMP is out of the support zone. In addition, the corresponding experiment screenshots are shown in Figure 20. The ZMP curves that were obtained in three random experiments for each motion are given in Figure 19, which shows that the trained global stabilizer can restore balance even if the ZMP is out of the support zone. In addition, the corresponding experiment screenshots are shown in Figure 20. In summary, the trained global self-stabilizer obtained the self-stabilization capability to cope with the random amplitude-limited perturbations under different motions. In addition, the stabilization capability was stronger than that of the model-based balance controllers after the training process, which indicates that the global self-stabilizer extracted and generated the control strategy that was most beneficial to maintain the robot's balance based on the training data, and obtained a better state/action mapping. In summary, the trained global self-stabilizer obtained the self-stabilization capability to cope with the random amplitude-limited perturbations under different motions. In addition, the stabilization capability was stronger than that of the model-based balance controllers after the training process, which indicates that the global self-stabilizer extracted and generated the control strategy that was most beneficial to maintain the robot's balance based on the training data, and obtained a better state/action mapping.

Conclusions
A general model of a stability training system with a training platform is designed for legged robots with an arbitrary number of legs and an arbitrary configuration. The application of the proposed idea was given from three perspectives: system variable determination, action set construction and model-based controller designs for training data generation. A global self-stabilizer capable of learning from different sources of training data in a high-dimensional continuous system space was proposed to address the stability training problem of legged robots. The overall task of keeping the robot stable is broken down into three modules: action selection, adjustment calculation and joint motion mapping, in which the action selection and adjustment calculation modules use the Q-learning algorithm, and the joint motion mapping module uses a modified RBF network.
Stability training simulations and experiments of the global self-stabilizer were conducted by taking the bipedal robot, GoRoBoT-II, as an example (it should also be noted that the application of the proposed training method was not limited by the size of robot). The training data that were generated from 18 controllers were used for training the global

Conclusions
A general model of a stability training system with a training platform is designed for legged robots with an arbitrary number of legs and an arbitrary configuration. The application of the proposed idea was given from three perspectives: system variable determination, action set construction and model-based controller designs for training data generation. A global self-stabilizer capable of learning from different sources of training data in a high-dimensional continuous system space was proposed to address the stability training problem of legged robots. The overall task of keeping the robot stable is broken down into three modules: action selection, adjustment calculation and joint motion mapping, in which the action selection and adjustment calculation modules use the Q-learning algorithm, and the joint motion mapping module uses a modified RBF network.
Stability training simulations and experiments of the global self-stabilizer were conducted by taking the bipedal robot, GoRoBoT-II, as an example (it should also be noted that the application of the proposed training method was not limited by the size of robot). The training data that were generated from 18 controllers were used for training the global self-stabilizer.
Stability verification simulations and experiments were conducted for the trained global self-stabilizer, and the following conclusions can be obtained:

1.
Simulation verification showed that the success rates of the trained global selfstabilizer, in three kinds of motion, under different disturbances, were higher than that of the model-based balance controller, with an improvement of at least 9.4%.

2.
Experiment verification showed that the trained global self-stabilizer could keep the robot balanced under the random amplitude-limited tilt perturbation. The success rates of the stability verification experiments could reach 75%, 60% and 55%, respectively, which were higher than the success rates obtained using the modelbased balance controller during the training data generation (58.3%, 33.3% and 29.6%, respectively).

3.
The trained global self-stabilizer obtained different action combinations from the training data, and also continuously switched parameters according to the system state. This indicates that the designed global self-stabilizer was able to explore better state-action mapping from the training data and had the ability to learn and evolve continuously.
In summary, the proposed global self-stabilizer was able to accomplish the stability training task under compound perturbations and explore better action combinations from multiple different sources of training data. In the next step, we will put the trained global self-stabilizer into a real, unknown environment for further experiments.

Conflicts of Interest:
The authors declare no conflict of interest.