Q-Learning of Straightforward Gait Pattern for Humanoid Robot Based on Automatic Training Platform

: In this paper, an oscillator-based gait pattern with sinusoidal functions is designed and implemented on a ﬁeld-programmable gate array (FPGA) chip to generate a trajectory plan and achieve bipedal locomotion for a small-sized humanoid robot. In order to let the robot can walk straight, the turning direction is viewed as a parameter of the gait pattern and Q-learning is used to obtain a straightforward gait pattern. Moreover, an automatic training platform is designed so that the learning process is automated. In this way, the turning direction can be adjusted ﬂexibly and e ﬃ ciently under the supervision of the automatic training platform. The experimental results show that the proposed learning framework allows the humanoid robot to gradually walk straight in the automated learning process.


Introduction
Humanoid robots are an attractive topic in the field of robotics. A biped structure is designed for humanoid robots and is expected to facilitate human lives and even allow the robots to coexist with humans. Therefore, bipedal locomotion is an important ability of humanoid robots that is widely researched. Some gait patterns are motivated by biologically inspired control concepts to achieve bipedal locomotion. Rhythmic movements in animals are realized via an interaction between the dynamics of a musculoskeletal system and the rhythmic signals from central pattern generators (CPGs) [1,2]. In robotics, CPGs were formulated as a set of neural oscillators to produce the gait pattern of oscillations necessary for rhythmic movements [3,4]. Based on the neural oscillator, a set of coupled-phase oscillators were presented using sinusoidal functions for the gait pattern [5]. However, the neural oscillator and the coupled-phase oscillator are modulated in the joint space for each joint of the humanoid robot, resulting in too many parameters needing to be adjusted. Based on the Cartesian coordinate system, the simplified coupled linear oscillators were extended from the abovementioned methods to produce the gait pattern [6,7] with trajectory planning in the workspace [8,9]. The simplified coupled linear oscillators can be divided into a balance oscillator and two movement oscillators which have a direct correlation between the oscillator parameters and the gait pattern. The center of mass (CoM) trajectory can be designed through the balance oscillator and its oscillator parameters. Similarly, the left and right ankle trajectories can be designed through the movement oscillator and its oscillator parameters. Hence, these oscillator parameters all affect the gait pattern for the humanoid robot. This gait pattern for the humanoid robot can achieve high flexibility through adjustment of the parameters experimental platform was designed to implement the proposed method and achieve the desired behavior of the robot.

Small-Sized Humanoid Robot
A small-sized humanoid robot with 23 degrees of freedom (DOFs) was designed to imitate human movements. There were two DOFs in the head, four DOFs per arm, one DOF in the waist, and six DOFs per leg. The mechanism and dimensions of the small-sized humanoid robot are described in Figure 2. Its height and weight were 56.45 cm and 4.5 kg, respectively. The main hardware included 23 servo motors, one complementary metal-oxide-semiconductor (CMOS) sensor, one FPGA board, and one integrated circuit board. The specifications of the small-sized humanoid robot are shown in Table 1. The FPGA board contained an FPGA chip which was used as the main controller for the humanoid robot. The internal signals of the robot could be transferred into the FPGA chip through the integrated circuit board. Hence, the commands could be transmitted from the FPGA chip to all device components (i.e., the 23 servo motors) by using general-purpose input/output (GPIO) pins and the integrated circuit board. It can be mentioned that the FPGA chip has the advantage of parallel processing and low power consumption. Therefore, the small-sized humanoid robot designed with this FPGA board had more significant computing and real-time processing capabilities compared to the Darwin-OP robot [6] with an Arduino board.

Small-Sized Humanoid Robot
A small-sized humanoid robot with 23 degrees of freedom (DOFs) was designed to imitate human movements. There were two DOFs in the head, four DOFs per arm, one DOF in the waist, and six DOFs per leg. The mechanism and dimensions of the small-sized humanoid robot are described in Figure 2. Its height and weight were 56.45 cm and 4.5 kg, respectively. The main hardware included 23 servo motors, one complementary metal-oxide-semiconductor (CMOS) sensor, one FPGA board, and one integrated circuit board. The specifications of the small-sized humanoid robot are shown in Table 1. The FPGA board contained an FPGA chip which was used as the main controller for the humanoid robot. The internal signals of the robot could be transferred into the FPGA chip through the integrated circuit board. Hence, the commands could be transmitted from the FPGA chip to all device components (i.e., the 23 servo motors) by using general-purpose input/output (GPIO) pins and the integrated circuit board. It can be mentioned that the FPGA chip has the advantage of parallel processing and low power consumption. Therefore, the small-sized humanoid robot designed with this FPGA board had more significant computing and real-time processing capabilities compared to the Darwin-OP robot [6] with an Arduino board.

Small-Sized Humanoid Robot
A small-sized humanoid robot with 23 degrees of freedom (DOFs) was designed to imitate human movements. There were two DOFs in the head, four DOFs per arm, one DOF in the waist, and six DOFs per leg. The mechanism and dimensions of the small-sized humanoid robot are described in Figure 2. Its height and weight were 56.45 cm and 4.5 kg, respectively. The main hardware included 23 servo motors, one complementary metal-oxide-semiconductor (CMOS) sensor, one FPGA board, and one integrated circuit board. The specifications of the small-sized humanoid robot are shown in Table 1. The FPGA board contained an FPGA chip which was used as the main controller for the humanoid robot. The internal signals of the robot could be transferred into the FPGA chip through the integrated circuit board. Hence, the commands could be transmitted from the FPGA chip to all device components (i.e., the 23 servo motors) by using general-purpose input/output (GPIO) pins and the integrated circuit board. It can be mentioned that the FPGA chip has the advantage of parallel processing and low power consumption. Therefore, the small-sized humanoid robot designed with this FPGA board had more significant computing and real-time processing capabilities compared to the Darwin-OP robot [6] with an Arduino board.  In this paper, trajectory planning was adopted to achieve the gait pattern of the humanoid robot. Hence, inverse kinematics was applied to obtain the angle of each joint from the trajectory planning to implement bipedal locomotion. The geometric approach was used to solve the inverse kinematics. The coordinate systems of the humanoid robot described in its sagittal plane and frontal plane are shown in Figure 3 [10].  Holding torque Speed Resolution 6.0 N•m @ 12 V 63 rpm @ no load 0.088° Sensor CMOS sensor 30 fps In this paper, trajectory planning was adopted to achieve the gait pattern of the humanoid robot. Hence, inverse kinematics was applied to obtain the angle of each joint from the trajectory planning to implement bipedal locomotion. The geometric approach was used to solve the inverse kinematics. The coordinate systems of the humanoid robot described in its sagittal plane and frontal plane are shown in Figure 3   In the sagittal plane of the humanoid robot described in Figure 3a, the angles of the hip joint, knee joint, and ankle joint of the right (left) foot in the pitch-axis are denoted as pit  In the sagittal plane of the humanoid robot described in Figure 3a, the angles of the hip joint, knee joint, and ankle joint of the right (left) foot in the pitch-axis are denoted as θ and where l t and l c are the lengths of the robot thigh and calf, respectively. L x R (L x L ), L y R (L y L ), and L z R (L z L ) are the step length, step width, and lift height of the right (left) foot.
In the frontal plane of the humanoid robot described in Figure 3b, the angles of the hip joint and ankle joint of the right (left) foot in the roll axis are denoted as θ rol RH (θ rol LH ) and θ rol RA (θ rol LA ), respectively. Similarly, based on the geometric approach, θ rol RH , θ rol RA , θ rol LH , and θ rol LA can be respectively described as follows: and θ rol LA = θ rol LH .

Automatic Training Platform
An automatic training platform with three degrees of freedom was designed to allow the robot to be trained in an automated learning process. The specifications of the automatic training platform are shown in Table 2. The main hardware included three servo motors, one personal computer (PC), two infrared sensors, and one CMOS sensor. The PC was used as the main controller for the automatic training platform. The mechanism dimension of the automatic training platform is shown in Figure 4. Its length, width, and height were 243 cm, 124 cm, and 85 cm, respectively. The length and width of the training field were 238 cm and 119 cm, respectively. Two infrared sensors were used to measure x-axis and y-axis distances of the robot in the training field. As shown in Figure 5, a unit coordinate of 17 × 17 cm 2 was considered to construct the coordinate of the training field in the horizontal plane of the automatic training platform. The measured information (d x , d y ) was transferred into a coordinate to represent the position of the robot in the training field. In addition, a blue round marker was put above the humanoid robot allowing the platform to follow and protect the robot. As shown in Figure 6, a traditional red-green-blue (RGB) image of the robot's mark was captured by the CMOS sensor and it was transferred into a filtered image via the dilation and erosion based on the hue-saturation-value (HSV) approach. Hence, the CMOS sensor could be applied to detect the robot so that the platform could move to follow and protect the robot. In this way, the humanoid robot could be protected and trained under the supervision of the automatic training platform. way, the humanoid robot could be protected and trained under the supervision of the automatic training platform.        In this paper, robot detection was adopted to allow the platform to follow the humanoid robot. Hence, motion control was applied to keep the robot's mark in the central position of the image at all times to implement visual tracking. Velocity control was used for motion control because the automatic training platform was continuously operated to track the humanoid robot. Hence, the velocities of the x-axis, y-axis, and z-axis of the automatic training platform are denoted as x ATP ω , y ATP ω , and z ATP ω . In the image, pixel errors in the x-axis and y-axis ( err x , err y ) represent the In this paper, robot detection was adopted to allow the platform to follow the humanoid robot. Hence, motion control was applied to keep the robot's mark in the central position of the image at all times to implement visual tracking. Velocity control was used for motion control because the automatic training platform was continuously operated to track the humanoid robot. Hence, the velocities of the x-axis, y-axis, and z-axis of the automatic training platform are denoted as ω x ATP , ω y ATP , and ω z ATP . In the image, pixel errors in the x-axis and y-axis (x err , y err ) represent the horizontal distance between the robot's mark position and central position, and the area of the robot's mark (area) represents the estimation of vertical distance between the fixed CMOS position and the robot's mark; they could both be obtained from the filtered image. In the horizontal motion control, pixel errors were given as the input for the proportional-derivative controller to calculate the velocity. In the vertical motion control, the area of the robot's mark was given as the input for the constant velocity to decide its direction. Hence, ω x ATP , ω y ATP , and ω z ATP can be respectively described as follows: and where K p and K d are the gains of the proportional and derivative controllers, respectively, ω C is the constant velocity, and area Min and area Max are the boundaries of the minimum and maximum area of the robot's mark.

System Overview
In order to allow the humanoid robot to learn a straightforward gait pattern in the automatic training platform, the proposed learning framework was developed using the system architecture illustrated in Figure 7 and described in Figure 8. The three modules (Q-learning algorithm, gait pattern, and inverse kinematics) were designed and implemented in the FPGA chip to speed up the learning process and to produce real-time bipedal locomotion. In addition, three additional modules (environmental information, robot detection, and motion control) were designed and implemented in the automatic training platform to assist and supervise the humanoid robot in the automatic learning process. Their functions are described below.
Firstly, the robot's mark was placed above it to be detected by the CMOS sensor. Pixel errors in the x-axis and y-axis and the area of the robot's mark (x err , y err , area) were obtained to follow the robot using the detection module. Secondly, the velocities ω were required by the automatic training platform to control the motors and to follow the robot from the motion control module. Thirdly, when the humanoid robot walked with its mechanism error and motor backlash in the real environment, its position in the training field s could be obtained based on the measured data from the environmental information module via the x-axis and y-axis infrared sensors. Fourthly, the turning direction φ, a parameter of the gait pattern, could be calculated according to s to learn the straightforward gait pattern from the Q-learning algorithm module. Fifthly, the trajectory planning P, which depended on the turning direction φ, could be generated from the gait pattern module. Finally, the angle of each joint θ was determined from the inverse kinematics module based on P so that the robot could exhibit bipedal locomotion. from the environmental information module via the x-axis and y-axis infrared sensors. Fourthly, the turning direction φ , a parameter of the gait pattern, could be calculated according to s to learn the straightforward gait pattern from the Q-learning algorithm module. Fifthly, the trajectory planning P , which depended on the turning direction φ , could be generated from the gait pattern module.
Finally, the angle of each joint θ was determined from the inverse kinematics module based on P so that the robot could exhibit bipedal locomotion. The process of the proposed automatic training platform is described in Figure 9 which consists of several states. In the beginning, the humanoid robot was suspended and then slowly lowered onto the training field, which served as the initial position (the start state), as shown in Figure 9a  The process of the proposed automatic training platform is described in Figure 9 which consists of several states. In the beginning, the humanoid robot was suspended and then slowly lowered onto the training field, which served as the initial position (the start state), as shown in Figure 9a,b. Next, the straightforward gait pattern was learned while the automatic training platform followed the robot at the same time (the operation state), as shown in Figure 9c,d. Then, once the robot was in danger or once it reached the target region, the humanoid robot was pulled up by the automatic training platform (the end state), as shown in Figure 9e,f. Finally, the automated training platform could return to the initial position and restart the learning process (the return state), as shown in Figure 9g,h. The process of the proposed automatic training platform is described in Figure 9 which consists of several states. In the beginning, the humanoid robot was suspended and then slowly lowered onto the training field, which served as the initial position (the start state), as shown in Figure 9a,b. Next, the straightforward gait pattern was learned while the automatic training platform followed the robot at the same time (the operation state), as shown in Figure 9c,d. Then, once the robot was in danger or once it reached the target region, the humanoid robot was pulled up by the automatic training platform (the end state), as shown in Figure 9e,f. Finally, the automated training platform could return to the initial position and restart the learning process (the return state), as shown in Figure 9g,h.
Next, the straightforward gait pattern was learned while the automatic training platform followed the robot at the same time (the operation state), as shown in Figure 9c,d. Then, once the robot was in danger or once it reached the target region, the humanoid robot was pulled up by the automatic training platform (the end state), as shown in Figure 9e,f. Finally, the automated training platform could return to the initial position and restart the learning process (the return state), as shown in Figure 9g,h. The procedure of the proposed learning framework based on the automatic training platform can be described as follows: Step 1: (Setting State) The robot's mark is put above the humanoid robot and is detected by a CMOS sensor installed on the automatic training platform.
Step 2: (Initial State) Pixel errors in the x-axis and y-axis and the area of the robot's mark ( , , ) err err x y area are obtained from the robot detection module tallow the platform to follow the robot.
Step 3: (Initial State) The velocities ω are determined from the motion control module to control the motors, allowing the automatic training platform to follow the robot.
Step 4: (Initial State) The position s of the humanoid robot in the training field is obtained from the environmental information module based on the measured data via the x-axis and y-axis infrared sensors.
Step 5: (Start State) The humanoid robot is suspended and then slowly placed on the training field, which serves as the initial position.
Step 6: (Operation State) The turning direction φ is calculated from the Q-learning algorithm module based on the position s to learn the straightforward gait pattern.
Step 7: (Operation State) The trajectory planning P , which depends on the turning direction φ , is generated from the gait pattern module. The procedure of the proposed learning framework based on the automatic training platform can be described as follows: Step 1: (Setting State) The robot's mark is put above the humanoid robot and is detected by a CMOS sensor installed on the automatic training platform.
Step 2: (Initial State) Pixel errors in the x-axis and y-axis and the area of the robot's mark (x err , y err , area) are obtained from the robot detection module tallow the platform to follow the robot.
Step 3: (Initial State) The velocities ω are determined from the motion control module to control the motors, allowing the automatic training platform to follow the robot.
Step 4: (Initial State) The position s of the humanoid robot in the training field is obtained from the environmental information module based on the measured data via the x-axis and y-axis infrared sensors.
Step 5: (Start State) The humanoid robot is suspended and then slowly placed on the training field, which serves as the initial position.
Step 6: (Operation State) The turning direction φ is calculated from the Q-learning algorithm module based on the position s to learn the straightforward gait pattern.
Step 7: (Operation State) The trajectory planning P, which depends on the turning direction φ, is generated from the gait pattern module.
Step 8: (Operation State) The angle of each joint θ is determined from the inverse kinematics module based on P, allowing the robot to exhibit bipedal locomotion.
Step 9: (End State) When the robot is in danger or when it reaches the target region, the humanoid robot is pulled up by the automatic training platform.
Step 10: (Return State) The automated training platform returns to Step 5 (Start State) and restarts the learning process.

Oscillator-Based Gait Pattern
In order to implement a flexible and adaptable gait pattern, oscillators were adopted for the humanoid robot in this paper. Hence, the legs of the humanoid robot and their coordinate system needed to be defined for the gait pattern, as shown in Figure 10a. P W = (P x W , P y W , P z W ) represents the position of the waist, which was considered to be the center of mass (CoM). P RA = (P x RA , P y RA , P z RA ) and P LA = (P x LA , P y LA , P z LA ) represent the positions of the left and right ankles, respectively. The right and left legs interchanged as the support leg to obtain the walking ability of the humanoid robot. Hence, the three-dimensional gait pattern could be described by the position of the waist, and left and right ankles (P W , P LA , P RA ), as shown in Figure 10b. The standing posture of the robot and its leg parameters are shown in Figure 11, where d y is the distance between the waist P W and the hip, and d z is the distance between the hip and the ankle. robot is pulled up by the automatic training platform.
Step 10: (Return State) The automated training platform returns to Step 5 (Start State) and restarts the learning process.

Oscillator-Based Gait Pattern
In order to implement a flexible and adaptable gait pattern, oscillators were adopted for the humanoid robot in this paper. Hence, the legs of the humanoid robot and their coordinate system needed to be defined for the gait pattern, as shown in Figure 10a.
( , , ) represents the position of the waist, which was considered to be the center of mass (CoM).
represent the positions of the left and right ankles, respectively. The right and left legs interchanged as the support leg to obtain the walking ability of the humanoid robot. Hence, the three-dimensional gait pattern could be described by the position of the waist, and left and right ankles ( , , ) W L A R A P P P , as shown in Figure 10b. The standing posture of the robot and its leg parameters are shown in Figure 11, where y d is the distance between the waist W P and the hip, and z d is the distance between the hip and the ankle. The humanoid robot was a high-dimensional complex structure; thus, three-dimensional trajectory planning ( , , ) W L A R A P P P P = was generated by the oscillators based on the Cartesian coordinate system to simplify the gait pattern of the humanoid robot. The oscillators could be divided into a balance oscillator and two movement oscillators, located at the CoM W P , and left and right ankles ( , ) LA R A P P , respectively, to generate the trajectories. The purpose of the balance oscillator was to maintain the balance of the robot and to generate the CoM trajectory. The purpose of the movement oscillators was to support and move the body of the robot and to generate the left and right ankle trajectories. Since the gait pattern was a periodic behavior, a sinusoidal function was adopted for the oscillators, which was adjusted by the walking phase p to simplify the design method. The equations of the oscillators at the CoM W P , and left and right ankles ( , ) LA R A P P can be expressed as follows:

Oscillator-Based Gait Pattern
In order to implement a flexible and adaptable gait pattern, oscillators were adopted for the humanoid robot in this paper. Hence, the legs of the humanoid robot and their coordinate system needed to be defined for the gait pattern, as shown in Figure 10a.
( , , ) represents the position of the waist, which was considered to be the center of mass (CoM).
represent the positions of the left and right ankles, respectively. The right and left legs interchanged as the support leg to obtain the walking ability of the humanoid robot. Hence, the three-dimensional gait pattern could be described by the position of the waist, and left and right ankles ( , , ) W L A R A P P P , as shown in Figure 10b. The standing posture of the robot and its leg parameters are shown in Figure 11, where y d is the distance between the waist W P and the hip, and z d is the distance between the hip and the ankle. The humanoid robot was a high-dimensional complex structure; thus, three-dimensional trajectory planning ( , , ) W L A R A P P P P = was generated by the oscillators based on the Cartesian coordinate system to simplify the gait pattern of the humanoid robot. The oscillators could be divided into a balance oscillator and two movement oscillators, located at the CoM W P , and left and right ankles ( , ) LA R A P P , respectively, to generate the trajectories. The purpose of the balance oscillator was to maintain the balance of the robot and to generate the CoM trajectory. The purpose of the movement oscillators was to support and move the body of the robot and to generate the left and right ankle trajectories. Since the gait pattern was a periodic behavior, a sinusoidal function was adopted for the oscillators, which was adjusted by the walking phase p to simplify the design method. The equations of the oscillators at the CoM W P , and left and right ankles ( , ) LA R A P P can be expressed as follows: The humanoid robot was a high-dimensional complex structure; thus, three-dimensional trajectory planning P = (P W , P LA , P RA ) was generated by the oscillators based on the Cartesian coordinate system to simplify the gait pattern of the humanoid robot. The oscillators could be divided into a balance oscillator and two movement oscillators, located at the CoM P W , and left and right ankles (P LA , P RA ), respectively, to generate the trajectories. The purpose of the balance oscillator was to maintain the balance of the robot and to generate the CoM trajectory. The purpose of the movement oscillators was to support and move the body of the robot and to generate the left and right ankle trajectories. Since the gait pattern was a periodic behavior, a sinusoidal function was adopted for the oscillators, which was adjusted by the walking phase p to simplify the design method. The equations of the oscillators at the CoM P W , and left and right ankles (P LA , P RA ) can be expressed as follows: where osc W , osc LA , and osc RA are the oscillators at the CoM, and left and right ankles, respectively, p 0 W , p 0 RA , and p 0 LA are the starting points of the CoM, and left and right ankles, respectively, and (ρ,ω,δ) are the amplitude, angular velocity, and phase shift of the oscillator parameters. All oscillators involved three axes of the sub-oscillator (x-axis, y-axis, and z-axis) in three-dimensional space, and the two movement oscillators additionally included one sub-oscillator for the turning direction φ.
The gait pattern could be described as three modes: starting mode, gait cycle mode, and ending mode, and each mode was divided into two phases. Hence, a complete walking process consisted of six phases: Phase 1-6 (p1-p6) [7], as shown in Figure 12. The leftmost (initial posture) and the rightmost (final posture) postures were both standing postures. In these six phases, the parameters of the CoM in terms of the x-axis, y-axis, and z-axis (S Wx , S Wy , H W ) were the same as those involved in the walking process. Phase 1 (p1) and Phase 2 (p2) were classified as the starting mode, which only worked once at the beginning of the walking process. The CoM swung from the middle to the left, and both feet remained on the floor in Phase 1. The CoM swung from the left back to the middle, with the left foot still on the floor, and the right foot lifted a height H S R to move one step forward S S R in Phase 2. Phase 3 (p3) and Phase 4 (p4) were classified as the gait cycle mode, which worked repeatedly in the middle of the walking process. The CoM swung in a circular motion on the right side, with the right foot on the floor, and the left foot lifted a height H G L to move one stride forward S G L in Phase 3. The CoM swung in a circular motion on the left side, with the left foot on the floor, and the right foot lifted a height H G R to move one stride forward S G R in Phase 4. Phase 5 (p5) and Phase 6 (p6) were classified as the ending mode, which also only worked once at the end of the walking process. The CoM swung to the right side, with the right foot on the floor, and the left foot lifted a height H E L to move one step forward S E L in Phase 5. The CoM swung from the right back to the middle, with both feet on the floor, in Phase 6. The turning direction φ was also involved in the designed gait pattern to increase the flexibility of the humanoid robot. When humans change direction, it is natural for them to rotate their legs. Hence, the movement oscillators were related to the turning direction of the humanoid robot to generate the trajectories. The turning direction of the humanoid robot is shown in Figure 13 and it could also be assigned a starting mode, gait cycle mode, and ending mode, which in total contained six phases ( 1 p -6 p ). If the left foot moved forward and the right foot was on the floor in the complete walking process, the turning left direction could be executed as shown Figure 13a. Similarly, if the right foot moved forward and the left foot was on the floor in the complete walking The turning direction φ was also involved in the designed gait pattern to increase the flexibility of the humanoid robot. When humans change direction, it is natural for them to rotate their legs. Hence, the movement oscillators were related to the turning direction of the humanoid robot to generate the trajectories. The turning direction of the humanoid robot is shown in Figure 13 and it could also be assigned a starting mode, gait cycle mode, and ending mode, which in total contained six phases (p1-p6). If the left foot moved forward and the right foot was on the floor in the complete walking process, the turning left direction could be executed as shown Figure 13a. Similarly, if the right foot moved forward and the left foot was on the floor in the complete walking process, the turning right direction could be executed as shown Figure 13b. The turning direction was distributed to both feet, the moving foot and the foot on the floor, to rotate the legs (φ L , φ R ) in a ratio of three to seven. In the turning left direction, it is expressed by In the turning right direction, it is expressed by The turning direction φ was also involved in the designed gait pattern to increase the flexibility of the humanoid robot. When humans change direction, it is natural for them to rotate their legs. Hence, the movement oscillators were related to the turning direction of the humanoid robot to generate the trajectories. The turning direction of the humanoid robot is shown in Figure 13 and it could also be assigned a starting mode, gait cycle mode, and ending mode, which in total contained six phases ( 1 p -6 p ). If the left foot moved forward and the right foot was on the floor in the complete walking process, the turning left direction could be executed as shown Figure 13a. Similarly, if the right foot moved forward and the left foot was on the floor in the complete walking process, the turning right direction could be executed as shown Figure 13b. The turning direction was distributed to both feet, the moving foot and the foot on the floor, to rotate the legs ( L φ , R φ ) in a ratio of three to seven. In the turning left direction, it is expressed by In the turning right direction, it is expressed by In this way, the designated region could be effectively reached using the turning direction. The parameter set of the oscillator-based gait pattern with the period of a walking step T in the walking process is shown in Table 3. Trajectories and footprints with turning direction are shown in Figure  14.  In this way, the designated region could be effectively reached using the turning direction. The parameter set of the oscillator-based gait pattern with the period of a walking step T in the walking process is shown in Table 3. Trajectories and footprints with turning direction are shown in Figure 14.

Learning the Straightforward Gait Pattern
In this paper, a flat terrain was adopted for the humanoid robot to learn the straightforward gait pattern. Most gait patterns are designed assuming an ideal situation, where the mechanism and motors are working well. However, the long-term operation of the humanoid robot may result in mechanism error and motor backlash. Moreover, the real environment also cause the humanoid robot to exhibit some unexpected behaviors. As shown in Figure 15, the target region (yellow area) was placed in front of the robot and the robot started from the initial position (green area). In an ideal situation, the humanoid robot could walk straight to reach the target region, as shown in Figure 15a. In a realistic situation, the humanoid robot could not walk straight and could not reach the target region, as shown in Figure 15b. Hence, the Q-learning algorithm was adopted to adjust the turning direction φ, allowing the robot to walk straight to reach the target region from the initial position according to the environmental information.
robot to exhibit some unexpected behaviors. As shown in Figure 15, the target region (yellow area) was placed in front of the robot and the robot started from the initial position (green area). In an ideal situation, the humanoid robot could walk straight to reach the target region, as shown in Figure 15a. In a realistic situation, the humanoid robot could not walk straight and could not reach the target region, as shown in Figure 15b. Hence, the Q-learning algorithm was adopted to adjust the turning direction φ , allowing the robot to walk straight to reach the target region from the initial position according to the environmental information. The Q-learning algorithm is a well-known model-free reinforcement learning method, and it employs the concept of the Markov decision process (MDP) with finite state and action [15,22]. An optimal policy can be learned by using Q-learning to maximize the expected reward [14]. During the learning process, an action is taken by an agent and interacts with the environment for one state to another state. After taking an action a for state s , the policy can be updated through an action-value function ( , ) Q s a . A Q- where α and γ are the learning rate and discount factor, respectively, r is the reward, which can be evaluated after taking action a for state s , ' s is the next state after taking action a for state s , and max ( ', ') Q s a denotes the maximum future Q-value, while ε -greedy is set to choose a random action. The pseudo-code of the Q-learning algorithm is shown in Table 4.  The Q-learning algorithm is a well-known model-free reinforcement learning method, and it employs the concept of the Markov decision process (MDP) with finite state and action [15,22]. An optimal policy can be learned by using Q-learning to maximize the expected reward [14]. During the learning process, an action is taken by an agent and interacts with the environment for one state to another state. After taking an action a for state s, the policy can be updated through an action-value function Q(s, a). A Q-table is composed of Q-values which are designed and evaluated by the action-value function Q(s, a) for the agent. The Q-values with state s and action a are updated as follows [12,14,16]: where α and γ are the learning rate and discount factor, respectively, r is the reward, which can be evaluated after taking action a for state s, s is the next state after taking action a for state s, and maxQ(s , a ) denotes the maximum future Q-value, while ε-greedy is set to choose a random action. The pseudo-code of the Q-learning algorithm is shown in Table 4. Table 4. Pseudo-code of the Q-learning algorithm.
Initialize Q(s, a) arbitrarily Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using policy derived from Q(e.g., ε-greedy) Take action a, observe r, s Q(s, a) ← Q(s, a) + α The proposed learning framework with the Q-learning algorithm is shown in Figure 16. The FPGA chip allowed the agent to learn the straightforward gait pattern, and the automatic training platform worked to follow and train the robot. In order to adjust the turning direction φ using the Q-learning algorithm, three elements of the Q-learning algorithm were defined and designed to update the Q-values of the Q-table: (1) state (s), the environmental information measured by the infrared sensors installed on the automatic training platform to offer the position of the humanoid robot in the training field; (2) action (a), the turning direction φ selected according to state s for the gait pattern of the humanoid robot; (3) reward (r), the learning guideline dependent on state s and action a to strengthen or weaken the selected action. platform worked to follow and train the robot. In order to adjust the turning direction φ using the Q-learning algorithm, three elements of the Q-learning algorithm were defined and designed to update the Q-values of the Q-table: (1) state ( s ), the environmental information measured by the infrared sensors installed on the automatic training platform to offer the position of the humanoid robot in the training field; (2) action ( a ), the turning direction φ selected according to state s for the gait pattern of the humanoid robot; (3) reward ( r ), the learning guideline dependent on state s and action a to strengthen or weaken the selected action.

State for the Straightforward Gait Pattern
In the learning process, the automatic training platform was adopted not only for supervision to protect the humanoid robot, but also to obtain the current environmental information of state s required by the Q-learning algorithm. As shown in Figure 17, there were 60 total states of the coordinate system in the training field. The green area denotes the initial position, i.e., the start point of the robot. The yellow region denotes the target region that needs to be reached from the initial position after passing the blue line, which denotes the target distance. Similarly, the red color denotes the danger regions or the boundary of the automatic training platform which the robot cannot reach. These 60 states can be used to present the current position of the robot in the training field. The states can be obtained as follows: where x d and y d are the x-axis and y-axis distances of the robot in the training field measured using the two infrared sensors.

State for the Straightforward Gait Pattern
In the learning process, the automatic training platform was adopted not only for supervision to protect the humanoid robot, but also to obtain the current environmental information of state s required by the Q-learning algorithm. As shown in Figure 17, there were 60 total states of the coordinate system in the training field. The green area denotes the initial position, i.e., the start point of the robot. The yellow region denotes the target region that needs to be reached from the initial position after passing the blue line, which denotes the target distance. Similarly, the red color denotes the danger regions or the boundary of the automatic training platform which the robot cannot reach. These 60 states can be used to present the current position of the robot in the training field. The states can be obtained as follows: where d x and d y are the x-axis and y-axis distances of the robot in the training field measured using the two infrared sensors.
FPGA chip allowed the agent to learn the straightforward gait pattern, and the automatic training platform worked to follow and train the robot. In order to adjust the turning direction φ using the Q-learning algorithm, three elements of the Q-learning algorithm were defined and designed to update the Q-values of the Q-table: (1) state ( s ), the environmental information measured by the infrared sensors installed on the automatic training platform to offer the position of the humanoid robot in the training field; (2) action ( a ), the turning direction φ selected according to state s for the gait pattern of the humanoid robot; (3) reward ( r ), the learning guideline dependent on state s and action a to strengthen or weaken the selected action. Figure 16. Proposed learning framework with Q-learning algorithm.

State for the Straightforward Gait Pattern
In the learning process, the automatic training platform was adopted not only for supervision to protect the humanoid robot, but also to obtain the current environmental information of state s required by the Q-learning algorithm. As shown in Figure 17, there were 60 total states of the coordinate system in the training field. The green area denotes the initial position, i.e., the start point of the robot. The yellow region denotes the target region that needs to be reached from the initial position after passing the blue line, which denotes the target distance. Similarly, the red color denotes the danger regions or the boundary of the automatic training platform which the robot cannot reach. These 60 states can be used to present the current position of the robot in the training field. The states can be obtained as follows: where x d and y d are the x-axis and y-axis distances of the robot in the training field measured using the two infrared sensors.

Action for the Straightforward Gait Pattern
In order to reach the target region from the initial position, the turning direction φ of the humanoid robot was designated as action a by the Q-learning algorithm. There were a total of 9 actions that could be selected, as shown in Table 5. Instead of the value 0, four levels labeled minor (value 1), middle (value 2), major (value 4), and urgent (value 7) were designed to allow the robot to walk straight to the target region. These four levels included positive (+) and negative (−) values to realize the turning left direction and turning right direction for the robot, as shown in Figure 18, while the value 0 represented walking straight. However, only one action a could be selected based on the obtained state s to estimate an appropriate policy in the training field. Table 5. Actions of the Q-learning algorithm.

Type
Actions actions that could be selected, as shown in Table 5. Instead of the value 0, four levels labeled minor (value 1), middle (value 2), major (value 4), and urgent (value 7) were designed to allow the robot to walk straight to the target region. These four levels included positive (+) and negative (−) values to realize the turning left direction and turning right direction for the robot, as shown in Figure 18, while the value 0 represented walking straight. However, only one action a could be selected based on the obtained state s to estimate an appropriate policy in the training field. Table 5. Actions of the Q-learning algorithm. Figure 18. Turning direction with nine actions.

Reward for the Straightforward Gait Gattern
In the learning process, after a selected action a is taken by an agent and interacts with the environment, a reward r can be returned to the agent. The learning guideline offered a reward to implement the straightforward gait pattern. If a good reward was returned, the selected action was strengthened. Similarly, if a bad reward was returned, the selected action was weakened. Hence, the reward was used to update the policy. The positive and negative rewards were respectively designated in the target region and danger region. In this way, the humanoid robot was attracted or repelled to achieve the straightforward gait pattern. In addition, the time of one learning process t was involved in the reward for the humanoid robot to walk approximately in a straight line and reach the target region, as shown in Figure 19. The reward can be established as follows: where t is the time of one learning process and it is greater than 0.

Reward for the Straightforward Gait Gattern
In the learning process, after a selected action a is taken by an agent and interacts with the environment, a reward r can be returned to the agent. The learning guideline offered a reward to implement the straightforward gait pattern. If a good reward was returned, the selected action was strengthened. Similarly, if a bad reward was returned, the selected action was weakened. Hence, the reward was used to update the policy. The positive and negative rewards were respectively designated in the target region and danger region. In this way, the humanoid robot was attracted or repelled to achieve the straightforward gait pattern. In addition, the time of one learning process t was involved in the reward for the humanoid robot to walk approximately in a straight line and reach the target region, as shown in Figure 19. The reward can be established as follows: where t is the time of one learning process and it is greater than 0.

Experimental Results
The performance of the proposed learning framework is illustrated in this section. The straightforward gait pattern was learned for the humanoid robot using an FPGA chip and an automatic training platform in a training field. The real learning process of the proposed learning framework is demonstrated with four states in Figure 20. In the start state, the humanoid robot was suspended and then slowly lowered by the automatic training platform in the initial position, as shown in Figure 20a,b. In the operation state, the humanoid robot was followed by the automatic training platform when walking from the initial position to the front coordinate of the training field, as shown in Figure 20c,d. In the end state, the humanoid robot reached the end position and then was pulled up by the automatic training platform, as shown in Figure 20e,f. In the return state, the humanoid robot was returned by the automatic training platform to the initial position, as shown in Figure 20g,h. The turning direction was adjusted by the Q-learning algorithm and the walking path

Experimental Results
The performance of the proposed learning framework is illustrated in this section. The straightforward gait pattern was learned for the humanoid robot using an FPGA chip and an automatic training platform in a training field. The real learning process of the proposed learning framework is demonstrated with four states in Figure 20. In the start state, the humanoid robot was suspended and then slowly lowered by the automatic training platform in the initial position, as shown in Figure 20a,b. In the operation state, the humanoid robot was followed by the automatic training platform when walking from the initial position to the front coordinate of the training field, as shown in Figure 20c,d. In the end state, the humanoid robot reached the end position and then was pulled up by the automatic training platform, as shown in Figure 20e,f. In the return state, the humanoid robot was returned by the automatic training platform to the initial position, as shown in Figure 20g,h. The turning direction was adjusted by the Q-learning algorithm and the walking path of the humanoid robot could also be recorded in this learning process.

Experimental Results
The performance of the proposed learning framework is illustrated in this section. The straightforward gait pattern was learned for the humanoid robot using an FPGA chip and an automatic training platform in a training field. The real learning process of the proposed learning framework is demonstrated with four states in Figure 20. In the start state, the humanoid robot was suspended and then slowly lowered by the automatic training platform in the initial position, as shown in Figure 20a,b. In the operation state, the humanoid robot was followed by the automatic training platform when walking from the initial position to the front coordinate of the training field, as shown in Figure 20c,d. In the end state, the humanoid robot reached the end position and then was pulled up by the automatic training platform, as shown in Figure 20e,f. In the return state, the humanoid robot was returned by the automatic training platform to the initial position, as shown in Figure 20g,h. The turning direction was adjusted by the Q-learning algorithm and the walking path of the humanoid robot could also be recorded in this learning process. Based on the proposed learning framework, there were a total of 594 episodes executed to learn the straightforward gait pattern for the humanoid robot. The target region, with a center of 229.5 cm, 59.5 cm, was located in front of the initial position (25.5 cm, 59.5 cm) where the humanoid robot began walking in each episode. The target distance was where the x-coordinate of the training field was 221 cm. In the learning process, an episode was terminated when the humanoid robot reached the danger region or the target region. The Q-table could be updated by selecting the turning Based on the proposed learning framework, there were a total of 594 episodes executed to learn the straightforward gait pattern for the humanoid robot. The target region, with a center of 229.5 cm, 59.5 cm, was located in front of the initial position (25.5 cm, 59.5 cm) where the humanoid robot began walking in each episode. The target distance was where the x-coordinate of the training field was 221 cm. In the learning process, an episode was terminated when the humanoid robot reached the danger region or the target region. The Q-table could be updated by selecting the turning direction according to the position of the robot in the training field. The walking paths of the humanoid robot in these 594 episodes were recorded to analyze the learning process, and they could be divided into three stages: (1) initial stage, (2)

Middle Stage of the Learning Process
Episodes 201 to 400 represented the middle stage of the learning process, as shown in Figure 22. Episode 247 shows that the humanoid robot could gradually reach over half of the target distance, as shown in Figure 22a. After a few learning processes, episode 290 shows that the humanoid robot could reach the target region, as shown in Figure 22b. However, most episodes in the middle stage, such as episodes 344 and 386, show that the humanoid robot still could not reach the target region, as shown in Figure 22c,d.

Middle Stage of the Learning Process
Episodes 201 to 400 represented the middle stage of the learning process, as shown in Figure 22. Episode 247 shows that the humanoid robot could gradually reach over half of the target distance, as shown in Figure 22a. After a few learning processes, episode 290 shows that the humanoid robot could reach the target region, as shown in Figure 22b. However, most episodes in the middle stage, such as episodes 344 and 386, show that the humanoid robot still could not reach the target region, as shown in Figure 22c,d.

Middle Stage of the Learning Process
Episodes 201 to 400 represented the middle stage of the learning process, as shown in Figure 22. Episode 247 shows that the humanoid robot could gradually reach over half of the target distance, as shown in Figure 22a. After a few learning processes, episode 290 shows that the humanoid robot could reach the target region, as shown in Figure 22b. However, most episodes in the middle stage, such as episodes 344 and 386, show that the humanoid robot still could not reach the target region, as shown in Figure 22c,d.  Episodes 401 to 594 represented the final stage of the learning process, as shown in Figure 23. Episode 431 shows that the humanoid robot could gradually approach the target region, as shown in Figure 23a. After a few learning processes, episode 466 shows that the humanoid robot could reach the target region, as shown in Figure 23b. Moreover, most episodes in the final stage, such as episodes 546 and 594, show that the humanoid robot could not only reach the target region, but also walk approximately in a straight line, as shown in Figure 23c,d. Hence, the straightforward gait pattern was learned in this stage. The recorded walking path could be analyzed based on the walking distance and the lateral offset. The walking distance was denoted by the horizontal length along the x-coordinate from the initial position to the end position. The lateral offset distance was the offset length compared with the straightforward line representing the walking distance. In the initial stage, the average walking distance was 95.4204 cm, which was far from the target region, and the average lateral offset distance was 22.8071 cm, which was also far from a straight line during this walking distance. In the middle stage, the average walking distance was 100.0183 cm, which approached the target region, and the lateral offset distance was 21.0969 cm, which also approached a straight line during this walking distance. In the final stage, the average walking distance was 148.7788 cm, which was closer to the target region, and the lateral offset distance was 14.8387 cm, which was closer to a straightforward line during this walking distance, within a unit coordinate of the training field. The detailed average experimental results are shown at each stage in Table 6. The final Q-table of the straightforward gait pattern is shown in Table 7.

Initial Stage of the Learning Process
Episodes 0 to 200 represented the initial stage of the learning process, as shown in Figure 21. Episode 0 shows that the humanoid robot could only walk in a straight line to approximately half of the target distance, as shown in Figure 21a. After a few learning processes, episode 81 shows that the humanoid robot could reach the target region, as shown in Figure 21b. However, most episodes in the initial stage, such as episodes 145 and 195, show that the humanoid robot still could not reach the target distance, as shown in Figure 21c,d.

Middle Stage of the Learning Process
Episodes 201 to 400 represented the middle stage of the learning process, as shown in Figure 22. Episode 247 shows that the humanoid robot could gradually reach over half of the target distance, as shown in Figure 22a. After a few learning processes, episode 290 shows that the humanoid robot could reach the target region, as shown in Figure 22b. However, most episodes in the middle stage, such as episodes 344 and 386, show that the humanoid robot still could not reach the target region, as shown in Figure 22c,d.

Final Stage of the Learning Process
Episodes 401 to 594 represented the final stage of the learning process, as shown in Figure 23. Episode 431 shows that the humanoid robot could gradually approach the target region, as shown in Figure 23a. After a few learning processes, episode 466 shows that the humanoid robot could reach the target region, as shown in Figure 23b. Moreover, most episodes in the final stage, such as episodes 546 and 594, show that the humanoid robot could not only reach the target region, but also walk approximately in a straight line, as shown in Figure 23c,d. Hence, the straightforward gait pattern was learned in this stage.
The recorded walking path could be analyzed based on the walking distance and the lateral offset. The walking distance was denoted by the horizontal length along the x-coordinate from the initial position to the end position. The lateral offset distance was the offset length compared with the straightforward line representing the walking distance. In the initial stage, the average walking distance was 95.4204 cm, which was far from the target region, and the average lateral offset distance was 22.8071 cm, which was also far from a straight line during this walking distance. In the middle stage, the average walking distance was 100.0183 cm, which approached the target region, and the lateral offset distance was 21.0969 cm, which also approached a straight line during this walking distance. In the final stage, the average walking distance was 148.7788 cm, which was closer to the target region, and the lateral offset distance was 14.8387 cm, which was closer to a straightforward line during this walking distance, within a unit coordinate of the training field. The detailed average experimental results are shown at each stage in Table 6. The final Q-table of the straightforward gait pattern is shown in Table 7.

Conclusions
In this paper, the Q-Learning algorithm was applied to learn a straightforward gait pattern for a humanoid robot based on an automatic training platform. There were four main contributions of this research. Firstly, an automatic training platform, which was an original idea, was proposed and implemented so that the humanoid robot could learn the straightforward walking gait in a real situation. Moreover, it could be used to reduce human resources and protect the humanoid robot in the training process. Secondly, a learning framework was proposed for the humanoid robot based on the proposed automatic training platform. Thirdly, an oscillator-based gait pattern was designed and combined with the proposed learning framework to reduce the number of learning parameters and speed up the learning process. Lastly, the Q-learning algorithm was applied in the proposed learning framework to allow the humanoid robot to learn the straightforward walking gait in a real situation. The proposed learning framework and automatic training platform were completely tested on a real small-sized humanoid robot, and an experiment was set up to verify its performance. In the learning process, the walking distance kept increasing, which shows that the humanoid robot could learn to walk toward the target region. Similarly, the lateral offset distance kept decreasing, which represents that the humanoid robot could walk in a straightforward pattern. From the experimental results of successful bipedal locomotion with a straightforward gait pattern, the feasibility of the proposed learning framework and automatic training platform could be validated. Hence, the desired behavior could be learned for the intrinsically unstable humanoid robot using the proposed learning framework, which could reduce human resources by using the automated learning process based on the proposed automatic training platform. The main purpose of this paper was to enable the robot to learn the straightforward gait pattern. When the robot is able to walk straight, it can then be combined with localization algorithms, such as Simultaneous Localization And Mapping (SLAM) and particle filter, in the future. The successfully learned straightforward gait pattern can be used in the localization algorithm to enable the robot to actually reach a specified position. Moreover, deep reinforcement learning can be designed and deployed in the proposed learning framework via the FPGA chip.

Conflicts of Interest:
The authors declare no conflicts of interest.