Control of the Acrobot with Motors of Atypical Size Using Artificial Intelligence Techniques

An acrobot is a planar robot with a passive actuator in its first joint. The main purpose of this system is to make it rise from the rest position to the inverted pendulum position. This control problem can be divided in the swing-up issue, when the robot has to rise itself by swinging up as a human acrobat does, and the balancing issue, when the robot has to maintain itself in the inverted pendulum position. We have developed three controllers for the swing-up problem applied to two types of motors: small and large. For small motors, we used the State-Action-Reward-State-Action (SARSA) controller and the proportional–derivative (PD) controller with a trajectory generator. For large motors, we propose a new controller to control the acrobot—a pulse-width modulation (PWM) controller. All controllers except SARSA are tuned using a genetic algorithm.

The acrobot control problem is to make it rise from the rest position to the inverted pendulum position.As the first and second link of an acrobot are coupled, the control problem is split in two [3,4,[6][7][8][9]12]: the swing-up problem, in which the acrobot has to rise itself above one meter as a human acrobat does; the balancing problem, in which it has to maintain itself in the inverted pendulum position.Other researchers have solved the whole problem with only one controller [5,10,13].
For the swing-up problem, Duong [4] proposed to use a NeuroController, Spong [9] and Brown [3] designed a proportional-derivative (PD) controller with a trajectory generator, and Mahindrakar [7] used an energy-based controller.Ueda [12] proposed a decision-making algorithm with several limitations of memory to solve the swing-up issue on acrobots with the capabilities available in a small robot.However, Sutton [10] solved the problem using reinforcement learning.He proposed State-Action-Reward-State-Action (SARSA)-an algorithm that learns by a trial and error method.SARSA uses a table (Q(s, a)) to save the states s that the acrobot visited, the actions a that took on that state, and the performance of taking that action on that state.Nichols [8] compared three other methods of reinforcement learning to solve the swing-up issue: Continuous Actor Critic Learning Automaton (CACLA), NerlderMead-SARSA (NM-SARSA) and NerlderMead-SARSA.He claims that NerlderMead-SARSA is faster than many other controllers in the literature.The second joint is on blue.The angle of the first link is called q 1 and the angle of the second link is q 2 .The solutions proposed for the balancing problems use a linear-quadratic regulator (LQR) controller [3,4,7,9] or a fuzzy controller [3].Duong, Brown, Mahindrakar, and Spong [3,4,7,9] tuned the parameters of the controllers using a genetic algorithm.
The acrobot problem was solved by Duong [5], as an extension of his previous work [4] with a NeuroController, and by Zhang [15] which used a spiking neural network with an LQR.Both used a genetic algorithm to improve the performance of their algorithms.Horibe [6] also solved the acrobot problem by a feedback law that was obtained by numerically solving a Hamilton-Jacobi equation by the stable manifold method.
Those papers proposed different methods to solve the acrobot problem, avoiding the saturation of the motors.The controllers made by Sutton [10] (SARSA) and Spong [9] (PD controller) can work with motors with low maximum torque, but none of them tried to take advantage of the saturation of motors with high maximum torque.
In this paper, we propose two variants of the SARSA algorithm [10]: to initialize the algorithm with zeros or non-zero values and to introduce some error on the sensors of the acrobot.As the PD controller proposed by Spong [9] is not a traditional PD controller, we have also compared his PD controller with a canonical PD controller.Finally, we propose a novel pulse-width modulation (PWM) controller that uses the saturation of a large motor to control the acrobot in any operation point of the first joint.

Acrobot Model
All the controllers were tested on a simulated acrobot.The parameters of our acrobot are shown in Table 1.The equations of motion of the system are the same as those of a planar robot without the input torque on the first joint: where M(q) is the inertia matrix, C( q, q) is the acceleration of Coriolis, G(q) the gravitation terms, q1 and q2 are the accelerations (rad/s 2 ) of the first and second joint, q1 and q2 are the velocities (rad/s) of the first and second joint, q 1 and q 2 are the position (rad) of the first and second joint, and τ 2 is the input torque on the second joint.The other terms on the Equation (1) are shown in the Equations (2).
The control time used for all the controllers is T = 50 ms.The calculus of q 1 and q 2 is performed four times between one control input and the next one.The positions of the first and second joint (q 1 ,q 2 ) are between [0, 2π), the velocity of the first joint ( q1 ) is between [−4π, 4π], and the velocity of the second joint ( q2 ) between [−9π, 9π].
To control an acrobot with these parameters, a motor with a maximum torque of ±10 Nm can be used.We used smaller motors for the SARSA and the PD controller (±1 Nm), and larger for the PWM controller (±300 Nm).The sensors are simulated, with an added error of 5% of the range of each dimension (only used for the SARSA controller).The sensors are thus able to read the position and the velocities of the first and the second joint, but not the accelerations.

State-Action-Reward-State-Action (SARSA) Controller
SARSA [10] is a reinforcement learning algorithm that can learn through experience (on-line learning), which can take the best action at each moment to achieve its goal.Generally, it uses a table Q(s, a) that associates states and actions, which stores the weights of how good an action a is at a state s.The SARSA algorithm is shown in Algorithm 1.
To create Q(s, a), it is necessary to discretize the range of each dimension in parts, in order to make the infinite range of the values a finite range of possible states.The combination of the discretized dimensions is called "tiling".It is possible to use more than one tiling, changing the updating rule for: Sutton [10] used 48 tiles (12 with 4 dimensions, 12 with 3 dimensions, 12 with 2 dimensions, and 12 with 1 dimension).The position and the velocity range are split in six intervals, but the dimensions of the velocities are offset by a random fraction of interval, so they have seven intervals.
As we have used a small motor with a maximum torque of ±1 Nm, the only possible actions are {−1, 0, 1}.The reward r used is −1 until the end of the acrobot is above 1 m.For the election of the action a, a greedy policy (with = 0) is used, because a bad move could end a set of good moves.A learning rate (α) is equal to 0.2/48 because a low value (0.2) saves the old information learned while the system continues learning, and it is divided by 48 because we use 48 tiles.The discount factor is high (γ = 1) to search for long-term reward.
In Section 4.1, we have compared the results (Figure 2) obtained when Q(s, a) are initialized with zero (Figure 2a,b) or non-zero (Figure 2c,d) values, as two possible ways to initialize Q(s, a) arbitrarily, and when an error of 5% of the range on q 1 , q 2 , q1 , and q2 (Figure 2b,d) are introduced to the sensors.

Proportional-Derivative (PD) Controller
Spong [9] proposed a PD controller to solve the swing-up problem: where K p and K d are the proportional and derivative terms of the PD controller, and the terms d 22 , h 2 y φ 2 are: This equation is not exactly a PD controller.The canonical controller used is: If the torque compensation is taken apart: However, a PD controller uses the following formula: where e k = q d 2 − q 2 .When these two formulas are equated to compare if there are differences: We obtain that the last Equation ( 18) is not always true.In Equation ( 4), the desired velocity (right part of (18)) of the second joint qd 2 has a constant value (=0); meanwhile, the classic PD controller computes that value as the derivate of the error (left part of (18)).
The controller is tuned by a genetic algorithm.The fitness is the needed time of the acrobot to go above 1 m.
where y acrobot is the position on the y axis of the end of the acrobot, h is the threshold of one meter to achieve the goal of the swing-up problem, and t is the time when the acrobot go above the threshold.
If the acrobot controller cannot solve the swing-up problem, the fitness is very high.The best third part of each population (those which have less fitness) is maintained for the next generation (33% of elitism).A linear ranking selection (Equation (20)) is a selection operator that selects the parents of the reproduction operator using a probability that depends on their rank.This operator selects a third part of the population for reproduction, in order to obtain another third part of the next population.
where i is an individual, prob i is the probability of selection, I i is the fitness of the individual i, and λ is the number of individuals in the population.
The crossover operator used is a BLX-α with α BLX = 0.5.The last third part of the population is created randomly.
where i is the gene of an individual, H i is the range of the gene of the children, and a i and b i are the genes of the parents.The mutation operator is not being used because the elitism and the BLX-α operator gave enough diversity to the algorithm.
The population has 60 individuals, 40 iterations to converge, and 60 s to achieve the goal of going over one meter.Each individual has three genes that correspond to [α PD , K p , K d ], whose values are real, between ([0, 1],[0, 60],[0, 5]) and are initializated randomly.

Pulse-Width Modulation (PWM) Controller
The PWM controller is the new method that we propose to control an acrobot with a large motor in the shortest time.To make the new controller, we use a genetic algorithm with the same selection, crossover, and elitism operators that the genetic algorithm used to tune the PD controllers.The maximum torque of the large motor needed is at least ±200 Nm, but we have used one with ±300 Nm.The structure of this controller is composed of a torque compensation with four proportional integral (PI) controllers-one for each dimension.The formula is: where x are the dimensions [q 1 , q 2 , q1 , q2 ], K x p and K x i are the constant of the controller of the x dimension, and e x k is the error of the x dimension (equal to the desired value less the real value).The desired values for [q 2 , q1 , q2 ] are 0, while the value for q 1 could be any value (in this case, π).
The population of the genetic algorithm is coded in eight real parameters that are between [0, 10 5 ].This controller learned how to solve the balancing problem, so the acrobot starts on q 1 = pi − 0.01 and q 2 = 0.If the end of the acrobot falls below the origin (0 m), it means that the fitness is very high (10 6 ).From that number, we subtract the time that the acrobot was above that line.By using a linear ranking selection, the genetic algorithm will rank the individuals according to the following criteria: those who can stand above the base of the acrobot at all times would be considered as the best, and then those who can stand longer above the base of the acrobot would follow along.Those controllers that can solve the balancing problem have a fitness equal to: where x i j is the value of the axis x of the j th link at the i t h step of time, y i j is the value of the axis y of the j th link at the i t h step of time, x d j is the desired value of the axis x of the j th link, y d j is the desired value of the axis y of the j th link, F i is the input torque at the i t h step of time, and w j are the weights of preference to reduce.The desired position is x 1 = 0, y 1 = 1, x 2 = 0, and y 2 = 2.The weights used are [w 1 = 1, w 2 = 10, w 3 = 0.1, w 4 = 0.1, w 5 = 10 −5 ]/(number of steps).
This controller is capable of controlling the acrobot at any value of q 1 and maintaining it in that position.Thus, the swing-up problem can be solved without swinging-up the acrobot, being able to go above the threshold h having restrictions on q 1 .This paper proposes this controller as an emergency method for the robot if the path for going up is blocked.On the one hand, it could control the acrobot quickly to follow a reference for the first link and maintain it on any point on a dynamic equilibrium.On the other hand, this controller makes the acrobot have zero damping.Using this method to swing-up and then change to a linear-quadratic regulator (LQR) controller for the balancing problem is then recommended.

Results
We have made a simulated acrobot to test the controllers.The simulated acrobot was tested with the SARSA, PD, and PWM algorithms.On SARSA, we have compared how quickly the algorithm learns how to solve the swing-up problem with different variations.The PD proposed by Spong [9] and a technically correct PD controller were compared.The new controller that we propose (the PWM controller) was tested to observe the performance solving the swing-up problem.

SARSA
The SARSA algorithm was tested ten times in each experiment in order to obtain significant results of each one.Four experiments were compared: the SARSA algorithm with no error on sensors and having the matrix Q(s, a) initialized with zero values; with an added error of 5% to the output of the sensors; having the matrix Q(s, a) initialized with non-zero values; with an added error and having Q(s, a) initialized with non-zero values.The results are depicted in Figure 2.
Those experiments which have an added error (Figure 2b,d) do not converge, but in this case the algorithm is more robust to changes.Those controllers that do not have an added error (Figure 2a,c) converge without any noise.In the case of the random initialization (Figure 2c,d), it converges before, but it needs more steps to achieve the goal.Finally, the initialization with zero values (Figure 2a,b) is smoother and it takes less time to go above one meter.

PD Controller
The value of the gains for the PD controller of Spong [9] are α PD = 1.93,K p = −15.9289and K i = −14.7768.Although the range in which the population is generated by the genetic algorithm did not have negative values, the BLX-α operator found better values outside this range.
The behaviour of this controller is shown in Figure 3. Almost all the time, the torque is saturated (Figure 3g) because of the low maximum torque.The controller lasts 13.3 s (Figure 3f) to achieve the goal, but the balancing controller should begin to control the acrobot at 17.5 s, when the red lines of the first graphic (Figure 3f) are exceeded (those lines are at π − 0.8 and −π + 0.8).Fortunately, at 17.5 s the acrobot is in a good position to be controlled by the balancing controller.The conventional PD controller tuned by the genetic algorithm has the gains α PD = −0.98,K p = 8.65 and K i = −0.31,and the results are shown in Figure 4.
The first time the controller achieves the goal is at the 18th second (Figure 4f) , but the balancing controller cannot control on that point because the red line of the first graph (Figure 4a) has not been exceeded.The conventional PD controller cannot control the acrobot, nor can the PD controller of Spong [9] with such a low maximum torque.

PWM Controller
The gains obtained by the genetic algorithm are shown in Table 2.These gains have the same order of magnitude as the ones we expected.The negative gains K i are not usual in the classical control, but the genetic algorithm found that those are the best values for this problem.The result of this controller is shown in Figure 5.
The controller takes three seconds (Figure 5a) to surpass the red line, where the balancing controller starts to handle the acrobot.Additionally, the acrobot is at a good position for the balancing problem, as the end of the acrobot is close to the two meters on the x axis (Figure 5f) and the angle of the second link is close to 0 (Figure 5b).On the second link's position in the y axis (Figure 5f), most of the time the acrobot is above one meter, so this controller is also valid to solve the balancing problem.
Another advantage of this controller is its ability to maintain the acrobot on impossible points until now, as the q 1 = π/2 and q 2 = 0, on horizontal position (Figure 6).In this operation point, the acrobot cannot be motionless.The PWM controller can work with the two joints at the same time and control the acrobot in this point.The task of maintaining the acrobot horizontal is achieved, as shown in the graph of the position on the x axis of the first (Figure 6c) and second (Figure 6e) links.The amplitude of the oscillations can be reduced with a lower time of control, but cannot be canceled.

Conclusions
In this paper, we prove that the acrobot can be controlled with a very low actuation through SARSA and a modified PD, but not by a conventional PD controller.We have also explored two variants of SARSA, and we found that the algorithm with a low error is not able to converge, and the algorithm converges more quickly if the Q(s, a) is initialized randomly, but finds worse solutions than if the initialization is with zero values.
Furthermore, we have developed a new PWM controller to control the acrobot with a large motor.This controller has large oscillations, but it lets the acrobot solve the acrobot problem in a shorter time than many approaches from the literature.With this new controller, it is also possible to maintain the acrobot in new operation points.
Moreover, we developed a genetic algorithm to tune both PD and PWM controllers, and it found the values of the controller parameters that solve the swing-up problem.
As further work, we are exploring new controllers to improve the performance our novel PWM controller, reducing the oscillations and the maximum torque needed.

Figure 1 .
Figure 1.Image of a simulated acrobot.This robot has a passive actuator on the first joint (on red).The second joint is on blue.The angle of the first link is called q 1 and the angle of the second link is q 2 .

Algorithm 1 : 3 s 4 5 for each step of episode do 6
SARSA algorithm 1 Initialize Q(s, a) arbitrarily.2 for (each episode:) do Initialize Choose a for s using policy derived from Q Take action a, observe r, s 7 Choose a for s using policy derived from Q 8

Figure 3 .
Figure 3. Results of the proportional-derivative (PD) controller by Spong to the swing-up problem.In the first row, the positions of the angle of the first (a) and the second (b) joints are shown in radians.The red line in (a) is the moment when the other controller starts to control for the balancing problem.The red line in (b) is the desired angle of the second link.In the second row, there are the positions on x (c) and y (d) of the first link and on x (e) and y (f) of the second link.The red lines in (d) and (f) are the goal height (1 m) of the swing-up problem.In the last row, the input torque is represented (g).

Figure 4 .
Figure 4. Results of the conventional PD controller to the swing-up problem.In the first row, the positions of the angle of the first (a) and the second (b) links are shown in radians.The red line in (a) is the moment when the balancing controller starts to control for the balancing problem.The red line in (b) is the desired angle of the second link.In the second row, there are the positions on x (c) and y (d) of the first link and on x (e) and y (f) of the second link.The red lines in (d) and (f) are the goal height (t = 1 m) of the swing-up problem.In the last row, the input torque is represented (g).

Figure 5 .
Figure 5. Results of the pulse-width modulation (PWM) controller to solve the acrobot problem.In the first row, he positions of the angle of the first (a) and the second (b) links are shown in radians.The red line in (a) is the moment when the balancing controller starts to control for the balancing problem.The red line in (b) is the desired angle of the second link.In the second row, there are the positions on x (c) and y (d) of the first link and on x (e) and y (f) of the second link.The red lines in (d) and (f) are the goal height (1 m) of the swing-up problem.In the last row, the graph (g) shows in blue the applied torque of the motor, and in red the same torque after a low-pass filter (the average of 10 values).By this method, it is possible to see the effect of the PWM, changing the frequency of the input signal.

Figure 6 .
Figure 6.Results of the PWM controller to maintain the acrobot horizontal.In the first row, the positions of the angles of the first (a) and the second (b) links are shown in radians.The red line in (a) is the moment when the balancing controller starts to control for the balancing problem.The red line in (b) is the desired angle of the second link.In the second row, there are the positions on x (c) and y (d) of the first link and on x (e) and y (f) of the second link.The red lines in (d) and (f) are the goal height (1 m) of the swing-up problem.In the last row, the input torque is represented (g).

Table 1 .
Parameters of the acrobot.

Table 2 .
Value of the gains of the PWM controller.