A Wheeled Inverted Pendulum Learning Stable and Accurate Control from Demonstrations

: In order to enable robots to be more intelligent and ﬂexible, one way is to let robots learn human control strategy from demonstrations. It is a useful methodology, in contrast to traditional preprograming methods, in which robots are required to show generalizing capacity in similar scenarios. In this study, we apply learning from demonstrations on a wheeled, inverted pendulum, which realizes the balance controlling and trajectory following simultaneously. The learning model is able to map the robot position and pose to the wheel speeds, such that the robot regulated by the learned model can move in a desired trajectory and ﬁnally stop at a target position. Experiments were undertaken to validate the proposed method by testing its capacity of path following and balance guaranteeing.


Introduction
After decades of development, robots have become the partners of human beings not only in the workplace, but also at home. Many efforts have been made to bring robots closer to humans. Making robots more human-like means more than making them look like humans; more importantly, it means granting them human-like behaviors. The humanoid robot seems naturally friendly because of its anthropomorphic appearance. However, the two legs of a humanoid robot might encounter many difficulties in motion because of unprecedented disturbances in complex environments. In contrast, the wheeled inverted pendulum (WIP) robot is much easier to control with fewer actuators, and would be more stable and flexible than biped robots, which makes it promising for working with humans [1,2]. Balance control and motion planning are two important issues for WIP robots. Nevertheless, to systematically solve them both is not trivial [3,4].
In various tasks undertaken by WIP robots (including mobile robot motion planning, keeping balance, and even manipulation, grasping, and human-machine cooperation with loaded robotic hands, etc.), a pre-designed controller is usually a prerequisite for robots to execute designated motions in a structured environment [5][6][7]. However, coding for simply one task is a heavy workload, not to mention that all the tasks should be addressed simultaneously by designing controllers. Furthermore, due to the fixed pattern of a manually designed controller, it is difficult for a robot to generalize to slight changes of workspace or sudden external perturbations. In this context, a methodology is desirable if it enables the robot to extract the features of the human control strategy such that it can generalize to similar tasks under the same kinds of working configurations [8].
Learning from demonstration (LfD) is one such a methodology that satisfies the above-mentioned requirements [9,10]. Specifically, LfD consists of three procedures [11], which are orderly collecting human demonstrations for the task, learning model parameters from the demonstration examples, and reproducing with the learned control model. By straightforwardly feeding the robot with demonstrations related to specific tasks, a controller implicating the human control strategy can be automatically learned, which benefits the amateur users without professional knowledge of how to design a controller. Additionally, the learned model has generalization capacity to guarantee that the robot can adapt to similar tasks, therefore avoiding the trouble of controller redesigning [12,13].
In this study, we focus on applying LfD on a wheeled inverted pendulum ( Figure 1). A controller is automatically learned from demonstrations, which should be able to regulate the two-wheeled robot to move along a path similar to the demonstrated counterpart and to converge to a desired position. Additionally, the control process should be effective under both the nonholonomic constraints of two-wheeled mobile robots and the underactuated constraints that guarantee the balance of the wheeled inverted pendulum. In order to achieve the above-mentioned control requirements, we provide a learning framework based on our previous work of learning stable and accurate dynamic system for manipulators. Different from the previous work, the proposed learning framework can handle the nonholonomic control problem of mobile robots. Meanwhile, we propose an online, state-variable estimating method that is applied during the reproduction process. The method is necessary because the nonholonomic constraints of the mobile robot and the unactuated constraints possibly deviate the robot from the expected direction. In this situation, the proposed estimating method can adjust the extra-dimensional component in the previous work appropriately according to the real robot positions, such that the trajectory accuracy can still be kept on a nonholonomic mobile robot. The contributions in this paper can be, therefore, listed as follows: 1. We provide a multi-objective learning framework to model an accurate and stable controller for a wheeled inverted pendulum, which has to take into consideration both the nonholonomic constraints of mobile robots and the underactuated constraints for balance.
2. A real-time estimating method for the extra-dimensional component (used in the reproduction process) is proposed based on a proposed constrained optimization problem, which changes the situation that the extra-dimensional component in the previous work cannot be influenced by the change of the robot positions (which is important for nonholonomic mobile robots to maintain trajectory accuracy).
The remainder of this paper is structured as follows. Section 2 introduces the related work. Section 3 formulates the problem investigated in this paper. Specific descriptions of the proposed approach are given in Section 4. Simulations and experiments are provided to validate the proposed method in Section 5. Finally, Section 6 concludes the contribution of this paper.

Related Work
By conventional control methodologies, the robot task is usually encoded by manually designing a controller [14][15][16]. In order to acquire more complicated control strategy from humans automatically, learning approaches are applied to yield a controller in the data-driven fashion. Spline decomposition methods [17,18] are presented to construct the robot regulators by computing the point-wise averages of the demonstrated trajectories. These methods are efficient and simple to execute, but they depend mainly on the human heuristics for trajectory synthesis.
Statistical approaches [19,20] are also prevailing for modeling robot motions. These approaches usually need a reasonable heuristic for controller programming, therefore, being possibly influenced by external perturbations. These approaches are close but still unable to generalize to similar task cases in that the choice of heuristic is decided based on specific task parameters.
The dynamic movement primitives (DMP) were proposed in [21] to effectively address the instability problem. Its main idea is to approximate a nonlinear system by several linear counterparts. In contrast to its effect on the aspect of stabilization, DMP did not emphasize on the accuracy factor when modeling complicated tasks or handling complex scenarios.
Dynamic systems (DS) based methods [22,23] were developed as a promising alternative to traditional control approaches due to its convenience, which can extract human control strategies accurately and automatically from given demonstrations. Several methods choose to model dynamic systems without considering stability constraints, which makes them possibly unstable at a target position [24][25][26]. A stable estimator of a dynamic system (SEDS) was proposed in [27] with Gaussian mixture models (GMMs) to incorporate both accuracy and stability factors. SEDS uses the Lyapunov stability condition as a constraint to optimize the parameters in the control model. It effectively captures the dynamics features from human demonstrations whilst stabilizing the robot motions. Fast and stable learning of dynamic system method (FSM-DM) in [28] handles three factors for learning a dynamic system, which includes one more factor of learning speed in addition to accuracy and stability. FSM-DM is able to teach a system fast, which is advantageous in practice when efficiency is a important requirement.
In order to reduce the influence of the accuracy against the stability dilemma, the algorithm of control-Lyapunov-function-based dynamic movements (CLF-DM) was proposed in [29]. This method is divided into three steps. First, A parameterized Lyapunov function candidate is roughly taught to be consistent with the data. Second, the control features reflected by the demonstrations are learned to guarantee the accuracy factor. Third, an online correction technique is developed to stabilize the reproduced trajectory on the fly. The neurally imprinted stable vector fields (NIVF) technique was presented as a learning framework to incorporate stable vector fields into the extreme learning machine (ELM) [30], which reproduces more accurate motions and is finally stable at the target position. The approach proposed in [31] considered using ELM network to train a Lyapunov candidate based on a specific task. The learned Lyapunov candidate is consistent with the motion trajectories related to the task; therefore, making the stability constraint based on it more slack. Accordingly, the accuracy of the learned model is also less influenced by the stability factor. τ-SEDS was presented in [32] to handle the trade-off of stability and accuracy by means of diffeomorphism. It transforms the original robot motions into those in a new space, where the transformed motions are consistent with a quadratic Lyapunov candidate. The method (MIMS) in [33] presented extended the SEDS scheme under the manifold immersion and submersion. Though this method can further improve the situation of stability against the accuracy dilemma, the extra-dimensional component produced by the manifold immersion is time variant. Therefore, it possibly causes performance degeneration under external perturbations.
The main feature of this paper in contrast to the previous work, is to apply learning from demonstrations on a wheeled inverted pendulum, where more robot constraints including the nonholonomic and underactuated constraints need to be taken into consideration.

Problem Formulation
This section formulates the problem mathematically. The path planning of a wheeled inverted pendulum can be reduced to a control problem of a wheeled robot (Figure 2), which is given as where x and y describe the Cartesian position of the wheeled robot on its motion plane, and θ represents the included angle between the robot orientation and the x-axis. v represents the forward speed of the robot and ω represents the angular speed. The relationship between v, ω and v l , and v r (the speed of two wheels (Figure 2)) is as follows: where L represents the axis width between two wheels. Meanwhile, the wheeled inverted pendulum is controlled under the underactuated constraints given as where ϕ is the tilt angle, H is the distance between the wheel centroid and the mass, and g denotes the acceleration of gravity. The control diagram of a robot system scheme can be seen in Figure 3. In the phase of collecting demonstrations, a motion capturer is used to collect the position and pose of the wheeled inverted pendulum at a sampling interval of ∆T. The sequence of robot positions and poses as well as the tilt angle are recorded as x t,n , y t,n , θ t,n , ϕ t,n T n ,N t=0,n=1 . In the demonstration set, n indexes the demonstrated trajectories whose total number is N. t represents the tth sampling instant and T n represents the total sampling number in the nth demonstration. The first-order derivativesẋ t,n ,ẏ t,n ,θ t,n , andφ t,n corresponding to x t,n , y t,n , θ t,n , and ϕ t,n can be further computed as Specially,ẋ T n ,n =ẏ T n ,n =θ T n ,n = 0 and the demonstration have a common target position and pose; i.e., ∀i, j = 1...N, The tilt angle does not have to be zeros at last, but has to be a small number.
The demonstration generating process can be thought to result from a dynamic system as follows:  The purpose of this study was to teach a model to approximate the counterpart in Equation (5). Subsequently, we used the learned model to construct a mapping to actuate v l and v r for the nonholonomic mobile robot, which satisfies both the balance controlling and the path following requirements.

Learning Path Following and Balance Controlling Simultaneously
In this section, we introduce the proposed learning framework for a wheeled inverted pendulum, along with an online state-variable estimating method applied during the reproduction process.

Teaching the Control Model with the Nonholonomic and Underactuated Constraints
Instead of directly learning the controller of the wheeled inverted pendulum in the form of Equation (5), we chose to first teach the dynamic system and to subsequently transfer the learned dynamic system into a controller.
In order to let the dynamic system sufficiently capture the control features implicated in the demonstrated trajectories (i.e., the sequences of x and y), we added an extra-dimensional component z for the demonstrated trajectories based on our previous work [33], and therefore, acquired transformed trajectories denoted as x t,n , y t,n , z t,n , θ t,n , ϕ t,n T n ,N t=0,n=1 . Therefore, we taught a dynamic system denoted as where β are the parameters to be learned during the learning process. The first loss term for learning is given as Equation (7) to fit the dynamic system that generate the transformed demonstration examples x t,n , y t,n , z t,n , θ t,n , ϕ t,n T n ,N t=0,n=1 .
Based on Equation (2), we can acquire the predicted wheel velocitiesv l andv r .

Estimating the Extra-Dimensional Component by a Constrained Optimization Related to the Learned Dynamic System
In spite of the fact that we have incorporated the loss terms J 1 and J 2 for the learning process to handle the nonholonomic constraints and the underactuated constraints, the controller (Equations (12) and (13)) yielded by the learned dynamic system still possibly fails to regulate to expected x and y at some instances due to these two constraints. The extra-dimensional component z cannot be influenced by the deviations of x and y; therefore, the accumulation effect might reduce the accuracy. This section introduces an online state variable estimating method to overcome this situation.
When reproducing the learned dynamic system, the extra-dimensional component z 0 of the initial input for the learned dynamic system is computed by a constant ratio µ as in the following where µ is determined by The rule in Equation (15) to decide the constant ratio µ comes from a natural thought shown in Figure 4. The proposed estimating approach is mainly based on a constrained optimization problem related to the learned dynamic system. The general idea can be seen in Figure 5. Specifically, the learned dynamic system predicts the k + 1th state variable x k+1 ,ŷ k+1 ,ẑ k+1 ,θ k+1 ,φ k+1 T with the kth state variable x k , y k , z k , θ k , ϕ k T as model input.
Suppose that the robot's actual state variable is actually x k+1 , y k+1 ,ẑ k+1 , θ k+1 , ϕ k+1 T under the robot constraints and the external perturbations (with only predictedẑ unchanged because it is a virtual component and cannot be influenced by any external factors). z y x Figure 5. Adjusting the extra-dimensional component by a constructed constrained optimization problem. In (a), the learned dynamic system outputs the predicted robot position and the corresponding extra-dimensional component at k + 1th instant (with the state variable at the kth instant as system input). The external factors, including the nonholonomic constraints and underactuated constraints, influence the robot's movement, and the robot moves to an actual position deviating from the predicted one. In order to adjust the extra-dimensional component which cannot be influenced by the external factors, we suppose there is a similar robot motion trajectory (red curve on the x-y plane in (b)) passing near the current robot trajectory (blue curve on the x-y plane in (b)). The robot motion trajectory passes through actual robot position at the k + 1th instant, which is also yielded by the learned dynamic system while it starts from another start point. In (c), considering the two trajectories (red and blue curves in the x-y-z Cartesian space) are alike, the derivative directions at the reproduced points on them at the kth instant should be also similar. Using those as constraints and combining the learned dynamic system function, the previous point on the red marked trajectory at kth instant can be acquired by solving the constrained optimization in Equation (19). Inputting it to the learned dynamic system, the next state variable can be predicted, the extra-dimensional component of which is the value to be used for adjusting the current component at k + 1th instant.
In order to find an appropriate value to adjust the extra-dimensional componentẑ k+1 , we suppose the actual state variable x k+1 , y k+1 ,ẑ k+1 T is a direct output of the learned dynamic sub-system, without the influences of the external factors ( Figure 5). Additionally, we fixed the values of θ k and ϕ k ; i.e., considering them as constants at that instant. Therefore, the corresponding input of the sub-system where ∆t is the sampling instant during the reproduction process.
Meanwhile, x k I , y k I , z k

Experiment
In this section, we validate the proposed method on a real wheeled inverted pendulum with a motion capturer to observe the position and pose information (as shown in Figure 6). The wheeled inverted pendulum uses STM32 single chip computer as the processor to control the velocities of its two wheels. The Bluetooth interface was provided for communications between the STM32 single chip computer and the upper computer. We used the Vicon Tracker 3.4 motion capturer to observe the state of the wheeled inverted pendulum in real-time. The motion capturer can send the observed state to the local network that it is connected to. We, therefore, built software to receive the state observed from the motion capturer. After handling the date received through the algorithm in the upper computer, we subsequently send orders to the STM32 computer to actuate the wheeled inverted pendulum at an interval of 0.005 s. The effectiveness of the proposed algorithm was verified by applying it to the path following task. We collected three kinds of trajectories with three groups of demonstrations for each kind (Figure 7). The recorded demonstrated trajectories were used to train the dynamic system in Equation (6). The sampling interval was also set as 0.005 s.

 -shape
Fish-shape S-shape

Demonstrations
Target position Start position Figure 7. The demonstrations given by humans to teach a model that can regulate a wheeled inverted pendulum to undertake path following task.
Subsequently, we picked a demonstration from each kind of trajectory as the reference trajectory. We let the robot start from a different position to follow the reference trajectory.
The results by the proposed method and two traditional control methods [34,35] are shown in Figure 8. It can be seen that the proposed learning method can smoothly follow the reference trajectory and finally converge to the target position.

Reference trajectory
Target point Start point Proposed method Method in [34] Method in [35]  -shape Fish-shape S-shape Figure 8. The results by the proposed method and the methods in [34,35]. The upper row shows the general trajectory shape of the path following. The lower row records the path following results through MATLAB.
We also used the swept error area standard (SEA) [29] to evaluate the proposed method and the traditional methods [34,35] quantitatively. The SEA standard measures the difference between the generated and the reference trajectories by their included area ( Figure 9). The meaning of the SEA standard can be simply thought as a fact that the generated trajectory should be enough close to the reference counterpart, if the evaluated controller is effective in path following. The computing formula is given by where x k,n ,ŷ k,n T and x k,n , y k,n T are sampling points, respectively, from the nth generated and reference trajectories. A · is the area of the tetragon enclosed by the four adjacent points along the trajectories. The SEA results by the proposed method and the approaches in [34,35] can be seen in Table 1. From the results, we can see that the proposed method can better achieve the capacity of path following; i.e., the error between the generated trajectory by the learned controller and the reference trajectory is smaller. area enclosed by points x k ,ŷ k T and x k+1 ,ŷ k+1 T in the generated trajectory, and points x k , y k T and x k+1 , y k+1 T in the reference trajectory represent the error between the two sampling points of the generated trajectory and the counterparts of the reference trajectory. The sum of all such tetragon areas are, therefore, used to measure the error between the generated and reference trajectories.

Conclusions
This study investigated applying learning from demonstrations on a wheeled inverted pendulum. We took into consideration four aspects: the control under nonholonomic constraints, keeping the pendulum balanced, stability, and accuracy. We incorporated the nonholonomic and underactuated constraints into the learned dynamic system by adding corresponding loss terms for optimizing. Additionally, we proposed an online, state variable estimating method to adjust the extra-dimensional component in the reproduction period such that the accuracy can be further improved even under the robot constraints, including the nonholonomic and underactuated constraints. In future work, we plan to exploit deeper control features in demonstrations, and improve the controller more intelligently by adding more human-like constraints during the learning process.