Adaptive Sliding Mode Disturbance Observer and Deep Reinforcement Learning Based Motion Control for Micropositioners

The motion control of high-precision electromechanitcal systems, such as micropositioners, is challenging in terms of the inherent high nonlinearity, the sensitivity to external interference, and the complexity of accurate identification of the model parameters. To cope with these problems, this work investigates a disturbance observer-based deep reinforcement learning control strategy to realize high robustness and precise tracking performance. Reinforcement learning has shown great potential as optimal control scheme, however, its application in micropositioning systems is still rare. Therefore, embedded with the integral differential compensator (ID), deep deterministic policy gradient (DDPG) is utilized in this work with the ability to not only decrease the state error but also improve the transient response speed. In addition, an adaptive sliding mode disturbance observer (ASMDO) is proposed to further eliminate the collective effect caused by the lumped disturbances. The micropositioner controlled by the proposed algorithm can track the target path precisely with less than 1 μm error in simulations and actual experiments, which shows the sterling performance and the accuracy improvement of the controller.


Introduction
Micropositioning technologies based on smart materials in precision industries have gained much attention for numerous potential applications in optical steering, microassembly, nano-inscribing, cell manipulation, etc. [1][2][3][4][5][6][7]. One of the greatest challenge in this research field is the uncertainties produced by various factors such as the dynamic model, environmental temperature, sensors performance, and the actuators' nonlinear characteristics [8,9], which make the control of micropositioning system a demanding problem.
To address the uncertain problem, different kinds of control approach have been developed, such as the PID control method [10], sliding mode control [11,12], and adaptive control [13]. In addition, many researchers have integrated these control strategies to further improve the control performance. Victor et al. have proposed a scalable fieldprogrammable gate array-based motion control system with a parabolic velocity profile [14]. A new seven-segment profile algorithm was developed by Jose et al. to improve the performance of the motion controller [15]. Combined with the backstepping strategy, Fei et al. proposed an adaptive fuzzy sliding mode controller in [16]. Based on the radial basis function neural network (RBFNN) and sliding mode control (SMC), Ruan et al. developed a RBFNN-SMC for nonlinear electromechanical actuator systems [17]. Gharib et al. designed a PID controller with a feedback linearization technique for path tracking control of a micropositioner [18]. Nevertheless, the performance and robustness of such model-based control strategies are still limited by the precision of the dynamics model. On the other hand, a sophisticated system model frequently leads to a complex control strategy. Although many researchers have considered the factors of uncertainties and disturbances, it is still difficult for the system to provide a precise and comprehensive process.
As the rapid development in artificial intelligence in recent years has roundly impacted the traditional control field, learning-based and data-driven approaches, especially reinforcement learning (RL) and neural networks, have become a promising research tropic. Different from traditional control strategies that need to make assumptions based on the dynamics model [19,20], reinforcement learning can directly learn the policy by interacting with the system. Back in 2005, Adda et al. presented a reinforcement learning algorithm for learning control of stochastic micromanipulation systems [21]. Li et al. designed a state-action-reward-state-action (SARSA) method using linear function approximation to generate an optimal path by controlling the choice of the micropositioner [22]; however, the reinforcement learning algorithms such as Q-learning [23] and SARSA [24] utilized in the aforementioned works are unable to deal with complex dynamics problems, especially the continuous state action space problem. With the spectacular improvement enjoyed by deep reinforcement learning (DRL), primarily driven by deep neural networks (DNN) [25], the DRL algorithms, such as the deep Q network (DQN) [26], policy gradient (PG) [27], deterministic policy gradient (DPG) [28], and deep deterministic policy gradient (DDPG) [29] with the ability to approximate the value function, have played an important role in continuous control tasks.
Latifi et al. introduced a model-free neural fitted Q iteration control method for micromanipulation devices; in this work, the DNN is adopted to represent Q-value function [30]. Leinen introduced the concept of experience playback in DQN and the approximate value function of the neural network into the SARSA algorithm for the control of a scanning probe microscope [31]. Both simulation and real experimental results have shown that their proposed RL algorithm based on the neural network could achieve better performance compared to traditional control methods to some extent; however, due to the collective effects of disturbances generated from nonlinear systems and deviations in value functions [29,32,33], the RL control method could induce significant inaccuracies in the tracking control tasks [34]. To improve the anti-disturbance capability and control accuracy, disturbance rejection control [35], time-delay estimation based control [36], disturbance observer-based controllers [37,38] have been proposed successively. To deal with this issue, a deep reinforcement learning controller integrated with an adaptive sliding mode disturbance observer (ASMDO) is developed in this work. Previous research on trajectory tracking control of DRL has shown that apparent state errors have always existed [39][40][41][42]. One of the main reasons is the inaccurate estimation of the action value function in DRL structure. As indicated in [43], even in elementary control tasks, accurate action values cannot be attained from the same action value function; therefore, in this work, the DDPG algorithm is developed with an integral differential compensator (DDPG-ID) added to cope with this situation. In addition, the comparison of the reinforcement learning control method with various common state-of-the-art control methods are listed in Table 1, which shows the pros and cons of these different methods.
In this study, deep reinforcement learning is leveraged into a novel optimal control scheme for complex systems. An anti-disturbance, stable, and precise control strategy is proposed for the trajectory tracking task of the micropositioner system. The contribution of this works are presented as follows: (1) A DDPG-ID algorithm based on deep reinforcement learning is introduced as a basic micropositioner system motion controller, which avoided the limitation of traditional control strategies to the accuracy and comprehensiveness of the dynamic model; (2) To eliminate the collective effect caused by the lumped disturbances from the micropositioner system and inaccurate estimation of the value function in deep reinforcement learning, an adaptive sliding mode disturbance observer (ASMDO) is proposed; (3) An integral differential compensator is introduced in DDPG-ID to compensate for the feedback state of the system, which improves the accuracy and response time of the controller, and further improves the robustness of the controller subject to external disturbances.
The manuscript is structured as follows. Section 2 presents the system description of the micropositioner. In Section 3, we develop a deep reinforcement learning control method combined with ASMDO and compensator, and parameters of the DNNs are illustrated. Then, simulation parameters and tracking results are given in Section 4. To further evaluate the performance of the proposed control strategy in the micropositioner, tracking experiments are presented in Section 4. Lastly, conclusions are given in Section 5.

System Description
The basic structure of micropositioner is shown in Figure 1, which consists of a base, a platform, and a kinematic device. The kinematic device is composed with an armature, an electromagnetic actuator, and a chain mechanism driven by electromagnetic actuator. As shown in Figure 1, there are mutual-perpendicular compliant chains actuated by the electron-magnetic actuator (EMA) in the structure. The movement of the chain mechanism is in accordance with the working air gap y. The EMA generates the magnetic force T m , which can be approximated as: where k and p are constant parameters related to the electronmagnetic actuator, I c is the excitation current, and y is the working air gap between the armature and the EMA. Then, the electrical model of the system can be given as: where V i is the input voltage from the EMA, R is the resistance of the coil and H denotes the coil inductance, which can be given as: where H 1 is the coil inductance while the air gap is infinite, and H 0 is the incremental inductance when the gap is zero. The motion equation for the micropositioner can be expressed as: where ι is the stiffness along the motion direction in the system, and α 0 is the initial air gap. where H 1 is the coil inductance while the air gap is infinite, and H 0 is the incremental inductance when the gap is zero. The motion equation for the micropositioner can be expressed as: where ι is the stiffness along the motion direction in the system, and α 0 is the initial air gap. According to the Equations (1)-(4), define x 1 = y, x 2 =ẏ, x 3 = I c as the state variables and the control input u = V i . Then the dynamics model of the electromagnetic actuator can be written as: Define the variables Hm(x 1 +p) 2 , and z 1 is the system output.
In realistic engineering application, there always exist some uncertainties of the system, then the system Equation (6) can be rewritten as: where f 0 (x) and g 0 (x) denote the nominal part of the micropositioner system and ∆ f (x), ∆g(x) denote the uncertainties of the modeling system; d denotes the external disturbances. According to Equations (1)-(4), they define x 1 = y, x 2 =ẏ, x 3 = I c as the state variables and the control input u = V i . Then, the dynamics model of the electromagnetic actuator can be written as: Define the variables Hm(x 1 +p) 2 , and z 1 is the system output.
In realistic engineering application, there always exist some uncertainties of the system, then system Equation (6) can be rewritten as: where f 0 (x) and g 0 (x) denote the nominal part of the micropositioner system and ∆ f (x), ∆g(x) denote the uncertainties of the modeling system; d denotes the external disturbances.
where D is the lumped system disturbances. The following assumption is exploited [44]: The lumped interference D is bounded and its upper bound is less than a fixed parameter β 1 and the derivative of D is unknown but bounded.
Remark 1. Assumption 1 is reasonable since all micropositioner platforms are accurately designed and parameter identified, and all disturbances are remained in a controllable domain.

Design of ASMDO and DDPG-ID Algorithm
In this section, the adaptive sliding mode disturbance observer (ASMDO) is introduced based on the dynamics of the micropositioner. Then, the DDPG-ID control method and pseudocode are given.

Design of Adaptive Sliding Mode Disturbance Observer
To develop the ASMDO, a virtual dynamic is firstly designed as where η i , i = 1, 2, 3 are auxiliary variables,D is the estimation of lumped disturbances, ρ denotes the sliding mode term, which is introduced afterwards. Define a sliding variable , k 1 and k 2 are positive design parameters. Then the sliding mode term ρ is designed as where λ 1 , λ 2 are positive design parameters with λ 2 ≥ β 1 .
Choosing an unknown constant β 2 to present the upper bound ofḊ, the ASMDO is proposed as:˙D where k and λ 3 are positive design parameters andβ 2 is defined as the estimation of β 2 given by˙β 2 = −δ 0β2 + ρ , with δ 0 is a small positive number. Then, the outputD of the ASMDO is used as a compensation of the control input to eliminate the uncertainties generated by the system and external disturbances.
as two Lyapunov function, derivative V 1 and V 2 with respect to time, it is easy to prove that both S andD will exponentially converge to the equilibrium point, so the proof process is not repeated.

Design of DDPG-ID Algorithm for Micropositioner
The goal of reinforcement learning is to obtain a policy for the agent that could maximizes the cumulative reward through interactions with the environment. The environment is usually formalized as a Makov decision process (MDP) described by a four-tuple (S, A, P, R), where S, A, P, and R represent the state space of environment, set of actions, state transition probability function, and reward function separately. At each time step t, the agent in current state s t ∈ S takes action a t ∈ A from policy π(a t |s t ), then the agent acquires a reward r t ← R(s t , a t ) and enters the next state s t+1 according to the state transition probability function P(s t+1 |s t , a t ). Based on the Markov property, the Bellman equation of action-value function Q π (s t , a t ), which is used for calculating the future expected reward, can be given as: where γ ∈ [0, 1] denotes the discount factor. In trajectory tracking control task of micropositioner, state s t is state array about the air gap y of micropositioner at time t. Action a t is the voltage u applied by the controller to micropositioner. As shown in Figure 2, DDPG is one of actor-critic algorithms, which has an actor and a critic. The actor is responsible for generating actions and interacting with the environment, and the critic evaluates the performance of the actor and guides the action in the next state. The action-value function and policy approximation are parameterized by DNN to solve the continuous states and actions problem in micropositioner with Q(s t , a t , w Q ) = π(a t |s t ), where w Q and w µ are the parameters of neural networks in action-value function and policy function. Under the prerequisite of using the neural network approximation representation policy function, the neural network gradient update method is used to seek the optimal policy π.
DDPG-ID uses deterministic policy π(s t , w µ ) rather than traditional stochastic policy π w µ (a t |s t ), where the output of policy is the action a t with highest probability to current state s t , π(s t , w µ ) = a t . The policy gradient is given as where J(π) = E π [∑ T t=1 γ (t−1) r t ] is the expectation of discount accumulative rewards, T denotes the final time of a whole process, ρ π is the distribution of state following the deterministic policy. Value function Q(s t , a t , w Q ) is updated by calculating time temporaldifference error (TD-error), which can be defined as where e TD is the TD-error, r t + γQ(s t+1 , π(s t+1 )) represents the TD target value. By minimizing the TD-error, the parameters are updated backwards through the neural network gradient.
To avoid the convergence problem of single network caused by correlation between TD target value and current value [45,46], A target Q network Q T (s t+1 , a t+1 , w Q ) is introduced to calculate network portion of TD target value and an online Q network Q O (s t , a t , w Q ) is used to calculate current value in critic. Both these two DNN have the same structure. The actor also has an online policy network π O (s t , w µ ) to generate current action and a target policy network π T (s t , w µ ) to provide the target action a t+1 . w µ and w Q separately represent the parameters of target policy and target Q networks.
In order to improve the stability and efficiency during RL training, experience replay technology is utilized in this work, which saves transition experience (s t , a t , r t , s t+1 ) into the experience replay buffer Ψ at each interaction with the environment for subsequent updates. In each training time t, a minibatch of M transitions (s j , a j , r j , s j+1 ) from the experience replay buffer are extracted to calculate the gradients and update neural networks.
An integral differential compensator is developed in deep reinforcement learning structure to improve the accuracy and responsiveness of tracking tasks in this work, which is shown in Figure 2. The integral portion of the state is utilized to increase the control input continuously, which would eventually reduce tracking error. The differential part is integrated to reduce the system oscillation and accelerates stability. The proposed compensator is designed as follows: where s t ID represents the compensator error at time t, y t e = y t d −ŷ t 2 , y t d represents the desired trajectory at time t,ŷ t is the measured air gap at time t and y t e is the error between them. α is the integral gain and β is the differential gain.
Then the state s t at time t can be described as: whereẏ t andẏ t d represent the derivatives ofŷ t and y t d . The reward r t function designed is to measure the tracking error: e > 0.005 +5, 0.003 < y t e 0.005 +10, 0.001 < y t e 0.003 +18, y t e 0.001 (17) As shown in Figure 3, the adaptive sliding mode disturbance observer (ASMDO) is embedded into the DDPG-ID between the actor and micropositioner system environment. Action a t with the environment is expressed as where w µ is the parameters of online policy network π O ,D t is the estimation of the micropositioner system at time t, and N t is Gaussian noise for action exploration.

Critic Update
After selecting M transitions (s j , a j , r j , s j+1 ) samples from experience replay buffer Ψ, the Q value is calculated. The online Q network is responsible for calculating the current Q value, which is as follows: where φ(s j , a j ) represents the input of online Q network, which is an eigenvector consisting of state s j and action a j . The target Q network Q T is defined as: where φ(s j+1 , π T (s j+1 , w µ )) is the input of the target Q network, which is a eigenvector consisting state s j+1 and target policy network output π T (s j+1 , w µ ).
For target policy network π T , the equation is: Then, we rewrite the target Q value Q T as: where r j is the reward from the selected samples.
Since M transitions (s j , a j , r j , s j+1 ) are sampled from experience buffer Ψ, the loss function of the update critic is shown in Equation (23).
where L w Q is the loss value of critic. In order to smooth the target network update process, the soft update is applied without copying parameters periodically as: where τ is the update factor, usually a small constant. The diagram of Q network is shown in Figure 4 , which is a parallel neural network. The Q network includes both state and action portions, and the output value of Q network is based on state and action. The state portion of the neural network consists of a state input layer, three full connection layers, and two ReLU layers clamped between the three full connection layers. The neural network of the action portion contains an action input layer and a full connection layer. The output layers of the above two portions are combined entering the neural network of the common part, which contains a ReLU layer and one output layer. The parameters of each layer in the Q network are shown in Table 2.

Actor Update
The output of online policy network is On account of using deterministic policy, the calculation of the policy gradient has no integrals of action a, but instead has the derivatives of the value function Q O with respect to action a in comparison with stochastic policy. The gradient formula can be rewritten as follows: where the weights w µ are updated with the gradient back-propagation method. The target policy network is also updated with soft update pattern as follows: where τ is the update factor, usually a small constant. Figure 5 shows the diagram of the policy network in this paper, which contains a state input layer, a full connection layer, a tanh layer, and an output layer. The parameters of each layer in the policy network are shown in Table 3.  Initialize a noise process N for exploration 9: Initialize ASMDO and ID compensator 10: Randomly initialize micropositioner states 11: Receive initial observation state s 1 12: for step = 1, T do 13: Select action a t = π O (s t ) +D t + N t

14:
Use a t to run micropositioner system model 15: Process errors with integral differential compensator 16: Receive reward r t and new state s t+1

17:
Store transition (s t , a t , r t , s t+1 ) in replay buffer Ψ 18: Randomly sample a minibatch of M transitions (s j , a j , r j , s j+1 ) from Ψ 19: Minimize loss: 2 to update online Q network 21: Use the sampled policy gradient to update online policy network: Update the target networks: end for 24: end for

Simulation and Experimental Results
In this section, two kinds of periodic external disturbances were added to verify the practicability of the proposed ASMDO and three distinct desired trajectories were utilized to evaluate the performance of proposed deep reinforcement learning control strategy. An traditional DDPG algorithm and a well-tuned PID strategy were adopted for comparison.
To further verify the spatial performances of the proposed algorithm, two kinds of different trajectories were introduced in the experiments.

Simulation Results
The parametric equations of two kinds of periodic external disturbances are defined as d 1 = 0.1 sin(2πt) + 0.1 sin(0.5πt + π 3 ), and d 2 = 0.1 + 0.1 sin(0.5πt + π 3 ). Based on the micropositoner model proposed in [44], the effectiveness of the observer is presented in Figures 6 and 7 The disturbance estimation results from the proposed ASMDO are presented in Figures 6a and 7a, it is can be seen that the observer could track the given disturbance rapidly. The estimation errors are less than 0.01 mm in Figures 6b and 7b, which shows the effectiveness of the ASMDO as interference compensation.
The dynamics model of micropositioner is given in Section 2, and its basic system model parameters are from our previous research [44,47], which is shown in Table 4. The DDPG algorithm is defined in same neural network structure and training parameters as DDPG-ID in this paper. The training parameters of the DDPG-ID and DDPG are shown in Table 5.
The first desired trajectory designed for tracking control simulation is a waved signal. According to the initial conditions, the parametric equation of the waved trajectory is defined as: The training process of both DDPG-ID and DDPG are run on the same model with stochastic initialized micropositioner states. During the training evaluation, a larger episode reward indicates a more accurate and lower error control policy. It is shown in Figure 8 that DDPG-ID reaches the maximum reward score with fewer episodes compared to DDPG, which reveals that DDPG-ID algorithm converge faster than DDPG algorithm. Comparing Figure 8a with Figure 8b, the average reward of DDPG-ID training process is larger than DDPG's average reward in stable state, which further indicates that policy learned by DDPG-ID algorithm has better performance. The trained algorithms are employed for tracking control of micropositioner system simulation experiments.    The tracking results of the waved trajectory is shown in Figure 9. The RMSE value, MAX value, and mean value of the tracking errors for these three control methods are provided in Table 6. In terms of tracking accuracy, the trained DDPG-ID controller has a better performance compared to DDPG and PID, which has smaller state error and smoother tracking trajectory. The tracking error of the DDPG-ID algorithm ranges from −8 × 10 −4 to 9 × 10 −4 mm, which is almost about a half of the DDPG policy. In the interim, the DDPG controller has a lesser tracking error than PID. A huge oscillation has been induced by the PID controller, which will affect the hardware to a certain extent in the actual operation process. This huge oscillation input signal is much larger than a normal control input signal, which typically ranges from 0 to 11 V. Based on the characteristics of reinforcement learning, it is hard for a well-trained policy to generate such a shock signal.  As can be seen in these figures, the tracking error of DDPG-ID in periodic trajectory is still less than the others, which ranges from −1.6 × 10 −4 to 9 × 10 −4 mm. Similar to the previous waved trajectory, the control input based on DDPG has shown better performance in terms of oscillations.
Another tracking results of a periodic trajectory is illustrated in Figure 10, and the tracking errors comparison of these three control methods are given in Table 7. The parametric equation of the periodic trajectory is defined as  To further demonstrate the universality of the DDPG-ID policy, a periodic step trajectory is also utilized for comparison. The step signal with a period of 8 s is designed as the desired trajectory, which is shown in Figure 11a. The well-tuned PID controller is also tested in this step trajectory simulation. Since intense oscillations emerge, the results of PID show extremely worse performance are not shown in this paper.
According to Figure 11, the tracking result of DDPG-ID algorithm remains stable with the tracking error bounded in −2 × 10 −4 to 9× 10 −4 mm, which is still as a half of DDPG's performance. Due to the characteristic of the step signal, the state error will become tremendous during the step transition. Errors of DDPG-ID and DDPG are observed dropping quickly after step transition. It can be seen from Table 8 that the errors of DDPG-ID algorithm are substantially less than that of DDPG algorithm. As to the control inputs, the value of DDPG still fluctuates considerably when the state converges stable.    According to above simulation results, it can be concluded that the control policy of DDPG-ID has triumphantly dealt with collective effect caused by disturbance and inaccurate estimation of deep reinforcement learning comparing to DDPG. The comparison results also have demonstrated the excellent control performance of the policy learned by DDPG-ID algorithm.

Experimental Results
The speed, acceleration, and direction of these designed trajectories vary with time, which makes the experiments results more trustworthy. In each test, the EMA in micropositioner is regulated for tracking the desired path of working air gap.
As shown in Figure 12, a laser displacement sensor is utilized to detect the motion states. Then DDPG-ID algorithm was administered through a SimLab board transplanted with Matlab-Simulink. The EMA controls the movement of the chain mechanism by executing the control signal, which is from the analog output port of SimLab board. The analog input port of SIMLAB board is connected with the signal output from the laser displacement sensor. Figure 13 shows the tracking experiment results of the waved trajectory. It reaches the starting point on a straight track with a speed of 5.6 µm/s. At time 5 s, it begins to track the desired waved trajectory in three periods, and the waved trajectory can be described as y d (t) = 28 + 25 sin( πt 10 + π 2 ). The tracking error fluctuates within ±1.5 µm, which is demonstrated in Figure 13b. Except for several particular points of time, the tracking errors could range from ±1 µm.
Another periodic trajectory tracking experiment was also executed. As shown in Figure 14, the desired periodic trajectory starts at time 5 s, and it is defined as y d (t) = 35 − 25 sin( πt 7.5 − 2π 3 ) − 5 sin( πt 15 + π 6 ). The tracking error of the periodic trajectory still range from ±1.5 µm.
The experimental results show that the proposed DDPG-ID algorithm is able to closely track above two trajectories. Compared with the simulation results, the tracking error does not increase significantly, and it can be maintained between −1 µm and +1 µm.

Conclusions and Future Works
In this paper, a composite controller is developed based on an adaptive sliding mode disturbance observer and a deep reinforcement learning control scheme. A deep deterministic policy gradient is utilized to obtain the optimal control performance. To improve the tracking accuracy and transient response time, an integral differential compensator is applied during the learning process in the actor-critic framework. An adaptive sliding mode disturbance observer is developed to further retrench the influence of modeling uncertainty, external disturbances, and the effect of inaccurate value function. In comparison with the existing DDPG and the most commonly used PID controller, the trajectory tracking results has successfully indicated the satisfactory performances and the precision of the control policy based on the DDPG-ID algorithm in the simulation. The tracking errors are less than 1 µm, which shows the significant tracking efficiency of the proposed methods. The experimental results also indicate the high accuracy and strong anti-interference capabil-ity of the proposed deep reinforcement learning control scheme. To further improve the tracking effect and realize micro-manipulation tasks in the future work, specific operation experiments will be performed such as cell manipulation, micro-assembly, etc.
Author Contributions: Writing-original draft preparation, S.L., R.X., X.X. and Z.Y.; writing-review and editing, S.L. and R.X.; data collection, S.L. and R.X.; visualization, S.L., R.X., X.X. and Z.Y.; supervision, Z.Y. All authors have read and agreed to the published version of the manuscript.  The magnetic force y The working air gap in micropositioner I c The excitation current in micropositioner EMA The electron-magnetic actuator V i The input voltage from the electron-magnetic actuator R The resistance of the coil in micropositioner H The coil inductance in micropositioner u The control input D The lumped system disturbance ASMDO Adaptive Sliding Mode Disturbance Observer s t The state at time t in reinforcement learning a t The action at time t in reinforcement learning r t The reward at time t in reinforcement learning ReLU Rectified linear unit activation function tanh Hyperbolic tangent activation function