A Novel Adaptive Neuro-Control Approach for Permanent Magnet Synchronous Motor Speed Control

A speed controller for permanent magnet synchronous motors (PMSMs) under the field oriented control (FOC) method is discussed in this paper. First, a novel adaptive neuro-control approach, single artificial neuron goal representation heuristic dynamic programming (SAN-GrHDP) for speed regulation of PMSMs, is presented. For both current loops, PI controllers are adopted, respectively. Compared with the conventional single artificial neuron (SAN) control strategy, the proposed approach assumes an unknown mathematic model of the PMSM and guides the selection value of parameter K online. Besides, the proposed design can develop an internal reinforcement learning signal to guide the dynamic optimal control of the PMSM in the process. Finally, nonlinear optimal control simulations and experiments on the speed regulation of a PMSM are implemented in Matlab2016a and TMS320F28335, a 32-bit floating-point digital signal processor (DSP), respectively. To achieve a comparative study, the conventional SAN and SAN-GrHDP approaches are set up under identical conditions and parameters. Simulation and experiment results verify that the proposed controller can improve the speed control performance of PMSMs.


Introduction
Permanent magnet synchronous motors (PMSMs) have many advantages, such as high power density, simple structure, small volume, high efficiency and reliability.PMSMs are widely used in numerical control machine tools, aerospace and industrial robotic manipulators [1].A PMSM is a typical nonlinear and strongly coupled system, with unpredictable external disturbances, as well as internal parameter variations [2].In recent years, various nonlinear control methods [3][4][5][6][7][8][9][10][11], such as fuzzy logic control, sliding mode control, neural network control, nonlinear optimal control, internal model control, adaptive control, have been used to meet the requirements of high reliability and performance in PMSM control [7][8][9][10].The fuzzy logic control is successfully applied in the speed control of PMSMs [12,13].However, the fuzzy control membership function is mainly based on expert experience, which is difficult to obtain.Sliding mode control is a preferred research topic, due to its insensitivity to variation of control object parameters and load disturbances [14,15].Nevertheless, chattering phenomena exist in this control method.Meanwhile, nonlinear optimal control has been put forward as a new PMSM control method [16].However, the parameters of the PMSM must be sufficiently accurate, and control results cannot adapt in time when the mechanical parameters of PMSM change.In [17], a novel control scheme combining the inverse system method and the internal model control for a bearingless permanent magnet synchronous motor (BPMSM) was proposed by Sun et al., although in order to regulate the tracking and disturbance rejection properties, the values of control parameter sets need to be adjusted separately [18].
Recently, adaptive dynamic programming (ADP) has attracted significantly increasing attention as a novel level reinforcement learning approach.It can solve the "curse of dimensionality" of conventional dynamic programming by approximately computing cost function [19].ADP can be categorized into three classical structures [20]: the first is heuristic dynamic programming (HDP), the second is dual heuristic dynamic programming (DHP), and the last is globalized dual heuristic dynamic programming (GDHP).The main difference is that the critic network is used to approximate the value function J in HDP, while it is used to approximate the derivatives of value function J in DHP.GDHP incorporates the benefits of HDP and DHP, by approximating both value function J and its derivatives, respectively.
In paper [21], a novel hierarchical structure of ADP approach named goal representation heuristic dynamic programming (GrHDP) is proposed.Compared with the conventional ADP approach, the proposed approach has an additional reference network which can automatically build an internal reinforcement signal to facilitate the optimal learning, control effectively and efficiency [22].This novel hierarchical ADP approach is of a superior learning performance over the traditional ones.The GrHDP approach is used in various fields of electrical engineering, such as power system stability control for a wind farm [23], power oscillation damping control for superconducting magnetic energy storage [24], and load frequency control for an islanded smart grid [25].
Meanwhile, the single artificial neuron (SAN) control approach has been used in many applications for its robust control in the presence of noise and uncertainties [26].Generally speaking, traditional SAN control has been applied to engineering practices for a long time due to its good performance and easy implementation [27][28][29].
It has been pointed out that although the conventional SAN control approach can provide an online learning ability for the PMSM parameter variation, it may not provide a satisfactory property of load disturbance rejection.The reason is that the control effect of SAN mainly depends on the parameter K (neuron scale-up factor).The parameter K is very important to the control response performance.The selection of K is very difficult in traditional SAN control approaches.The control system will respond faster if the K value increases.However, the K value will lead to the instability of the system, if it is out of a certain range.Moreover, there is no profound theoretical background, which can be used to tune the parameter K for complicated systems with uncertainties and disturbance.It is a new idea to use machine learning to adjust the K value of SAN and make it applicable to PMSM control.At the same time, for the ultimate convergence, the action network weights of GrHDP approach usually need repeating online learning to achieve optimization solutions to the Bellman equation.So far, articles about ADP approaches mostly focus on the simulation stage [23,24,[30][31][32][33][34][35][36].
To solve the above problems, in this paper, a novel neuro-control framework using GrHDP and SAN is proposed.Moreover, an application study on PMSM vector control system is also presented in this research.The main contributions of this paper are summarized as follows: (1) A novel adaptive neuro-control controller, called single artificial neuron goal representation heuristic dynamic programming (SAN-GrHDP), based on SAN and GrHDP has been proposed in this paper.This framework, under which the parameter K in the SAN has been updated through a reference learning mechanism, can provide a sequential online control policy.(2) The formula of SAN-GrHDP approach is derived, and the reinforcement signal and learning process are designed for the vector control of PMSM.Simulation studies have been carried out for the proposed approach.Simulation results demonstrate that the proposed controller has a higher The remainder of the paper is organized as follows.Section 2 describes the servo control system of a PMSM as well as the certain modeling of the speed controller used in this paper.Section 3 illustrates the details of the SAN-GrHDP controller, and the learning algorithm associated.In Section 4, the simulation of the speed control of the PMSM and the experimental setup based on SAN-GrHDP are presented.The results prove the effectiveness of the proposed SAN-GrHDP by comparing with the conventional SAN control approach.Finally, Section 5 presents our conclusions and a few future study directions.

Model of Permanent Magnet Synchronous Motor Control System
Assuming that magnetic circuit saturation, hysteresis eddy current losses are disregarded and the sinusoidal magnetic field is distributed in space, a surface-mounted PMSM is considered as the controlled object.In d-q coordinates, the model of a surface mounted PMSM can be expressed as follows [37,38]: where u d and u q are the stator dand q-axes voltages, i d and i q are the stator dand q-axes currents, L d and L q are the stator dand q-axes inductances, n p is the number of pole pairs, R s is the stator resistance, ω is the rotor angular velocity, ψ f is the flux linkage, T L is the load torque, and B is the viscous friction coefficient.The strategies of the vector control of PMSM have i d = 0 control, power factor cos φ = 1 control, the maximum torque control, maximum output power control, flux weakening control and so on.The approach of i d = 0 control which has many advantages such as small torque ripple and wide speed range, is the most simple strategy of vector control and used in this article.The field oriented control (FOC) diagram of PMSM system by i d = 0 control approach is shown in Figure 1.There are three controllers in the diagram: one speed tracking loop controller and two current tracking loop controllers.The dand q-axes currents i d , i q can be calculated from the two-phase static coordinate currents i α , i β of PMSM by the PARK transform.Similarly, the i α , i β currents can be obtained from the actual phase currents of PMSM through the CLARK transform.The rotor angular velocity ω and rotor position θ can be calculated from encoder.Usually, the reference current value i * q is determined by the speed loop controller output, and i * d is set to zero.Due to saturation phenomena of PMSMs, some values can depend on the operating point of the machine, such as rotor inductance and rotor resistance.This can affect the performance and the accuracy of the conventional controller.The SAN-GrHDP approach is a kind of machine learning algorithm (ADP approach).When motor parameters change, the controller can learn from a complex, uncertain environment (controlled plant) according to the optimal cost function, which is also the essence of ADP method [19].Compared with the traditional control approach, the SAN-GrHDP can realize self-regulation by critic network and provide an online sequential control policy, not subject to the external load disturbances and parameter variations.This article mainly discusses the external load disturbances rejection capacity of proposed control strategy.The current-loop sampling period is 200 µs, and the speed-loop sampling time is ten times that of the current-loop.The current-loop controllers require faster response.Therefore, the inner current-loop controllers adopt the traditional PI controllers.Here, the task is to design a speed controller based on SAN-GrHDP approach.

Single Artificial Neuron Goal Representation Heuristic Dynamic Programming Controller
Like the conventional GrHDP approach [21,39,40], the proposed SAN-GrHDP controller also includes three approximate networks: an action network, a critic network, and a reference network.The critic network is set to approximate the cost-to-go function in Bellman equation by online learning.The reference network provides an adaptive internal reinforcement signal to facilitate the critic network to better approximate the value function.Compared with the classic ADP structure, GrHDP approach has an additional reference network to generate an internal goal-representation signal to facilitate learning and optimization.It provides an effective method for the intelligent system to achieve the goals by adaptive and automatic construction of internal goal representations [21].This structure, due to the addition of reference network, also has some disadvantages, such as complex structure and high computation burden.
However, the action network of conventional GrHDP approach must be trained many times to ensure the convergence of weights.Because the action network is BP network, and it is difficult to use the conventional GrHDP approach for real-time control, especially in the field of PMSM speed control.In this article, the traditional GrHDP approach is improved and the action network is replaced by SAN control approach.Different from that of the conventional SAN control approach, the parameter K of the action network (SAN) is not fixed, and can be updated through interaction with controlled object in real time.
The schematic diagram of FOC by proposed SAN-GrHDP is shown in Figure 2. The ultimate objective for the SAN-GrHDP controller is still to solve the Bellman's optimal equation [20,22] as: so that the optimal control strategy can be achieved.Here the J * (x, u) is the immediate cost incurred by u at current time, the J * (x , u ) is refer to the one-step future cost, the α is a discounted factor (0 < α < 1), and the r(x, u) is the external reinforcement signal.Compared with conventional SAN control approach, the SAN-GrHDP approach has two additional networks (i.e., the reference network and the critic network).The reference network is related to the primary reinforcement signal r(t), and generates the internal reinforcement signal S(t) to facilitate the critic network to better approximate the value function.The critic network generates the cost function J(t), according to S(t).

Learning and Adaptation of Reference Network
The structure of the reference network is shown in Figure 3.It can be seen that the reference network is designed with three-layer nonlinear architecture (including one hidden layer).

Subsection Learning and Adaptation of Reference Network
The structure of the reference network is shown in Figure 3.It can be seen that the reference network is designed with three-layer nonlinear architecture (including one hidden layer).

Input layer
Hidden layer Output layer ( ) ( ) The feed-forward propagation formulas of the reference network are as follows: The feed-forward propagation formulas of the reference network are as follows: Energies 2018, 11, 2355 6 of 21 where a(t) is the input vector of the reference network whose number is 4, including error value e(t) at time t, error value e(t − 1) at time t − 1, action value u(t) at time t, and action value u(t − 1) at time t − 1. q i (t) is the ith hidden node input of the reference network.p i (t) is the corresponding output of the hidden node.Nf is the total number of the hidden nodes.S(t) is the output of the reference network.We define the error function of the reference network as [25]: and the objective function to be minimized as: To calculate the back propagation through the chain rule, the weights updating rules can be presented as follows [25]: ∆w f (t) (the weights adjustments of reference network for the hidden to the output layer): f (t) (the weights adjustments of reference network for the input to the hidden layer):

Learning and Adaptation of Critic Network
The structure of the critic network is shown in Figure 4.It is designed with a three-layer nonlinear architecture (with one hidden layer).
where ( ) a t is the input vector of the reference network whose number is 4, including error value ( ) e t at time t, error value ( ) at time t − 1, action value ( ) u t at time t, and action value ( ) the ith hidden node input of the reference network.
( ) i p t is the corresponding output of the hidden node.Nf is the total number of the hidden nodes.( ) S t is the output of the reference network.
We define the error function of the reference network as [25]: and the objective function to be minimized as: To calculate the back propagation through the chain rule, the weights updating rules can be presented as follows [25]: (the weights adjustments of reference network for the hidden to the output layer):

e t S t e t p t e t S t w t w
( ) ( ) w t Δ (the weights adjustments of reference network for the input to the hidden layer):

Learning and Adaptation of Critic Network
The structure of the critic network is shown in Figure 4.It is designed with a three-layer nonlinear architecture (with one hidden layer).

Input layer
Hidden layer Output layer ( ) ( ) The feed-forward propagation formulas of the critic network are as follows:  The feed-forward propagation formulas of the critic network are as follows: where c(t) is the input vector of the critic network which number is 5, including the internal reinforcement signal S(t) (produced by reference network), error value e(t) at time t, error value e(t − 1) at time t − 1, action value u(t) at time t and action value u(t − 1) at time t − 1. z l (t) is the lth hidden node input of the critic network.y l (t) is the corresponding output of the hidden node.Nc is the total number of hidden nodes.J(t) is the output of the critic network.Define the error function of the critic network as [19,21]: and the objective function to be minimized as: To calculate the backpropagation through the chain rule, the weights updating rules can be presented as follows [21]: ∆w c (t) (the weights adjustments of critic network for the hidden to the output layer): ) c (t) (the weights adjustments of critic network for the input to the hidden layer):

Learning and Adaptation of Action Network
The structure of the action network (SAN) is shown in Figure 5.The SAN is employed as the controller, which is different from the traditional GrHDP (action network is BP network).The feed-forward propagation formulas of the SAN are introduced as follows [27]: x 1 (t) = e(t) − e(t − 1) where u(t) is the output of the action network (SAN), which is applied to the controlled object directly.
η P , η I are proportion, integral study rate respectively.
Energies 2018, 11, x 8 of 21 where ( )  The parameter K named neuron scale-up factor (where K > 0) is very important to the control response performance.The selection of K is very difficult of traditional SAN control approach.The control system will respond faster, if the K value is greater.However, the K value will lead to the instability of the system, if it is out of a certain range.
The key point of the SAN-GrHDP approach is to use the approximate function J from critic network to achieve the K value of optimization adjustment.Define "0" as the reinforcement signal for "success", and "−1" for "failure", so ( ) c U t is set to "0" for our following studies.
To calculate the backpropagation, the error function ( ) a e t is defined as follows [19]: and the objective function to be minimized as: For backward propagation, the error function of the reference network is not only related to the primary reinforcement signal ( ) r t , but also the internal reinforcement signal ( ) S t .
To calculate the backpropagation through the chain rule, the error function of the critic network involves the internal reinforcement signal ( ) S t .The signal ( ) S t from reference network is related to the primary reinforcement signal ( ) r t .So the parameter K updating rules are composed of two parts: one is from the critic network path and the other is from the reference network path.
The detailed learning and adaptation formulas can be presented as follows: ( )  The parameter K named neuron scale-up factor (where K > 0) is very important to the control response performance.The selection of K is very difficult of traditional SAN control approach.The control system will respond faster, if the K value is greater.However, the K value will lead to the instability of the system, if it is out of a certain range.
The key point of the SAN-GrHDP approach is to use the approximate function J from critic network to achieve the K value of optimization adjustment.Define "0" as the reinforcement signal for "success", and "−1" for "failure", so U c (t) is set to "0" for our following studies.
To calculate the backpropagation, the error function e a (t) is defined as follows [19]: and the objective function to be minimized as: For backward propagation, the error function of the reference network is not only related to the primary reinforcement signal r(t), but also the internal reinforcement signal S(t).
To calculate the backpropagation through the chain rule, the error function of the critic network involves the internal reinforcement signal S(t).The signal S(t) from reference network is related to the primary reinforcement signal r(t).So the parameter K updating rules are composed of two parts: one is from the critic network path and the other is from the reference network path.
The detailed learning and adaptation formulas can be presented as follows: where l a (t) is the learning rate of the parameter K.In the end, the gradient descent rule is selected as the tuning method of the parameter K, the formula is presented as follows:

Reinforcement Signal Design of Speed Controller
The SAN-GrHDP controller is a real-time controller with immediate online learning from the surroundings, and its overall performance depends upon the design of the input, output and reinforcement signal.
The input signal of the controller is designed as follows: where ω(t) is actual angular velocity of PMSM (obtained by the encoder) in time t, ω * (t) is the reference angular velocity.The output signal of the controller is i * q .The cost-to-go function (reinforcement signal) is designed as follows: r(t) = 0.98 * e(t) + 0.02 * e(t − 1) Conventional controller designs are primarily based on on-linear analysis gear such as eigenvalue analysis, Bode diagrams, Nyquist diagrams and so on.In contrast, the SAN-GrHDP is based totally on online learning to regulate its parameters to reduce the reinforcement signal.Due to the similar approximation functionality of the neural network, it's far more liable to find the proper mapping among the input and output signals to withstand the disturbance of PMSM parameters.The critic network is used to approximate the cost-to-go function (reinforcement signal) r(t) in the Bellman's optimal equation of dynamic programming [20].The Bellman's optimal equation is shown in Equation (2).The reference network is integrated in the typical ADP structure to approximate an internal reinforcement signal S(t).The internal reinforcement signal is used to interact with the operation of the critic network [21].It can better facilitate the optimal learning and control over time to accomplish goals [30].
It is known that the initial parameters are significant for the performance of the SAN-GrHDP controller.Table 1 shows the parameters setting of the proposed SAN-GrHDP approach.Where, l a (t) is the learning rate of the action network, l f (t) is the learning rate of the reference network, and l c (t) is the learning rate of the critic network.The learning rate of the reference network is usually set same as the critic network.When these two learning rates are set too big, it will lead to instability of the controller.When these two learning rates are set too small, the convergence rate of controller is slow.When training offline, these two learning rates can be set bigger, and weights of these two networks can be obtained rapidly.After offline training, these two learning rates can be set a little bit lower, which can enhance the stability of the controller.The selection of K is very difficult in the traditional SAN control approach.The control system will respond faster, if the K value is greater.However, the K value will lead to the instability of the system, if it is out of a certain range.The rate of value K variation is decided by the learning rate of the action network, which is usually set according to the experimental process.The α is the discount factor of the reference network, γ is the discount factor of the critic network.The discount factor determine how much the t moment affects the previous t − 1 moment.If the discount factor is set too small, the effect of reinforcement learning signal at the current moment is small; otherwise, the effect is large.They are usually set between 0.95 and 0.99.The N f is the hidden node number of the reference network.The Nc is the hidden node number of the critic network.Both the hidden node number of the critic network and the reference network are set to 8. The more layers, the better performance of controller.However, the more layers need a more powerful processor.According to experimental research, the quantity of layers is 8, so that computing speed of DSP28335 processor is acceptable.For a more detailed description of the process for setting the parameters of the ADP method readers may refer to relevant works [22].Using the ADP approaches with the characteristics of the interaction of the control object (vector control system of PMSM).Through the evaluation value J of critic network, the state variable feedback control object is calculated with the gradient descent rule, to guide the selection of SAN controller's K value, expressed as follows: The detailed learning and adaptation are shown in Equations ( 27)- (33).The selection of K value is used to promote the rapid convergence of the J value.The appropriate K value is selected and applied to the SAN (action network), and the optimal control value is output to vector control system of PMSM directly.The detailed calculating process is shown in Equations ( 23)- (26).The SAN-GrHDP optimal control output signal is q-axis current reference value i * q of vector control system of PMSM.The weights of the reference network and critic network in SAN-GrHDP approach are initialized randomly.For comparative studies, the parameters of SAN approach are set the same as the SAN-GrHDP approach.

Learning Process of Single Artificial Neuron Goal Representation Heuristic Dynamic Programming Speed Controller for Permanent Magnet Synchronous Motor
In the field of oriented control system of PMSM, speed difference is usually chosen as the input signal for the speed controller.In this SAN-GrHDP controller, previous control output is usually used as a supplementary signal input of the controller, so the controller input is of error value e(t) at time t, error value e(t − 1) at time t − 1, error value e(t − 2) at time t − 2, previous control output value u(t − 1), and the controller output is u(t).The optimization parameters of controller will be updated accordingly by online learning.The data flowchart is shown in Figure 6 and the algorithm training process is described as follows: (1) Initialize the various parameters of the SAN-GrHDP, such as neural network learning rate, the initial weights values of neural network, discount factor and so on.
(2) Observe the differences of speed and obtain the control signal u(t) that is q-axis current reference value for the control system of PMSM.(3) Calculate the internal reinforcement learning signal S(t), and the value function signal J(t).(4) Retrieve the previous time data S(t − 1) and J(t − 1), calculate the temporal difference errors and obtain the objective functions in reference network and critic network.(5) Update the weights values of reference network, critic network and the K value of action network (SAN).( 6) Repeat from the second step when entering the t + 1 step.

Simulation and Experimental Results
The weights of the reference network and critic network in SAN-GrHDP approach are initialized randomly.For comparative studies, the parameters of SAN control approach are set the same as the SAN-GrHDP approach.To check the overall performance of the SAN-GrHDP control approach, simulation, and experiment on the speed control system of PMSM are carried out.

Simulation Results
To compare the disturbance rejection performance of both approaches, the comparative simulation of the proposed SAN-GrHDP control approach and the traditional SAN control approach are implemented on Simulink Matlab2016a (MathWorks, Natick, MA, USA).The parameters of the PMSM used in the simulation are listed in Table 2.The parameters of both current PI are the same: the proportional coefficient is 9, the integral coefficient is 3375.The saturation limit of the q-axis reference current i * q is ±10 A. The initial load of PMSM is 0.2 N•m. Figure 7 shows that simulation responses under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 1300 rpm. Figure 8 shows that simulation responses under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 800 rpm. Figure 7a shows that the SAN-GrHDP-based controller gives the same settling time with a same overshoot compared with the SAN-based controller, in the case of 1300 rpm reference speed.Figure 8a shows that the SAN-GrHDP-based controller gives the same settling time with a same overshoot compared with the SAN-based controller, in the case of 800 rpm reference speed.It can also be seen that, when a load torque 0.5 N•m is applied at 0.1 s, the SAN-GrHDP approach has less speed fluctuation than the traditional SAN approach.
Figure 7b shows that the q-axis current response under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 1300 rpm.It shows that the q-axis current i q is quite large at the moment of the start of PMSM.The i * q is much less than 10 A, which is the saturation limit of the output.As the speed is steady, the actual q-axis current i q decreases down to reference q-axis current i * q .It can also be seen that, when a load torque 0.5 N•m is applied at 0.1 s, the actual q-axis current i q of both approaches rise quickly under the sudden load disturbance impact.However, the SAN-GrHDP approach has less current fluctuation than the traditional SAN control approach.Figure 8b shows that the q-axis current response under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 800 rpm.It can be seen from the Figures 7b and 8b, when the same load torque 0.5 N•m is added suddenly at different speed, the q-axis current response at 1300 rpm is same as 800 rpm.
The evolution of the neural network parameters is presented in the SAN-GrHDP controller at 1300 rpm in Figure 7c-e.Figure 7c shows that the trajectory of the parameter K.At the load disturbance time (0.1 s), the neural network weights are adapting dramatically, which is constant with the full-size adjustments in the reinforcement signals, as shown in Figure 7d,e.The reason is that in spite of the load mutation, the system is converting according to the controller learning surroundings, so that it adapt its parameters to provide the most suitable control signal for the system again to achieve its normal working point.The evolution of the neural network parameters is presented in the SAN-GrHDP controller at 800 rpm in Figure 8c-e.Figure 7 shows that simulation responses under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 1300 rpm. Figure 8 shows that simulation responses under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 800 rpm. Figure 7a shows that the SAN-GrHDP-based controller gives the same settling time with a same overshoot compared with the SAN-based controller, in the case of 1300 rpm reference speed.Figure 8a shows that the SAN-GrHDP-based controller gives the same settling time with a same overshoot compared with the SAN-based controller, in the case of 800 rpm reference speed.It can also be seen that, when a load torque 0.5 N•m is applied at 0.1 s, the SAN-GrHDP approach has less speed fluctuation than the traditional SAN approach.

D-axis Inductance
Figure 7b shows that the q-axis current response under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 1300 rpm.It shows that the q-axis current q i is quite large at the moment of the start of PMSM.The * q i is much less than 10 A, which is the saturation limit of the output.As the speed is steady, the actual q-axis current q i decreases down to reference q-axis current * q i .It can also be seen that, when a load torque 0.5 N•m is applied at 0.1 s, the actual q-axis current q i of both approaches rise quickly under the sudden load disturbance impact.
However, the SAN-GrHDP approach has less current fluctuation than the traditional SAN control approach.Figure 8b shows that the q-axis current response under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 800 rpm.It can be seen from the Figures 7b  and 8b, when the same load torque 0.5 N•m is added suddenly at different speed, the q-axis current response at 1300 rpm is same as 800 rpm.
The evolution of the neural network parameters is presented in the SAN-GrHDP controller at 1300 rpm in Figure 7c-e.Figure 7c shows that the trajectory of the parameter K.At the load disturbance time (0.1 s), the neural network weights are adapting dramatically, which is constant with the full-size adjustments in the reinforcement signals, as shown in Figure 7d,e.The reason is that in spite of the load mutation, the system is converting according to the controller learning surroundings, so that it adapt its parameters to provide the most suitable control signal for the system again to achieve its normal working point.The evolution of the neural network parameters is presented in the SAN-GrHDP controller at 800 rpm in Figure 8c-e.

Experimental Results
An experimental platform for a PMSM device is built to evaluate the overall performance of the proposed SAN-GrHDP control approach.The configuration and the experimental test setup are shown in Figures 9 and 10

Experimental Results
An experimental platform for a PMSM device is built to evaluate the overall performance of the proposed SAN-GrHDP control approach.The configuration and the experimental test setup are shown in Figures 9 and 10, respectively.Energies All the control algorithms, which include the SVPWM technique, are implemented by using this system of the floating DSP TMS320F28335 with a clock frequency of one hundred and fifty MHz, the usage of a C-language.The current-loop sampling period is 200 µs, the speed loop sampling time is ten times that of the current loop.The saturation restriction of the q-axis reference current is ±10 A. The PMSM is driven by using an intelligent power module (IPM) PS21965, which is designed by the Mitsubishi Company (Tokyo, Japan).The phase currents are measured by Hall sensors, converted to voltages by sampling resistances and AD7606 converter.The rotor speed and absolute rotor position can be measured by the incremental position encoder of 2500 lines.The speed and q-axis current signals are displayed on the oscilloscope, through a DAC converter (AD5344) output.
The parameters of both current PI units are the same: the proportional coefficient is 0.2, the integral coefficient is 0.006.The parameters of SAN are as follow: η p = 0.05, η I = 0.05.The initial value of scale-up factor K = 0.01.The parameters of SAN-GrHDP are shown in Table 1, the parameters of action network are same as SAN control approach.
Figure 11 shows the experimental response curves of speed and i q with sudden disturbance by SAN control approach at 1300 rpm. Figure 12 shows the experimental response curves of speed and i q with the same sudden load disturbance by SAN-GrHDP approach at 1300 rpm.From Figure 11, it can be seen that the speed of SAN approach fluctuates greatly when load is added.It can be inferred from Figure 11 that the control effect of SAN can be improved with application of the machine learning (GrHDP) to tuning the K value.The proposed control strategy can quickly stabilize the speed when load is added.Figures 13 and 14 show the comparative experimental response curves with the SAN and proposed SAN-GrHDP approach at 800 rpm, respectively.The experimental results in Figures 13  and 14 are similar in Figures 11 and 12. From the experimental results, it can be seen that there are some differences from the results of simulation.The reason is that the PMSM model in simulation is ideal, and it has some disparities in practical application.In the process of experiment, the fluctuation error of speed is greater than the simulation result in steady state.The proposed SAN-GrHDP approach is a kind of machine learning algorithm (ADP).It can learned by itself according to the environmental characteristics.Therefore, the weights of neural networks in experiment are different from that of simulation.This is also the reason for disparities between simulation and experimental results.It is found that compared with the SAN control approach, the proposed SAN-GrHDP approach indicates a higher disturbance rejection potential, with much less speed fluctuation and shorter recovering time towards load disturbance.The weights adjustments of reference network for the hidden to the output layer ∆w

Constants
(1) f The weights adjustments of reference network for the input to the hidden layer c Input vector of the critic network z l lth hidden node input of the critic network y l lth hidden node output of the critic network ∆w (2) c The weights adjustments of critic network for the hidden to the output layer ∆w Learning rate of the critic network

Figure 1 .
Figure 1.The field oriented control (FOC) diagram of permanent magnet synchronous motor (PMSM) system by i d = 0 control approach.

Figure 2 .
Figure 2. Schematic diagram of FOC by proposed single artificial neuron goal representation heuristic dynamic programming (SAN-GrHDP).

Figure 3 .
Figure 3. Schematic diagram of the reference network.

Figure 4 .
Figure 4. Schematic diagram of the critic network.

Figure 4 .
Figure 4. Schematic diagram of the critic network.

Figure 5 .
Figure 5. Schematic diagram of the action network (SAN).

Figure 5 .
Figure 5. Schematic diagram of the action network (SAN).

Figure 9 .
Figure 9. Configuration of the experimental system.

Figure 8 .
Figure 8. Simulation responses under SAN and SAN-GrHDP approaches in the presence of load torque disturbance at 800 rpm.(a) Speed.(b) i q .(c) K value of the SAN-GrHDP approach.(d) S value of the SAN-GrHDP approach.(e) J value of the SAN-GrHDP approach.

Figure 11 .Figure 12 .
Figure 11.Experimental responses under SAN in the presence of load torque disturbance at 1300 rpm.(a) Speed; and (b) q i .

Figure 11 .Figure 11 .Figure 12 .
Figure 11.Experimental responses under SAN in the presence of load torque disturbance at 1300 rpm.(a) Speed; and (b) i q .
The weights adjustments of critic network for the input to the hidden layer η P Proportion study rate of SAN η I Integral study rate of SAN l a Learning rate of the parameter K ω Actual angular velocity of PMSM ω * Reference angular velocity of PMSM l fLearning rate of the reference network l c Moreover, comparative experiments of original SAN and SAN-GrHDP approaches are performed on the speed control of PMSM under the same conditions and parameters.The results of the experiments verify that SAN-GrHDP can better improve the control effect by interacting with the control object, and has much better robustness than SAN with load mutation and load disturbance.

Table 1 .
Parameters setting of the SAN-GrHDP approach.

Table 2 .
Parameters setting of the PMSM.