Multivariable System Identification Method Based on Continuous Action Reinforcement Learning Automata

: In this work, a closed-loop identification method based on a reinforcement learning algorithm is proposed for multiple-input multiple-output (MIMO) systems. This method could be an attractive alternative solution to the problem that the current frequency-domain identification algorithms are usually dependent on the attenuation factor. With this method, after continuously interacting with the environment, the optimal attenuation factor can be identified by the continuous action reinforcement learning automata (CARLA)


Introduction
With the rapid development of modern industry, it becomes increasingly difficult for the traditional model control methods to properly control complex process due to the uncertainty, time-delay, multivariable coupling, and constraints between input and output.It is a challenge for the traditional identification methods to obtain optimal results, particularly in multivariable systems, due to their complex structure, various parameters, and time-varying in industrial applications.Methods for the identification of multivariable systems go back to the 1960s, but most identification methods require the observation to be noise-free.This situation together with their heavy computational cost makes them difficult to be applied in practice [1].In view of the above problems, many researchers proposed replacing the state space model with a polynomial matrix to describe the multivariable system.In 1975, Guidozi [2] proposed the equivalence relation between the observable canonical form of system state and the canonical form of an input-output difference equation, and established the mapping relation between the parameters of the state equation and the observed input-output data.Subsequently, some researchers proposed row subspace identification methods based on the Hankel matrix.In these methods, the first step is to obtain the augmented observability matrix (or state sequence) of the system, and then the parameter matrix of each subspace is calculated.The main representative methods include multivariable output error state space (MOESP) [3,4], numerical algorithms for subspace state-space system identification (N4SID) [5,6], and canonical variate analysis (CVA) [7,8].At first, these methods of subspace identification were difficult to apply online, and were very computationally intensive in recursive decomposition.With the development of technology, they have been widely applied in industrial practice due to the advantage that they do not depend on any prior knowledge in the identification of the multivariable system.However, the objective of multivariate identification methods based on state space models is not to find the optimal solution, but the Pareto suboptimal solutions, so there is substantial room for improvement in the performance of these methods.
Methods for the identification of multivariable systems based on state space have large computational demand, take a long time, and it is difficult for them to achieve the global optimal solution.Therefore, many researchers have focused on identification methods based on the transfer function matrix or other models (i.e., to transform the state space model into a transfer function model through the Laplace transform).Researchers have proposed several methods for the transfer function n matrix modeling of multivariable systems, such as the instrumental model method, the sub-sub model recursive method, the combined identification algorithm (CIA), the extended least squares method, and the multi-innovation recursive identification method, which have greatly promoted the direction of multivariable system identification based on the transfer function matrix model.With the instrumental model, the identification of the transfer function model can be decomposed into several sub-models, and it has attracted increasing attention.In [9], the least squares method was used to eliminate the bias in a multi-input single-output (MISO) system, but this method is still difficult to apply online, and it has harsh preconditions and assumptions.Ding et al. [10] proposed a bias-compensation-based recursive least-squares algorithm to solve the problem of the identification in non-stationary systems.On the basis of the nearest Kronecker product and low-rank approximation, Camelia et al. [11] proposed a low-complexity recursive least-squares (RLS) algorithm, which has good robustness against additive noise and good identification effect.The instrumental model [12] can solve the problem of the unknown information vector in the model.Du et al. [13] proposed a robust output error model identification method, and used the auxiliary model to estimate the noiseless output under random noise, which is suitable for the time-delay industrial process under load disturbance.In order to eliminate the interference of abnormal points in the observation, the multi-innovation concept was introduced to accelerate the convergence of the model [14].Based on the sub-model (MISO), Li et al. [15] proposed a sequential excitation method for a multiple-input multiple-output (MIMO) system.With sequential excitation signals, the multiple-input multiple-output (MIMO) system could be decomposed into several equivalent single-input single-output (SISO) systems in an open-loop control.
In recent years, the open-loop step response test has been a common method in the field of system identification, and is simple to operate and easy to implement.However, there are still some issues: (1) Open-loop identification is usually not permitted in field systems.Considering the factors of safety and economy, the output of the controller is required within a limited range once the system is running.(2) The input signal is usually limited to a step-function signal, and the test time is limited to the transition time.It is often affected by external disturbance and the change of internal working conditions.Therefore, authors in [16][17][18][19] proposed frequency response estimation methods for the identification of the closed-loop system.However, the existing methods still have some limitations (e.g., the accuracy depends on the choice of attenuation factor).The value of the attenuation factor is usually given according to prior knowledge.Jin et al. [20] proposed an intelligent searching algorithm to obtain the range of the attenuation factor, and achieved good results.However, for large-scale complex systems, it is time-varying and the attenuation factor easily falls into local premature convergence.
Learning automata (LA) have potential applications in system control.Li et al. [21] proposed a coral reef algorithm based on LA for the coverage control problem of heterogeneous directional sensor networks.Mohammed et al. [22] developed a fuzzy maximum power point tracking controller using the information collected by LA through the learning process.Thus, a novel method based on a reinforcement learning algorithm, namely continuous action reinforcement learning automata (CARLA), is presented here to solve the issue mentioned above.After continuously interacting with the environment, an optimal attenuation factor can be achieved by CARLA, and then the system parameters of the system can be estimated.Moreover, the proposed method can be used online and applied to time-varying systems due to its online learning ability.

Background
Frequency response methods can be applied to the parameter estimation of open-loop and closed-loop systems, but their accuracy depends on the value of the attenuation factor, which is usually obtained from prior knowledge.Some researchers [20,23] proposed the use of heuristic search algorithms (e.g., particle swarm optimization (PSO) and its improved methods) to obtain the optimal value.However, it is difficult to solve the issue of the on-line identification of time-varying systems.An online adaptive learning method is required to find the optimal solution of the parameters within the contiguous space.To solve these problems, a frequency response estimation (FRE) method based on continuous action reinforcement learning automata (CARLA-FRE) is proposed in this paper.

Basic Reinforcement Learning
As an important branch of machine learning, reinforcement learning (RL) [24] interacts with the environment actively and constantly, updates iterations based on feedback, and finally gives the optimal strategy.It contains the main elements of agent, state, action, and rewards, and its learning target is to obtain the optimal strategy to maximize the long-term cumulative rewards.As shown in Figure 1, the most important feature of reinforcement learning is the capability of autonomous and online learning without any prior knowledge and state transition probability.Firstly, the agent perceives the state of the environment and takes various exploratory actions according to the accumulative compensation.After taking the action, the environment undergoes a state transition and enters into a new state.At the same time, the behavior strategy is evaluated, and the feedback is returned to the learning system.After receiving the feedback of the reward or the punishment, the agent modifies its strategies continuously to meet the requirements of the environment, and the whole process is iteratively updated until the optimal strategy is obtained.

Background
Frequency response methods can be applied to the parameter estimation of open-loop and closed-loop systems, but their accuracy depends on the value of the attenuation factor, which is usually obtained from prior knowledge.Some researchers [20,23] proposed the use of heuristic search algorithms (e.g., particle swarm optimization (PSO) and its improved methods) to obtain the optimal value.However, it is difficult to solve the issue of the on-line identification of time-varying systems.An online adaptive learning method is required to find the optimal solution of the parameters within the contiguous space.To solve these problems, a frequency response estimation (FRE) method based on continuous action reinforcement learning automata (CARLA-FRE) is proposed in this paper.

Basic Reinforcement Learning
As an important branch of machine learning, reinforcement learning (RL) [24] interacts with the environment actively and constantly, updates iterations based on feedback, and finally gives the optimal strategy.It contains the main elements of agent, state, action, and rewards, and its learning target is to obtain the optimal strategy to maximize the long-term cumulative rewards.As shown in Figure 1, the most important feature of reinforcement learning is the capability of autonomous and online learning without any prior knowledge and state transition probability.Firstly, the agent perceives the state of the environment and takes various exploratory actions according to the accumulative compensation.After taking the action, the environment undergoes a state transition and enters into a new state.At the same time, the behavior strategy is evaluated, and the feedback is returned to the learning system.After receiving the feedback of the reward or the punishment, the agent modifies its strategies continuously to meet the requirements of the environment, and the whole process is iteratively updated until the optimal strategy is obtained.

Continuous Action Reinforcement Learning Automata (CARLA)
The target of system identification is to obtain the appropriate attenuation factor, which is a problem of finding the optimal action.Because the action space is continuous, a continuous action reinforcement learning automata (CARLA) [25,26] is proposed in this paper.Compared with other algorithms, CARLA can use the probability density function to select behavior in continuous space with a stochastic or unknown system model.The system learns interactively with the environment in a trial-and-error manner, and gets better behavior strategies by strengthening signals, increasing the probability of the action by strategy iteration, and finally obtains the optimal parameters online [27][28][29][30].
For CARLA, each action  is a mapping of a cumulative probability density function (CPDF), registered as ( ) With reinforcement signal β , the density functions are updated many times, and the optimal decision variables with the maximum of the corresponding CPDF is obtained.During

Continuous Action Reinforcement Learning Automata (CARLA)
The target of system identification is to obtain the appropriate attenuation factor, which is a problem of finding the optimal action.Because the action space is continuous, a continuous action reinforcement learning automata (CARLA) [25,26] is proposed in this paper.Compared with other algorithms, CARLA can use the probability density function to select behavior in continuous space with a stochastic or unknown system model.The system learns interactively with the environment in a trial-and-error manner, and gets better behavior strategies by strengthening signals, increasing the probability of the action by strategy iteration, and finally obtains the optimal parameters online [27][28][29][30].
For CARLA, each action x is a mapping of a cumulative probability density function (CPDF), registered as f (x).With reinforcement signal β, the density functions are updated many times, and the optimal decision variables with the maximum of the corresponding CPDF is obtained.During the process of the iteration, the reinforcement signal is determined by the evaluation function of the last iteration.Therefore, the whole process of learning and updating is always optimized in the direction of better results, and the ultimate goal is to achieve long-term rewards rather than one-step rewards, so as to ensure the global optimality.The learning process is as follows: CARLA algorithm 1: Initialize the probability density function f 0 (x i ): establish the uniform distribution of CPDFs according to the range of the parameter; 2: Actions selection: select actions (or parameters) randomly based on the CPDF value; 3: System evaluation: take the action, substitute parameters into the system to obtain the responding curve, and calculate the fitness function J(x i ); 4: Calculate the enhanced signal value β according to the value of the fitness function; 5: Update each CPDF value according to the enhanced signal value; 6: Update behavior parameters: introduce the normal random number generator to update the action parameters at the next moment; 7: If the stopping condition has not been reached, return to step 2 until the convergence condition is met.

Frequency Response Estimation Based on CARLA (CARLA-FRE)
In order to improve learning efficiency of the algorithm, the threshold value of attenuation factor obtained in [31] is used as the initial region of CARLA (i.e., x i ∈ [x i min , x i max ]).Then, using the powerful interactive learning ability of CARLA, the optimal parameters can be obtained online to improve the accuracy and effectiveness of frequency response estimation.The learning process is shown in Figure 2, and the details of the algorithm are as follows: (1) Selection of the test signal: The input functions with the characteristics of continuous secondorder derivability can be used as the excitation signals of frequency response estimation.Therefore, there are many choices of test signal, including the step signal r(t) = c, the pulse signal r(t) = 1/δ(t), the exponential attenuation signal r(t) = e −kt , and the composite function r(t) = te −kt .
(2) Selection of the test mode: This includes open-loop identification and closed-loop identification, according to the process requirements and environmental conditions.If the system is asymptotically stable and the external disturbance is small or the disturbance signal is regular, it may be a good choice to adopt a simple open-loop testing mode.If the process requires high safety and stability, large shutdown loss, and sensitive to external disturbance, it will be necessary to choose the closed-loop approach.
(3) Selection of the model structure: The establishment of a system model includes structure selection and parameter identification.The former selects the model structure according to the characteristics of the process, including model order and time delay.The latter estimates model parameters on the basis of the model structure.This paper focuses on the problem of model parameter estimation.
Steps ( 1)-(3) can be regarded as the preparation stage of the test, which can seriously affect the accuracy of system identification.
(4) Analysis of system parameters: This part focuses on the frequency response estimation method.The mathematical expressions of the relationship between system parameters and input-output can be given by analyzing the transfer function of the system.Then, the frequency response of the expression can be obtained by substituting s = a + jw, where a is the attenuation factor.With the appropriate attenuation factor a, the parameter estimation can be obtained.The traditional approach is to give the value of a directly according to experiential knowledge, or using a heuristic searching algorithm.However, in practice, the system often has time-varying characteristics, and the attenuation factor usually falls into local premature convergence.Therefore, an online learning method is required to acquire appropriate parameters adaptively.

The Applications of CARLA-FRE in MIMO Systems
In industrial applications, due to the complex structure, various parameters, and the coupling relationship between the loops of multivariable systems, the traditional identification methods are difficult to use effectively in multivariable systems.Li [15] proposed a multivariable system identification method based on sequential step signals.With the sequential step method, the multiple-input multiple-output (MIMO) system could be equivalently decomposed into several single-input single-output (SISO) systems.Then, the analytic expressions of model parameters of these sub-systems could be obtained by frequency response estimation.The simulation results reveal that the presented approach could match the requirements of identification accuracy both in square and non-square systems.(5) Online optimization of attenuation factor based on CARLA: a. Initialize the attenuation factor a(0).According to the literature, the range of the attenuation factor can be set as a ∈ [a min , a max ] in advance.For each iteration k, the action a(k) can be selected according to the probability density function f (a, k).In the beginning, the probability density function f (a, k) can be initialized as in Formula (1): b. Select the attenuation factor.Select the attenuation factor a according to the value of the probability distribution function (as Formulas ( 2) and (3))-the attenuation factor a with the maximum value of probability distribution function f (a, k) is the best. (2) c. Calculate the cost function J(a, k).According to the cost function, the cost function value under the current attenuation factor a will be calculated as Formula (4).It is the integral square error (ISE) of the real output and the estimated output of the system.
d. Calculate the enhancement signal β(k).By substituting the cost value J(a, k), average cost value J mean , and minimum cost value J min into Formula (6), the enhancement signal β(k) can be calculated to measure the performance of the evaluation.The closer the value of β(k) is to 1, the better the performance of the system under the attenuation factor.On the contrary, the closer the value of β(k) is to 0, the worse the performance of the system under the attenuation factor.
e. Update the value of the probability density function corresponding to the current attenuation factor a. After the system takes action to change the environment, it will give feedback to this policy, and update the probability density function corresponding to the iteration action by strengthening the signal.The probability density function f (a, k + 1) corresponding to the selected attenuation factor can be updated according to Formula (7): where H(x i , r)) is the Gaussian nearest neighbor function, whose value represents the possibility of action change.The specific formula is as follows: where g h and g w represent the height and width of the Gaussian distribution function, respectively.They determine the speed and depth of the learning process, respectively.According to previous work [32], g w = 0.02 and g h = 0.3 are usually selected if the sample number is 500 in the calculation of the strengthened signal, r is the action parameter, and a is the normalized factor, which can keep the value of the probability density function within [0,1].f.Update attenuation factor parameters.In order to solve the problem of large computations when updating the probability density function value and the probability distribution value in step b, this paper proposes an improved method as follows: where norm α(k), ωe (1−β(k)) is the normal-distribution random number simulator with α(k), ωe (1−β(k)) , and ω as the average, standard deviation, and learning rate factor, respectively.It can be seen that the learning time of the improved algorithm is linear with the number of iterations.Compared with the previous method using the integral operation to calculate the value of the probability distribution function, the calculation and learning time of the improved algorithm are significantly reduced without any loss of learning performance.g.Update the iterations to achieve the optimal attenuation factor a which makes the estimated output as close as possible to the real output.
(6) Substitute the learned optimal attenuation factor into the mathematical expression in step (4) to obtain the parameter estimation.After obtaining the system estimation model, it can be applied to the internal model control.

The Applications of CARLA-FRE in MIMO Systems
In industrial applications, due to the complex structure, various parameters, and the coupling relationship between the loops of multivariable systems, the traditional identification methods are difficult to use effectively in multivariable systems.Li [15] proposed a multivariable system identification method based on sequential step signals.With the sequential step method, the multiple-input multipleoutput (MIMO) system could be equivalently decomposed into several single-input single-output (SISO) systems.Then, the analytic expressions of model parameters of these sub-systems could be obtained by frequency response estimation.The simulation results reveal that the presented approach could match the requirements of identification accuracy both in square and non-square systems.

Closed-Loop Identification for Square Multivariate Systems
A multivariable square closed-loop control system n × n is depicted in Figure 3, where G s (s) is the controlled object, G c (s) is the distributed diagonal controller, R is the system input vector, Y is the system output vector, E is the deviation vector, and U is the controller output vector.
[ ] [ ]  According to the above, it can be known that So, we can get: Then, the MIMO system can be equivalently decomposed into several SISO systems as follows: According to the above, it can be known that So, we can get: Then, the MIMO system can be equivalently decomposed into several SISO systems as follows: A method to simplify the identification process of closed-loop systems was proposed in the literature, in which the system deviation could be regarded as the input signal, that is, r(t) could be replaced with e(t) = r(t) − y(t) of the identification process, and the closed-loop system was equivalent to the open-loop system.Therefore, Formula (20) can be further expressed as Formula (21): . . .
Processes 2019, 7, 546 Then, n step excitation signals are applied to the system successively and n groups of vector relationships can be obtained for each MISO system, depicted, for example, as Formula ( 22): . .
where r j i is the excitation signal of the input channel of the ith subsystem under the jth test, is the output signal of the output channel of the ith subsystem under the jth test, and then the deviation signal of the channel can be obtained, e j i = r j i − y j i , that is, the equivalent input signal in the identification process.
Therefore, the following expression can be obtained by matrix transformation: That is, where g ci e Since the closed-loop system is asymptotically stable and the input signal is continuously differentiable, it can be proved that F i is a non-singular square matrix.So, Formula (28) can be further obtained according to Formula (24): where F i −1 is the inverse matrix of F i , adj(•) is the adjoint matrix operator, and det(•) is the matrix determinant operator.According to Formula (28), the MIMO system identification problem can first be decomposed into several MISO subsystem problems, and then further decomposed into SISO identification problems.
G s ij is the transfer function of the decomposed equivalent subsystem (SISO) of the MISO system, and F * k j is the joint factor of the jth row of the F matrix under the kth test.For Formula (29), if the assumption that u ij = det(F i ), y ij = n k=1 F * k j y k i , u ij , and y ij can be regarded as the equivalent input and output of the SISO system identification problem, MISO can be further decomposed into several SISO identification problems.Considering that the presented CARLA-FRE is applicable for a variety of excitation signals, the parameters of the decomposed equivalent SISO system can be estimated using CARLA-FRE.Then, the identified SISO can be combined into a MIMO system according to Formulas (24) and (19) to complete the multivariable system identification.

Closed-Loop Identification for Non-Square Multivariate Systems
For multivariable non-square systems, due to the inconsistency between the input and output dimensions of the system, the inverse matrix of the equivalent matrix F does not exist, making the methods above difficult to apply directly.Based on the idea of the first method in [17], this paper adjusts the decentralized controller into a centralized controller, establishes an association between the input and output of each loop in the MIMO system, and then constructs a solvable matrix form.A multivariable non-square closed-loop control system of m × n, (m n) is shown in Figure 4, where G s (s) is the controlled object, G c (s) is the central controller, R is the system input vector, Y is the system output vector, E is the deviation vector, and U is the controller output vector.
is the transfer function of the decomposed equivalent subsystem (SISO) of the MISO system, and * kj F is the joint factor of the jth row of the F matrix under the kth test.
For Formula (29), if the assumption that det( ) , ij u , and ij y can be regarded as the equivalent input and output of the SISO system identification problem, MISO can be further decomposed into several SISO identification problems.Considering that the presented CARLA-FRE is applicable for a variety of excitation signals, the parameters of the decomposed equivalent SISO system can be estimated using CARLA-FRE.Then, the identified SISO can be combined into a MIMO system according to Formulas ( 24) and ( 19) to complete the multivariable system identification.

Closed-Loop Identification for Non-Square Multivariate Systems
For multivariable non-square systems, due to the inconsistency between the input and output dimensions of the system, the inverse matrix of the equivalent matrix F does not exist, making the methods above difficult to apply directly.Based on the idea of the first method in [17], this paper adjusts the decentralized controller into a centralized controller, establishes an association between the input and output of each loop in the MIMO system, and then constructs a solvable matrix form.
A multivariable non-square closed-loop control system of ,( ) m n m n × ≠ is shown in Figure 4, where ( ) G s is the central controller, R is the system input vector, Y is the system output vector, E is the deviation vector, and U is the controller output vector.In a non-square system structure, m times signals are applied to the input of the system in turn, and the system will output n response signals at a time, that is, m × n (m n) groups of equations are generated.The system can be decomposed and calculated according to Formulas ( 28) and ( 29), and the pseudo-inverse of the matrix is selected to replace it accordingly.

CARLA Algorithm Performance Verification
The continuous action reinforcement learning algorithm has strong online search and learning ability and can converge to the optimal value after full ergodic learning, without prior knowledge to set parameters.In order to test the identification ability of the CARLA algorithm, it was compared with the particle swarm optimization and the parallel diffuse algorithm (fireworks algorithm, FWA) by employing standard test functions.Here the Sphere, Rosenbrock, Griewank, Rastrigin, Ackley, and Schwefel's problem 22 functions were selected to illustrate the applicability and performance of the CARLA algorithm.
(1) Sphere function: and Schwefel's problem 22 functions were selected to illustrate the applicability and performance of the CARLA algorithm.
(1) Sphere function: ) as can be shown in Figure 5.The Sphere function has a unique global minimum value, which is obtained by the sum of squares when the minimum value is taken by independent variables with the same definition of domain.
(2) Rosenbrock function: (1 ) 100( ) The Sphere function has a unique global minimum value, which is obtained by the sum of squares when the minimum value is taken by independent variables with the same definition of domain.
(2) Rosenbrock function: As shown in Figure 6, Rosenbrock's global optimum lies in a smooth, narrow and parabolic valley.Due to the limited information, it is difficult to determine the search gradient and find the optimal solution.Therefore, it is often used to test the optimization performance of the non-convex function of the optimization algorithm, and the function can find the minimum value 0 at x * = (1, • • • , 1).
(3) Griewank function: where As shown in Figure 7, the Griewank function has many local minimum points, and the number is related to the dimension of variables.It can detect the ability of the algorithm to jump out of the local minimums and the global minimum value f(0) = 0, which is generally recognized as a difficult complex multimodal problem for the optimization algorithm.
As shown in Figure 6, Rosenbrock's global optimum lies in a smooth, narrow and parabolic valley.Due to the limited information, it is difficult to determine the search gradient and find the optimal solution.Therefore, it is often used to test the optimization performance of the non-convex function of the optimization algorithm, and the function can find the minimum value 0 at * = x  (1, ,1) .(3) Griewank function: min ( ) cos( ) 1 4000 where As shown in Figure 7, the Griewank function has many local minimum points, and the number is related to the dimension of variables.It can detect the ability of the algorithm to jump out of the local minimums and the global minimum value f(0) = 0, which is generally recognized as a difficult complex multimodal problem for the optimization algorithm.(4) Rastrigin function: where [ ] -5.12,5.12i x ∈ .
As shown in Figure 8, the Rastrigin function is a multi-peak function, and there are about { } -5.12, 5.12 1, 2 . Similar to the Griewank function, it is also a typical nonlinear multi-modal function, and the peak shape features ups, downs, and jumps, so it is difficult to optimize and find the global optimal value.(4) Rastrigin function: where x i ∈ [−5.12, 5.12].As shown in Figure 8, the Rastrigin function is a multi-peak function, and there are about 10 n local minimum points within the range of x i ∈ [−5.12, 5.12], i = 1, 2 • • • n .Similar to the Griewank function, it is also a typical nonlinear multi-modal function, and the peak shape features ups, downs, and jumps, so it is difficult to optimize and find the global optimal value.
As shown in Figure 8, the Rastrigin function is a multi-peak function, and there are about .Similar to the Griewank function, it is also a typical nonlinear multi-modal function, and the peak shape features ups, downs, and jumps, so it is difficult to optimize and find the global optimal value.(5) Ackley function: As shown in Figure 9, when the dimension of the Ackley function increases, its direction gradient and forward direction are various.This function can detect the global convergence speed of an algorithm.The function finds a minimum value 0 at x * = (0, • • • , 0).
Processes 2019, 9, x FOR PEER REVIEW 14 of 20 (5) Ackley function: As shown in Figure 9, when the dimension of the Ackley function increases, its direction gradient and forward direction are various.This function can detect the global convergence speed of an algorithm.The function finds a minimum value 0 at * = x  (0, ,0) .(X)  The parameter configuration is shown in Table 1.Particle swarm optimization (PSO), parallel diffuse algorithm (FWA), and continuous action reinforcement learning (CARLA) algorithms were respectively tested for the standard functions above.The results are shown in Table 2. FWA and CARLA algorithms had better search accuracy than the PSO algorithm, and both could accurately obtain function parameter estimation.The parameter configuration is shown in Table 1.Particle swarm optimization (PSO), parallel diffuse algorithm (FWA), and continuous action reinforcement learning (CARLA) algorithms were respectively tested for the standard functions above.The results are shown in Table 2. FWA and CARLA algorithms had better search accuracy than the PSO algorithm, and both could accurately obtain function parameter estimation.The Wood-Berry double distillation towel model is a classic multivariate model in industrial production.The identification of the Wood-Berry model has drawn the attention of academic and industrial researchers due to the complex characteristics of multi-parameters, large time-delay, and strong coupling relationships.Many works, including Li's sequence identification method based on step response (frequency response estimation, FRE) [33,34], Cheng's identification method based on heuristic search (NLG-particle swarm optimization, NPSO) algorithm [20], Cao's intelligent search algorithm based on parallel diffuse type algorithm (modified fireworks explosion optimization algorithm-FRE, MFA-FRE) [23], etc., have acquired many achievements.In this paper, some comparative simulations between the presented CARLA-FRE and the methods above were carried out to estimate the parameters of the Wood-Berry model (see Figure 11).The Wood-Berry double distillation towel model is a classic multivariate model in industrial production.The identification of the Wood-Berry model has drawn the attention of academic and industrial researchers due to the complex characteristics of multi-parameters, large time-delay, and strong coupling relationships.Many works, including Li's sequence identification method based on step response (frequency response estimation, FRE) [33,34], Cheng's identification method based on heuristic search (NLG-particle swarm optimization, NPSO) algorithm [20], Cao's intelligent search algorithm based on parallel diffuse type algorithm (modified fireworks explosion optimization algorithm-FRE, MFA-FRE) [23], etc., have acquired many achievements.In this paper, some comparative simulations between the presented CARLA-FRE and the methods above were carried out to estimate the parameters of the Wood-Berry model (see Figure 11).
The sequential step signal is: Four methods were used to respectively identify the mentioned multivariable closed-loop system, and the results are presented in Table 3.It can be seen that several methods achieved high estimation accuracy in the parameter identification of the Wood-Berry model.In addition, the CARLA-FRE method proposed in this paper can be implemented online and adjusted automatically with the change of objects.The sequential step signal is: Four methods were used to respectively identify the mentioned multivariable closed-loop system, and the results are presented in Table 3.It can be seen that several methods achieved high estimation accuracy in the parameter identification of the Wood-Berry model.In addition, the CARLA-FRE method proposed in this paper can be implemented online and adjusted automatically with the change of objects.Multivariable non-square systems exist widely in industrial production, and this paper takes the Shell model of a standard non-square model [35] as an example to test the proposed method.The closed-loop control of the Shell model is shown in Figure 12: algorithm-FRE).

Wood-Berry
Method in [26] (FRE) Multivariable non-square systems exist widely in industrial production, and this paper takes the Shell model of a standard non-square model [35] as an example to test the proposed method.The closed-loop control of the Shell model is shown in Figure 12: Here, the model transfer function matrix is:  The controller is the following formula: Gaussian white noises with a noise-to-signal ratio (NSR) of 10% and 20% were applied to verify the effectiveness of the algorithm, and the results were compared with the MAF-FRE method, as shown in Table 4.It can be seen that both methods achieved high estimation accuracy in the parameter identification of the Shell model, and the CARLA-FRE method proposed in this paper could better adapt to changes in the external environment.

Conclusions
This paper proposed a frequency response estimation method based on a continuous action reinforcement learning machine.It could solve closed-loop identification problems of multivariable square systems and non-square systems.Some comparative simulations between the presented method and existing methods were carried out.The classic Wood-Berry model (square system) and the Shell model (non-square system) were chosen to test the algorithms.From the results, it was found that the proposed method could not only achieve good identification accuracy, but also had stronger online learning ability and anti-interference ability.

Figure 2 .
Figure 2. Frequency response estimation method based on continuous action reinforcement learning automata (CARLA-FRE) flow chart.

Figure 3 .
Figure 3. Diagram of the multivariable square closed-loop control system.

Figure 3 .
Figure 3. Diagram of the multivariable square closed-loop control system.

Figure 4 .
Figure 4. Closed-loop control schematic diagram of a multivariable non-square system.

Figure 5 .
Figure 5. Three-dimensional graph of the Sphere function.

Figure 5 .
Figure 5. Three-dimensional graph of the Sphere function.

Figure 6 .
Figure 6.Three-dimensional graph of the Rosenbrock function.

Figure 6 .
Figure 6.Three-dimensional graph of the Rosenbrock function.Processes 2019, 9, x FOR PEER REVIEW 13 of 20

n 10 local
minimum points within the range of [ ]

Figure 8 .
Figure 8. Three-dimensional graph of the Rastrigin function.Figure 8. Three-dimensional graph of the Rastrigin function.

Figure 8 .
Figure 8. Three-dimensional graph of the Rastrigin function.Figure 8. Three-dimensional graph of the Rastrigin function.

Figure 9 .
Figure 9. Three-dimensional graph of the Ackley function.

Figure 9 .
Figure 9. Three-dimensional graph of the Ackley function.

Figure 11 .
Figure 11.Wood-Berry system closed-loop control diagram.The transfer function of the Wood-Berry model is:

Figure 12 .
Figure 12.The diagram of the Shell closed-loop control system.

Figure 12 .
Figure 12.The diagram of the Shell closed-loop control system.Here, the model transfer function matrix is: )

Table 1 .
Parameter configuration of the test functions.

Table 1 .
Parameter configuration of the test functions.

Table 2 .
Parameter configuration of the test function.PSO: particle swarm optimization.

Table 2 .
Parameter configuration of the test function.PSO: particle swarm optimization.

Table 4 .
Identification results of the Shell model.