Self-Tuning Two Degree-of-Freedom Proportional–Integral Control System Based on Reinforcement Learning for aMultiple-Input Multiple-Output Industrial Process That Suffers from Spatial Input Coupling

: Proportional–integral–derivative (PID) control remains the primary choice for industrial process control problems. However, owing to the increased complexity and precision requirement of current industrial processes, a conventional PID controller may provide only unsatisfactory performance, or the determination of PID gains may become quite difﬁcult. To address these issues, studies have suggested the use of reinforcement learning in combination with PID control laws. The present study aims to extend this idea to the control of a multiple-input multiple-output (MIMO) process that suffers from both physical coupling between inputs and a long input/output lag. We speciﬁcally target a thin ﬁlm production process as an example of such a MIMO process and propose a self-tuning two-degree-of-freedom PI controller for the ﬁlm thickness control problem. Theoretically, the self-tuning functionality of the proposed control system is based on the actor-critic reinforcement learning algorithm. We also propose a method to compensate for the input coupling. Numerical simulations are conducted under several likely scenarios to demonstrate the enhanced control performance relative to that of a conventional static gain PI controller.


Introduction
Control system synthesis for industrial processes has long attracted much research attention. However, the increased complexity and granularity of industrial processes make control system design difficult or intractable even for experienced engineers. Despite the sophistication of modern industrial processes, proportional-integral-derivative (PID) controllers remain the primary choice for industrial process control [1,2]. It was known that PID controllers could be applied to many control systems [3]. However, the broad application spectrum of PID controllers literally means that engineers have to find their own ways to tune their PID parameters to satisfy their performance objectives [4]. The adaptive and/or automatic tuning of the PID controller parameters was accordingly regarded as a course of development to provide feasible solutions to the problem.
Hägglund andȦström [4] proposed the auto-tuning scheme for PID controllers using artificially induced limit-cycle behavior. Papadopoulos et al. [5] proposed the automatic tuning of PID controllers based on the magnitude optimum criterion. Sarhadi et al. [6] applied an adaptive PID control for model reference adaptive control of an autonomous underwater vehicle. They derived a PID parameter tuning law based on the Lyapunov stability theory. These foregoing works were the prime examples of automatic PID gain of highly experienced operators in the factory. We aim to provide an automatic self-tuning functionality to the PID controller in this study for automatic high-quality film production.
We use the actor-critic RL algorithm to synthesize the self-tuning two-degree-offreedom PI control system. We introduced RBF networks to both the actor and critic independently. The critic networks are trained to approximate the value function of the states, whereas the actor networks learn to determine internal policy parameters to maximize the return. We propose to introduce the spatial coefficients η for spatial augmentation of the error and the reward to help proceed with appropriate learning under the existence of input coupling. The performance of the proposed control system is evaluated through numerical simulations under several likely scenarios in the operation of industrial processes. The results demonstrate the superiority of the proposed control system over the fixed-gain PI controller.
Although we concentrate on the development of the self-tuning two-DOF PI controller for the film production process in this study, the idea of the spatial augmentation of the error and the reward can be applied to other plants that also suffer from input coupling. Table 1 summarizes the number of documents published in the past twenty years that state the development of PID control systems for industrial processes. The figures in the table clearly shows that there exists a consistent and strong demand for the use of PID controllers for process control problems, and their increasing trend predicts that PID controllers will continue to be used in various control problems in the future. The enhancement of PID control for process control problems developed in this article will continue to remain valuable in this light.

Description of the Target Process
We aim to synthesize a control system for a thin plastic film fabrication process. The objective of the control system is to fabricate a plastic film that has a spatially uniform designated thickness. Figure 1 shows the control related schematics of the film fabrication machine. The thickness of the fabricated film is controlled by a die aligned orthogonal to the flow of melted resin. The width of the die is 63 mm. Fifteen die-manipulating devices are aligned linearly on a die with a 4 mm interval, as shown in Figure 2. We can measure the thickness of a film with a spatial resolution of 1 mm, as shown in Figure 2; however, we cannot monitor the cross-sectional shape of the die.  The displacement of each device can be controlled independently. We performed a step response experiment with a single manipulating device and measured its response. The transfer function of a manipulating device from its input u(t) to the displacement y(t) is accordingly identified to be: where u(t) takes a value within the interval [0, 100] and y(t) is measured in units of millimeters (mm). We hereafter assume that all manipulating devices are characterized by the transfer function (1) in their nominal operating condition. The actual shape of the cross-section of a die is determined by the displacements of the manipulating devices. The displacement of a die where a particular manipulating device is located is determined not only by the displacement of the corresponding device, but also by the displacements of other manipulating devices located nearby. Let x ∈ (0, 63] be the position of a film, and let c i (i = 1, 2, · · · , 15) be the location of the i-th manipulating device, as shown in Figure 2. We model the spatial coupling of the manipulating devices with the function Ψ x,i (i = 1, 2, · · · , N) given by: where d x,i = x − c i is the distance between the position x and the i-th manipulating device. The blue broken lines in Figure 3 show Ψ x,i for all i. The displacement of a die at position x can be calculated as: where y i (t) represents the displacement of the i-th manipulating device at time t and N = 15 is the number of die-manipulating devices installed in the process. We calculated the steady-state deformation of the die by letting u i (t) = 100 (∀i) and kept them until all the outputs y i (t) converged. The red solid plot in Figure 3 shows the final form of the deformation of a die as calculated using Equation (3). This figure shows that the deformation of the left and right edges of the die will not reach its maximum. Therefore, in some cases, the thickness of the fabricated film cannot be made constant if the reference thickness is too small. We define the control objective accordingly to regulate the thickness of the fabricated film corresponding to the region 13 < x < 51 of the die to be equal to its reference, denoted hereafter as s ref .

Structure of the Proposed Control System
Herein, we propose a two-degree-of-freedom (two-DOF) PI control system for a thin film fabrication process whose gains and feed-forward terms are tuned online by an actorcritic-type RL algorithm. We first define the signals used by the controller. Because we implemented the proposed control system with a digital computer, we hereafter use a discrete time description for the signals included in the system. However, the dynamics of manipulating devices are still described by a continuous-time s-domain transfer function because the manipulating devices should be considered continuous-time plants.
Since each manipulating device is known to suffer a nominal lag of L = 95 s, we applied the Smith predictor structure for each manipulating device, as shown in Figure 4, wherein a f f , a P , and a I represent the feed-forward input, the proportional gain vector, and the integral gain vector, respectively. η denotes a spatial augmentation coefficient for the compensation of input coupling, as will be detailed later. Let P M (s) be the transfer function defined as: Let T s = 3 denote the sampling interval of the control system. Hereafter, we use the sampling number index m to describe the current signal values at t = mT s . Let s M [m] be the response of P M (s) to the input signal sequence u[m]. The signal s[m] to be fed to both the PI controller and the RL block is defined as: where L D = L/T s denotes the sampling number that approximates the nominal lag L. We define the signal s P [m] as the measured thickness of a film corresponding to a specific manipulating device. Because L D × T s = L and we cannot measure the individual displacement of a manipulating device, we note that the feedback control structure shown in Figure 4 is not a perfect implementation of the Smith predictor. . Proposed Smith predictor based on the two-DOF PI control system. Its feed-forward control and PI gains are tuned by the RL algorithm. We note that this figure represents the construction of a control system for a single manipulating device, but the reinforcement learning part takes care of the entire process to compensate for the interference between inputs.

Two-DOF PI Controller Tuned by RL Algorithm
We implemented the feedback control structure shown in Figure 4 for the control of each manipulating device. Hereafter, we denote the signal s[m] for the i-th manipulating device as s i [m](i = 1, 2, · · · , N). Let: be the normalized tracking error variable for the i-th manipulating device. We accordingly define its summation as: The control law to be synthesized for the i-th manipulating device is described as: where a f f ,i represents the feed-forward control input, and the remaining terms constitute a PI controller for the i-th manipulating device. The errorε i and its summationε ∑ i that appear in (8) are the vectors defined as: and:ε respectively.ε i,j is a spatially augmented error defined as: where η i,j is a coefficient that is defined to consider ε j [m] (j = 1, 2, · · · , N) in (6) for the control of the i-th manipulating device. We introduce the spatial error augmentation to compensate for the mechanical coupling of the manipulating devices, which would worsen the film thickness accuracy. This setup yielded the PI gains a P,i and a I,i to be vectors in R N .

Parameterized Policy and Policy Gradient RL for Self-Tuning Controller Parameters
The tuning of controller parameters of a process is considered to be highly empirical and require experienced operators to run the process stably under various uncertainties. We regard the control law defined in (8) to be the deterministic policy and employ RL for the self-tuning of the parameters in this policy based on the observed control performance. The parameters to be tuned are a f f ,i , a P,i , a I,i , and η i,j . We include tunable parameters Θ in these quantities and update their values to maximize the expected return E{R t |Θ}. The learning scheme is classified as the policy-gradient-type actor-critic algorithm proposed by Kimura and Kobayashi [20].
We define the control actions for the i-th manipulating device included in the deterministic policy (8) as: and: where θ f f ,i,j , θ P,i,j , θ I,i,j , and θ g i are the internal policy parameters for the feedforward, proportional, and integral control actions and the spatial coefficient η i,j , respectively. G a f f , a f f ,max , a P,max , a I,max , and g max are suitably chosen constants. As each manipulating device has a long time constant and suffers a large delay, the feedforward control term a f f ,i is used to facilitate a faster response.

Critic Network
In this study, continuous changes in the states and actions of the RL algorithm must be assumed. Therefore, we used RBF networks that map the observed signals to quantities used in learning and control.
The critic network approximates the value function V(s[m]) for the film thickness measurement s Figure 5 shows the structure of the critic network. The k-th hidden layer node φ i,k (k = 1, 2, · · · , K i ; K i ≤ K max ) is described as: where µ i,k = [µ i,1,k , · · · , µ i,N,k ] T , σ i,k defines the center and the standard deviation of the RBF function, respectively. The parameter ρ controls the tailedness of the RBFs. The number of hidden layer nodes K i will be increased dynamically within the predefined maximal number K max to improve the precision of the approximation for a wide range of state spaces. The value function V i (s[m]) is defined as: where w i,k is a weight. The parameters of the critic network, w i,k , µ i,k , and σ i,k , are updated so as to make the index: small. δ i is a temporal difference (TD) error defined as: and r i is an instantaneous reward calculated as: and SS i defines a stable hypersurface of the error such that SS i → 0 indicates ε i → 0. The gradient of the hypersurface B must be determined appropriately for this purpose. The update laws of the critic network parameters follow the policy-gradient algorithm, specifically defined as: where α w , α µ , and α σ define the learning rates for w, µ, and σ, respectively, and D i,k is an eligibility trace defined as:

Actor Network
The actor networks are configured to approximate the internal policy parameters included in (12) and (13). Figure 6 shows the structure for the feedforward control action parameters θ f f ,i,j and the PI control action parameters θ P,i,j and θ I,i,j . The feedforward actor network is designed to receive the normalized state s and the normalized policy parameter a f f = [a f f ,1 , · · · , a f f ,N ] T as its input, and the actor networks use the normalized state s and the normalized control gains a * ,i = [a * ,i,1 , · · · , a * ,i,15 ] T ( * ∈ {P, I}) as their inputs. The k-th hidden layer node φ (k) * * ,i,j for the i-th actor networks is defined by an RBF: where its argument x * * ,i is defined as: x f f ,i = [s 1 , · · · , s 15 , a f f ,1 , · · · , a f f ,15 ] T for the feedforward network and: x * ,i = [s 1 , · · · , s 15 , a * ,i,1 , · · · , a * ,i,15 ] T ( * ∈ {P, I}) for the PI controller networks. Both the feedforward and the PI actor networks aim to tune θ * ,i,j ( * ∈ { f f , P, I}) using their network parameters v, µ, and σ. We applied the policy-gradient method to the learning of the actor networks. The update laws of the RBF parameters were synthesized as: to maximize the expected return E{R t }, where * * represents either f f , P, or I. D v , D µ , and D σ are the eligibility traces characterized as: where α * * , α aµ , and α aσ are the corresponding learning rates.

Learning Spatial Coupling Coefficient η i,j
We introduced the spatial coupling coefficient η i,j to compensate for the mechanical coupling of the die manipulating devices. As their coupling form cannot be known prior to operation and may vary within consecutive runs, we tried to tune the parameter g i included in η i,j to improve the control performance.
We also applied the actor-critic structure with the RBF network shown in Figure 7 to self-tune g i . Each node in the hidden layer RBF is fully connected to the inputs of the networks. The critic function attempts to learn the value function V g i for η i,j , whereas the actor network tunes the internal policy parameter θ g i in (14) to minimize the performance index defined by J a g i = ε 2 i /2.  Similar to the actor-critic networks for the two-DOF PI self-tuning control system, the learning targets V g i (for critic) and θ g i (for actor) are the functions of the network parameters. The learning rules to be formulated should target the network parameters, namely the center and variance of the RBFs in hidden layer nodes and weights to synthesize the outputs of the networks.
The critic network parameters were updated to minimize the squared form J c g i = δ 2 g i /2 of a TD-error δ g i defined as: where the instantaneous reward r g i is defined as: The parameters of the actor network are updated to the negative gradient direction of the performance index J a g i = ε 2 i /2 to make J a g i small. Because the resulting update laws for the actor-critic network of the internal policy parameter g i will appear quite similar to those dictated to the actor-critic RL of a two-DOF PI control system, they are omitted to avoid repetition. We finally summarize the calculation flow of our proposed control system in Algorithm 1. Determine internal policy parameters θ * * ,i , θ g,i using RBF networks; 13 Determine action parameters a f f ,i , a P,i , a I,i , and η;

Performance Evaluation through Numerical Simulations
We performed numerical simulations under several likely scenarios to demonstrate the performance of the proposed control system. As a representative conventional process control technique, we synthesized a static PI controller using the Ziegler-Nichols (Z.N.) ultimate gain method. We applied the same set of gains to all N manipulating devices and configured the spatial coupling coefficient η i,j as: to evaluate how much η i,j would be effective in compensating for the intrinsic mechanical coupling caused by manipulating devices other than the i-th one.
To quantify the control performance, we calculated the spatial root mean squared error (sRMSE), defined as: where thickness(x) is the film thickness measured at the labeled position x in Figure 2. Tables 2 and 3 respectively list the actor and critic network parameters used in the simulation. Table 2. Parameters used to formulate actor and critic networks for self-tuning the two-DOF PI control system.

Parameter Value
Parameters for spatial coupling coefficient η i,j α gc for critic RBF network 1.0 × 10 −3 α g c µ for critic RBF unit 1.0 × 10 −4 α g c σ for critic RBF unit 1.0 × 10 −4 α g a for actor RBF network 2.0 × 10 −2 α g a µ for actor RBF unit 1.0 × 10 −4 α g a σ for actor RBF unit 1.0 × 10 −4 g i (0) (initial value of g i ) 0.06 g max 2 Other parameters ρ (controls tailedness of RBF) 2.5 Gradient of hypersurface B 0.2 The sampling interval was set as 3 s in all simulation scenarios. We added random noise to the calculated thickness to simulate measurement noise. The noise was generated within a ±0.5% range of the reference thickness of a film.
In the following scenarios, we applied three different controllers: (1) the proposed self-tuning two-DOF PI controller, (2) the proposed self-tuning two-DOF PI controller, but with η i,j defined using (26), and (3) the static gain PI controller whose gains were determined using the Z.N. ultimate gain method. We uniformly applied the feedback control structure shown in Figure 4 to all three controller setups in all scenarios. We only disabled the RL calculations when we tried to obtain control results with static PI controllers. For the self-tuning control simulations, we repeated the simulation with an identical initial thickness distribution for 40 episodes. The results corresponding to the 41st episode are shown below. We note that the number of RBF nodes in the hidden layer of the actor and critic networks were set to zero initially and increased automatically. We applied the algorithm for the automatic addition of RBF nodes proposed by Kamaya et al. [21] whilst making necessary changes to adapt to our MIMO control problem.

sRMSE Trajectories for a Fixed Reference Thickness
We first set the reference thickness to 70 and observed the transient changes in the sRMSE metrics. Figure 8 shows the result. The plots indicate that the proposed self-tuning controllers not only yielded much faster convergence, but also achieved significantly smaller sRMSE values than the conventional static PI controller. The figure also shows that incorporating the spatial coupling coefficient η i,j defined using (14) and the associated learning scheme resulted in a smaller sRMSE than that achieved with the decoupled self-tuning controller corresponding to η i,j defined using (26).

Response to Changes in Reference Film Thickness
Although the real film production process does not change the reference thickness within a single production batch, we changed the reference thickness from 70 to 65 at the 500th sampling step in the 41st episode to observe the response after completing 40 episodes with a constant reference thickness of 70. Figure 9 shows the result. The proposed self-tuning controllers again exhibited much smaller sRMSE metrics than the static PI controller. η i,j defined using (14) with the associated RL produced better accuracy than the SISO self-tuning controller, as was also observed in the previous scenario. Although the self-tuning controllers temporarily exhibited larger sRMSEs than the static PI controller after the reference thickness was altered, this was an incidental issue as evidenced by the film thickness distributions corresponding to the Z.N. PI and the proposed controller in Figure 10. Since the film corresponding to the Z.N. PI controller has portions apparently thinner than 70 and closer to the new reference of 65, it temporarily exhibited a smaller sRMSE metric than film generated by the proposed controller whose thickness was uniformly close to 70. Our inference was further justified by the additional simulation in which the new reference was set to be 75, which is larger than 70. The result is shown in Figure 11.  The recovery of the sRMSE metric after the change of the reference thickness corresponding to our proposed controller as shown in Figures 9 and 11 revealed that the proposed controller exhibited much faster response as compared to the Z.N. PI controller.
The result in Figure 9 shows that using η i,j in (14) contributes to a smaller steady-state sRMSE.
We next show how our proposed controller reacted to the changes of reference. Since the left and the three right devices were excluded as explained earlier in this manuscript, we provide the changes of the controller related parameters of the fourth to twelfth manipulating devices.
All the quantities suffered steep changes at the 500th step. Since 40 episodes of training were completed before applying this scenario, the feedforward control inputs seemed to be dominant in the control behavior, whereas small transient adjustments could be observed in a P and a I , as evidenced by the plots in Figure 12. Figure 13 shows the plots corresponding to only the fifth manipulating device. On the changes of PI controller gains, it is of technical interest to note that although a P,5,5 and a I,5,5 were the largest, the gains corresponding to their closest neighbors (a P,5, * and a I,5, * for * ∈ {4, 6}) exhibited a similar magnitude, indicating that errors measured at the nearest neighbor devices were important in the control under coupling.

sRMSE Trajectory under Plant Perturbation
We then introduced perturbations to the dynamics of the manipulating devices. Because the devices were modeled using first-order transfer functions with a lag, we added perturbations to their DC gains and time constants randomly; the perturbed parameters should stay within the interval of ±5% of their nominal values. We note that the perturbation was introduced only in the 41st episode in the self-tuning control scenarios. We note that the proposed self-tuning controller again exhibited much smaller sRMSE metrics than the static PI controller in this scenario as shown in Figure 14. The use of the adaptive coefficient η i,j in the self-tuning control system resulted in a continuous improvement in the sRMSE metric after the 400th step, whereas the metric did not decrease with the SISO policy controller.

Disturbance Rejection
We empirically know that we should sometimes expect a disturbance that would worsen the film thickness precision at the left and right edges. We modeled the disturbance as a perturbation of the thickness around the right edge, which was characterized by: and added it to the thickness after the 500th time step. We did not perturb the plant dynamics in this scenario, and the reference was set to 70 throughout the episode. Figure 15 shows the result. All three controllers suffered increased sRMSE metrics when a disturbance was introduced. However, the self-tuning controllers quickly rejected the disturbance, whereas the sRMSE metric of the static PI controller continued to decrease even after 500 sampling steps, indicating its very slow transient behavior. Notably, the self-tuning controller with η i,j defined using (14) showed faster convergence than the SISO self-tuning controller in this scenario, likely because of the spatially monotonic sign of the introduced disturbance.

Comparison with Online Tuning by Particle Swarm Optimization
In order to illustrate the performance enhancement of our proposed control system further, we conducted a comparative on-line tuning of our two-DOF PI controller parameters θ * * by particle swarm optimization (PSO). PSO is classified as a swarm intelligence algorithm. It can be applied to various global optimization problems, and it is known to exhibit fast convergence.
We tuned the parameters of the proposed two-DOF PI control system in the SISO setup (η i,j = 1 only if i = j; otherwise, it was set to zero). A particle includes all the policy parameters θ * * , which amounts to a point in the 45th-dimensional space (there are 15 manipulating devices, each of which is assigned a two-DOF PI controller that has three parameters). We prepared 20 initial particles that were distributed within the ±35% range of the initial value. The initial velocities of the particles were randomly initialized within the interval [−0.5, 0.5]. We needed to define pbest and gbest to evaluate the fitness of the particles (please see [9] for details). We decided to use the sRMSE metric as a fitness evaluation. The updates of the particles were carried out at every 100 steps of episodes, and we performed 40 episodes also for the tuning parameters with PSO. The reference was set to be 70, which was identical to the value used for the numerical evaluation of the proposed control system. Figure 16 below shows the changes of sRMSE metrics corresponding to the controllers tuned by four different methods. It shows the superior fast adaptation performance of PSO tuning. However, PSO tuning was outperformed by the proposed control system, which explicitly took input coupling into account. It can be said that PSO did not exhibit significant performance improvement over our proposed control system in a SISO setup. We concluded that the proposed control system exhibited not only improved steady-state thickness accuracy, but also comparable learning speed to PSO.

Conclusions and Future Work
This study proposes a self-tuning two-DOF PI control system for a MIMO film production process. The adaptive tuning laws of the controller parameters are synthesized based on the actor-critic-type RL algorithm. As the target process intrinsically suffers spatial mechanical coupling, we introduce the tunable coefficient η i,j to improve the thickness control performance under the existence of spatial couplings of the inputs.
We conduct numerical simulations under several likely scenarios and confirm better control performance compared to that of the conventional static-gain PI controller whose gains are determined using the Z.N. method. The numerical results indicate that the proposed controller exhibits better performance in all likely scenarios.
We observe transient oscillation in the sRMSE thickness error metrics of the proposed control in almost all cases. We will continue to investigate the cause of this phenomenon and will try to synthesize an improved control system with a smaller oscillatory response.