Real-Time Parameter Identification for Forging Machine Using Reinforcement Learning

It is a challenge to identify the parameters of a mechanism model under real-time operating conditions disrupted by uncertain disturbances due to the deviation between the design requirement and the operational environment. In this paper, a novel approach based on reinforcement learning is proposed for forging machines to achieve the optimal model parameters by applying the raw data directly instead of observation window. This approach is an online parameter identification algorithm in one period without the need of the labelled samples as training database. It has an excellent ability against unknown distributed disturbances in a dynamic process, especially capable of adapting to a new process without historical data. The effectiveness of the algorithm is demonstrated and validated by a simulation of acquiring the parameter values of a forging machine.


Introduction
Complex engineering systems are with a high requirement for system reliability and control and production performance. A variety of technologies are developed to support the monitoring, optimization, and control for complex industrial processes such as chemical processes, manufacturing systems, power, and energy systems [1][2][3]. The forging process that enhances the mechanical properties by compressing the microstructure of parts [4] is widely applied in the fields of mining equipment, thermal hydro wind power generation equipment, nuclear power equipment, petroleum, and so on. As the key equipment, a forging machine should provide a precise pressing speed with a huge force to achieve the technological requirements of forging pieces. Therefore, the control of the forging machine is the guarantee of high forging quality. The control algorithms have made great progress from conventional PID-based algorithms [5] to advanced model-based control algorithms, including sliding mode control [6,7], back-stepping control [8], and feedback linearization [9], in order to obtain higher performance. However, the effects of these control algorithms strongly depend on the accuracy of the mechanism model. In [10,11], fuzzy-based control was proposed by using fuzzy rules instead of the mechanism model, but it cannot achieve the requirement of high precision. It is worthy to point out that the equivalent models, including regression models [12], neural networks [13], support vector machines [14], and so on [15], are alternatives of the mechanism model. These equivalent models overcome the difficulty of mechanical analysis, but at the cost of the model's extension and physical meanings. Up to now, the mechanism model is still feasible for precision control of the forging machine.
The mechanism knowledge of the forging machine has been mastered based on the related principles such as fluid mechanics, dynamics, and machinery technology. For example, the dynamic behaviors of the forging machine were analyzed according to the mechanism model [16]. A focus of the mechanism model with known structure is to determine the parameters, which is often by the way of offline identification and online correction. Especially for a forging machine, most parameters come from the design handbook of forging machine [17] in which the values of parameters are recorded under the pre-set environment. The others are estimated based on the states of the forging machine by kinds of sensors. A number of offline identification methods such as least square method, maximum likelihood, Bayesian estimation, posteriori estimates, and minimizing maximum entropy were shown in reviews [18][19][20]. Reference [21] proposed to minimize the entropy of a kernel estimation, constructed from the residuals to deal with the case of not using the maximum likelihood estimation. In reference [22], a system parameter estimation method based on deconvolution of the system output process and explicit Levenberg optimization method was presented. Reference [23] presented a new derivative-free search method for finding models of acceptable data fit in a multidimensional parameter space and made use of the geometrical constructs known as Voronoi cells to derive the search in the parameter space. Reference [24] described a method for estimating the Nakagami distribution parameters by the moment method in which the distribution moments were replaced by their estimates. In order to trace the varying working parameters, the online estimated techniques were developed to improve the accuracy of model. The recursive parameter estimations were introduced to the linear model [25], the bilinear system [26], and the ARMA system [27]. In [28], an estimated noise transfer function was used to filter the input-output data of the Hammerstein system. By combining the key-term separation principle and the filtering theory, a recursive least squares algorithm and a filtering-based recursive least squares algorithm were addressed. Reference [29] proposed a parameter estimation algorithm using the simultaneous perturbation stochastic approximation (SPSA) to modify parameters with only two measurements of an evaluation function regardless of the dimension of the parameter. Reference [30] collected time-series data from an experimental paradigm involving repeated training and investigated the effect of various clustering methods on the parameter estimation. Reference [31] provided a servo press force by employing a novel dual-particle filter-based algorithm, achieving a maximum relative error in the force estimation of 3.6%.
As a foundation, a lot of effective historical data are necessary for parameter identification. Unfortunately, a forging machine is often working on batch processes whose parameters are different in each batch, and are even impossible to be known for new forging pieces. This means the parameters of the mechanism model for a forging machine will need to be determined from as few data as possible. From the perspective of data effectiveness, the classical parameter identification methods, whether offline estimation or online correction, are based on the least squares concept with the assumption of data following a normal distribution. It needs an appropriate window to observe the data because the statistical characteristics hide in the collected data. However, the difference of forging material quality and the variable pressure caused by pipe diameter change and flow rate change will lead to some disturbances that cause the data noise to be in an unknown distribution. So it is a challenge to determine the parameters of a model for a forging machine online to meet the needs of a complex environment.
Reinforcement learning (RL), motivated by psychology, statistics, neuroscience, and computer science, is about learning from interaction how to behave in order to achieve a design goal [32][33][34]. It will get rid of the limitation of training samples by learning directly from the raw data online. Through the learning process, an optimal action will be achieved to respond to the states. By sensing the current states, the RL does not need the assumption of prior distribution of noise. By episodes training, the action will overcome the overfitting difficulty and become robust due to eliminating the disturbance gradually. If the parameters were taken as the actions, they would be determined by reinforcement learning without thinking about the assumptions and disadvantages of the methods. In the case of a forging machine, it is a feasible approach to find the optimal values of the model parameters in a new condition under disturbances. There are some mature algorithms in the RL family, such as Q-learning [35], actor-critic [36], and deep reinforcement learning [37]. In this study, the Q-learning algorithm is proposed to determine the model parameters under the Processes 2021, 9, 1848 3 of 19 working condition due to its simplicity. The contributions of this paper can be summarized as follows: (1) The parameters are identified only based on the information of one period, which is promising for online control. (2) The values of parameters are determined directly by raw data without any assumptions of noisy characteristics. (3) The parameters have strong stability through a number of training episodes, which resists the bad influence of disturbance of unknown law.
The rest of this paper is organized as follows. Section 2 gives the model of pressingdown in forging machine that shows the state variables and the parameters. Section 3 describes the RL's procedure and releases the proposed approach. In Section 4, the model parameters are elaborated by the proposed approach and comparisons are made with two classical methods. Finally, conclusions are drawn in Section 5.

The Model of the Pressing-Down in Forging Machine
A semisolid metallic confectioning constant-speed isothermal forging is an important forging technique especially for light-weight alloy confectioning in the aerospace industry. The typical structure of the forging machine is illustrated in Figure 1, and the model has been built in our previous work [38]. It is repeated here for integrity. family, such as Q-learning [35], actor-critic [36], and deep reinforcement learning [37]. In this study, the Q-learning algorithm is proposed to determine the model parameters under the working condition due to its simplicity. The contributions of this paper can be summarized as follows: (1) The parameters are identified only based on the information of one period, which is promising for online control. (2) The values of parameters are determined directly by raw data without any assumptions of noisy characteristics. (3) The parameters have strong stability through a number of training episodes, which resists the bad influence of disturbance of unknown law. The rest of this paper is organized as follows. Section 2 gives the model of pressingdown in forging machine that shows the state variables and the parameters. Section 3 describes the RL's procedure and releases the proposed approach. In Section 4, the model parameters are elaborated by the proposed approach and comparisons are made with two classical methods. Finally, conclusions are drawn in Section 5.

The Model of the Pressing-Down in Forging Machine
A semisolid metallic confectioning constant-speed isothermal forging is an important forging technique especially for light-weight alloy confectioning in the aerospace industry. The typical structure of the forging machine is illustrated in Figure 1, and the model has been built in our previous work [38]. It is repeated here for integrity. The function of the forging machine in pressing-down phase is affected by the oil pipe-line, the proportional servo valve, and the hydraulic cylinder with abandoning the auxiliary attachments.

The Oil Pipe-Line
The pressing speed in the pressing-down phase is always slow to meet the craft needs, so the oil works in the state of filament flow. Taking a pipe oil column as an object, the pressure balance equation is in the form of Formula (1). The function of the forging machine in pressing-down phase is affected by the oil pipe-line, the proportional servo valve, and the hydraulic cylinder with abandoning the auxiliary attachments.

The Oil Pipe-Line
The pressing speed in the pressing-down phase is always slow to meet the craft needs, so the oil works in the state of filament flow. Taking a pipe oil column as an object, the pressure balance equation is in the form of Formula (1).
The difference between input volume and output volume is equal to the sum volume of oil compress and pipe swelling. So the oil continuity equation is where q 1 and q 2 are the oil flow in pipe and the output oil flow of proportional servo valve, p 1 and p s are the input pressure of proportional servo valve and the pressure of a constant rate pump output, S 1 and l are the sectional area of pipe and the length of oil pipe, and K is the young's modulus of oil equal volume.

Proportional Servo Valve
The proportional servo valve performs between the servo valve and the proportional valve. It eliminates the dead band by the way of fluid forerunner. The proportional servo valve is widely applied in the ultra-low-speed hydraulic machine to control the oil flow to the hydraulic cylinder. The proportional servo valve is described as where ξ amd ω n are the damping rate and the inherent frequency of propositional servo valve, respectively, K q = K n p 1 −p 2 ∆p n is used to compensate the error between the practical pressure and criterion pressure, and A is the opening of proportional servo valve.

The Hydraulic Cylinder
The pipe-line between proportional servo valve and the hydraulic cylinder is omitted due to its short distance. The oil continuity equation of hydraulic cylinder is the form of where S 2 is the plunger's sectional area of exporting cavity of hydraulic cylinder, v is the moving speed of plunger, λ c is the leak coefficient of hydraulic cylinder, p 2 is the output pressure of proportional servo valve, and V c is the oil volume of upper cavity of hydraulic cylinder, The dynamic equation of plunger is obtained according to the force analysis with the form of where m is the mass of slider block, g is the acceleration of gravity, B is the viscous damping coefficient, F is the load resistance, and p 3 is the holding pressure of slide block. According to the design of forging machine, the holding power of slide block is equal to the gravity of slide block: The Formula (6) is simplified to Formula (8) by substituting Formula (7) for Formula (6):

The Model of the System as a Whole
Let dt , x 4 = q 2 , x 5 = p 2 , and x 6 = v. By integrating the subsystems together, the global forging machine model can be described in the state-space form

Remark 1.
In the model, most parameters such as the length, the sectional area of oil pipe, the mass of slider block, and the rated flow gain can be valued according to the design. The values of parameters that are influenced by the surrounding or working conditions will result in the inaccuracy of model.

Reinforcement Learning
The basic frame of reinforcement learning is shown in Figure 2. At each time step k, the agent makes observations x(k) ∈ X and takes action u(k) ∈ U, and receives reward R(x(k + 1), x(k), u(k)) ∈ R.

The Model of the System as a Whole
Let 1 = 1 , 2 = 1 − , 3 = 2 , 4 = 2 , 5 = 2 , and 6 = . By integrating the subsystems together, the global forging machine model can be described in the statespace form

Reinforcement Learning
The basic frame of reinforcement learning is shown in Figure 2. At each time step k, the agent makes observations ( ) ∈ and takes action ( ) ∈ , and receives reward ( ( + 1), ( ), ( )) ∈ ℝ. The expected return that is received in the long run is described using the state-action value function V( , ), under the condition of first taking an arbitrary action ∈ from a certain state ∈ and subsequently acting according to a certain control series . So the value function V π ( ( ), ( )) at time is defined as The expected return that is received in the long run is described using the state-action value function V(x, u), under the condition of first taking an arbitrary action u ∈ U from a certain state x ∈ X and subsequently acting according to a certain control series π. So the value function V π (x(k), u(k)) at time k is defined as

Agent Environment
where γ ∈ [0, 1] is the discount factor. The value function V π (x(k + 1), u(k)) at time k + 1 is defined as According to the theory of dynamic programming Unfortunately, the value function V π (x(k), u(k)) and V π (x(k + 1), u(k)) is not obtained because no one knows the rewards after time k + 1. To remove this obstacle, the Qfunction is designed with Q(x(k), u(k)) and Q(x(k + 1), u(k)) replacing V π (x(k), u(k)) and V π (x(k + 1), u(k)), respectively Let The u(k) will be optimized by a process of seeking δ approach to zero. As an important member of reinforcement learning family, the basic step of Qalgorithm is carried out as Procedure 1 [30].

Remark 2.
There is only state information in Procedure 1. One can obtain the optimal action online by using two states, x(k) and x(k + 1), in the process of maximizing the value function. By this way, it makes an online control become possible because this approach gives up the requirement of sliding window length.

The Proposed Approach
The scheme of proposed approach is shown in Figure 3.  A model that consists of undetermined parameter ( ∈ ) is paralleled to the forging machine under the controller. The state variables of model are recorded as ( ) and ( + 1) at sampling and + 1, which are connected by a delay link z −1 . The undetermined parameter is regarded as the action of Q-algorithm. Therefore, the Q-algorithm following Procedure 1 is applied to determine the parameter based on ( ) and ( + 1) and finally, the optimal parameter * will be obtained when it is convergent.
To explicate Q-algorithm for the acquisition of model parameters, the key concepts A model that consists of undetermined parameter p (p ∈ R m ) is paralleled to the forging machine under the controller. The state variables of model are recorded as x(k) and x(k + 1) at sampling k and k + 1, which are connected by a delay link z −1 . The unde-termined parameter p is regarded as the action of Q-algorithm. Therefore, the Q-algorithm following Procedure 1 is applied to determine the parameter p based on x(k) and x(k + 1) and finally, the optimal parameter p * will be obtained when it is convergent.
To explicate Q-algorithm for the acquisition of model parameters, the key concepts of the proposed Q-algorithm are illustrated as follows.
(i). Action space, reward, and value function The action space is made up of the undetermined parameter p. The values of parameter are usually inconsistent with the working condition, which will disturb with model accuracy. A goal is to determine their values responding to the surroundings.
The forging machine's velocity is designated a constant pressing speed or a given curve of speed during a certain temperature range according to the properties of forging materials, so the reward R(k) is selected as the reciprocal of change for absolution error between the measured speed and the set speed at adjacent sampling times k and k + 1 where v(k) and v set (k) are the measured speed and the preset speed at sample k; v(k + 1) and v set (k + 1) are the measured speed and the preset speed at sample k + 1. Here, using v instead of x 6 that is the sixth component of state vector x is only to stress the physics meaning. Let s = [x; u] so the value functions V(s(k), p(k)) and V(s(k + 1), p(k + 1)) from samples k and k + 1 are defined by Formulas (15) and (16) (ii). Q-function The value functions V(s(k), p(k)) and V(s(k + 1), p(k + 1)) are replaced by Q-function according to the Q-algorithm because the value functions are not obtained due to the unknown rewards after sample k. The early Q-function that is applied for the discrete space is presented as a look-up table of states row and actions column. When the states or actions are continuous, their discretization will lead to the curse of dimensionality by generating an exponentially increasing complexity of algorithm and insufficient storage. Therefore, the parameterized function is proposed to fit the Q-function with the form where f and θ are a parameterized mapping and the parameters, respectively. Let s = [s; p], an approximator is used to substitute for the unknown parameterized mapping, and there isQ where φ i (s, a) is usually selected as Gauss radial kernel function due to its simplicity, whose form is in which s i is the central coordinates of i-th radial kernel function and σ i is the width of i-th radial kernel function. There are two ways to determine the action in RL. The exploitation is used to get the best action from the Q-function that is based on the reward received. The exploration is used to escape the local optimization of exploitation by randomly giving the action. As a compromise of exploitation and exploration, the ε− greedy algorithm is proposed to evolve the action. The agent selects the action that maximizes the Q-value function according to the probability ε that is usually a large probability event. In addition, it selects the action randomly according to the probability of 1 − ε from the action space, which makes sure the action exploration is within the unknown area. The form of ε− greedy algorithm is where p(k) and p(k + 1) are the acquisition parameter at k and k + 1, respectively, Pr is the probability of select action, and U is the action set.

(iv). The Process of Method
The proposed algorithm is summarized as Procedure 2. In this procedure, the input states are x(k), u(k) and x(k + 1), whose physical meanings are shown in Section 2, and the output parameter is p.

Procedure 2.
Step 1: Give a state x(k) and the control u(k) and then construct s according to s = [x; u ] Step 2: Select parameters p(k) randomly.
(v). Convergence The convergence of Q-algorithm can be found in [35,36].

Case Studies
The forging machine usually keeps a good state at the early life stage. In this stage, the values of parameters after a fine machine debugging always coincide with the design condition, except for the viscous damping coefficient B because it is prone to be influenced by the temperature and working condition. With time elapsing, the leakage becomes the main uncertainty of the forging machine. A little leakage is permitted for the forging machine if the leakage does not affect the work process. Nevertheless, the forging machine needs to be repaired if there appears much leakage. Therefore, we chose the viscous damping coefficient B and leakage coefficient λ c as the identification parameters. These two parameters are unmeasurable, which make their values unverifiable in practice. As a result, we conducted a simulation to verify the proposed method.

Data Source
The state space model of (9) was used to simulate a forging machine. The values of model parameters are shown in Table 1 according to the design condition. A controller is necessary for a forging machine to guarantee the quality of pressing process, therefore, a PID controller was used to simulate this situation. We chose a PID controller because here we focus on verifying our proposed method rather than discussing the control method. The PID controller is enough to provide the states and control for the proposed approach. The data series were generated by solving the model (9) with ODE45 that applies the fourth-order Runge Kutta algorithm to provide the candidate solution and the fifth-order Runge Kutta algorithm to control errors. These continuous sequences provided the data source by adding two kinds of noise with uniform distribution or Gaussian distribution as a simulation of real data. The set speed was changed from 0.02 to 0.08 that is consistent with the requirement of a typical pressing process. A typical control process that includes a transition process and a stable process is shown in Figure 4.  The subsequent simulation was carried out at the platform of MatlabR2011b with the computer of Intel ® Core™ 2 Duo CPU E7300 @2.66GHz 2.67GHz.

Acquisition of the Viscous Damping Coefficient
According to experiments, the viscous damping coefficient is usually during 10-30 for this model. As a result, the value of 15 was chosen as the predetermined value and targeted by the proposed approach according to Procedure 2. The episodes training process is shown in Figure 5, where the subgraph above is with the noises of the uniform distributions and the subgraph below is with the noises of the Gaussian distributions. It is generally believed that the training time is related to the nature of the object and the computer performance. In order to avoid the time difference caused by different computer performance, we used the number of the episodes as an index of training time.  The subsequent simulation was carried out at the platform of MatlabR2011b with the computer of Intel ® Core™ 2 Duo CPU E7300 @2.66GHz 2.67GHz.

Acquisition of the Viscous Damping Coefficient
According to experiments, the viscous damping coefficient B is usually during 10-30 for this model. As a result, the value of 15 was chosen as the predetermined value and targeted by the proposed approach according to Procedure 2. The episodes training process is shown in Figure 5, where the subgraph above is with the noises of the uniform distributions and the subgraph below is with the noises of the Gaussian distributions. It is generally believed that the training time is related to the nature of the object and the computer performance. In order to avoid the time difference caused by different computer performance, we used the number of the episodes as an index of training time.
targeted by the proposed approach according to Procedure 2. The episodes training process is shown in Figure 5, where the subgraph above is with the noises of the uniform distributions and the subgraph below is with the noises of the Gaussian distributions. It is generally believed that the training time is related to the nature of the object and the computer performance. In order to avoid the time difference caused by different computer performance, we used the number of the episodes as an index of training time.   Figure 5 shows there is a trial process at the beginning of training because there is no priori information on B. After a trial of about 3000 episodes, the best historical value of B that indicates 20 for the above subgraph and 15.0626 for the below subgraph appears during the process of seeking the best reward. After about 10,000 episodes, a better value of 14.5000 occurs for the above subgraph. In contrast, a value of 15.0626 for the below subgraph is unchanged until the episodes terminate.
The viscous damping coefficient B was changed from 15 to 20 to test the proposed method. The episodes training process is shown under a uniform distribution (the above subgraph) and under a Gaussian distribution (the below subgraph). Figure 6 shows the training episodes process similar to Figure 5. It is also seen that the trial process of Figure 6 lasts about 3000 episodes.
Processes 2021, 9, x FOR PEER REVIEW 11 of 20 Figure 5 shows there is a trial process at the beginning of training because there is no priori information on . After a trial of about 3000 episodes, the best historical value of that indicates 20 for the above subgraph and 15.0626 for the below subgraph appears during the process of seeking the best reward. After about 10,000 episodes, a better value of 14.5000 occurs for the above subgraph. In contrast, a value of 15.0626 for the below subgraph is unchanged until the episodes terminate.
The viscous damping coefficient was changed from 15 to 20 to test the proposed method. The episodes training process is shown under a uniform distribution (the above subgraph) and under a Gaussian distribution (the below subgraph). Figure 6 shows the training episodes process similar to Figure 5. It is also seen that the trial process of Figure  6 lasts about 3000 episodes. In order to show the accuracy of parameter acquisition, the relative error δ between the estimated value ̂ and the predetermined value is defined as a form of and the results are shown in Table 2   In order to show the accuracy of parameter acquisition, the relative error δ between the estimated valueB and the predetermined value B r is defined as a form of Processes 2021, 9, 1848 11 of 19 and the results are shown in Table 2  It is seen from Figures 5 and 6 that the excellent results with relative errors no greater than 5% were obtained in the cases of noises with different distributions.
Further tests under the condition of oil leakage were done to verify the effectiveness of the proposed approach. For a forging machine, the leakage is prone to go into saturation and is limited to a small value, so the leakage coefficients λ c were assumed as a constant 0.01 and 0.02. The episodes training processes are shown in  Table 3. Table 3 shows the viscous damping coefficient will approach the predetermined value B r under different coefficients or different noise distributions, showing a maximal relative error less than 2%. For training time, there are some differences for different parameters, such as about 6000 episodes in Figure 7, about 4000 episodes in Figure 9, and about 3000 episodes in Figure 10. Sometimes the different distributions also have an effect on the training speed, which is shown in Figure 8.  Table 3. Table 3 shows the viscous damping coefficient will approach the predetermined value under different coefficients or different noise distributions, showing a maximal relative error less than 2%. For training time, there are some differences for different parameters, such as about 6000 episodes in Figure 7, about 4000 episodes in Figure  9, and about 3000 episodes in Figure 10. Sometimes the different distributions also have an effect on the training speed, which is shown in Figure 8.

Acquisition of the Leakage Coefficient
The leakage that is marked with leakage coefficient λ c in the model will become the main uncertainty along with the lapsing time of forging machine. The leakage coefficient was predetermined as a constant 0.01 and 0.02. The learning processes with uniform distribution and with Gaussian distribution are shown in Figures 11 and 12, respectively. As for training time, it is affected by different distributions in Figure 11 and about 5000 episodes in Figure 12.

Acquisition of the Leakage Coefficient
The leakage that is marked with leakage coefficient in the model will become the main uncertainty along with the lapsing time of forging machine. The leakage coefficient was predetermined as a constant 0.01 and 0.02. The learning processes with uniform distribution and with Gaussian distribution are shown in Figure 11 and Figure 12, respectively. As for training time, it is affected by different distributions in Figure 11 and about 5000 episodes in Figure 12.

Acquisition of the Leakage Coefficient
The leakage that is marked with leakage coefficient in the model will become the main uncertainty along with the lapsing time of forging machine. The leakage coefficient was predetermined as a constant 0.01 and 0.02. The learning processes with uniform distribution and with Gaussian distribution are shown in Figure 11 and Figure 12, respectively. As for training time, it is affected by different distributions in Figure 11 and about 5000 episodes in Figure 12.  The values of leakage coefficientλ c are acquired when the curve becomes stable. Here, the absolute error E with the definition of E = λ c −λ c (22) was used to replace the former relative error because the value of leakage coefficient is too small as the denominator of Formula (22), which is prone to an inappropriate relative error.
The results are listed in Table 4. Table 4 shows the absolute errors are not more than 0.0015 in the cases of noisy with different distributions.

Acquisition of the Viscous Damping Coefficient and the Leakage Coefficient
In order to test higher dimensionality of parameters, an experiment on acquiring concurrently the viscous damping coefficient and the leakage coefficient was done. The parameters of B and λ c were predetermined as 18 and 0.01, respectively. The learning processes with uniform distribution and with Gaussian distribution are shown in Figures 13 and 14, respectively, and the results are shown in Table 5, which shows both parameters can reach a good estimation concurrently in the cases of noisy conditions. Here, all the training times are less than 5000 episodes.
Processes 2021, 9, x FOR PEER REVIEW 15 of 20 was used to replace the former relative error because the value of leakage coefficient is too small as the denominator of Formula (22), which is prone to an inappropriate relative error. The results are listed in Table 4. Table 4 shows the absolute errors are not more than 0.0015 in the cases of noisy with different distributions.

Acquisition of the Viscous Damping Coefficient and the Leakage Coefficient
In order to test higher dimensionality of parameters, an experiment on acquiring concurrently the viscous damping coefficient and the leakage coefficient was done. The parameters of B and were predetermined as 18 and 0.01, respectively. The learning processes with uniform distribution and with Gaussian distribution are shown in Figures 13  and 14, respectively, and the results are shown in Table 5, which shows both parameters can reach a good estimation concurrently in the cases of noisy conditions. Here, all the training times are less than 5000 episodes. . Figure 14. The episode training process of viscous damping coefficient and leakage coefficient subject to noise of Gaussian distribution.

Comparison with Other Methods
A famous BP network approach and the sliding window correlation methods were chosen as a comparison of the proposed approach. The data series with 160 samples that was produced by the model with a controller was considered as the data source to determine the parameters. This data series includes a transient process of 50 and a stable process of 110 based on the viscous damping coefficient of 15. As we know, the BP network has a strong nonlinear approximation ability and an excellent estimation of recursion problem, which needs the length of input time series to match the order of the system. Here, we focused on identifying the parameter of viscous damping coefficient just in one period. After several attempts, the BP network was chosen as a 7-20-1 structure with an input of seven variables (six states and one control in the model of Section 2) and an output of the viscous damping coefficient . It was trained by the back propagation algorithm based on a train set of 2000 data from different cases in which the set speed was changed from 0.02 to 0.08. The learning rate was 0.001. The well-trained BP network was used to estimate the values of viscous damping coefficient, and the results are shown in Figure 15.

Comparison with Other Methods
A famous BP network approach and the sliding window correlation methods were chosen as a comparison of the proposed approach. The data series with 160 samples that was produced by the model with a controller was considered as the data source to determine the parameters. This data series includes a transient process of 50 and a stable process of 110 based on the viscous damping coefficient B of 15.
As we know, the BP network has a strong nonlinear approximation ability and an excellent estimation of recursion problem, which needs the length of input time series to match the order of the system. Here, we focused on identifying the parameter of viscous damping coefficient B just in one period. After several attempts, the BP network was chosen as a 7-20-1 structure with an input of seven variables (six states and one control in the model of Section 2) and an output of the viscous damping coefficient B. It was trained by the back propagation algorithm based on a train set of 2000 data from different cases in which the set speed was changed from 0.02 to 0.08. The learning rate was 0.001. The welltrained BP network was used to estimate the values of viscous damping coefficient, and the results are shown in Figure 15.
The values of viscous damping coefficient from sampling 1 to sampling 160 that were estimated by the BP network and the proposed approach are shown with the black curve and the red curve. It is seen that the BP network will approach to the viscous damping coefficient in the stable process, but it is bad in the transient process. The proposed approach shows an excellent performance that achieves the 15.0625 approaching to the goal of 15.0000 throughout the whole process. The values of viscous damping coefficient from sampling 1 to sampling 160 that were estimated by the BP network and the proposed approach are shown with the black curve and the red curve. It is seen that the BP network will approach to the viscous damping coefficient in the stable process, but it is bad in the transient process. The proposed approach shows an excellent performance that achieves the 15.0625 approaching to the goal of 15.0000 throughout the whole process.
The sliding window correlation method, as a kind of conventional parameters identification method for data series, was applied to estimate the values of viscous damping coefficient by an optimization of minimizing the sums of squared errors during each observation window. Considering the sliding window is influenced with the disturbance, it is prone to change the statistical properties of the observation window. The numbers of 2, 5, 10 and 50 were chosen as the length of sliding window, and the results are seen in Figure  16.    The sliding window correlation method, as a kind of conventional parameters identification method for data series, was applied to estimate the values of viscous damping coefficient by an optimization of minimizing the sums of squared errors during each observation window. Considering the sliding window is influenced with the disturbance, it is prone to change the statistical properties of the observation window. The numbers of 2, 5, 10 and 50 were chosen as the length of sliding window, and the results are seen in Figure 16. The values of viscous damping coefficient from sampling 1 to sampling 160 that were estimated by the BP network and the proposed approach are shown with the black curve and the red curve. It is seen that the BP network will approach to the viscous damping coefficient in the stable process, but it is bad in the transient process. The proposed approach shows an excellent performance that achieves the 15.0625 approaching to the goal of 15.0000 throughout the whole process.
The sliding window correlation method, as a kind of conventional parameters identification method for data series, was applied to estimate the values of viscous damping coefficient by an optimization of minimizing the sums of squared errors during each observation window. Considering the sliding window is influenced with the disturbance, it is prone to change the statistical properties of the observation window. The numbers of 2, 5, 10 and 50 were chosen as the length of sliding window, and the results are seen in Figure  16.    It is seen from Figure 16 that the sliding window correlation method and the proposed approach have a similar accuracy throughout the process from sampling 1 to sampling 160. However, there are some fluctuations for the sliding window correlation method according to different window length. The shorter the length of the slide window, the more sensitive the result, and vice versa. In contrast, the proposed approach shows a fine stability owing to its episodes training.
The advantages and disadvantages of three methods are summarized in Table 6. The proposed approach has the ability to obtain a high accuracy of viscous damping coefficient in steady state and transient state during only a period. To our best knowledge, there are no other approaches to implement the identification of model parameters with so little information, which is beneficial to the online control. However, it is limited to a slow process of the forging machine due to a long training time, though some improvements have been made, such as eligibility traces and heuristic search. A hardware implementation of this proposed approach is an attractive request for broader industrial processes.

Conclusions
In this paper, reinforcement learning has been addressed to identify optimal parameters values online by directly using raw data in one period. Compared with the BP network approach, the proposed technique has a good accuracy throughout the whole process. Compared with the sliding window correlation method, the proposed method has a similar accuracy but has a better ability to resist the influence of noise. As a result, the proposed approach has been demonstrated to be effective for online parameter identification in a simulation of real-time process of a forging machine.  Acknowledgments: The authors would like to acknowledge the research support from the School of Electrical Engineering and Automation at Tianjin University, and the E&E faculty at the University of Northumbria at Newcastle.

Conflicts of Interest:
The authors declare no conflict of interest. Output pressure of proportional servo valve P s Pressure of a constant rate pump output ∆p n Valve port pressure drop q 1 Oil flow in pipe q 2 Output oil flow of proportional servo valve R Intermediate coefficient S1

Nomenclature
Sectional area of pipe S2 Plunger's sectional area of exporting cavity of hydraulic cylinder u Control voltage of proportional servo valve v Moving speed of plunger V 0 Initial oil volume of upper cavity of hydraulic cylinder V c Current oil volume of upper cavity of hydraulic cylinder