Analysis of Recurrent Neural Network and Predictions

: This paper analyzes the operation principle and predicted value of the recurrent-neural-network (RNN) structure, which is the most basic and suitable for the change of time in the structure of a neural network for various types of artiﬁcial intelligence (AI). In particular, an RNN in which all connections are symmetric guarantees that it will converge. The operating principle of a RNN is based on linear data combinations and is composed through the synthesis of nonlinear activation functions. Linear combined data are similar to the autoregressive-moving average (ARMA) method of statistical processing. However, distortion due to the nonlinear activation function in RNNs causes the predicted value to be different from the predicted ARMA value. Through this, we know the limit of the predicted value of an RNN and the range of prediction that changes according to the learning data. In addition to mathematical proofs, numerical experiments conﬁrmed our claims.


Introduction
Artificial intelligence (AI) with machines are coming into our daily lives. In the near future, there will be no careers in a variety of fields, from driverless cars becoming commonplace, to personalroutine assistants, automatic response system (ARS) counsellors, and bank clerks. In the age of machines, it is only natural to let machines do the work [1][2][3][4][5], aiming for the operation principle of the machine and the direction of a machine's prediction. In this paper, we analyzed the principles of operation and prediction through a recurrent neural network (RNN) [6][7][8].
The RNN is an AI methodology that handles incoming data in a time order. This methodology learns about time changes and predicts them. This predictability is possible because of the recurrent structure, and it produces similar results as the time series of general statistical processing [9][10][11][12]. We calculate the predicted value of a time series by calculating the general term of the recurrence relation. Unfortunately, the RNN calculation method is very similar to that of the time series, but the activation function in a neural-network (NN) structure is a nonlinear function, so nonlinear effects appear in the prediction part. For this reason, it is very difficult to find the predicted value of a RNN. However, due to the advantages of the recurrent structure and the development of artificial-neural-network (ANN) calculation methods, the accuracy of predicted values is improving. This led to better development and greater demand for artificial neural networks (ANNs) based on RNNs. For example, long short-term memory (LSTM), gated recurrent units (GRU), and R-RNNs [13][14][15][16] start from a RNN and are used in various fields. In other words, RNN-based artificial neural networks are used in learning about time changes and the predictions corresponding to them.
There are not many papers attempting to interpret the structure of recurrent structures, and results are also lacking. First, the recurrent structure is used to find the expected value by using it iteratively according to the order of data input over time. This is to predict future values from past data. In a situation where you do not know a future value, it is natural to use the information you know to predict the future. These logical methods include the time-series method in statistical processing, which is a numerical method. The RNN structure is very similar to the combination of these two methods. Autoregressive moving average (ARMA) in time series is a method of predicting future values by creating a recurrence relation by the linear combination of historical data. More details can be found in [17,18]. Taylor's expanding RNN under certain constraints results in linear defects of historical data, such as the time series. More details are given in the text. From these results, this paper describes the range of the predicted value of a RNN. This paper is organized as follows. Section 2 introduces and analyzes the RNN, and correlates it with existing methods. Section 3 explains the change of the predicted value through the RNN. Section 4 confirms our claim through numerical experiments.

RNN and ARMA Relationship
In this section, we explain how a RNN works by interpreting it. In particular, the RNN is based on the ARMA format in statistical processing. More details can be found in [19][20][21]. This is explained through the following process.

RNN
In this section, we explain RNN among various modified RNNs. For convenience, RNN refers to the basic RNN. The RNN that we deal with is where t represents time, y t is a predicted value, w 1 is a real value, and h t is a hidden layer. The hidden layer is computed by where x t is input data, w 2 and w 3 are real values, and h t−1 is the previous hidden layer. For machine learning, let LS be the set of learning data, and let κ > 2 be the number of the size of LS. In other words, when the first departure time of learning data is 1, we can say that LS = {x 1 , x 2 , ..., x κ }. Assuming that the initial condition of the hidden layer is 0 (h 0 = 0), we can compute y t for each time t. x t is data on time and y t is a predicted value, so we want to satisfy y t = x t+1 . Because unhappiness does not establish the equation, an error occurs between y t and x t+1 . So, let E t = (y t − x t+1 ) 2 and E = ∑ κ−1 t=1 E t . Therefore, machine learning based on RNN is the process of finding w 1 , w 2 , and w 3 that can minimize error value E. We used x 1 , x 2 ,...,x κ−1 in learning data LS to find w 1 , w 2 , and w 3 that minimize error E, and used them to predict the values (y κ , y κ+1 ,...) after time κ. More details can be found in [22][23][24][25].

ARMA in Time Series
People have long wanted to predict stocks. This required predictions from historical data on stocks, and various methods have been studied and utilized. In particular, the most widely and commonly used is the ARMA method, which was developed on the basis of statistics. This method simply creates a linear combination of historical data for the value to be predicted and calculates it on this basis.x where x 0 ,· · · ,x κ are given data, and we can calculate predicted valuex l+1 by calculating the values of C 0 , · · · , C κ , and C * . In order to obtain the values of C 0 , · · · , C κ , and C * , there are various methods, such as optimization by numerical data values, Yule-Walker estimation, and corelation calculation. This equation is used to predict future values through the calculation of general terms of the recurrence relation. More details can be found in [17].

RNN and ARMA
In RNN, the hidden layer is constructed by the hyperbolic tangent function that is Function tanh is expanded: where x is in [−π/2, π/2]. Using this fact and expanding h t , where e t is an error. Therefore, Since the same process is repeated for h t−1 , Repeatedly, Therefore, If w 3 is less than 0.1, the terms after the fourth order (w 4 3 ) are too small to affect the value to be predicted. Conversely, if w 3 is greater than 1, the value to be predicted increases exponentially. Under the assumption that we can expand hyperbolic tangent function (tanh), condition w 3 must be less than 1. Since we can change only w 1 , w 2 , and w 3 , the RNN can be written as This equation is an ARMA of order 5. More details can be found in [18]. This development method was developed on the premise that the variable part of the tanh function is smaller than a specific value (tanh(x) and |x| < π/2), and is limited in terms of utilization.

Analysis of Predicted Values
From the above section, w 1 , w 2 , w 3 , b y , and b h are fixed. Then, we obtained sequence {y κ } by the following equality: where Theorem 1. Sequence {h κ } is bounded and has a converging subsequence.
In order to see the change in the value of h κ , if the limit of h κ is h, Equation (14) is written as h = tanh (θh + b). Therefore, as the values of θ and b, the value of h that satisfies this equation changes.

Limit Points of Prediction Values
We now analyze the convergence value of the sequence. In order to see the convergence of the sequence, we introduced the following functions: For calculation convenience, this equation changes as follows.
where z 0 is an initial condition, the convergence of z κ is z * , and z * satisfies Equation (17) (z * = tanh (θz * + b)). Therefore, we have to look at the roots that satisfy the expression in Equation (17).
Theorem 3. If θ ≤ 1, then the equation has just one solution.
Under the assumption that the value of θ > 1, two values satisfying g (z) = 0 necessarily exist. Therefore, assuming θ > 1, we find z l and z r satisfying θsech 2 (θz l + b) − 1 = θsech 2 (θz r + b) − 1 = 0, and have g(z l ) < g(z r ) assuming z l < z r . Therefore, g (z) < 0 on z < z l , g (z) > 0 on z l < z < z r , and g (z) on z r < z from computing g. Assuming g(z l ) = 0 and g(z r ) = 0, we have Theorem 4. Assuming θ > 1, If b = b l or b = b r then, g has two solutions. If b r < b < b l , then g has three solutions. If b l < b or b < b r , then g has one solution.
Proof. This proof assumes that θ > 1. If b < b r , then we know g(z r ) < 0. Therefore, we have g(z l ) < g(z r ) < 0. Since g(z) is a monotonically decreasing function on z < z l , there exists a unique solution, such that g(z) = 0. If b = b r , then we know g(z r ) = 0. Therefore, we know g(z l ) < g(z r ) = 0, and there exists a unique solution, such that g(z) = 0 on z < z l for the same reason. So, if b = b r , we have two solutions. One is g(z) = 0 on z < z l and the other is g(z r ) = 0. If b r < b < b l , we have g(z l ) < and g(z r ) > 0. There are three solutions, such that g(z) = 0 on z < z l , g(z) = 0 on z l < z < z r , and g(z) = 0 on z l < z. If b = b l , we know that g(z l ) = 0. Therefore, since g(z r ) > 0, and g is a monotonically decreasing function on z r < z, there is a solution satisfying g(z) = 0. So, if b = b l , we have two solutions, such that g(z l ) = 0 and g(z) = 0 on z r < z. If b l < b, then g(z l ) > 0.
Since g(z r ) > g(z l ) and g is a decrease function, there is a solution, such that g(z) = 0 on z > z r .
In this section, we see the change in the number of solutions that satisfy Equation (17) as the values of θ and b change. The change of the sequence according to the initial condition of the sequence and according to the number of each solution of Equation (17) is explained. Figure 1 shows

Change of Prediction Values (Sequence)
We examined the number of the solutions of g depending on the values of θ and b. In order to see the change of the predicted value according to the change of θ and b, Equation (14) was changed to z i+1 = tanh (θz i + b), and sequence {z i } was obtained. Sequences {z i }, g, and h κ have the following relationship: z i+1 = z i + g(z i ) and z 0 = h κ . Therefore, the predicted value y κ+m+1 was obtained by y κ+m+1 = w 1 h κ+m + b y and h κ+m = z m . The solutions of g are the limit points of sequence {z i } by using z i+1 = z i + g(z i ). One of the reasons we interpreted the predictions was to identify the movement condensation (the changing value) of the predictions. We saw various cases that made function g zero from the previous theorem. The change of the sequence according to initial condition z 0 in each case is explained.
Proof. Under condition θ > 1 and b l < b, g(z) > 0 on z < z * and g(z) < 0 on z * < z. If z 0 < z * then g(z 0 ) > 0. From computing, {z i } is a monotonically increasing sequence. So, sequence {z i } converges to z * . If z * < z 0 then g(z 0 ) < 0. From computing, {z i } is a monotonically decreasing sequence. Therefore, sequence {z i } converged to z * . Theorem 6. Assuming θ > 1 and b l = b, there exist two solutions z l and z * that satisfy g(z) = 0. If z 0 < z l , sequence {z i } converges to z l . If z l < z 0 , sequence {z i } converges to z * , Proof. 0 ≤ g(z) on z < z * . So {z i } is a monotonically increasing sequence from computing. If z 0 < z l , {z i } converges to z l ; if z l < z 0 < z * , {z i } converges to z * . On z Theorem 7. Assuming θ > 1 and b r < b < b l , if z 0 < z * , {z i } converges to z l ; if z 0 > z * , {z i } converges to z r , where z 0 is an initial condition.
Proof. From computing g (z), we have g(z) > 0 on z < z l , and tanh(θz i + b) > z i on z 0 < z l . Therefore sequence {z i } is a monotonically increasing sequence, and {z i } converges to z l . From g (z) > 0, g is convex, and g(z l ) = g(z * ) = 0 on z l < z < z * , we have g(z) < 0 on z l < z < z * . On z l < z 0 < z * we have g(z i ) < 0 and g(z i ) = tanh(θz i + b) − z i < 0. Sequence {z i } is a monotonically decreasing sequence, and the convergence value is z l . With the same calculation, g is concave, and g(z * ) = g(z r ) = 0. Therefore, g(z) > 0 on z * < z < z r and g(z i ) = tanh(θz i + b) − z i > 0 on z * < z 0 < z r . Sequence {z i } is a monotonically increasing sequence, and the convergence value is z r . If z > z r , g(z) < 0. Therefore, g(z i ) = tanh(θz i + b) − z i > 0 on z 0 > z r . Sequence {z i } is a monotonically decreasing sequence, and the convergence value is z r . Theorem 8. Assuming θ > 1 and b = b r , there exist two solutions z r and z * that satisfy g(z) = 0. If z r < z 0 , sequence {z i } converges to z r . If z * < z 0 < z r , sequence {z i } converges to z * . If z 0 < z * , sequence {z i } converges to z * , Proof. If z r < z 0 , g(z 0 ) < 0. Therefore, sequence {z i } is a monotonically decreasing sequence. So, sequence {z i } converges to z r . If z * < z 0 < z r , g(z 0 ) < 0. Therefore, sequence {z i } is a monotonically decreasing sequence. So, sequence {z i } converges to z * . If z 0 < z * , g(z 0 ) > 0. Therefore, sequence {z i } is a monotonically increasing sequence. So, sequence {z i } converges to z * . Theorem 9. Assuming θ > 1 and b < b r , sequence {z i } converges to z * , where z * satisfies g(z * ) = 0.
In condition θ > 0, function tanh(θz + b) is an increasing function, and there is no change of the sign of θz. However, in condition θ < 0, function tanh(θz + b) is a decreasing function, and there is change of the sign of θz. Theorem 11. Assuming −1 < θ < 0, sequence {z i } converges to z * , where z * satisfies g(z * ) = 0. Proof.
where ζ is between z i−1 and z i . Therefore, Sequence {z i } is a Cauchy sequence that converges to z * Theorem 12. Assuming θ ≤ −1, sequence {z i } converges to z * , where z * satisfies g(z * ) = 0, or sequence {z i } vibrates. Proof.

Numerical Experiments
In this section, we confirmed the numerical results to identify RNN analysis interpreted in the previous section. As we saw in the previous section, RNN predictions appeared in three cases. Case 1 is Equation (17) that has one solution, Case 2 is Equation (17) that has two solutions, and Case 3 is Equation (17) that has three solutions. In Cases 1 to 3, we checked the number of solutions in Equation (17), and predicted the values according to the initial conditions. In Cases 4 through 7, experiments were conducted on the situation where learning data increase, learning data increase and decrease, learning data decrease and increase, and learning data vibrate. We obtained a picture from each numerical experiment. In each figure, (a) plots the RNN predictions and the learning data, the red curve is sin, (b) denotes θ and b in the area of existence of the solution, and (c) is a picture of z about Equation (17). (17) The situation with one solution was divided into the case where θ is less than 1 and θ is greater than 1.
In Figure 2a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 2b shows θ and b( *  = (θ, b)). Figure 2c shows the result of Equation (17). In Figure 2c, * is z 0 . From Figure 2, we see that from the learning data, the solution of Equation (17) is one, initial value z 0 is 0.6, and z 40 is 0.5. (c) Plot of z in Equation (17).

Case 2: Two-Solution Case of Equation (17)
This situation is two solutions of Equation (17) by (θ, b) = (1.3, 0.101). Let x 0 = 0, x 1 = 0.02, x 2 = 0.19, x 3 = 0.36, and x 4 = 0.5. x 0 ∼ x 4 are learning data. Figure 4 shows the solution number region and (θ, b) (black star). As shown in Figure 4, there are two solutions to Equation (17) from the learning data. In this situation, we conducted two experiments. The first case was initial condition z 0 existing between z l and z r . The second case was initial condition z 0 being less than z l . In the first case, the limited value of z i from the proof had to go to z r , and in the second case, the limited value of z i from the proof had to go to z l . This result was verified from the numerical experiments. The theory of the previous section was exempted through this numerical experiment.

First Case
In this case, we obtained w 1 = 0.9, w 2 = 0.4, w 3 = 0.94, b y = −0.1 and b h = 0.141. Therefore, θ = 1.3 and b = 0.101. The limit of y t is y * (0.47). Figure 5a shows that x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 5b shows the result of Equation (17). In Figure 5b, * is z 0 , and z 40 is 0.71.

Case 3: Three-Solution case of Equation (17)
This situation is three solutions of Equation (17) Figure 7 shows the solution number region and (θ, b) (black star). As shown in Figure 7, there are three solutions from the learning data. In this situation, we conducted two experiments. For convenience, the three roots are indicated by z l , z * , and z r , respectively, as in the notation above. The first case was initial condition z 0 existing between z l and z r . The second case is initial condition z 0 existing between z l and z * . In the first case, the limited value of z i from the proof had to go to z r , and in the second case, the limited value of z i from the proof had to go to z l . This result was verified from the numerical experiments. The theory of the previous section was exempted through numerical experiments.
In Figure 8a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 8b shows the result of Equation (17). In Figure 8b, * is z 0 , and z 40 is 0.79.
In Figure 9a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 9b shows the result of Equation (17). In Figure 5b, * is z 0 , and z 40 is −0.86.
In Figure 10a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 10b shows θ and b. Figure 10c shows the result of Equation (17). In this case, x 4 . From θ and b, Equation (17) has one solution. As can be seen in Figure 10, learning data increased and converged to a specific value. (c) Plot of z in Equation (17).
In Figure 11a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 11b shows θ and b. Figure 11c shows the result of Equation (17). In this case, x 4 . From θ and b, Equation (17) has one solution. As can be seen in Figure 11, the training data converged to a specific value after increasing and decreasing. From θ and b, Equation (17) has one solution. As can be seen in Figure 11, the average value of the learning data gave the predicted value. (c) Plot of z in Equation (17).
In Figure 12a, x 0 ∼ x 4 are the black stars and y 0 ∼ y 40 are the prediction values (blue line). Figure 12b shows θ and b. Figure 12c shows the result of Equation (17). In this case, x 4 . From θ and b, Equation (17) has one solution. As can be seen in the Figure 12, data increased and converged to a specific value. From θ and b, Equation (17) has one solution. As can be seen in Figure 12, the average value of the learning data gave the predicted value.
In Figure 13a, x 0 ∼ x 4 are the green circles, y 0 ∼ y 4 are the black stars, and y 4 ∼ y 40 are the prediction values (blue line). In Figure 13a, the reason that the value of learning data (x t ) and the values of the learning result (y t ) are different is that the RNN structure was simple, and sufficient learning was not achieved. In future work, we aim to study the RNN structure to learn these complex learning data well. Figure 13b shows θ and b. Figure 13c shows the result of Equation (17). In this case, x 4 . From θ and b, Equation (17) has one solution. As can be seen in Figure 13, data increased and converged to a specific value. In this case of θ and b, the solution of Equation (17) should be one. However, two contents are contradictory because learning data should be presented in two cases, 1 and −1. As a result, the cost function only increased. (c) Plot of z in Equation (17).

Conclusions
In this paper, we interpreted the structure of the underlying the RNN and, on this basis, we found the principles that the RNN could predict. A basic RNN works like a time series in a very narrow range of variables. In a general range, a nonlinear function of which the maximum and minimum are specified causes the value of a function to fall within an iterative range. Because the function value is repeated within a certain range, the predicted value behaves like fixed-point iteration. In other words, we used the tanh (activation) function, so that the value was in the range of −1 to 1, and the absolute value of the predicted value in this range was less than 1. As a result, as the prediction value was repeated, the prediction value converged to a specific value. Through this paper, we found that the basic operating principle of a RNN is the operation principle of the time series, which we know as linear analysis and fixed-point iteration, which is nonlinear. In general, the solution of Equation (17) was one of the numerical calculations. Therefore, the present structure could not be solved in the case of numerical experiment Case 7 (learning data vibration). To solve this problem, it is necessary to diversify the structure, increase the number of layers, and switch to a vector structure. Next, we aim to further study RNNs in vector structures.