Analysis of Recurrent Neural Network and Predictions

Park, Jieun; Yi, Dokkyun; Ji, Sangmin

doi:10.3390/sym12040615

Open AccessFeature PaperArticle

Analysis of Recurrent Neural Network and Predictions

by

Jieun Park

¹,

Dokkyun Yi

¹ and

Sangmin Ji

^2,*

¹

Seongsan Liberal Arts College, Daegu University, Kyungsan 38453, Korea

²

Department of Mathematics, College of Natural Sciences, Chungnam National University, Daejeon 34134, Korea

^*

Author to whom correspondence should be addressed.

Symmetry 2020, 12(4), 615; https://doi.org/10.3390/sym12040615

Submission received: 14 March 2020 / Revised: 31 March 2020 / Accepted: 1 April 2020 / Published: 13 April 2020

(This article belongs to the Special Issue Discrete Mathematics and Symmetry)

Download

Browse Figures

Versions Notes

Abstract

This paper analyzes the operation principle and predicted value of the recurrent-neural-network (RNN) structure, which is the most basic and suitable for the change of time in the structure of a neural network for various types of artificial intelligence (AI). In particular, an RNN in which all connections are symmetric guarantees that it will converge. The operating principle of a RNN is based on linear data combinations and is composed through the synthesis of nonlinear activation functions. Linear combined data are similar to the autoregressive-moving average (ARMA) method of statistical processing. However, distortion due to the nonlinear activation function in RNNs causes the predicted value to be different from the predicted ARMA value. Through this, we know the limit of the predicted value of an RNN and the range of prediction that changes according to the learning data. In addition to mathematical proofs, numerical experiments confirmed our claims.

Keywords:

recurrent neural network; analysis; ARMA; time series; prediction

1. Introduction

Artificial intelligence (AI) with machines are coming into our daily lives. In the near future, there will be no careers in a variety of fields, from driverless cars becoming commonplace, to personal-routine assistants, automatic response system (ARS) counsellors, and bank clerks. In the age of machines, it is only natural to let machines do the work [1,2,3,4,5], aiming for the operation principle of the machine and the direction of a machine’s prediction. In this paper, we analyzed the principles of operation and prediction through a recurrent neural network (RNN) [6,7,8].

The RNN is an AI methodology that handles incoming data in a time order. This methodology learns about time changes and predicts them. This predictability is possible because of the recurrent structure, and it produces similar results as the time series of general statistical processing [9,10,11,12]. We calculate the predicted value of a time series by calculating the general term of the recurrence relation. Unfortunately, the RNN calculation method is very similar to that of the time series, but the activation function in a neural-network (NN) structure is a nonlinear function, so nonlinear effects appear in the prediction part. For this reason, it is very difficult to find the predicted value of a RNN. However, due to the advantages of the recurrent structure and the development of artificial-neural-network (ANN) calculation methods, the accuracy of predicted values is improving. This led to better development and greater demand for artificial neural networks (ANNs) based on RNNs. For example, long short-term memory (LSTM), gated recurrent units (GRU), and R-RNNs [13,14,15,16] start from a RNN and are used in various fields. In other words, RNN-based artificial neural networks are used in learning about time changes and the predictions corresponding to them.

There are not many papers attempting to interpret the structure of recurrent structures, and results are also lacking. First, the recurrent structure is used to find the expected value by using it iteratively according to the order of data input over time. This is to predict future values from past data. In a situation where you do not know a future value, it is natural to use the information you know to predict the future. These logical methods include the time-series method in statistical processing, which is a numerical method. The RNN structure is very similar to the combination of these two methods. Autoregressive moving average (ARMA) in time series is a method of predicting future values by creating a recurrence relation by the linear combination of historical data. More details can be found in [17,18]. Taylor’s expanding RNN under certain constraints results in linear defects of historical data, such as the time series. More details are given in the text. From these results, this paper describes the range of the predicted value of a RNN.

This paper is organized as follows. Section 2 introduces and analyzes the RNN, and correlates it with existing methods. Section 3 explains the change of the predicted value through the RNN. Section 4 confirms our claim through numerical experiments.

2. RNN and ARMA Relationship

In this section, we explain how a RNN works by interpreting it. In particular, the RNN is based on the ARMA format in statistical processing. More details can be found in [19,20,21]. This is explained through the following process.

2.1. RNN

In this section, we explain RNN among various modified RNNs. For convenience, RNN refers to the basic RNN. The RNN that we deal with is

y_{t} = w_{1} h_{t} + b_{y},

(1)

where t represents time,

y_{t}

is a predicted value,

w_{1}

is a real value, and

h_{t}

is a hidden layer. The hidden layer is computed by

h_{t} = \tanh (w_{2} x_{t} + w_{3} h_{t - 1} + b_{h}),

(2)

where

x_{t}

is input data,

w_{2}

and

w_{3}

are real values, and

h_{t - 1}

is the previous hidden layer. For machine learning, let

L S

be the set of learning data, and let

κ > 2

be the number of the size of

L S

. In other words, when the first departure time of learning data is 1, we can say that

L S = {x_{1}, x_{2}, \dots, x_{κ}}

. Assuming that the initial condition of the hidden layer is 0

(h_{0} = 0)

, we can compute

y_{t}

for each time t.

x_{t}

is data on time and

y_{t}

is a predicted value, so we want to satisfy

y_{t} = x_{t + 1}

. Because unhappiness does not establish the equation, an error occurs between

y_{t}

and

x_{t + 1}

. So, let

E_{t} = {(y_{t} - x_{t + 1})}^{2}

and

E = \sum_{t = 1}^{κ - 1} E_{t}

. Therefore, machine learning based on RNN is the process of finding

w_{1}

,

w_{2}

, and

w_{3}

that can minimize error value E. We used

x_{1}

,

x_{2}

,...,

x_{κ - 1}

in learning data

L S

to find

w_{1}

,

w_{2}

, and

w_{3}

that minimize error E, and used them to predict the values (

y_{κ}

,

y_{κ + 1}

,...) after time

κ

. More details can be found in [22,23,24,25].

2.2. ARMA in Time Series

People have long wanted to predict stocks. This required predictions from historical data on stocks, and various methods have been studied and utilized. In particular, the most widely and commonly used is the ARMA method, which was developed on the basis of statistics. This method simply creates a linear combination of historical data for the value to be predicted and calculates it on this basis.

{\hat{x}}_{κ + 1} = C_{0} x_{κ} + C_{1} x_{κ - 1} + C_{2} x_{κ - 2} +, \dots, C_{l} x_{0} + C^{*} e,

(3)

where

x_{0}

,⋯,

x_{κ}

are given data, and we can calculate predicted value

{\hat{x}}_{l + 1}

by calculating the values of

C_{0}

, ⋯,

C_{κ}

, and

C^{*}

. In order to obtain the values of

C_{0}

, ⋯,

C_{κ}

, and

C^{*}

, there are various methods, such as optimization by numerical data values, Yule–Walker estimation, and corelation calculation. This equation is used to predict future values through the calculation of general terms of the recurrence relation. More details can be found in [17].

2.3. RNN and ARMA

In RNN, the hidden layer is constructed by the hyperbolic tangent function that is

\tanh (x) = \frac{e^{x} - e^{- x}}{e^{x} + e^{- x}} .

(4)

Function tanh is expanded:

\tanh (x) = x - \frac{1}{3} x^{3} + \frac{2}{15} x^{5} - \frac{17}{315} x^{7} + \dots .,

(5)

where x is in

[- π / 2, π / 2]

. Using this fact and expanding

h_{t}

,

h_{t} = \tanh (w_{2} x_{t} + w_{3} h_{t - 1}) = w_{2} x_{t} + w_{3} h_{t - 1} + e_{t},

(6)

where

e_{t}

is an error. Therefore,

y_{t} = w_{1} w_{2} x_{t} + w_{1} w_{3} h_{t - 1} + w_{1} e_{t} .

Since the same process is repeated for

h_{t - 1}

,

y_{t} = w_{1} w_{2} x_{t} + w_{1} w_{3} h_{t - 1} + w_{1} e_{t}

(7)

= w_{1} w_{2} x_{t} + w_{1} w_{3} (w_{2} x_{t - 1} + w_{3} h_{t - 2} + e_{t - 1}) + w_{1} e_{t}

(8)

= w_{1} w_{2} x_{t} + w_{1} w_{2} w_{3} x_{t - 1} + w_{1} w_{3}^{2} h_{t - 2} + w_{1} e_{t} + w_{1} w_{3} e_{t - 1} .

(9)

Repeatedly,

y_{t} = w_{1} w_{2} x_{t} + w_{1} w_{2} w_{3} x_{t - 1} + w_{1} w_{2} w_{3}^{2} x_{t - 2} + w_{1} w_{3}^{3} h_{t - 3} + w_{1} e_{t} + w_{1} w_{3} e_{t - 1} + w_{1} w_{3}^{2} e_{t - 2} .

(10)

Therefore,

y_{t} = \sum_{k = 0}^{t - 1} (w_{1} w_{2} w_{3}^{k} x_{t - k} + w_{1} w_{3}^{k} e_{t - k}) + w_{1} w_{3}^{t} h_{0} .

(11)

If

w_{3}

is less than 0.1, the terms after the fourth order (

w_{3}^{4}

) are too small to affect the value to be predicted. Conversely, if

w_{3}

is greater than 1, the value to be predicted increases exponentially. Under the assumption that we can expand hyperbolic tangent function (tanh), condition

w_{3}

must be less than 1. Since we can change only

w_{1}

,

w_{2}

, and

w_{3}

, the RNN can be written as

y_{t} = w_{1} w_{2} x_{t} + w_{1} w_{2} w_{3} x_{t - 1} + w_{1} w_{2} w_{3}^{2} x_{t - 2} + w_{1} w_{2} w_{3}^{3} x_{t - 3} + w_{1} w_{2} w_{3}^{4} x_{t - 4}

(12)

= + w_{1} w_{2} e_{t} + w_{1} w_{3} e_{t - 1} + w_{1} w_{3}^{2} e_{t - 2} + w_{1} w_{3}^{3} e_{t - 3} + w_{1} w_{3}^{4} e_{t - 4} .

(13)

This equation is an ARMA of order 5. More details can be found in [18]. This development method was developed on the premise that the variable part of the tanh function is smaller than a specific value (

\tanh (x)

and

| x | < π / 2

), and is limited in terms of utilization.

3. Analysis of Predicted Values

From the above section,

w_{1}

,

w_{2}

,

w_{3}

,

b_{y}

, and

b_{h}

are fixed. Then, we obtained sequence

{y_{κ}}

by the following equality:

\begin{matrix} y_{κ + 1} & = & w_{1} h_{κ} + b_{y} = w_{1} \tanh (w_{2} y_{κ} + w_{3} h_{κ - 1} + b_{h}) + b_{y}, \\ h_{κ} & = & \tanh (θ h_{κ - 1} + b), \end{matrix}

(14)

where

θ = w_{1} w_{2} + w_{3}

and

b = b_{h} + w_{2} b_{y}

.

Theorem 1.

Sequence

{h_{κ}}

is bounded and has a converging subsequence.

Proof.

Since

|\tanh| \leq 1

,

|h_{κ}| \leq 1

for all l. Using the Arzela–Ascoli theorem, there exists a converging subsequence. More details can be found in [26]. □

In order to see the change in the value of

h_{κ}

, if the limit of

h_{κ}

is h, Equation (14) is written as

h = \tanh (θ h + b)

. Therefore, as the values of

θ

and b, the value of h that satisfies this equation changes.

3.1. Limit Points of Prediction Values

We now analyze the convergence value of the sequence. In order to see the convergence of the sequence, we introduced the following functions:

y = x

(15)

y = \tanh (θ x + b) .

(16)

For calculation convenience, this equation changes as follows.

z = \tanh (θ z + b) .

(17)

where

z_{0}

is an initial condition, the convergence of

z_{κ}

is

z_{*}

, and

z_{*}

satisfies Equation (17) (

z_{*} = \tanh (θ z_{*} + b)

). Therefore, we have to look at the roots that satisfy the expression in Equation (17).

Theorem 2.

There should be at least one solution to Equation (17)

Proof.

Let

g (z) = \tanh (θ z + b) - z

. Function g is continuous and differentiable. If

z < - 2

, then

g (z) > 0

; If

z > 2

, then

g (z) < 0

. Therefore, there exists at least one solution. □

Theorem 3.

If

θ \leq 1

, then the equation has just one solution.

Proof.

If

θ \leq 1

, then

g^{'} (z) = θ {sech}^{2} (θ z + b) - 1 \leq 0

. Therefore, g is a monotonically decreasing function. As a result of this, there exists only one solution satisfying

g = 0

. □

Under the assumption that the value of

θ > 1

, two values satisfying

g^{'} (z) = 0

necessarily exist. Therefore, assuming

θ > 1

, we find

z_{l}

and

z_{r}

satisfying

θ {sech}^{2} (θ z_{l} + b) - 1 = θ {sech}^{2} (θ z_{r} + b) - 1 = 0

, and have

g (z_{l}) < g (z_{r})

assuming

z_{l} < z_{r}

. Therefore,

g^{'} (z) < 0

on

z < z_{l}

,

g^{'} (z) > 0

on

z_{l} < z < z_{r}

, and

g^{'} (z)

on

z_{r} < z

from computing g. Assuming

g (z_{l}) = 0

and

g (z_{r}) = 0

, we have

b = b_{l} = θ \tanh ({({sech}^{2})}^{- 1} (1 / θ)) - {({sech}^{2})}^{- 1} (1 / θ)

and

b = b_{r} = θ \tanh ({({sech}^{2})}^{- 1} (1 / θ)) - {({sech}^{2})}^{- 1} (1 / θ)

, respectively. From computing

{sech}^{2}

,

b_{r} < b_{l}

is obtained.

Theorem 4.

Assuming

θ > 1

, If

b = b_{l}

or

b = b_{r}

then, g has two solutions. If

b_{r} < b < b_{l}

, then g has three solutions. If

b_{l} < b

or

b < b_{r}

, then g has one solution.

Proof.

This proof assumes that

θ > 1

. If

b < b_{r}

, then we know

g (z_{r}) < 0

. Therefore, we have

g (z_{l}) < g (z_{r}) < 0

. Since

g (z)

is a monotonically decreasing function on

z < z_{l}

, there exists a unique solution, such that

g (z) = 0

. If

b = b_{r}

, then we know

g (z_{r}) = 0

. Therefore, we know

g (z_{l}) < g (z_{r}) = 0

, and there exists a unique solution, such that

g (z) = 0

on

z < z_{l}

for the same reason. So, if

b = b_{r}

, we have two solutions. One is

g (z) = 0

on

z < z_{l}

and the other is

g (z_{r}) = 0

. If

b_{r} < b < b_{l}

, we have

g (z_{l}) <

and

g (z_{r}) > 0

. There are three solutions, such that

g (z) = 0

on

z < z_{l}

,

g (z) = 0

on

z_{l} < z < z_{r}

, and

g (z) = 0

on

z_{l} < z

. If

b = b_{l}

, we know that

g (z_{l}) = 0

. Therefore, since

g (z_{r}) > 0

, and g is a monotonically decreasing function on

z_{r} < z

, there is a solution satisfying

g (z) = 0

. So, if

b = b_{l}

, we have two solutions, such that

g (z_{l}) = 0

and

g (z) = 0

on

z_{r} < z

. If

b_{l} < b

, then

g (z_{l}) > 0

. Since

g (z_{r}) > g (z_{l})

and g is a decrease function, there is a solution, such that

g (z) = 0

on

z > z_{r}

. □

In this section, we see the change in the number of solutions that satisfy Equation (17) as the values of

θ

and b change. The change of the sequence according to the initial condition of the sequence and according to the number of each solution of Equation (17) is explained.

Figure 1 shows the graph of

z_{l}

and

z_{r}

. If point (

θ

, b) is contained in the white region, there is one solution. If point (

θ

, b) lies in the red curve, there are two solutions. If point (

θ

, b) is contained in the blue region, there are three solutions. In Section 4, we plot point (a, b) in the solution number region to check for the number of solutions of each case.

3.2. Change of Prediction Values (Sequence)

We examined the number of the solutions of g depending on the values of

θ

and b. In order to see the change of the predicted value according to the change of

θ

and b, Equation (14) was changed to

z_{i + 1} = \tanh (θ z_{i} + b)

, and sequence

{z_{i}}

was obtained. Sequences

{z_{i}}

, g, and

h_{κ}

have the following relationship:

z_{i + 1} = z_{i} + g (z_{i})

and

z_{0} = h_{κ}

. Therefore, the predicted value

y_{κ + m + 1}

was obtained by

y_{κ + m + 1} = w_{1} h_{κ + m} + b_{y}

and

h_{κ + m} = z_{m}

. The solutions of g are the limit points of sequence

{z_{i}}

by using

z_{i + 1} = z_{i} + g (z_{i})

. One of the reasons we interpreted the predictions was to identify the movement condensation (the changing value) of the predictions. We saw various cases that made function g zero from the previous theorem. The change of the sequence according to initial condition

z_{0}

in each case is explained.

Theorem 5.

Assuming

θ > 1

and

b_{l} < b

, sequence

{z_{i}}

converged to

z_{*}

, where

z_{*}

satisfies

g (z_{*}) = 0

.

Proof.

Under condition

θ > 1

and

b_{l} < b

,

g (z) > 0

on

z < z_{*}

and

g (z) < 0

on

z_{*} < z

. If

z_{0} < z_{*}

then

g (z_{0}) > 0

. From computing,

{z_{i}}

is a monotonically increasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. If

z_{*} < z_{0}

then

g (z_{0}) < 0

. From computing,

{z_{i}}

is a monotonically decreasing sequence. Therefore, sequence

{z_{i}}

converged to

z_{*}

. □

Theorem 6.

Assuming

θ > 1

and

b_{l} = b

, there exist two solutions

z_{l}

and

z_{*}

that satisfy

g (z) = 0

. If

z_{0} < z_{l}

, sequence

{z_{i}}

converges to

z_{l}

. If

z_{l} < z_{0}

, sequence

{z_{i}}

converges to

z_{*}

,

Proof.

0 \leq g (z)

on

z < z_{*}

. So

{z_{i}}

is a monotonically increasing sequence from computing. If

z_{0} < z_{l}

,

{z_{i}}

converges to

z_{l}

; if

z_{l} < z_{0} < z_{*}

,

{z_{i}}

converges to

z_{*}

. On z □

Theorem 7.

Assuming

θ > 1

and

b_{r} < b < b_{l}

, if

z_{0} < z_{*}

,

{z_{i}}

converges to

z_{l}

; if

z_{0} > z_{*}

,

{z_{i}}

converges to

z_{r}

, where

z_{0}

is an initial condition.

Proof.

From computing

g^{'} (z)

, we have

g (z) > 0

on

z < z_{l}

, and

\tanh (θ z_{i} + b) > z_{i}

on

z_{0} < z_{l}

. Therefore sequence

{z_{i}}

is a monotonically increasing sequence, and

{z_{i}}

converges to

z_{l}

. From

g^{'} (z) > 0

, g is convex, and

g (z_{l}) = g (z_{*}) = 0

on

z_{l} < z < z_{*}

, we have

g (z) < 0

on

z_{l} < z < z_{*}

. On

z_{l} < z_{0} < z_{*}

we have

g (z_{i}) < 0

and

g (z_{i}) = \tanh (θ z_{i} + b) - z_{i} < 0

. Sequence

{z_{i}}

is a monotonically decreasing sequence, and the convergence value is

z_{l}

. With the same calculation, g is concave, and

g (z_{*}) = g (z_{r}) = 0

. Therefore,

g (z) > 0

on

z_{*} < z < z_{r}

and

g (z_{i}) = \tanh (θ z_{i} + b) - z_{i} > 0

on

z_{*} < z_{0} < z_{r}

. Sequence

{z_{i}}

is a monotonically increasing sequence, and the convergence value is

z_{r}

. If

z > z_{r}

,

g (z) < 0

. Therefore,

g (z_{i}) = \tanh (θ z_{i} + b) - z_{i} > 0

on

z_{0} > z_{r}

. Sequence

{z_{i}}

is a monotonically decreasing sequence, and the convergence value is

z_{r}

. □

Theorem 8.

Assuming

θ > 1

and

b = b_{r}

, there exist two solutions

z_{r}

and

z_{*}

that satisfy

g (z) = 0

. If

z_{r} < z_{0}

, sequence

{z_{i}}

converges to

z_{r}

. If

z_{*} < z_{0} < z_{r}

, sequence

{z_{i}}

converges to

z_{*}

. If

z_{0} < z_{*}

, sequence

{z_{i}}

converges to

z_{*}

,

Proof.

If

z_{r} < z_{0}

,

g (z_{0}) < 0

. Therefore, sequence

{z_{i}}

is a monotonically decreasing sequence. So, sequence

{z_{i}}

converges to

z_{r}

. If

z_{*} < z_{0} < z_{r}

,

g (z_{0}) < 0

. Therefore, sequence

{z_{i}}

is a monotonically decreasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. If

z_{0} < z_{*}

,

g (z_{0}) > 0

. Therefore, sequence

{z_{i}}

is a monotonically increasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. □

Theorem 9.

Assuming

θ > 1

and

b < b_{r}

, sequence

{z_{i}}

converges to

z_{*}

, where

z_{*}

satisfies

g (z_{*}) = 0

.

Proof.

Under conditions (

θ > 1

and

b < b_{r}

),

g (z_{r}) < 0

. Therefore if

z_{*} < z_{0}

then

g (z_{0}) < 0

. Therefore, sequence

{z_{i}}

is a monotonically decreasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. If

z_{0} < z_{*}

,

g (z_{0}) > 0

. Therefore, sequence

{z_{i}}

is a monotonically increasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. □

Theorem 10.

Assuming

0 \leq θ \leq 1

, sequence

{z_{i}}

converges to

z_{*}

, where

z_{*}

satisfies

g (z_{*}) = 0

.

Proof.

Under condition (

0 \leq θ \leq 1

),

g (z)

has a unique solution satisfying

g (z) = 0

. If

z_{0} < z_{*}

,

g (z_{0}) > 0

. Therefore, sequence

{z_{i}}

is a monotonically increasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. If

z_{*} < z_{0}

,

g (z_{0}) < 0

. Therefore, sequence

{z_{i}}

is a monotonically decreasing sequence. So, sequence

{z_{i}}

converges to

z_{*}

. □

In condition

θ > 0

, function

\tanh (θ z + b)

is an increasing function, and there is no change of the sign of

θ z

. However, in condition

θ < 0

, function

\tanh (θ z + b)

is a decreasing function, and there is change of the sign of

θ z

.

Theorem 11.

Assuming

- 1 < θ < 0

, sequence

{z_{i}}

converges to

z_{*}

, where

z_{*}

satisfies

g (z_{*}) = 0

.

Proof.

z_{i + 1} - z_{i} = \tanh (θ z_{i} + b) - \tanh (θ z_{i - 1} + b) = θ \sec^{2} (ζ) (z_{i} - z_{i - 1}),

(18)

where

ζ

is between

z_{i - 1}

and

z_{i}

. Therefore,

|z_{i + 1} - z_{i}| \leq θ |z_{i} - z_{i - 1}| .

(19)

Sequence

{z_{i}}

is a

C a u c h y

s e q u e n c e

that converges to

z_{*}

□

Theorem 12.

Assuming

θ \leq - 1

, sequence

{z_{i}}

converges to

z_{*}

, where

z_{*}

satisfies

g (z_{*}) = 0

, or sequence

{z_{i}}

vibrates.

Proof.

z_{i + 1} - z_{i} = \tanh (θ z_{i} + b) - \tanh (θ z_{i - 1} + b) = θ \sec^{2} (ζ) (z_{i} - z_{i - 1}),

(20)

where

ζ

is between

z_{i - 1}

and

z_{i}

. Therefore,

|z_{i + 1} - z_{i}| \leq |θ| \sec^{2} (ζ) |z_{i} - z_{i - 1}| .

(21)

If

|θ| \sec^{2} (ζ) < 1

, sequence

{z_{i}}

is a

C a u c h y

s e q u e n c e

that converges to

z_{*}

. If

|θ| \sec^{2} (ζ) \geq 1

, sequence

{z_{i}}

vibrates. □

4. Numerical Experiments

In this section, we confirmed the numerical results to identify RNN analysis interpreted in the previous section. As we saw in the previous section, RNN predictions appeared in three cases. Case 1 is Equation (17) that has one solution, Case 2 is Equation (17) that has two solutions, and Case 3 is Equation (17) that has three solutions. In Cases 1 to 3, we checked the number of solutions in Equation (17), and predicted the values according to the initial conditions. In Cases 4 through 7, experiments were conducted on the situation where learning data increase, learning data increase and decrease, learning data decrease and increase, and learning data vibrate. We obtained a picture from each numerical experiment. In each figure, (a) plots the RNN predictions and the learning data, the red curve is

s i n

, (b) denotes

θ

and b in the area of existence of the solution, and (c) is a picture of z about Equation (17).

4.1. Case 1: One-Solution Case of Equation (17)

The situation with one solution was divided into the case where

θ

is less than 1 and

θ

is greater than 1.

4.1.1. Theta < 1

Let

x_{0} = 0

,

x_{1} = 0.12

,

x_{2} = 0.23

,

x_{3} = 0.38

, and

x_{4} = 0.5

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = 0.9

,

w_{2} = 0.9

,

w_{3} = 0.09

,

b_{y} = 0.2

and

b_{h} = - 0.08

. Therefore,

θ = 0.9

and

b = 0.1

. The limit of the

y_{t}

is

y_{*}

(0.65).

In Figure 2a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 2b shows

θ

and

b (* = (θ, b))

. Figure 2c shows the result of Equation (17). In Figure 2c, * is

z_{0}

. From Figure 2, we see that from the learning data, the solution of Equation (17) is one, initial value

z_{0}

is 0.6, and

z_{40}

is 0.5.

4.1.2. Theta > 1

Let

x_{0} = 0

,

x_{1} = - 0.03

,

x_{2} = 0.15

,

x_{3} = 0.33

, and

x_{4} = 0.4

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = 0.9

,

w_{2} = - 0.1

,

w_{3} = 1.39

,

b_{y} = - 0.2

and

b_{h} = 0.18

. Therefore,

θ = 1.3

and

b = 0.2

. The limit of

y_{t}

is

y_{*}

(0.64).

Figure 3 also shows results similar to those in Figure 2. Figure 3a shows

x_{0} \sim x_{4}

and

y_{4} \sim y_{40}

(

y_{4} \sim y_{40}

are the prediction values). Figure 3b shows

θ

and b. Figure 3c shows the result of Equation (17).

4.2. Case 2: Two-Solution Case of Equation (17)

This situation is two solutions of Equation (17) by

(θ, b)

= (1.3, 0.101). Let

x_{0} = 0

,

x_{1} = 0.02

,

x_{2} = 0.19

,

x_{3} = 0.36

, and

x_{4} = 0.5

.

x_{0}

∼

x_{4}

are learning data. Figure 4 shows the solution number region and

(θ, b)

(black star). As shown in Figure 4, there are two solutions to Equation (17) from the learning data. In this situation, we conducted two experiments. The first case was initial condition

z_{0}

existing between

z_{l}

and

z_{r}

. The second case was initial condition

z_{0}

being less than

z_{l}

. In the first case, the limited value of

z_{i}

from the proof had to go to

z_{r}

, and in the second case, the limited value of

z_{i}

from the proof had to go to

z_{l}

. This result was verified from the numerical experiments. The theory of the previous section was exempted through this numerical experiment.

4.2.1. First Case

In this case, we obtained

w_{1} = 0.9

,

w_{2} = 0.4

,

w_{3} = 0.94

,

b_{y} = - 0.1

and

b_{h} = 0.141

. Therefore,

θ = 1.3

and

b = 0.101

. The limit of

y_{t}

is

y_{*}

(0.47).

Figure 5a shows that

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 5b shows the result of Equation (17). In Figure 5b, * is

z_{0}

, and

z_{40}

is 0.71.

4.2.2. Second Case

In this case, we obtained

w_{1} = - 0.6

,

w_{2} = - 6.5

,

w_{3} = - 2.6

,

b_{y} = 0.7

and

b_{h} = 4.65

. Therefore,

θ = 1.3

and

b = 0.101

. The limit of

y_{t}

is

y_{*}

(0.2).

Figure 6a shows that

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 6b shows the result of Equation (17). In Figure 6b, * is

z_{0}

, and

z_{40}

is −0.34.

4.3. Case 3: Three-Solution case of Equation (17)

This situation is three solutions of Equation (17) by

(θ, b)

= (2, 0.1). Let

x_{0} = 0

,

x_{1} = - 0.01

,

x_{2} = 0.16

,

x_{3} = 0.37

, and

x_{4} = 0.46

.

x_{0}

∼

x_{4}

are learning data. Figure 7 shows the solution number region and

(θ, b)

(black star). As shown in Figure 7, there are three solutions from the learning data. In this situation, we conducted two experiments. For convenience, the three roots are indicated by

z_{l}

,

z_{*}

, and

z_{r}

, respectively, as in the notation above. The first case was initial condition

z_{0}

existing between

z_{l}

and

z_{r}

. The second case is initial condition

z_{0}

existing between

z_{l}

and

z_{*}

. In the first case, the limited value of

z_{i}

from the proof had to go to

z_{r}

, and in the second case, the limited value of

z_{i}

from the proof had to go to

z_{l}

. This result was verified from the numerical experiments. The theory of the previous section was exempted through numerical experiments.

4.3.1. First Case

In this case, we obtained

w_{1} = 0.6

,

w_{2} = 0.5

,

w_{3} = 1.7

,

b_{y} = - 0.1

and

b_{h} = 0.15

. Therefore,

θ = 2

and

b = 0.1

. The limit of the

y_{t}

is

y_{*}

(0.58).

In Figure 8a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 8b shows the result of Equation (17). In Figure 8b, * is

z_{0}

, and

z_{40}

is 0.79.

4.3.2. Second Case

In this case, we obtained

w_{1} = - 1.2

,

w_{2} = - 3

,

w_{3} = - 1.6

,

b_{y} = - 0.1

, and

b_{h} = - 0.2

. Therefore,

θ = 2

and

b = 0.1

. The limit of

y_{t}

is

y_{*}

(1.03).

In Figure 9a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 9b shows the result of Equation (17). In Figure 5b, * is

z_{0}

, and

z_{40}

is −0.86.

4.4. Case 4: Learning Data Increase

Let

x_{0} = 0

,

x_{1} = 0.15

,

x_{2} = 0.3

,

x_{3} = 0.45

, and

x_{4} = 0.58

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = - 0.96

,

w_{2} = - 0.95

,

w_{3} = 0.13

,

b_{y} = 0.24

, and

b_{h} = 0.08

. Therefore,

θ = 1.04

and

b = - 0.15

. The limit of

y_{t}

is

y_{*}

(0.93).

In Figure 10a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 10b shows

θ

and b. Figure 10c shows the result of Equation (17). In this case,

x_{4}

. From

θ

and b, Equation (17) has one solution. As can be seen in Figure 10, learning data increased and converged to a specific value.

4.5. Case 5: Learning Data Increase and Decrease

Let

x_{0} = 0.95

,

x_{1} = 0.98

,

x_{2} = 1

,

x_{3} = 0.98

, and

x_{4} = 0.95

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = - 0.49

,

w_{2} = - 0.58

,

w_{3} = - 0.07

,

b_{y} = 0.67

, and

b_{h} = - 0.2

. Therefore,

θ = 0.21

and

b = - 0.6

. The limit of

y_{t}

is

y_{*}

(0.97).

In Figure 11a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 11b shows

θ

and b. Figure 11c shows the result of Equation (17). In this case,

x_{4}

. From

θ

and b, Equation (17) has one solution. As can be seen in Figure 11, the training data converged to a specific value after increasing and decreasing. From

θ

and b, Equation (17) has one solution. As can be seen in Figure 11, the average value of the learning data gave the predicted value.

4.6. Case 6: Learning Data Decrease and Increase

Let

x_{0} = - 0.95

,

x_{1} = - 0.98

,

x_{2} = - 1

,

x_{3} = - 0.98

, and

x_{4} = - 0.95

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = - 0.32

,

w_{2} = - 0.55

,

w_{3} = - 0.14

,

b_{y} = - 0.58

, and

b_{h} = 0.28

. Therefore,

θ = 0.06

and

b = - 0.47

. The limit of

y_{t}

is

y_{*}

(-0.97).

In Figure 12a,

x_{0} \sim x_{4}

are the black stars and

y_{0} \sim y_{40}

are the prediction values (blue line). Figure 12b shows

θ

and b. Figure 12c shows the result of Equation (17). In this case,

x_{4}

. From

θ

and b, Equation (17) has one solution. As can be seen in the Figure 12, data increased and converged to a specific value. From

θ

and b, Equation (17) has one solution. As can be seen in Figure 12, the average value of the learning data gave the predicted value.

4.7. Case 7: Learning Data Vibrate

Let

x_{0} = 1

,

x_{1} = - 1

,

x_{2} = 1

,

x_{3} = - 1

, and

x_{4} = 1

.

x_{0}

∼

x_{4}

are learning data. In this case, we obtained

w_{1} = 0.5

,

w_{2} = 11.74

,

w_{3} = - 5.15

,

b_{y} = 0

, and

b_{h} = 2.48

. Therefore,

θ = 0.71

and

b = 2.48

The limit of

y_{t}

is

y_{*}

(0.5).

In Figure 13a,

x_{0} \sim x_{4}

are the green circles,

y_{0} \sim y_{4}

are the black stars, and

y_{4} \sim y_{40}

are the prediction values (blue line). In Figure 13a, the reason that the value of learning data (

x_{t}

) and the values of the learning result (

y_{t}

) are different is that the RNN structure was simple, and sufficient learning was not achieved. In future work, we aim to study the RNN structure to learn these complex learning data well. Figure 13b shows

θ

and b. Figure 13c shows the result of Equation (17). In this case,

x_{4}

. From

θ

and b, Equation (17) has one solution. As can be seen in Figure 13, data increased and converged to a specific value. In this case of

θ

and b, the solution of Equation (17) should be one. However, two contents are contradictory because learning data should be presented in two cases, 1 and −1. As a result, the cost function only increased.

5. Conclusions

In this paper, we interpreted the structure of the underlying the RNN and, on this basis, we found the principles that the RNN could predict. A basic RNN works like a time series in a very narrow range of variables. In a general range, a nonlinear function of which the maximum and minimum are specified causes the value of a function to fall within an iterative range. Because the function value is repeated within a certain range, the predicted value behaves like fixed-point iteration. In other words, we used the tanh (activation) function, so that the value was in the range of −1 to 1, and the absolute value of the predicted value in this range was less than 1. As a result, as the prediction value was repeated, the prediction value converged to a specific value. Through this paper, we found that the basic operating principle of a RNN is the operation principle of the time series, which we know as linear analysis and fixed-point iteration, which is nonlinear. In general, the solution of Equation (17) was one of the numerical calculations. Therefore, the present structure could not be solved in the case of numerical experiment Case 7 (learning data vibration). To solve this problem, it is necessary to diversify the structure, increase the number of layers, and switch to a vector structure. Next, we aim to further study RNNs in vector structures.

Author Contributions

Conceptualization, J.P. and D.Y.; Data curation, J.P.; Formal analysis, D.Y.; Funding acquisition, D.Y.; Investigation, J.P.; Methodology, D.Y. and S.J.; Project administration, J.P. and D.Y.; Resources, J.P.; Software, S.J.; Supervision, S.J.; Validation, S.J.; Visualization, S.J.; Writing—original draft, D.Y.; Writing—review & editing, J.P. and S.J. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education, Science, and Technology (grant number NRF-2017R1E1A1A03070311).

Acknowledgments

We sincerely thank the anonymous reviewers whose suggestions helped to greatly improve and clarify this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
Werbos, P.J. Generalization of backpropagation with application to a recurrent gas market model. Neural Netw. 1988, 1, 339–356. [Google Scholar] [CrossRef]
Schmidhuber, J. A Local Learning Algorithm for Dynamic Feedforward and Recurrent Networks. Connect. Sci. 1989, 1, 403–412. [Google Scholar] [CrossRef]
Cho, K.; Merrienboer, B.V.; Bahdanau, D.; Bengio, Y. On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. arXiv 2014, arXiv:1409.1259. [Google Scholar]
Jin, Z.; Zhou, G.; Gao, D.; Zhang, Y. EEG classification using sparse Bayesian extreme learning machine for brain—Computer interface. Neural Comput. Appl. 2018, 1–9. [Google Scholar] [CrossRef]
Schmidhuber, J. A Fixed Size Storage O(n³) Time Complexity Learning Algorithm for Fully Recurrent Continually Running Networks. Neural Comput. 1992, 4, 243–248. [Google Scholar] [CrossRef]
Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning (ICML 2013), Atlanta, GA, USA, 16–21 June 2013; pp. 1310–1318. [Google Scholar]
Cho, K.; Merrienboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar]
Dangelmayr, G.; Gadaleta, S.; Hundley, D.; Kirby, M. Time series prediction by estimating markov probabilities through topology preserving maps. In Applications and Science of Neural Networks, Fuzzy Systems, and Evolutionary Computation II; International Society for Optics and Photonics: Bellingham, WA, USA, 1999; Volume 3812, pp. 86–93. [Google Scholar]
Wang, P.; Wang, H.; Wang, W. Finding semantics in time series. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, Athens, Greece, 12–16 June 2011; pp. 385–396. [Google Scholar]
Afolabi, D.; Guan, S.; Man, K.L.; Wong, P.W.H.; Zhao, X. Hierarchical Meta-Learning in Time Series Forecasting for Improved Inference-Less Machine Learning. Symmetry 2017, 9, 283. [Google Scholar] [CrossRef]
Xu, X.; Ren, W. A Hybrid Model Based on a Two-Layer Decomposition Approach and an Optimized Neural Network for Chaotic Time Series Prediction. Symmetry 2019, 11, 610. [Google Scholar] [CrossRef]
Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 1994, 5, 157–166. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Gers, F.A.; Schraudolph, N.N.; Schmidhuber, J. Learning Precise Timing with LSTM Recurrent Networks. J. Mach. Learn. Res. 2002, 3, 115–143. [Google Scholar]
Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef]
Brockwell, P.J.; Davis, R. Introduction to Time-Series and Forecasting; Springer: New York, NY, USA, 2002. [Google Scholar]
Shumway, R.H.; Stoffer, D.S. Time Series Analysis and Its Applications; Springer: New York, NY, USA, 2000. [Google Scholar]
Elman, J.L. Finding structure in time. Cognit. Sci. 1990, 14, 179–211. [Google Scholar] [CrossRef]
Rohwer, R. The moving targets training algorithm. In Advances in Neural Information Processing Systems 2; Touretzky, D.S., Ed.; Morgan Kaufmann: San Matteo, CA, USA, 1990; pp. 558–565. [Google Scholar]
Mueen, A.; Keogh, E. Online discovery and maintenance of time series motifs. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 25–28 July 2010; pp. 1089–1098. [Google Scholar]
Khaled, A.A.; Hosseini, S. Fuzzy adaptive imperialist competitive algorithm for global optimization. Neural Comput. Appl. 2015, 26, 813–825. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J.L. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference for Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Zhang, Y.; Wang, Y.; Zhou, G.; Jin, J.; Wang, B.; Wang, X.; Cichocki, A. Multi-kernel extreme learning machine for EEG classification in brain-computer interfaces. Expert Syst. Appl. 2018, 96, 302–310. [Google Scholar] [CrossRef]
Zhang, X.; Yao, L.; Wang, X.; Monaghan, J.; Mcalpine, D.; Zhang, Y. A Survey on Deep Learning based Brain Computer Interface: Recent Advances and New Frontiers. arXiv 2019, arXiv:1905.04149. [Google Scholar]
Yosida, K. Functional Analysis; Springer: New York, NY, USA, 1965. [Google Scholar]

Figure 1. Solution number region.

Figure 2. One-solution case of Equation (17) (θ < 1).

Figure 3. One-solution case of Equation (17) (θ > 1).

Figure 4. Solution number region in Case 2.

Figure 5. Two-solution case of Equation (17).

Figure 6. Two-solution case of Equation (17).

Figure 7. Solution number region in Case 3.

Figure 8. Three-solution case of Equation (17).

Figure 9. Three-solution case of Equation (17).

Figure 10. Learning data increase.

Figure 11. Learning data increase and decrease.

Figure 12. Learning data decrease and increase.

Figure 13. Learning data vibrate.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Park, J.; Yi, D.; Ji, S. Analysis of Recurrent Neural Network and Predictions. Symmetry 2020, 12, 615. https://doi.org/10.3390/sym12040615

AMA Style

Park J, Yi D, Ji S. Analysis of Recurrent Neural Network and Predictions. Symmetry. 2020; 12(4):615. https://doi.org/10.3390/sym12040615

Chicago/Turabian Style

Park, Jieun, Dokkyun Yi, and Sangmin Ji. 2020. "Analysis of Recurrent Neural Network and Predictions" Symmetry 12, no. 4: 615. https://doi.org/10.3390/sym12040615

APA Style

Park, J., Yi, D., & Ji, S. (2020). Analysis of Recurrent Neural Network and Predictions. Symmetry, 12(4), 615. https://doi.org/10.3390/sym12040615

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Analysis of Recurrent Neural Network and Predictions

Abstract

1. Introduction

2. RNN and ARMA Relationship

2.1. RNN

2.2. ARMA in Time Series

2.3. RNN and ARMA

3. Analysis of Predicted Values

3.1. Limit Points of Prediction Values

3.2. Change of Prediction Values (Sequence)

4. Numerical Experiments

4.1. Case 1: One-Solution Case of Equation (17)

4.1.1. Theta < 1

4.1.2. Theta > 1

4.2. Case 2: Two-Solution Case of Equation (17)

4.2.1. First Case

4.2.2. Second Case

4.3. Case 3: Three-Solution case of Equation (17)

4.3.1. First Case

4.3.2. Second Case

4.4. Case 4: Learning Data Increase

4.5. Case 5: Learning Data Increase and Decrease

4.6. Case 6: Learning Data Decrease and Increase

4.7. Case 7: Learning Data Vibrate

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI