The Application of Stock Index Price Prediction with Neural Network

: Stock index price prediction is prevalent in both academic and economic ﬁelds. The index price is hard to forecast due to its uncertain noise. With the development of computer science, neural networks are applied in kinds of industrial ﬁelds. In this paper, we introduce four different methods in machine learning including three typical machine learning models: Multilayer Perceptron (MLP), Long Short Term Memory (LSTM) and Convolutional Neural Network (CNN) and one attention-based neural network. The main task is to predict the next day’s index price according to the historical data. The dataset consists of the SP500 index, CSI300 index and Nikkei225 index from three different ﬁnancial markets representing the most developed market, the less developed market and the developing market respectively. Seven variables are chosen as the inputs containing the daily trading data, technical indicators and macroeconomic variables. The results show that the attention-based model has the best performance among the alternative models. Furthermore, all the introduced models have better accuracy in the developed ﬁnancial market than developing ones.


Introduction
The stock market is an essential component of the nation's economy, where most of the capital is exchanged around the world. Therefore, the stock market's performance has a significant influence on the national economy. It plays a crucial role in attracting and directing the distributed liquidity and savings into optimal paths. In this way, the scarce financial resources could be adequately allocated to the most profitable activities and projects [1]. Investors and speculators in the stock market aim to make better profits from the analysis of market information. Thus, one could take advantage of the financial market if they have proper models to predict the stock price and volatility which are affected by macroeconomic factors, and also by hundreds of other factors.
The stock index price prediction has always been one of the most challenging tasks for people who work in the financial field and other related scales due to the volatility and noise features [2]. How to improve the accuracy of the stock index price prediction is an open question in modern society. The definition of time series can be summarized as a chronological sequence consisting of observed data from discretionary periodical behavior or activity in some fields such as social science, finance, engineering, physics, and economics. The stock index price is a kind of financial time series which has a low signal-to-noise ratio [3] and heavy-tailed distribution [4]. Those kinds of complicated features make it very tough to predict the trend of the index price. Time series prediction aims at building models to simulate the future values given their past values. As in many cases, the relationships between past and future observations is not deterministic, this amounts to expressing the conditional probability distribution as a function of the past observations [4,5]: p(X t+d |X t , X t−1 , . . .) = f (X t , X t−1 , . . .). (1) There are many traditional econometric models for forecasting time series such as the Autoregressive (AR) Model [6], Autoregressive Moving Average (ARMA) Model [7], Generalized Autoregressive Conditional Heteroskedasticity (GARCH) Model [8], Vector Autoregressive model [9] and Box-Jenkins [10] which have nice performance in predicting stock price. In the past decade, deep learning has experienced booming development in many fields. Deep learning has a better explanation on the nonlinear relationships than other traditional methods. The importance of nonlinear relationships in financial time series has gained much more attention in researchers and financial analysts. Therefore, people start to analyze the nonlinear relationships by using deep learning methods such as Artificial Neural Networks (ANNs) [2,11] and Support Vector Regression (SVR) [12,13], which have been applied in financial time series prediction and have good predictive accuracy. However, a deep nonlinear topology that should be applied to predict time series raises the consideration in the recent trend in the communities of pattern recognition and machine learning. Considering the complexity of financial time series, combing deep learning methods with financial market prediction is regarded as one of the most charming topics in the stock market [14][15][16]. The main aim of using deep learning is to design an appropriate neural network to estimate the nonlinear relationships representing f in Equation 1. Nevertheless, this kind of field remains a less explored area.
In this paper, we aim to make a comparison in the stock index price prediction by using three typical machine learning methods and Uncertainty-aware Attention (UA) mechanism [17] to testify the prediction performance. We will build four models to predict the stock index price: Multilayer Perceptron (MLP), Long Short Term Memory (LSTM), Convolutional Neural Networks (CNN), and Uncertainty-aware Attention (UA).
Neural networks are widely parallel interconnected networks composed of simple adaptive units whose organization can simulate the interaction of biological nervous systems with real-world objects [18]. In biological neural networks, each neuron is connected to other neurons, and when it is excited, it sends chemicals to the connected neurons, changing the potential within those neurons. If the potential of a neuron exceeds a threshold, then it will be activated and sends chemicals to other neurons. When we abstract this neuron model out, for example, it can be described as the M-P network with one neuron and three input nodes. In Figure 1, we can see that the neuron receives input x and transmits through the connection with weight w. Then it compares the total input signal with the neuron threshold, and determines whether it is activated through activation function processing. Making use of neural architecture, artificial neural networks (ANNs) has been through rapid development in the past few decades. ANNs could learn and generalize from experience with building complex neural network to deal with the non-linear relationships among complex time series, which makes it a good approximator in forecasting problems. Multi-layer Perceptron is a feedforward artificial neural networks. One MLP consists of, at least, three layers of nodes: An input layer, a hidden layer and an output layer [19]. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation (BP) for training [20]. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable [21]. Long Short Term Memory (LSTM) is a particular type of Recurrent Neural Network (RNN). LSTM was first proposed in 1997 by Sepp Hochreiter and Jürgen Schmidhuber [22] and improved in 2000 by Felix Gers' team [23]. This kind of neural network has great performance in learning about long-term dependencies. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate [22,23]. The cell remembers values over given time intervals and the three gates regulate the flow of information into and out of the cell. LSTM is very popular in image processing, text classifying, and time series predictions. LSTMs were developed to deal with the exploding and vanishing gradient problems that can be encountered when training traditional RNNs. Convolutional Neural Network (CNN) is usually applied in the pattern recognition which is designed according to the visual organization of the human brain. CNN consists of convolution layer, pooling layer and fully connected layer with nonlinear activation function between each layer. The network utilizes a mathematical operation called convolution which is a specialized linear operation in place of general matrix multiplication. CNN has unique advantages dealing with processing image problems due to its special structure of local weight sharing reducing the complexity of the network. UA is an attention-based model using RNN as its basic structure. The network employs two RNNs dealing with the raw time series data and generates temporal attention and variable attention in terms of time and feature using variational inference. This special design could reflect the connection among different variables better.
In this work, four methods would be used to predict the stock index price and we would compare the performance among the alternative models.

Literature Review
Literature in time series prediction has a long history in finance and econometric. In this part, some related work will be introduced in terms of machine learning since artificial intelligence gains a booming development, which inspires the idea to this paper.
As is known, both macroeconomic factors and financial series inherent changes can influence the stock index price. Xiong et al. applied Long Short Term Memory neural network to model SP500 index volatility with Google domestic trends as the indicators of the macroeconomic factors [24]. Forecasting the implied volatility is also crucial in stock price predicting. Except for the high, low, open and close price data, they collect 25 domestic trends which could be considered as a representation of the public interest together as the input data. In their work, the group uses the mutual information method to determine the value of the observation interval and the normalization window size. As for the benchmark, they use the GARCH model as a comparison with the LSTM model. Results show that the LSTM model performs better than the GARCH model.
In the work of Hiransha, M et al. the group applied four types of neural networks named MLP, RNN, LSTM, and CNN to test the performance on predicting two stock markets price [25]. The dataset is taken from NSE market including three different sectors of Automobile, Banking and IT. The NYSE market contains the Finance and Petroleum sectors. ARIMA model was used in this article as the comparison between linear and nonlinear models.
There are many improvements that could be made in the model predicting performance in the option trading markets. While traditional closed-form models such as the Kemna-Vorst Model and Levy Approximation are widely used and performed well, the accuracy needs to be improved. Zhou and George proposed a model to improve the accuracy of option pricing by using neural network integrated with Levy Approximation [26]. First, the authors used the Antithetic Monte Carlo model to obtain a better forecasting result as the benchmark of learning. Second, the implied volatility of the Levy Approximation model was modified with Monte Carlo prices. Third, a neural network was built to map the real volatility to the implied volatility.
The price in the stock market has already reflected all known information which means that the financial market is 'informationally efficient' [27] and the price is influenced by news or events.
Ding et al. made use of the work in Natural Language Processing (NLP) and applied this in the stock market prediction [28]. In this article, the team proposed a novel deep neural network in the event-driven stock market prediction. The experimental results show that the proposed model could achieve almost 6% improvements compared with state-of-the-art baseline methods.
Bao et al. proposed a novel deep learning framework combining wavelet transforms (WT), stacked autoencoders (SAEs) and LSTM in stock price prediction [29]. There are three stages in this model. First, the WT method was used to eliminate noise by decomposing the stock price series. Haar function was used as the wavelet basis function due to its ability in decomposing the financial time series and reducing the processing time significantly [30]. Second, after denoising the time series data, SAEs which is unsupervised layer-wise training was applied to extract the deep features from input data. Single AE is a three-layer neural network including input layer, hidden layer and output layer also called reconstruction layer. This architecture aims at minimize the error between the input layer and the reconstruction layer through the hidden layer designed to get the deep features. The authors used four single-layer autoencoders and every AE was trained with the same gradient descent algorithm. Lastly, LSTM was introduced to forecast the stock price using the processed time series with high-level features which were the last hidden layer output in the second part. The efficiency of solving vanishing gradients and learning long-term dependency in LSTM could help to predict stock price better.
In the work of Chen et al. [31], they proposed an eight-layers convolutional neural network. The main task of this work is to make the binary classification instead of predicting exact values. The output of this model contains two values-one or zero representing the stock price movement-up or down. They used 1D convolution since the stock price data belonged to the 1D time series. Result shows that CNN could have decent performance as well as RNN in time series predicting.
Binkowski et al. proposed a novel convolutional network in predicting multivariate asynchronous time series called Significance-Offset Convolutional Neural Network (SOCNN) [2]. This model is designed as a combination of the AR model and CNN. The authors were inspired by the gate mechanism in RNN. The output of this model is obtained as a weighted sum of adjusted coefficients which depend on input data parameterized through a convolutional network. There are two convolutional parts in this model architecture. One captures the local significance of observed data while the other represents the predictors that are completely independent of position in time. The combination could learn both the information of raw time series and the weighs in the AR system.
Traditional CNN in dealing with time series might not reflect the information in terms of time domain properly. Thus Lea et al. introduced a novel temporal model called Temporal Convolutional Networks (TCN), which used a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection [32]. The main task is time series classification. The authors designed an encoder-decoder TCN structure with hierarchy of temporal convolutional filters which could capture long-range temporal patterns prominently. The parameters in the network will be updated simultaneously in every time step and the convolution is computed across time [32]. The proposed TCN model could have faster speed in training network than LSTM-based recurrent neural networks.
Qin et al. proposed a dual-stage attention-based recurrent neural network (DA-RNN) to address the limits that traditional methods could not capture the long-term temporal dependencies appropriately and select the relevant driving series to make predictions [33]. In the first stage, the authors extract the endogenous series at each time step that affects the outcome most with a new attention mechanism by referring to the previous encoder hidden state. In the second stage, the model will automatically select relevant encoder hidden states with the designed temporal attention mechanism across all time steps. Through the two attention mechanisms in encoder and decoder separately, this model could learn the long-term temporal dependencies and choose the most related input variables of a time series.

Multilayer Perceptron
Multilayer Perceptron (MLP) is a kind of primary artificial neural network which consists of a finite number of continuous layers. One MLP at least has three layers including input layer, hidden layer, and output layer. In many cases, there could be more hidden layers in the MLP structure which can deal with approximate solutions for many complex problems such as fitting approximation. One MLP can be thought of as a directed graph mapping a set of input vectors to a set of output vectors, which consists of multiple node layers connected to the next layer. These connections are generally called links or synapses [34]. In addition to the input nodes, each node is a neuron with a nonlinear activation function. A supervised learning method (BP algorithm) called backpropagation algorithm is often used to train MLP. MLP is a generalization of the perceptron, which overcomes the weakness that the perceptron can't recognize inseparable linear data. The layers of MLP are fully connected, which means every neuron of each layer is connected with all neurons of the previous layer, and this connection represents a weight addition and sum. Figure 2 shows the framework of MLP model.

Four Hidden Layers
Inputs Output In detail, from the input layer to the hidden layer: X represents the input and H 1 represents the first hidden layer, then the formulation is H 1 = f (W 1 X + B 1 ), where W 1 stands for weight parameters and B 1 stands for bias parameters, and the function f is usually a non-linear activation function. During the hidden layers, it repeats the similar methods to get the hidden layers' output. Every layer has different weight parameters and bias parameters. From hidden layer to output layer, Y is used to represent the output, then Y = G(W L H L + B L ), H L stands for the last hidden layer and L stands for the number of the hidden layer. The function G is the activation function, commonly known as Softmax. Therefore, the relationships between input and output will be built. If the weight w or bias b of one neuron or several neurons are changed a little bit, the output of that neuron will be different, and those differences will eventually be reflected in the output. Finally, BP algorithm and optimization algorithm are used to update the weight W and bias B to get the expected results.

Long Short Term Memory
Long Short Term Memory (LSTM) is a particular Recurrent Neural Network (RNN), which successes the superiority of RNN and has better performance with long-term dependency applied in many fields [22]. Due to the vanishing gradient problem, RNN has difficulty in learning long-term dependencies [35]. LSTM is a practical solution for combating vanishing gradients by using memory cells [22,23]. All recurrent neural networks have a repeated chain in one single neural cell. Traditional RNN only has one layer in the neural cell, but LSTM has four layers which are particularly interacting with each other. The key to LSTM is the cell state C t . Cell state is like a conveyor belt that runs through the chain, with only linear interaction, keeping information flow unchanged. LSTM can delete or modify the information of the cell state, which is regulated by the gate mechanism. The gate mechanism is a way for information to pass selectively, consisting of the sigmoid neuronal layer and the pointwise multiplication operation. There are three types of gates in the LSTM cell to protect and control cell state. Every gate has an expression of σ(W i X + b i ). The range of sigmoid layer output is (0,1), which indicates how much of each component in C t−1 should be passed. If the output is 0, it means that no pass is allowed while the output of 1 representing all pass. Figure 3 shows the framework of LSTM model.

Forget Gate Layer
The first step of LSTM is to decide what information to discard from the cell state. This decision is made by a sigmoid layer called the 'forget gate layer'. It will output a vector ranging from 0 to 1 for cell state C t−1 according to h t−1 and x t . If the value equals to 1, it means to keep the whole information from C t−1 , while 0 indicating dropping the information completely. Here f t represents the value of the forget gate.

Input Gate Layer
The second step is to decide what new information should be stored in cell state, which consists of two parts: (1).
One sigmoid layer determines what kind of values should be updated represented as i t .
(2). One tanh layer will create a new candidate value ofC t which will be added to the state information.
Flowing, the two values of i t andC t will be combined to update the state. After that, the state C t by adding two items with f t * C t−1 and i t * C t will be updated as Equation 3. Here, the symbol * represents the element-wise product.

Output Gate Layer
Lastly, the output is going to be determined by the output gate layer, which is based on the cell state. The hidden state h t is the element-wise product of a tanh activation layer of C t and output gate o t capturing the information of previous hidden state h t−1 and input x t .
One fully connected layer will be applied to generate the final predictions based on the hidden states h.

Convolutional Neural Network
Convolutional Neural Network (CNN) is a typical deep neural network inspired by the human visual nervous system. CNN is usually applied in image processing. For multivariate time series prediction, CNN could be applied to deal with the 2-D input time series data. CNN has two important properties to reduce the number of parameters. The first one is called the local perceptron. Each neuron perceives the local information and then combine the local information at a higher level to obtain the global information like the human brain. The second one is called weight sharing. Each neuron acts as a filter to extract the local features from local input data. The filter slides through the image until all the data points in one image are covered with fixed window size. There are three main layers in a convolutional neural network. Figure 4 shows the framework of CNN model.

Input
Cov 3 Cov 2 Cov 1 drop 1 drop 2 Output Convolution layer is the core of CNN. The convolution kernel also called filter could be considered as a small window that contains learned parameters as a matrix form. This filter slides all over the input data to capture the local information by applying the convolution operation on each patch. Different local information by applying different convolution kernels would be combined to generate the global information.
Pooling layer is added between continuous convolution layers in one CNN structure. It is designed to gradually reduce the amount of data and parameters, which could help to avoid overfitting to some extent. The pooling layer utilizes a small sliding window which is similar to the convolution layer. The convolved features of a specific area are compressed into the maximum value or the mean value of that area.
Fully connected layer is applied after several convolution layers and pooling layers to obtain the prediction results or classification results. The latent features through dimensionality reduction and feature extraction would be learned well with a fully connected layer.

Uncertainty-Aware Attention
Uncertainty-aware Attention (UA) is proposed by [17]. UA is an attention-based model that utilizes two RNNs to generate two types of attentions. The attention mechanism was originally applied in image processing. It simulates the visual mechanism of human brains that people will focus on some special areas when they watch the whole picture. This indicates that the human brain's attention to the whole picture is not balanced, which has a certain weight distinction. UA makes use of variational inference to learn this mechanism. Specifically, the two types of attention weights are assumed as Gaussian distribution with input dependent noise, and produced with two RNNs. The model generates attentions with small variance when it is confident about the contribution of the given features, and allocates noisy attentions with large variance to uncertain features for each input. With the input embeddings v 1 , . . . , v i , the two different attentions w.r.t timesteps (α) and features (β) are generated follow the below equations: α 1 , . . . , α i = Softmax(e 1 , . . . , e i ), β j = tanh(d j ) f or j = 1, . . . , i.
The parameters of two RNNs are represented as ω and i is the timestep we are interested in. The two attentions are produced through squashing functions Softmax and tanh, respectively, with the attention logits e and d which are generated from the RNN outputs g and h. Then the context vector c would be obtained by a convex sum of attentions α, β and embedding v: c = ∑ i j=1 α j β j v j . One fully connected layer would be applied to generate the final predictions:ŷ = f c (c), where f c represents the fully connected layer. Figure 5 shows the framework of UA model.

Dataset
In this paper, three stock indices would be chosen from three different markets which represent three different market levels as our datasets. Eight years time period has been collected in these datasets which are from July 2008 to September 2016. In the real world, the market state may have the potential influence on the validity of our model prediction. It would be helpful to build the model if we choose datasets from different market conditions. For the datasets, the first one for analysis is the CSI300 index from mainland China which is considered as a developing market. In this kind of market, there are many issues that the institutions should complete to improve the financial mechanism gradually. The second one is the S&P500 index from the US trading market in New York stock exchange, which is considered as the most advanced financial market. This kind of market is at the highest development level. Thus we choose the S&P500 index representing the most developed market. The third one is the Nikkei225 index from Tokyo financial market which represents the state between developed and developing market. These three different stock indices could help us test the robustness of the model prediction.
In these three different markets, we choose seven variables as our inputs and divide them as three sets. Table 1 shows the details about the selected variables.
The first set is historical trading data of each index including open price, close price and trading volume. Those kinds of variables represent the necessary trading information. The second part is technical indicators consisting of Moving Average Convergence Divergence (MACD) indicator and Average True Range (ATR) indicator. The third set is the macroeconomic variables including exchange rate and interest rate. In general, macroeconomic factors may have a significant impact on the performance of the stock markets. According to Zhao, the authors conclude that the fluctuation of currency exchange rate could have an essential influence on the trend of stock markets [36]. Therefore, the exchange rate and interest rate would be the first choice as extra inputs in the model prediction which may explain more potential information in stock prediction. In this paper, the US dollar index have been used as the proxy for the exchange rate which is the most crucial role in the monetary market. As for interest rate, we select Shanghai Interbank Offered Rate (SHIBOR) in mainland China market, Tokyo Interbank Offered Rate (TIBOR) in the Tokyo market and Federal funds rate of the US.

Experimental Design
In this section, the performance of each model would be discussed in the experimental and evaluation aspects.

Prediction Approach
The data is divided into two parts: The training part which is used to train the model and update the model parameters, the other part which would be used for testing part so that we would use the data to optimize the model for data forecasting. For all methods, 90% of the data are used to train the models and 10% of the data are used to test the performance of the models. Figure 6 shows the three different market indices' close price.  According to this figure, we could see that the original time series of the three markets are not stationary. For deep learning methods, there is no need for the time series to be stationary. For all methods in this paper, the index's close price of the next day is used as the label to make predictions and seven features described in the dataset section before are chosen as the input variables. For the MLP model, we design for four hidden layers with the corresponding neurons 70, 28, 14, 7 in each hidden layer and one fully connected layer as the output layer. For hidden layers, the activate function is relu function. In the LSTM model, the unit of hidden layer is designed to be 140. One fully connected layer is chosen as the output layer. Sigmoid function and tanh function are used in the LSTM cell. For the CNN model, we choose three convolution layers with the corresponding channels 7, 5, 3. Each convolution layer has the same kernel size of 3. We apply drop layer after each convolution layer. In the UA model, the hidden units of two RNNs are set to be 70. The embedding size of v is 7. One fully connected layer is used to output the final predictions. Table 2 shows the hyper-parameters of each method. For all methods, the time step is set to be twenty and learning rate is 0.01. We choose Mean Square Error (MSE) loss as our loss function and use Adam algorithm to optimize the loss function. The batch size is set to be 64. All the input data will be shuffled before training. The experiments of all methods were implemented on PyTorch.

Predictive Performance
In this part, the accuracy measurements would be introduced which are designed to compare the performance of the four different models. Here, we choose three classical indicators including Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and Correlation Coefficient (R) as the measurements for the predictive performance of each model. The equation of each indicator is shown in the following section: In the above equations, y i is the original series that we want to forecast andȳ i is the mean of the original series.ŷ i is the predicted series obtained from the model output andȳ i is the mean of the predicted series. N represents the number of samples in one batch. Among the three indicators, MAPE measures the size of the error calculated as the relative average of the error, and RMSE measures the mean square error of the true values and the predicted values. R is a measure of the linear correlation between two variables. Here, the model would perform well with smaller MAPE and RMSE. The predicted series is similar to the actual series with larger R. Table 3 shows the summarized experimental results of all methods on the three different market index price datasets. The best performance is displayed in boldface. In general, the UA model outperforms the other three models in all metrics, which indicates that the attention-based model could achieve better performance in multivariate financial time series prediction. The attention mechanism could allocate more contribution to a specific variable which has more impacts on the predictions. MLP model has the worst performance on the three datasets. This is mainly because MLP is a simple neural network only with several fully connected layers. This simple structure would limit the model prediction with multivariate time series. LSTM model performs well in CSI300 dataset which represents the developing financial market, while not perfect in the other two datasets. LSTM model has similar results on RMSE and R indicators with UA model on CSI300 dataset which are better than MLP model. Figure 7 shows the predicted results of four methods on the CSI300 dataset. We could see that the LSTM model has smaller volatility at the end of the forecasting period. In Figure 7d, the UA model has better fitted results of the original series in the forecasting period. This reflects that RNN based structure could handle less complete financial market time series properly. As for SP500 dataset and Nikkei225 dataset, CNN and UA models perform similarly. UA model has smaller RMSE, MAPE and bigger R on both SP500 index and Nikkei225 index which represent the most developed financial market and less developed financial market, respectively. Figures 8 and 9 show the predicted results of four methods on SP500 dataset and Nikkei225 dataset, respectively. Through the results of different datasets representing different market stage, we find that all the four models perform differently in the financial market with the different economic states from Table 3 and Figures 7-9. As for the UA model, the SP500 index has the smallest MAPE of 0.0067 which is much smaller than CSI300 index and Nikkei225 index with MAPE of 0.0071 and 0.0091. In the measurements of R, the similar results could also be found as MAPE indicator. The MLP, LSTM and CNN models all show a similar phenomenon to the UA model. SP500 index is traded in the U.S market which is the most developed financial market while CSI300 index and Nikkei225 index are from less developed financial market and developing financial market, respectively. This finding may indicate that stocks in developed financial market may have less noise and the predicted model could perform better with the complete and mature financial system.

Conclusions
In this paper, several deep learning models are used including MLP model, LSTM model, CNN model and UA model to predict the one-day-ahead closing price of three stock indices traded in different financial markets. We select SP500 index traded in the U.S financial market, CSI300 index traded in China mainland financial market and Nikkei225 index traded in Tokyo financial market. SP500 index represents the most developed financial market with a sophisticated trading system while the other two indices represent the less developed financial market and developing financial market, respectively. In each market, seven variables are selected as the inputs including daily trading data, technical indicators and macroeconomic variables. In the MLP model, we design for four hidden layers and the corresponding neurons are 70, 28, 14, 7 in each hidden layer. In the LSTM model, the unit of the hidden layer is designed to be 140 and one neuron in the output state. In the CNN model, three convolution layers with the corresponding channels 7, 5, 3 of kernel size 3 are chosen as the encoder. Drop layer is applied after each convolution layer. In the UA model, the hidden units of two RNNs are set to be 70. The embedding size of v is 7. One fully connected layer is used to output the final predictions. The time step was set to be twenty. MAPE was set to be the main predictive accuracy measurement. From the results, UA model which is attention-based deep learning method has the best performance in stock index prediction. Among the four alternative models, the UA model has the smallest MAPE in all the three stock indices and it could explain the non-linear relationships and allocate more contribution to more important variables in financial time series along with long term prediction. Furthermore, all of the models perform differently in the three financial markets. The results show that all the four models have better performance in the most developed financial market with the SP500 index than the developing financial market.

Future Work
Predicting financial time series is a tough work due to its low signal-noise ratio and there is too much noise in this kind of series. The neural network has many advantages in explaining the non-linear relationships in time series. In the future, we might consider combining linear and non-linear models to build a new model in stock predicting such as using an exponential smoothing method to fit the linear part in financial time series and using the neural network to fit the non-linear part. In exponential smoothing, some specific neural network could also be used to estimate the coefficients. In financial time series, there might be lots of indicators which could influence the trend of the stock price. Selecting proper indicators is another problem researchers may face. Recently, unsupervised learning has been popular in deep learning. An appropriate model could be constructed by using an unsupervised learning method so that the model could extract vital information in many indicators to reduce the dimension in the inputs and reduce the parameters for training.