A Robust Regression-Based Stock Exchange Forecasting and Determination of Correlation between Stock Markets

: Knowledge-based decision support systems for ﬁnancial management are an important part of investment plans. Investors are avoiding investing in traditional investment areas such as banks due to low return on investment. The stock exchange is one of the major areas for investment presently. Various non-linear and complex factors affect the stock exchange. A robust stock exchange forecasting system remains an important need. From this line of research, we evaluate the performance of a regression-based model to check the robustness over large datasets. We also evaluate the effect of top stock exchange markets on each other. We evaluate our proposed model on the top 4 stock exchanges—New York, London, NASDAQ and Karachi stock exchange. We also evaluate our model on the top 3 companies—Apple, Microsoft, and Google. A huge (Big Data) historical data is gathered from Yahoo ﬁnance consisting of 20 years. Such huge data creates a Big Data problem. The performance of our system is evaluated on a 1-step, 6-step, and 12-step forecast. The experiments show that the proposed system produces excellent results. The results are presented in terms of Mean Absolute Error (MAE) and Root Mean Square Error (RMSE).


Introduction
The stock market is a strong indication for economic conditions of a country.Stock exchange provides a neutral ground for brokers and companies to invest.People can invest their money and can get a huge profit if they invest sensibly.Stock markets provide a better platform to people as compared to traditional banking investments.Stock investments return more profit than bank deposits and bonds.However, the higher profits come with higher risks involved with stock exchange rates.The stock exchange is associated with non-linear and highly fluctuating factors [1].These factors include the economic conditions of a country, public sentiment and political conditions of a country.These factors cause stock rates to fluctuate after short time intervals.For this reason, investors and brokers purchase and sell stocks within the short time interval.Predicting stock exchange prices by considering all dynamic factors is an important part of the business investment plan.Many researchers have explored time series analysis, machine learning methods, and technical analysis.Therefore, to assist investors by providing stock price prediction by effectively using available huge Big Data information, remains a key research area [2].
Stock exchange prediction is meant to reduce risk and provide better investment plans.These stock exchange prediction and forecasting methods are categorized into two groups, namely computationally intelligent (AI) based methods and statistical methods.The first category includes Adaptive-Network-based Fuzzy Inference Systems (ANFIS) [3], autoregressive conditional heteroskedasticity (ARCH) [4], AutoRegressive Integrated Moving Average (ARIMA) [5] and Generalized Autoregressive Conditional Heteroskedasticity (GARCH) [6].These methods work on a strong assumption that data is linearly distributed.However, uncertainty and data complexity make it difficult to create a model based on the strict linear distribution of data.There are some factors such as investor social network, policy changing and economic factors.However, internal rules for stock exchange data can be represented by historical data.The second category suggests artificial intelligence based prediction methods can learn the internal rules for stock exchange data.These methods can predict the results without considering any strict data assumption because of their non-linear underlying capabilities.Various machine learning classifiers have been used to predict stock exchange prices.These classifiers include regression [3], Support Vector Machines (SVM) and Neural Networks [7].
Stock market time series forecasting is an interesting and open research area.Computationally, intelligent artificially intelligent algorithms are now mostly used to forecast time series.However, a highly efficient stock exchange prediction model is yet to be designed.There is another limitation in existing work, namely that they do not consider the effect of top international stock markets on each other.The rise or fall in top international stock markets impacts on other stock exchanges.Usually, the rise or fall in an international stock market is due to some external factors.So, the stock exchange prediction depends upon local factors and also on these international stock exchange markets.The robustness of prediction models remains an open research area.
In this paper, we evaluate the robustness of a simple efficient regression-based stock exchange prediction model.Regression-based models are more flexible and computationally efficient as compared to statistical methods.A linear regression-based prediction approach is used to predict stock exchange indices and companies.We have done time series analysis for 4 stock exchanges-New York, London, NASDAQ, and Karachi stock exchange to evaluate the effectiveness of our regression-based model using Big Data covering 20 years.We have also evaluated our model to forecast time series analysis for 3 companies-Google, Microsoft, and Apple.We also calculate the effect of one stock exchange on another using a correlation factor.The results show that the proposed model provides very good results.In general, this paper offers the following contributions: 1.
We propose and evaluate the robustness of regression-based time series analysis and forecasting; 2.
We forecast the future values for 4 stock exchanges and 3 international companies; 3.
We calculate the correlation between 4 stock exchanges for rising and fall of stock indices.
The remaining paper organization is as follows: related work is presented in Sections 2-4 present methodology and results, respectively, followed by a conclusion.

Related Work
To analyze the stock market, various techniques have been used.These techniques fall under the categories of artificial intelligence systems, hybrid of artificial systems with the trading rules and machine learning techniques.Work done in each of the categories is thoroughly discussed in the section below.

Artificial Intelligence Systems
To estimate the stock market indices, various artificial intelligence techniques have been devised over recent years.Kodogiannis and Lolis in 2002 for the first time proposed the Artificial Neural Network (ANN) to predict stock markets [8].Later, Thirunavukarasu et al. [9] in 2009 also proposed a stock market prediction system using ANN.In 2014, Xi, Muzhou et al. proposed an ANN to predict the stock market indices [10].
Support Vector Machines has also been of high interest in the research community and it was first introduced for the prediction of the stock market in 2009 by Zhang and Shen [11].It was utilized by Wen, Yang, Song and Jia in 2010 as the artificial intelligence technique for stock market prediction [12].Support Vector Machine was utilized by, Lin Guo and Hu in 2013 [13] and Yu, Chen and Zhang in 2014 [14] respectively.Recently in the year 2016, it was proposed by Gong et al. to estimate the stock market [4].
In the year 2002 rough set theory was first utilized by Wang and Wang for the stock market estimation and prediction [15].Later Wang in 2003 also utilized it in their work [16].It was then proposed in the methodology by Nair et al. in 2010 [17].
In the year 2007 Ives and Scandol proposed the utilization of Bayesian Analysis [18] which was then used in the methodology by Su and Peterman in 2012 [19], Ticknor in 2013 [20], and later in 2015 it was utilized as part of the methodology by Miao, Wang and Xu [21], Wang et al. [22], and Peng et al. [5].
Other artificial intelligence techniques such as K-Nearest Neighbors (KNN) was proposed by Li, Sun and Sun in 2009 [23] which was later-on utilized by Teixeira and De Oliveira in 2010 [24].Techniques such as Particle Swarm Optimization (PSO) has been developed by Fu-Yuan in 2008 [25] and Shen, Zhang and Ma in 2009 [11].Sorensen et al. in the year of 2000 [26], Wu, Lin and Lin in 2006 [27] and Hu, Feng et al. in 2015 [28] proposed Decision Tree in their methodology.The use of evolutionary learning algorithms such as the Genetic Algorithm has also been seen in the work of Hassan et al. in 2007 [29], followed by Huang and Wu in 2008 [30], and Rahman et al. in 2015 [6].

Artificial Intelligence Systems with Trading Rules
For most of the time, it has been seen that the artificial intelligent systems are accompanied by trading rules in search of development of autonomous and smart decision support systems.Following the chronological distribution of literature in 2015, Cervell'o-Royo et al. proposed a trading rule that was not only beneficial, but also adaptive to risk [31].The rule was based on technical analysis and a combinational pattern, which provides the information about selling and buying, amount of profit earned and the maximum loss that can be tolerated.Kim and Enke in 2016 devised a heuristic based change trading system (RTCS) compromised of various historical values generated using rough set analysis [32].The proposed methodology is developed to cater to diverse market conditions.Podsiadlo and Rybinski in 2016 set out to experimentally determine the feasibility of rough sets to build productive prediction models [33].In 2016 Chiang et al. proposed a dynamic stock prediction system using Predicted Square Error (PSE) and neural network [34].The proposed method incorporated the shortcomings due to individual application of ANN.

Artificial Intelligence Systems with Artificial Neural Network
In recent years, hybrid implementation of artificial intelligence systems with ANN has been a trend among the research community.The reason for utilizing ANN has been due to computational complexity introduced by the large dimensionality of neurons.Zhong and Enke in 2017 applied the principal component analysis (PCA) along with its different variants such as Fast Robust Principal Component Analysis (FRPCA) and Kernel Principal Component Analysis (KPCA) for the simplification and re-arrangement of data [3].The reformed data is then classified to predict the daily market returns using ANN.Gocken et al. in 2016 [7] devised a combination of genetic algorithm (GA) and ANN as a hybrid model to improve the stock market estimation.Majhi et al. in 2009 [35] proposed a neural network variant for prediction of Dow Jones Industrial Average (DJIA) and Standard & Poor's (S&P) 500.It was concluded that the Functional Link Artificial Neural Network (FLANN) is a compatible model with other ANN models requiring less time during testing and training phase.
Considering the FLANN, Chakravarty and Dash [36] in 2012 developed a system to predict stock prices for DIJA, Bombay stock market and S&P 500.The results show that the fuzzy neural network based system produced better results compared to other systems.Dash and Bisoi in 2014 [37] proposed a hybrid approach based on the search optimization technique.This hybrid approach used a single layer neural network.

Artificial Intelligence Systems with Support Vector Machines
Hybrid systems with ANN are quite successful in estimating the stock market; however, they seem to undergo limitations of over-fitting, local maxima, and convergence problem.To handle such issues, SVMs are utilized by researchers in a hybrid with other techniques for stock market indices estimation.The application of SVM not only minimizes the likelihood of over-fitting but also provides a globalized solution.In 2012 Huang proposed a hybrid methodology using both GA and SVR for stock prediction [38].Genetic Algorithm (GA) is mainly used for parameters optimization of the model and to perform feature selection to achieve optimal parameters as an input to the SVR model.The use of GA for feature selection is vital and helps to significantly outperform the benchmark schemes.Liu and Wang in 2013 [39] utilized a combinational model of SVM and Decision Trees (DT) to forecast the stock prediction aiming to achieve an increase in precision, recall, and F-One rate.The proposed methodology was tested against techniques such as Bootstrap-SVM, Bootstrap-DT, and Back Propagation Neural Network (BPNN).
In 2015 Nayak et al. [40] proposed a hybrid framework utilizing SVM with KNN.The proposed methodology was used to predict the Indian stock exchange market.SVM was utilized to predict future or loss.It also estimated the stock value over a time for one day, week and month.The model performed well for high dimensional feature vector and handled the error and the performance of the classification methods.The SVM-KNN model outperformed the mentioned models by removing the need to tune multiple parameters for ANN and fuzzy-based model.

Proposed Methodology
The proposed methodology is step-wise explained in the sections below.

Data Collection
Historical data for stock exchange indices and different companies can be fetched from Yahoo finance and Google finance.Yahoo finance provides the facility to download this historical data into.between any two dates.To evaluate our proposed system, we have trained and tested our technique over the publically available stock exchange dataset from Yahoo finance https://finance.yahoo.com/.The site hosts repositories of multiple stock exchanges such as Karachi, London, and New York Stock Exchange.Stock market data from multiple top-ranked technology companies such as Microsoft, Google and Apple have also been utilized to test our proposed system.The collected data includes weekly stock market trends over a time of 20 years from 10 July 1998 to 10 July 2018, which is Big Data.As we also want to study the effect of the dependency between stock markets, therefore, we have used the same dates to download the dataset for each stock market.The attributes for Yahoo finance dataset are given in Table 1 and details of historical data is presented in Table 2.

Data Pre-Processing
To avoid spurious regression time series data, such as stock prices, these need to be pre-processed to check for stationarity data.Most of the forecasting methods process the data with an assumption of data being stationary.A stationary time series is the one whose statistical properties such as mean and standard deviation does not change with time.Time series data such as stock prices are checked for stationarity using unit root tests such as Augmented Dickey-Fuller (ADF) test.Collected stock market data is tested using the ADF test to find the unit root.The results from the ADF test determine whether to accept or reject the null hypothesis of data being non-stationary or if it has unit root based on the significance value p. Value of p less than 5% leads to the rejection of the null hypothesis.

Linear Regression
Trends in the stock market can be estimated under different regression models given below [41], linear regression models, neural network based model and SVM based regression.
Among the given models, linear regression is utilized due to its simplicity and robustness.In this methodology, we have considered modeling between a single dependent, and multiple independent variables.The regression model that deals with multiple variables is known as multiple linear regression model [1].The multiple linear regression is a generalization of simple linear regression in a couple of ways.Multiple linear regression allows the dependence of multiple explanatory variables rather than one and allows for having multiple shapes rather than a single straight line.
Let y represents the dependent variable that is in a linear relationship with the k independent variables X 1 , X 2 , X 3 . . .X k through parameters β 1 , β 2 , β 3 . . .β k and is given as, where the parameters β 1 , β 2 , β 3 . . .β k are the regression coefficients which are having an association with X 1 , X 2 , X 3 . . .X k respectively and ε represents the random error component depicting the difference between the observed and fitted linear relationship.The j th regression coefficient given as β j shows the anticipated change in y per unit change in j th independent variable X j .Assuming E(ε) = 0,

Stock Exchange Interdependency
Stock exchange interdependency is another research problem addressed in this paper.Stock markets affect one another in multiple ways for multiple reasons, majorly due to the effect of currencies on stock markets, similar listed products in different stock markets, dependencies of economies on one another etc.The objective is to find the correlation between different stock exchange markets.The effect of international markets on each other is evaluated using Pearson correlation and results are shown in tabular as well as graphical form.The correlation between two-time series can be calculated as: Here N, ∑ xy, ∑ x, ∑ y, ∑ x 2 , ∑ y 2 represents the number of pairs of scores, the sum of the products of paired scores, the sum of the x scores, the sum of the y scores, the sum of the squared x scores and the sum of squared y scores respectively.
Similarly, autocorrelation is a method to determine the correlation between the successive values in the same data to calculate the randomness within the data.In a time series data, autocorrelation is calculated for different lags estimating the data dependency between the instances separated by the respective lag.The value of autocorrelation lies between +1 and −1 where the extremes represent a strong correlation between the values of the dataset.To calculate the randomness of the data autocorrelation is calculated for lag set as 1, i.e., successive values in the dataset.
The autocorrelation coefficient for N observations is calculated as, The values x (1) and x (2) are the mean values of the first N − 1 and last N − 1 observations respectively.
Autocorrelation coefficient for the multiple stock market data of all companies was calculated using the above equation.

Data Description, Preparation, and Multi-Step Prediction
Stock exchange data acquired from the Yahoo finance provided the stock prices from different stock exchanges such as Karachi Stock Exchange (KSE), London Stock Exchange (LSE), New York Stock Exchange (NYSE) and American Stock Exchange (NASDAQ).They provide stock prices of multiple companies such as Microsoft (MSFT), Apple, and, Google characterized by the opening, closing, highest, and lowest values of the stock along with their number of share trade during the day.These characteristic attributes serve as multiple independent variables which are used to predict and forecast the closing values of stock as the dependent variable in the stock market.The dataset is comprised of historical stock market data for more than 20 years.The dataset is publicly available and covers the stock market trends from regions of Asia, Europe and America elevating its geographical and economic significance.
Considering the null hypothesis of data being non-stationary, all the stock market data were tested for stationarity using the ADF test and were found to be stationary at level using the first difference.Table 3 shows the p-values of stock market data showing the unit root and the p-value for the first difference of the data.As per the ADF test, the significance value p was less than 5% at first difference level, thus leading to the rejection of the null hypothesis.The stock market data was then used for further processing.The close stock values are estimated using multi-step prediction implying the measurement of the accuracy of predicted outcomes at multiple steps in the future.A step represents the time unit for which stock data is forecasted in the future.Then, the 1, 6 and 12 steps represent the forecasting of data for 1, 6 and 12-time units in future.The stock market data comprises weekly recorded prices, thus 1, 6 and 12 step predictions predict stock prices for 1, 6 and 12 weeks ahead in time.

Evaluation Metrics
The prediction performance of our proposed system is measured through multiple metrics [42][43][44][45][46][47][48].The comparison is mainly drawn based on the difference between the actual value and the predicted one [2].These evaluation metrics are explained here:

Root Mean Squared Error-RMSE
It is a quadratic score principle used to determine the average magnitude of estimation error in stock market trends [49].The mathematical representation of RMSE is given as below,

Mean Absolute Error-MAE
It is an average measure of errors in the prediction of stock market indices [49].The average error is calculated without considering the directions of the set of predictions and each set of difference is having equal weight.
In the equations above n represents the number of estimated values, f orecast(t).and actual(t) represent the estimated value and the actual value w.r.t time t respectively.

Results for NASDAQ
In this section, we present the results for the stock forecasting prediction.In our first experiment, we carried out the experiment for the NASDAQ stock exchange.In Figure 1 the prediction is done on "Close" for 12 steps ahead.The complete dataset consists of 20 years and we have trained our model on 70% data and tested our model on 30% of the data.The results in the following figure show very good results for prediction.The prediction is very good at the start and varies somewhat at the end of the data.We also carried out the 1, 6 and 12 steps ahead prediction.The results are presented in Figure 1.The first figure shows the results of the training data and the results are extremely good.The results are shown for original data for the 1, 6 and 12 step prediction.The prediction is so good that the original values exceed all predicted values.The results for testing data for 1, 6 and 12 step prediction are shown in Figure 1c.The results on the testing data are also very good.These excellent results show the performance of our proposed model.There is some error in the prediction for testing data because the error propagates.The results show that the 1 step ahead prediction is closest to the original values.The error becomes more for the 6 step and 12 step ahead predictions.In the last step, the future values are predicted for the NASDAQ stock exchange.The next 12 steps are predicted that can help investors to check the future patterns of the stock market.

Results for New York Stock Exchange
In our second experiment, we carried out an experiment for the NYSE.In Figure 2

Results for London Stock Exchange
In our third experiment, we carried out an experiment for the LSE.In Figure 3 the prediction is done on "Close" for 12 steps ahead.The complete dataset consists of 20 years and we have trained our model on 70% data and tested our model on 30% of the data.The results in the following figure show even better results for a prediction on test data.There is only a small deviation from the original data and shows almost similar results to training data.There is one spike in the training data but the prediction is not similar to training for that specific data point.The reason is that prediction is made based on the previous pattern and the previous data is uniform.The experiment is also carried out for

Results for Karachi Stock Exchange
In our fourth experiment, we carried out an experiment for the KSE.In Figure 4 the prediction is done on "Close" for 12 steps ahead.The complete dataset consists of 20 years and we have trained our model on 70% data and tested our model on 30% of the data.There is only a small deviation from the original data and shows almost similar results to training data.The experiment is also carried out for 1, 6 and 12step ahead prediction.The results are presented in Figure 4b for 1, 6 and 12 step on training and testing data respectively.The deviations from the original data are almost similar for all forecasting.The results show that error is extremely small for all types of forecasting.In the last step, the future values are predicted for the LSE.The next 12 steps are predicted that can help investors to check the future patterns of the stock market.

Stock Prediction for Companies:
In our fifth experiment, we carried out an experiment for the top three companies-Microsoft, Apple, and Google.In this section, only the prediction results on test data and 1, 6 and 12 ahead of data is presented.In our last experiment, we find the correlation between different stock exchange markets.Other researchers have checked the different factors such as political events and sentiment analysis.In this paper, we checked the correlation between different stock exchange companies.All possible combinations for 4 stock exchange markets we evaluated in this paper and the results are presented in Table 6.There is a negative correlation between the KSE and NASDAQ, NY stock exchange and LSE.The correlation values for NASDAQ, New York, and LSE are −0.02,−0.019 and −0.025 respectively.Interestingly, there is no correlation between the KSE and these top 3 stock exchange markets.On the other hand, the NASDAQ and LSE have a positive correlation value of 0.57.NY and LSE have a 0.522 correlation.Interestingly, New York and NASDAQ stock exchange have a very strong positive correlation of 0.829.This is because both are top stock exchanges situated in the USA.The correlation results are shown in graphical form in Figure 8.

Robustness Analysis for the Proposed Method
Stock Exchange prediction using linear regression is performed over the historical dataset of about 20 years.The proposed method is tested for robustness varying the distribution of data in terms of years and corresponding computational time is recorded for both training and testing.The test included the regression-based forecasting over the percentage-based data distribution.We compared the training results and testing results for linear regression and SVM-regression.Figure 9a,b shows the training and testing results respectively for stock markets.Similarly, Figure 9c,d represents results for training and testing respectively for different companies.The results are plotted against the increasing order of data distribution.The results show linear regression performs way better as compared to SVM-regression in terms of computational time.It is noted that the training and testing time falls closely with one another over the range of increasing data distribution, thereby showing its robustness.

Conclusions
Intelligent stock exchange prediction is an important aspect for business investment plans.Non-linear and complex factors make it difficult to predict stock exchange indices.We propose a regression-based model to predict stock exchange indices.The proposed model is trained over a historical data of 20 years for 4 stock exchange markets-NASDAQ, New York, London, and KSE.The model is also evaluated for top 3 stock companies-Microsoft, Apple, and Google.The results show that forecasting for a different step ahead is very close to the original data.The time series forecasting is also presented in the paper.The dynamic correlation between different stock markets is also calculated and presented.The results show that there is no effect of NASDAQ, London and New York on KSE, while the other 3 stock exchanges share a positive correlation with each other.The highest correlation is between NASDAQ and the NYSE which is found to be 0.829.In the future, we are planning to design a hybrid deep learning based model for stock exchange prediction.
the prediction is done on "Close" for 12 steps ahead.The complete dataset consists of 20 years and we have trained our model on 70% data and tested our model on 30% of the data.The results in the following figure show even better results for a prediction on test data.There is the only a deviation from the original data and shows almost similar results to training data.The experiment is also carried out for 1, 6 and 12 step ahead predictions.The results are presented in Figure 2. The first figure shows the results of the training data and the results are extremely good.The results are shown for the original data, 1 step, 6 step, and 12 step predictions.The prediction is so good that the original values exceed all prediction values.The results for testing data for the 1, 6 and 12 step predictions are shown in Figure 2c.The results on the testing data are excellent too.There are very little deviations from the original data for all 1, 6 and 1 step forecasting.All the predictions are almost very similar to original data.The results show that 1, 6 and 12 step ahead predictions are closest to the original values.In the last step, the future values are predicted for the NYSE.The next 12 steps are predicted that can help investors to check the future patterns of the stock market.

Figure 2 .
Figure 2. (a) New York stock 12 steps prediction on test data (b) New York stock 1, 6 and 12-step prediction on training data (c) New York stock 1, 6 and 12-step prediction on test data.

1 , 6
and 12 ahead predictions.The first figure shows the results of the training data and the results are extremely good.The results are shown for original data, 1, 6 and 12 step prediction.The prediction is so good that the original values exceed all prediction values.The results for the testing data for 1, 6 and 12 step predictions are shown in Figure 3c.The results on testing data are also good.There are very little deviations from the original data for all 1, 6 and 12 step forecasting.The predictions for 1, 6 and 12 step ahead predictions are almost similar.The deviations from the original data are almost similar for all forecasting.In the last step, the future values are predicted for the NYSE.The next 12 steps are predicted that can help investors to check the future patterns of the stock market.

Figure 3 .
Figure 3. (a) London stock 12 steps prediction on test data (b) London stock 1, 6 and 12-step prediction on training data (c) London 1, 6 and 12-step prediction on test data.
Figures 5-7 show the results for Microsoft, Apple, and Google (14 years maximum available data) for 12 step prediction respectively.The results are excellent for Microsoft and Apple and face some minor deviations in Google forecasting.The experiment is also carried out for 1, 6 and 12 step ahead predictions.The results are presented in the figure for 1, 6 and 12 steps on training and testing data respectively.The Figures 5-7 show the results for Microsoft, Apple, and Google for 12 step

Figure 7 .
Figure 7. (a) Google 12 steps stock prediction on test data (b) Google stock 1, 6 and 12-step prediction on training data (c) Google stock 1, 6 and 12-step prediction on test data.

Figure 8 .Figure 8 .
Figure 8.(a) Correlation between KSE NASDAQ (b) Correlation between KSE and New York (NY) (c) Correlation between KSE and London Stock Exchange (LSE) (d)Correlation between NASDAQ and LSE (e) Correlation between NY and LSE (f) Correlation between NY and NASDAQ.

Figure 9 .
Figure 9. (a) Average Computational Training Times using Linear Regression and Support Vector Machines (SVM)-Regression for Stock Markets (b) Average Computational Testing Times using Linear Regression and SVM-Regression for Stock Markets (c) Average Computational Training Times using Linear Regression and SVM-Regression for Companies (d) Average Computational Testing Times using Linear Regression and SVM-Regression for Companies.

Table 1 .
Attributes of Yahoo Finance dataset.

Table 2 .
Training data information.

Table 3 .
Augmented Dickey-Fuller (ADF) test for stationarity of stock data.

Table 6 .
Correlation results for stock markets.