Exploring the Relationship among Predictability, Prediction Accuracy and Data Frequency of Financial Time Series

In this paper, we aim to reveal the connection between the predictability and prediction accuracy of stock closing price changes with different data frequencies. To find out whether data frequency will affect its predictability, a new information-theoretic estimator Plz, which is derived from the Lempel–Ziv entropy, is proposed here to quantify the predictability of five-minute and daily price changes of the SSE 50 index from the Chinese stock market. Furthermore, the prediction method EEMD-FFH we proposed previously was applied to evaluate whether financial data with higher sampling frequency leads to higher prediction accuracy. It turns out that intraday five-minute data are more predictable and also have higher prediction accuracy than daily data, suggesting that the data frequency of stock returns affects its predictability and prediction accuracy, and that higher frequency data have higher predictability and higher prediction accuracy. We also perform linear regression for the two frequency data sets; the results show that predictability and prediction accuracy are positive related.


Introduction
As the most essential task of financial market analysis, price analysis has been paid more and more attention, even though the support for the strong version of the efficient market hypothesis (EMH) [1][2][3] has decreased since the 1980s [4,5]. If the EMH is of some relevance to reality, then a market would be very unpredictable due to the possibility for investors to digest any new information instantly [6,7]. However, new evidence challenges the EMH with many empirical facts from observations, e.g., the leptokurtosis and fat tail of the non-Gaussian distribution, especially the fractal market hypothesis (FMH) [8,9]. In addition, Beben and Orlowski [10] and Di Matteo et al. [11][12][13] found that emerging markets were likely to have a stronger degree of memory than developed markets, suggesting that the emerging markets had a larger possibility of being predicted.
Traditionally, econometricians and econophysicists are more interested in predictability of price changes in principle and in practice. The notion of predictability of the time series can be explained by the memory effects of the past values. Using entropy to measure the degree of randomness and the predictability of a series has been a topic for a long time; it goes back almost to the very beginning of the development of communication and information theory.
In this paper, we propose a new information-theoretic predictability estimator P lz for financial time series, which is derived from the Lempel-Ziv estimator [14][15][16]. The P lz quantifies the contributions of the past values by reducing the uncertainty of the forthcoming values in time series. Then we use the prediction method EEMD-FFH [17] to find some connections between the predictability and prediction accuracy of financial time series.
where p (x t ) is the probability distribution of x t and Θ is sample space. Shannon also introduced the entropy rate, which generalizes the notion of entropy. For a stochastic process X = {X t } , t = 1, 2, · · · , T, the entropy rate is given by The right side can be interpreted as the average entropy of each random variable in the stochastic process. If the process satisfies the stationarity condition, the entropy rate can also be expressed as a conditional entropy rate H (X) = lim T→∞ H ( X T | X 1 , X 2 , · · · , X T−1 ) It denotes the uncertainty in a quantity at time T having observed the complete history up to that point.

Entropy Rate Estimation
Entropy rate estimation has been paid more and more attention over the last 10 years due to the fact that the real entropy is known in very few isolated applications, one of the main reasons being the crucial practical importance of information-theoretic techniques in neurosciences. Entropy rate estimators can be classified into two categories [20]: The "plug in" (also called maximum-likelihood) technique and its modifications. The main principle of these methods is to compute the empirical frequencies of different patterns in the data, and then calculate the entropy of the empirical distribution. Due to the cost of calculation and limits on the data size, the "plug in" method cannot reveal the signal with long term time dependency. ii. Estimators based on data compression methods, such as Lempel-Ziv (LZ) [14][15][16] and context-tree weighting (CTW) [21,22]. This kind of approach is used to speed up the convergence and improve the performance in capturing long term time dependency.
In this study, we use two estimators which fall into the above two categories, entropy difference [18] belonging to the "plug in" class, and the new estimator we propose P lz belonging to the Lempel-Ziv class).

D norm Predictability Estimator
Consider a time series X = {x t } , t = 1, 2, · · · , T. The entropy rate at time t for a stationarity process is defined as H [x t |x 1 , x 2 , · · · , x t−1 ]. We assume that the underlying system can be approximated by a p-order Markov process. Then the value of the current moment is only related to the previous p moments. Hence, we can simplify the entropy rate: After we consider the past values, the uncertainty of the time series will not increase; therefore, . Now we can define the entropy difference (ED) as D, which is the difference between the entropy and entropy rate and is non-negative.
The right side can be interpreted as the contributions of the past values to reduce the uncertainty at time t. If the underlying process is a random walk, then D = 0. That is to say, the past values provide no information for current time. 0 < D ≤ H [x t ] indicates that the process has time autocorrelation; thus, the past values can help to improve the predictability at time t. Due to the lower bound and upper bound being certain of D: D norm measures the predictability of time series. When D norm tends to 0, the time series is unpredictable.
t−1 ≈ 1, the time series can be predicted completely at time t. We proceeded with three numerical simulations to apply D norm to different time series respectively. The data size is 10,000 points. The results are like those in Table 1. The first row is the entropy and D norm of a deterministic time series 1, 1, . . . , 1. The simulation results are consistent with our intuitive understanding, namely, the uncertainty is 0 or the predictability is 1. The results of repeated pattern indicates that for a repeat pattern time series, the entropy is 1 up to the upper bound, and the underlying time series can be predicted totally when past values were considered. For the pure random series, the predictability equals to 0. That is, we cannot predict a pure random time series even we consider its past values. The result is dependent on three factors: sample size, the efficiency of the estimator and the quality of the random generator. Hence, it is easy to understand why the entropy is 0.9997 not equal to 1. After Kolmogorov defined the complexity as the size of the minimum binary code that produces this time series in 1965 [23], complexity has been widely used to estimate entropy rate. Jacob Ziv and Abraham Lemple in 1977 designed a practical algorithm called Lempel-Ziv [16] to measure the complexity in the Kolmogorov sense, which also can identify the randomness of a time series. On this basis, there may entropy rate estimators were derived. One of them was created by Kontoyiannis in 1998 [24] (it will denoted as H lz in the better). H lz was widely used, and proved to have good statistical properties and better practical performance than other Lempel-Ziv estimators [24].
Consider a time series X = {x t } , t = 1, 2, · · · , T. The H lz estimator is defined as: where T is the size of the underlying time series and L i is the length of the shortest substring starting from time i that does not appear as a contiguous substring in the prior values. It has been proved that H lz converges to the entropy rate with probability one as T approaches infinity for a stationary ergodic Markov process.
We calculate the estimator H lz values of some simulated time series, which were generated at random from the alphabet {1, 2, 3, 4, 5, 6, 7, 8} with different data sizes T. The theoretical entropy As shown in the left part of Figure 1, H lz converges to this theoretical value as data size T tends to infinity. In our study, we propose a new predictability estimator P lz , which is derived from the estimator H lz . P lz is defined as: P lz = 1 − H lz /Ŝ whereŜ = log 2 S and S is the number of distinct states of the symbolized data. The estimator H lz is normalized into interval [0, 1]. We will introduce the detailed discretized method and its necessity in Section 3.

EEMD-FFH Prediction Algorithm
In this paper we will use a particular method EEMD-FFH [17] to find out whether the predictability of time series is related to the prediction accuracy of a particular algorithm.
The EEMD method is used to decompose a time series into a series of intrinsic mode functions (IMFs) and one residue. It has been widely used in various industries [25,26], and was proposed by Huang et al. [27]. Based on the EEMD model, there is a hybrid prediction model called EEMD-FFH [17,28] that integrates MKNN (for predicting high frequency IMFs), ARIMA (for predicting low frequency IMFs) and quadratic regression (for residue wave) models.
The operation steps of EEMD-FFH are as follows: Step 1. Decompose the time series X(t) via EEMD Step 2. Use different models to predict IMFs of different frequencies Step 3. Sum up the results to get the prediction valuê In the experimental section of this article, we use this algorithm to predict daily financial data and five-minute high-frequency financial data.

Numerical Simulation
In this section, we consider a nonlinear system, the logistic map, to test the two predictability estimators as mentioned above. Chaos in dynamical systems has been investigated over a long period of time. With the advent of fast computers, the numerical investigations on chaos have increased considerably over the last two decades, and by now, a lot is known about chaotic systems. One of the simplest and most transparent system exhibiting order to chaos transition is the logistic map [29]. The logistic map is a discrete dynamical system defined by with 0 ≤ x t ≤ 1. Thus, given an initial value (seed) x 0 , the series x is generated. Here the subscript t plays the role of discrete time. The behavior of the series as a function of the parameter p is interesting. A thorough investigation of logistic map has already been done [29]. Here, without going into detailed discussion, we simply note that • The logistic map has x = 0 and x = (p − 1)/p as fixed points. That is, if For p < 1, x = 0 is an attractive (stable) fixed point. That is, for any value of the seed x 0 between 0 and 1, x t approaches 0 exponentially.
56995, the logistic map shows interesting behavior such as repeated period doubling, appearance of odd periods, etc. • Most values of p beyond 3.56995 exhibit chaotic behavior.
Here, we set p = 3.7 and let the data length N = 10 5 . The initial value of x 0 is set to 0.5. As only one equation is described in the logistic map, x t changes no information with other variables. We added Gaussian white noises to the original time series x t with different strengths to obtain a composite time series, y t = x t + λ t . t is the Gaussian white noise (with zero mean and unit variance). λ ≥ 0 is a parameter that tunes the strength of noise. x t is the real signal corrupted by the external noise t , and λ determines the signal-noise ratio. The larger the λ, the smaller the signal-noise ratio.
We used k-means clustering to discretize the data with added Gaussian white noises into b distinct clusters. b is a pre-defined parameter that determines the number of clusters. In Figure 2, we show the values of D norm and P lz on different bins b = 5, 6, · · · , 14, with the noise strength parameter λ from 0.01 to 0.1 with a step of 0.01. The result indicates that the predictability of the time series decreased with increasing λ, as the signal-noise ratio became lower. D norm and P lz reach values close to 0.1 when λ = 0.1, so it is hard to predict the composite time series when we add more Gaussian white noise into it. Moreover, the predictability of a logistic map has no obvious relationship with the number of bins. This result is consistent with [30], which found the choice of bins is largely irrelevant to the estimation results. Here, the parameter b for the k-means clustering was 10. This experimental setup has been proven to be very efficient at revealing the randomness of the original data [31].
From the above numerical simulations, we were able to conclude that the two estimators have good performances in estimating the randomness or predictability of the system (the predictability of the time series decreases with increasing λ), so we carried out the following real financial data experiments.

Data and Stock Selection
In order to assess whether the five-minute high-frequency financial data or daily financial data has a larger possibility of being predicted, we estimated the entropy rate of close price of the stocks that make up the SSE (Shanghai Securities Exchange) 50 index. This index is based on scientific and objective methods to select the most representative 50 stocks with the large scale and good liquidity in the Shanghai stock market, able to form sample stocks, so as to comprehensively reflect the overall situation of a group of leading enterprises with the most market influence in Shanghai's stock market. The data have been found at URL http://www.10jqka.com.cn/ and were up to date as of the 13 January 2019, going 10 years back. Only complete records, i.e., five-minute and daily data with valid values for both 50 stocks, were admitted; invalid values were filtered out. In reality, non adjacent data may become adjacent data because of this procedure, but the relatively small number of invalid values compared to the valid values prevents a statistically significant impact [32]. The original close data of SSE 50 stocks cannot reasonably be assumed as stationary, a property for a time series yet essential for the validity of the forthcoming analysis. A classical solution to solve this problem is to define some new variables which can be considered stationary or at least asymptotically stationary [33]. The usual transforms for raw time series X = {x t } , t = 1, 2, · · · , T are as follows: The choice of the variable does not affect the outcome of the present work; in fact, in the high-frequency regime they are approximately identical or proportional to each other [33]. We will use the log-returns in the forthcoming analysis. The usual quantity employed to characterize the fluctuation in financial data is the so called volatility, here defined as where the parameter ∆ refers to the chosen length of the time-window and τ (in our cases always τ = 1 day&5 min) denotes the basic time scale. The average values of the whole 50 stocks log-returns areŝ 1day (t) ±1 × 10 −4 andŝ 5min (t) ±1 × 10 −5 , while the absolute log-returns, also interpretable as estimates of the 5min and daily volatility, have mean values of vôl 1day (t) vôl 5min (t) 6 × 10 −4 . However, as is widely known, the strength of fluctuations in financial data is subject to long-term correlated oscillations. Still, in concordance with other authors [33], we assume a sufficiently long financial time series to be asymptotically stationary, i.e., leading to relevant results for the long-term statistical properties of the analyzed data. The distributions of the Shannon entropy for daily data and five-minute high-frequency are shown in Figure 3. µ and σ represent mean and standard deviation respectively.  The results show that the entropy of 5 min closing prices is lower than that of daily closing prices. This is not very surprising, since high entropy has been observed even for larger time scales [34]. We considered 20 stocks, which included the five highest entropy stocks and the five lowest entropy stocks of the daily data and the same choice for the five-minute high-frequency financial data. In order to eliminate the influence of multi-scale on entropy calculation and stock selection, we used a coarse-graining algorithm by amplification in different proportions (range from 0 to 20). Then we calculated the average value and the median value for the coarse-graining dataset to choose high entropy stocks and low entropy stocks. The stocks selected by average value and median value were totally identical. After removing overlapping stocks, we obtained 20 stocks, and the detailed calculation results are shown in Table 2. These 20 stocks are considered in the experiments in the next sections. To give the evidence that the raw time series are not random, we compared the entropy of the raw time series with the entropy of randomly shuffled variants of the original data, which is also called surrogate testing. With such a preprocessing, all potential correlations in the original time series were destroyed; 100 shuffled time series for each raw time series (before the homogeneous partitioning) were generated and their average entropy was measured. The distributions of shuffled data are different to those in the original time series, and the average entropy is much larger, as can be seen in Figure 4. This provides evidence that there are temporal dependencies in the data we analyzed, and it makes sense for us to calculate their degrees of predictability.  . Distribution of the shuffled data entropy of the stocks that make up the SSE 50 index (the left one shows log-returns of shuffled daily closing prices while the right one shows log-returns of shuffled 5min closing prices). In every histogram, a normal distribution with the same mean and standard deviation is plotted. The entropy of surrogate time series is much larger than that of the raw data in the middle box-plots.

Estimating the Predictability of Different Frequency Time Series Based on D norm and P lz
In this section, we use the 20 stocks already obtained in Table 2 to calculate the predictability of daily data and five-minute high-frequency data respectively, based on D norm and P lz . The main question asked in this paper is whether daily price changes are more or less predictable than intraday (five-minute high-frequency) price changes. The reason why we use these two predictability estimators is to make the experiment more credible, and the two estimators are not compared in this paper.
We divide those 20 stocks into four groups, every group including five stocks, as shown in Figure 5, which we obtained in Section 4.1. Then for every part we calculate the predictability of daily data and five-minute high-frequency data respectively, based on D norm and P lz . In Figure 6, the left panels show the predictability of every group based on D norm , and the right panels show the predictability of every part based on P lz . Group2 Group4 Figure 5. The 20 selected stocks we obtained in Section 4.1 are assigned to four groups to test whether daily price changes are more or less predictable than intraday (five-minute high-frequency) price changes.  Figure 6. The predictability of daily data and five-minute high-frequency data respectively based on predictability estimators D norm and P lz . Left panels show the results D norm for group1-4, respectively. Right panels show the results P lz for group1-4, respectively. Surprisingly, for every stock of every group the predictability value of five-minute high-frequency data is obviously much larger than daily data, which means that five-minute high-frequency price changes are more predictable than daily price changes. The experimental results strongly suggest that the predictability of time series is related to the frequency of the data itself. From this conclusion we raise another question: whether the predictability of time series is related to the prediction accuracy of a particular algorithm. In the next section we will focus on this question.

Comparing the Prediction Accuracies of Different Frequency Time Series Based on EEMD-FFH
In last section, the experimental results strongly suggest that five-minute high-frequency price changes are more predictable than daily price changes. Then, is it possible that high frequency time series, which are more predictable, have higher prediction accuracy? In this section, we detail another experiment based on EEMD-FFH algorithm to explore this.
In order to assess the performance of EEMD-FFH for data at different frequencies, we use an indicator, root mean squared error (RMSE).
where x t represents the raw data;x t represents the prediction value; n is the number of prediction points; smaller RMSE means higher accuracy. Table 3 tabulates the average RMSE of five stocks in every group. The last 200 points for every time series have been predicted and the set containing these 200 points was our testing set. For every one of them we can see that the five-minute high-frequency financial data have higher prediction accuracy than daily data. We also show RMSE of every group in Figure 7, where the obvious difference is more intuitive. To show the relationship between the predictability and prediction accuracy, we conducted correlation analysis. We calculated the Pearson correlation coefficient and Spearman correlation coefficient between the predictability and RMSE in different frequency, as shown in the following Table 4. These results reveal that the connection between these two concepts exists, and they are negatively correlated. To explore the relationship further, we also performed linear regression for the two frequency data sets (every set includes 20 stocks of 4 groups). In this linear regression model, the value of predictability is an independent variable and the value of RMSE is the dependent variable. The scatter plot and fitted lines are shown in Figure 8. Every regression line shows that the predictability and RMSE are negatively correlated. RMSE is root mean squared error; high RMSE denotes low prediction accuracy-that is to say the predictability and prediction accuracy are positively related.  Figure 8. The scatter plot and regression fitted lines of daily data (left panels) and five-minute high-frequency data (right panels) respectively, based on predictability estimators D norm and P lz . Every regression line shows that the predictability and RMSE are negatively correlated-that is to say the predictability and prediction accuracy are positive related.

RMSE
In statistics, the predictability of time series belongs to the category of time series analysis, which is different from the prediction accuracy based on a forecasting method. In this experiment, our goal was to see if the two were related. Surprisingly, the analysis results indicate that predictability fits prediction accuracy perfectly, and we found that the five-minute high-frequency financial data have higher predictability and prediction accuracy than daily data.

Conclusions
In this paper, we introduced a new information-theoretic predictability estimator P lz for financial time series, which was derived from the Lempel-Ziv estimator. The P lz quantifies the contributions of the past values by reducing the uncertainty of the forthcoming values in the time series. We limited ourselves to the stocks constituting SSE 50 index because they are primary components of the Chinese market, to do an experiment to explore whether data's frequency would effect its predictability. The results strongly suggest that five-minute high-frequency price changes are more predictable than daily price changes. Additionally, we used the prediction method EEMD-FFH to find some connections between the predictability and prediction accuracy. Here, the empirical evidence suggests that there is a strong positive relationship between these two concepts-this is, higher frequency data have higher predictability and higher prediction accuracy.
Further studies should be performed to confirm whether these results are robust and valid for other stock markets as well. Another important study is to find whether different prediction methods will change the result that a strong positive relationship exists between predictability and prediction accuracy for different frequency financial data.