Compression-Based Methods of Time Series Forecasting †

: Time series forecasting is an important research topic with many practical applications. As shown earlier, the problems of lossless data compression and prediction are very similar mathemati-cally. In this article, we propose several forecasting methods based on real-world data compressors. We consider predicting univariate and multivariate data, describe how multiple data compressors can be combined into one forecasting method with automatic selection of the best algorithm for the input data. The developed forecasting techniques are not inferior to the known ones. We also propose a way to reduce the computation time of the combined method by using the so-called time-universal codes. To test the proposed techniques, we make predictions for real-world data such as sunspot numbers and some social indicators of Novosibirsk region, Russia. The results of our computations show that the described methods ﬁnd non-trivial regularities in data, and time universal codes can reduce the computation time without losing accuracy.


Introduction
The problem of time series forecasting is to estimate the future values of a process from a sequence of its observations. This task is important because it has many practical applications. Examples include predicting future stock prices and air temperature forecasting. Nowadays, there are many different approaches to solving this problem. Classical statistical models such as exponential smoothing and the autoregressive integrated moving average (ARIMA) model are very popular, highly accurate, and relatively easy to use. A detailed description of these methods can be found in [1,2]. Neural networks [3][4][5] are also widely used, especially on large datasets. However, there is no best method for all situations, and the development of new forecasting techniques remains relevant.
This work is based on an information-theoretic approach to time series forecasting. As is it was shown in [6], the problems of data compression and prediction are closely related and an asymptotically optimal method for predicting stationary stochastic processes can be based on a universal code (see also [7]). In [8] it was shown how any lossless data compression algorithm can be used to predict finite-alphabet and real-valued time series.
In this paper, we do not focus on asymptotic properties of algorithms and consider using this approach in practical situations. The main contributions of the work are summarized as follows. First, we describe how to use arbitrary data compressors for time series forecasting. Such compressors are a promising tool for forecasting because many of them are implementations of universal codes with numerous modifications to increase the level of compression in real-world situations. It is important to note that modern data compression techniques are based on several different approaches to universal coding: the Burrows-Wheeler Transform (BWT) [9], the Prediction by Partial Matching (PPM) [10] algorithm, the Lempel-Ziv family of algorithms [11,12], among others. Some algorithms [13][14][15] search for a compact context-free grammar that unambiguously represents the sequence for compression, one can see this approach as a kind of artificial intelligence. In previous works, only some theoretical possibilities of using universal codes for prediction without any practical applications were described [8], or the use of one universal code for prediction was presented [7].
Secondly, we propose an adaptive approach to time series forecasting, which is useful in situations when we do not know in advance which data compressor is optimal for a given time series.
Thirdly, we describe how the proposed techniques can be using to predict multivariate data. To test the proposed techniques, we make forecasts for real-world data such as sunspot numbers and some social indicators of Novosibirsk region, Russia.
It should be noted that our approach can be complemented with well-known time series transformation and adjustment techniques. For example, in this work, we use differencing and smoothing in the computational experiments.
The rest of the paper is structured as follows. In the next section, we describe the mathematical foundations of using arbitrary data compression techniques for forecasting finite-alphabet time series. After that, we describe generalizations of the model to realvalued and multivariate cases. Then we give some examples of the practical use of the presented techniques. Next, we propose an adaptive algorithm that can significantly reduce computation time when multiple data compressors are used. Further, we use the proposed algorithm to predict sunspot numbers. At the end of the paper, we describe the limitations of the proposed approach and make some conclusions.

Predicting Finite-Alphabet Time Series
Time series with finite alphabets are most convenient for forecasting using data compression algorithms. Suppose we have a sequence X = x 1 , x 2 , . . . , x t , x i ∈ A, where A is a finite set (an alphabet), and we want to give a prediction for x t+1 , x t+2 , . . . , x t+h , h ∈ Z + .
Denote as A n the set of all sequences of lengths n over A and A * = ∞ i=0 A i . A uniquely decodable data compression method (or code) ϕ is a set of mappings ϕ n : A n → {0, 1} * , n = 1, 2, . . ., such that for any sequence of words x 1 , x 2 , . . . , x m , x i ∈ A n , m ≥ 1, the sequence ϕ n (x 1 ), ϕ n (x 2 ), . . . , ϕ n (x m ) can be uniquely decoded as x 1 , x 2 , . . . , x m . The compressed size (in bits) of sequence α we denote as |ϕ(α)|. We can get a probability distribution on A n using ϕ by (1) A code ϕ is called universal if for any stationary and ergodic measure P with probability 1, and where H(P) is the entropy rate [16] of P, E( f ) is the expected value of f . In [8] it was shown that if P is a stationary and ergodic stochastic process and ϕ is a universal code, (1) in certain sense is a nonparametric estimate of the unknown probability measure P(X). More precisely, the following theorem was proved.
Theorem 1. If P is a stationary ergodic measure and ϕ is a universal code, then the following equalities hold: . . , x t ))) = 0 with probability 1,
Suppose that A is a finite set of integers, P ϕ (x t+j = a|X) = ∑ Y=(y 1 ,...,y j−1 ,a,y j+1 ,...,y h )∈A h P ϕ (Y|X). We can give a point forecast asx t+j = ∑ a∈A aP ϕ (x t+j = a|X) (i.e., compute the mean over the marginal distribution for step j).
Example 1. Consider the following sequence: Suppose we want to make a forecast for its next two values using gzip [17] as ϕ. Assume that the alphabet A of the underlying process is {0, 1}. We need to compress every sequence of the form XZ, where Z ∈ A 2 . The compression results along with the calculated conditional probabilities P gzip (Z|X) are presented in Table 1.
Suppose we have a finite set of data compression algorithms F = {ϕ 1 , ϕ 2 , . . . , ϕ k }. If we do not know which ϕ ∈ F is the best predictor for a given series X, we can mix the conditional probability distributions yielded by each compressor from F by where Note that (3) works in such a way that the highest probability gets Y ∈ A h such that |ϕ s (XY)| = min Z∈A h ,ϕ∈F |ϕ(XZ)| for some ϕ s ∈ F, i.e., (3) selects the best compressor ϕ s "automatically".

Predicting Real-Valued Time Series
Many time series that can be found in practice are sequences of real numbers. To predict such a series using data compression algorithms, we need to convert it to a sequence of integers (with loss of information, obviously). This process of conversion is known as quantization. Consider a sequence X = x 1 , x 2 , . . . , x t , where x i ∈ R. Denote its minimal and maximal elements as m and M respectively: Probably the simplest way of conversion is to split [m; M] into a finite number n of disjoint numbered intervals {q 1 , q 2 , . . . , q n } of equal length and replace each x i with its corresponding interval number: if x i ∈ q j , then replace x i with j. Later, we can perform the inverse conversion, replacing the indices, for example, with the medians of the corresponding intervals. Now, we consider the question how to select the number of intervals n. On the one hand, if this value is too small, some important regularities may be missing in the converted series. On the other hand, if n is too large, a data compressor may not be able to capture the regularities due to noise in the data. One possible solution of this problem is to employ an approach similar to that used in (3). Let n be some positive power of 2: l = log 2 n is an integer greater than zero. We can mix the probability distributions, obtained using partitions into 2 k intervals, 1 ≤ k ≤ l, with some weights. The partition yielded the smallest code length will have the greatest impact on the final result. Denote the number of interval that contains x i in partition into 2 k intervals as x [k] i . Then the mixed probability distribution can be defined as where A k = {0, 1, . . . , 2 k − 1} is a set of interval numbers, ω k -non-negative weights, l ∑ k=1 ω k = 1.

Predicting Multivariate Data
In some practical situations, it is required to predict several related series. In such cases, we can try to utilize the connection between them to improve the accuracy of our forecasts. For example, air temperature and barometric pressure are related, and it might make sense to predict them together. Many univariate time series forecasting techniques have generalizations to the multivariate (or vector) case [18][19][20][21].
The approach based on data compression algorithms can be used in a multivariate setting too. Suppose we have a vector time seriesX =x 1 , Let's find the minimal and maximal values for each coordinate: As in the previous section, we split each interval [m j ; M j ] into a finite number of intervals n and thus obtain n d d-cubes. Then we need to number these cubes and replace each pointx i with the number of the cube it falls into. As a result, we get an integer sequence that can be predicted using the previously described methods.

Experiments
In this section, we use the proposed method to predict real-world data. Along with the point forecasts (which are the means of the corresponding distributions), we provide 95% confidence intervals for them. To estimate these intervals, we used the following strategy. Suppose that we want to make an h-step ahead prediction for time series X = x 1 , x 2 , . . . , x t . Let s be t/2 . For each series X s , X s+1 , . . . , X t−h , where X i = x 1 , x 2 , . . . , x i , we made an h-step forecastX h i =x i+1 ,x i+2 , . . . ,x i+h and calculated the residuals for each step: Then we computed the standard deviations of the residuals and we want to make a 2-step forecast. As was explained previously, we make 2-step ahead predictions for the series X 2 = 1.1, 1.2 and X 3 = 1.1, 1.2, 1.3.
We mixed these compressors with the same weights, that is, in (3) ω i = 1/8. Several simple techniques of data transformation also were used. First, to remove trends in data we took the first difference: y i = x i − x i−1 . Secondly, we used smoothing: Let us move on to forecasting some indicators of Novosibirsk region. In the first example, we predict the annual number of unemployed persons in the region. In Figure 1a this series along with our 4-step forecast with confidence intervals is presented. Since this series is a real-valued one, we used partitions of its range of values into 2, 4 and 8 intervals in (4). According to our forecast, in the next four years the indicator will be higher than the current level. In the second example, we predict the average annual monetary income of the population in the region. The results are shown in Figure 1b. The forecast shows that in the next 4 years the trend will continue and by 2023 the indicator will be near 35,000 rubles per year.

Adaptive Method of Forecasting
As noted in Section 2.1, multiple data compression algorithms can be combined into one method of forecasting (almost) without loss of accuracy using (3). The only issue is computation time-we need to compress all sequences with every data compression algorithm that we include in our combination. In this section, we consider a way to significantly reduce computation time while maintaining near-optimal accuracy. This can be achieved using the so-called time-universal codes.

Time-Universal Codes
Suppose we want to compress a sequence X = x 1 , x 2 , . . . , x n , x i belongs to some finite alphabet A. Also suppose we have a finite set of data compression algorithms F = {ϕ 1 , ϕ 2 , . . . , ϕ k } and we want to compress X with ϕ s ∈ F that yields the smallest code length. The obvious way to find ϕ s is to compress X with each ϕ i and then to choose the best one. After that, we need to store s along with ϕ s (X) in order to be able to decompress the original data. Binary representation < q > of any integer q from 0 to k − 1 requires no more than log 2 k bits, therefore our final code word would be < s > ϕ s (X) with length log 2 k + |ϕ s (X)| bits. This approach works well but requires additional time proportional to k. The goal is to compress X with the best compressor using relatively small additional time.
Time-universal codes were proposed in [25] and allow making the portion of extra time asymptotically as small as desired. Let v i be the time ϕ i spends on encoding a single letter of X, v = max 1≤i≤k v i . Thus, an upper bound for time needed to encode X using any ϕ ∈ F is T = vn. Denote as δT the amount of extra time we have to select a close to optimal data compressor from F for X, where δ is a positive constant. The total time of selection and compression becomes no more than T + δT = T(1 + δ). Time-adaptive and time-universal codes are defined as follows.
Definition 1. Any method of data compression that encodes a sequence x 1 , x 2 , . . . , x n , n > 0, x i ∈ A by a binary word of the length log 2 k + |ϕ s (x 1 , x 2 , . . . , x n )| for some ϕ s ∈ F and the time of encoding is not greater than T(1 + δ) is called as time-adaptive code and denoted asΦ δ compr .

Definition 2. If for a time-adaptive codeΦ δ compr the following equation is valid
this code is called time-universal.
In [25] was proposed the following simple algorithm and proved that it yields a time universal code:
In this work, we implemented this algorithm and used it to predict real-world time series.

Experiments
In this section, we consider sunspot numbers forecasting. Sunspots are temporary spots on the Sun with reduced surface temperature that appear darker than the surrounding areas. The sunspot number time series is provided by the WDC-SILSO (World Data Center-Sunspot Index and Long-term Solar Observations) [26]. The daily, monthly, yearly and 13-month smoothed monthly total sunspot numbers can be found on the SILSO site http://www.sidc.be/silso/datafiles. Here we used the monthly mean total sunspot number series.
We accessed the SILSO site on 20 June 2020, at that time the series had 3257 observations. The entire series is shown in Figure 2. It has an obvious cyclical component and it's known that the approximate cycle lengths is 11 years [27].
The Space Weather Services (SWS) of the Australian Bureau of Meteorology issues forecasts for the WDC-SILSO sunspot numbers and maintains an archive of this forecasts (http://listserver.ips.gov.au/pipermail/ips-ssn-predictions/), so we can compare our forecasts with them. In July 2015, SILSO adjusted the original data and SWS moved to the new version in the forecasts in February 2016. We started making predictions from that date, thus after each element with the number greater than 3205 (which corresponds to January 2016), before adding it to the end of the series, we made a 4-step forecast. In order to select a (close to) optimal compressor for this series, we tried to use from 10% to 100% of its values with step 10%. For instance, when 10% of the values were used, δ was 8× 0.1 = 0.8 since we had 8 algorithms. In Section 2.4, the maximal number of intervals we considered when quantizing was 8; in this section, we increased this value to 16. In everything else, we used the same methodology as in Section 2.4.
For brevity, let us denote the combination of all 8 algorithms, obtained using (3), as joint method.Code lengths for different compressors, obtained by compressing the entire series (100% of values), are presented in Table 2. We can see that by code length zstd is the best compressor for this series and hence it will have the greatest contribution to the result of the joint method. For 10% trough 40% of the series values the best compressor by code lengths is ppmd, then for 50-100% zstd becomes the best compressor. The code length for ppmd and zstd are shown in Table 3. To assess the accuracy of our forecasts, we calculated the mean absolute error (MAE), defined as the mean of the absolute values of the residuals, for each step 1-4. The MAEs of ppmd's and zstd's forecasts are presented in Table 4. We can see that ppmd was a bit more accurate than zstd. We tried to add more values for prediction (from 2800 to 3205), and in that case zstd was slightly more accurate for steps 1-3 (the mean absolute errors are presented in Table 5). For step 4, ppmd was more accurate. We ran the program 5 times using all 8 algorithms and 5 more times using the adaptive algorithm (50% of the values were used to select the best compressor). The average time required to compute one 4-step forecast was 9945.46 s. in the former case and 564.63 s. in the latter. Thus, we reduced the computation time by more than 17 times without loss of accuracy.

Model Limitations
The presented model has some limitations. First, it is suitable for prediction of data with no trend pattern. But, as was shown in Section 2.4, to overcome this limitation some additional techniques such as differencing or time series decomposition [28] can be used. Secondly, the computational complexity of the model is high: if we make forecast for h steps ahead, |A| h sequences have to be compressed. This means that it cannot be directly applied in long-term forecasting or in a setting when it is required to compute predictions very quickly. While sacrificing accuracy, we can still make long-term forecasts using the following approach. Suppose we have a series X = x 1 , x 2 , . . . , x t and we want to predict its next h values. For simplicity, assume that t and h are even numbers, but the generalization is obvious. We can split X into two separate time series X odd = x 1 , x 3 , . . . , x t−1 and X even = x 2 , x 4 , . . . , x t . Then we predict x t+1 , x t+3 , . . . , x t+h−1 using X odd and x t+2 , x t+4 , . . . , x t+h using X even . This allows us reduce the number of sequences to compress from |A| h to 2|A| h/2 . It is clear that we can go further and split X into more than two series. Another obvious way to speed up computations is to choose small n when quantizing.

Discussion
In our opinion, the results of computations show that the proposed method has good accuracy. The adaptive approach to forecasting can significantly reduce computation time without loss in effectiveness if multiple compressors are used.
It is important to note that some time series can contain complex regularities. For example, the dynamics of financial time series can be influenced by the participants of the corresponding processes, and this can lead to the emergence of subtle patterns in the data. Another example is the time series arising in the study of space objects. Such series can contain various nonstationarities like attenuation and non-periodic changes. Thus, in our opinion, forecasting methods that allow finding unusual patterns may be of interest to many researchers. Patterns of the indicated types that occur in text data can be found by modern archivers. Therefore, we believe that the use of data compressors implicitly allows the use of various techniques of finding patterns, including artificial intelligence methods beyond neural networks.
In further research, a more elaborate strategy for optimal algorithms selection in the adaptive method can be developed. For example, one can use multidimensional optimization techniques for this purpose.

Conclusions
It turns out that from a mathematical point of view, data compression and prediction can be thought of together, allowing ideas from one area to be used in another. From a practical point of view, many of the data compression techniques implemented in software can be used in time series forecasting. Efficient data compression algorithms to be developed in the future can be combined with existing ones using the described approach.  Data Availability Statement: Publicly available datasets were analyzed in this study. This data can be found here: https://novosibstat.gks.ru/folder/31734 (average number of unemployed persons in Novosibirsk region), https://novosibstat.gks.ru/folder/31847 (average annual monetary income of the population in Novosibirsk region), http://www.sidc.be/silso/datafiles (monthly mean total sunspot numbers), http://listserver.ips.gov.au/pipermail/ips-ssn-predictions/ (SWS observed and predicted solar indices).

Conflicts of Interest:
The authors declare no conflict of interest.