Adaptive Online Learning for Time Series Prediction

We study the problem of predicting time series data using the autoregressive integrated moving average (ARIMA) model in an online manner. Existing algorithms require model selection, which is time consuming and inapt for the setting of online learning. Using adaptive online learning techniques, we develop algorithms for fitting ARIMA models with fewest possible hyperparameters. We analyse the regret bound of the proposed algorithms and examine their performance using experiments on both synthetic and real world datasets.


Introduction
An Autoregressive Integrated Moving Average (ARIMA) model, which is important for time series analysis [1][2][3][4][5], specifies that the values of a time series depend linearly on their previous values and error terms. In recent years, online learning methods have been applied to estimating the ARIMA models for their efficiency and scalability [6-9]. These methods are based on the fact that any ARIMA model can be approximated by a finite dimensional Autoregressive (AR) model, which can be fitted incrementally using online convex optimisation algorithms. However, to guarantee accurate predictions, these methods require a proper configuration of hyperparameters, such as the diameter of the decision set, the learning rate, the order of differencing, and the lag of the AR model. These hyperparameters need either to be set according to prior knowledge about the dataset, which is difficult to obtain in practice, or to be tuned using a collected dataset, which is notoriously expensive and inapt for the online setting.
Given a new problem of predicting time series values, it appears that tuning the hyperparameters of the online algorithms can negate the benefits of the online setting. To avoid this, we propose new algorithms for learning an ARIMA model online with fewest possible hyperparameters, while their performance can still be guaranteed in both theory and practice. We first add more "flavor of online learning" by considering an adversarial setting for multivariate time series, where the values of a time series are taken from a vector space, and the error terms are generated arbitrarily. We show that all ARIMA models with fixed order of differencing can be approximated using an AR model of the same order for a large enough lag. Then we propose new algorithms learning the AR model adaptively without requiring any prior knowledge about the model parameters. For Lipschitz-continuous loss functions, we apply a new algorithm based on the adaptive follow the regularised leader (FTRL) framework [10] and show that our algorithm achieves a sublinear regret bound depending on the data sequence and the Lipschitz constant. We provide some special treatment on the commonly used squared error due to its non-Lipschitz continuity. To obtain a data-dependent regret bound, we combine polynomial regulariser [11] with the adaptive FTRL framework. Finally, to find the proper order and lag of the AR model in an online manner, we simultaneously maintain multiple AR models and apply an adaptive hedge algorithm to aggregate their predictions. In the previous attempts [12,13], the exponentiated gradient algorithm has been directly applied to aggregating the predictions, which not only requires tuning of the learning rate, but also yields a regret bound depending on the loss incurred by the worst model. Our adaptive hedge algorithm is parameter-free and guarantees a regret bound depending on the time series sequence. In addition to the theoretical analysis, we also demonstrate the performance of the proposed algorithm using both synthetic and real world datasets.
The rest of the paper is organised as follows. In Section 2, we review the existing work on the subject. Then we introduce the notation, learning model and formal description of the problem in Section 3. Next, we present and analyse our algorithms in Section 4. Section 5 demonstrates the empirical performance of the proposed methods. Finally, we conclude our work with some future research directions in Section 6.

Related Work
An ARIMA model can be fitted using statistical methods such as Recursive Least Square and Maximum Likelihood Estimation, which are not only based on strong assumptions on the noise terms and data generation [14][15][16], but also require solution of non-convex optimisation problems [17]. Although these assumptions can be relaxed [17][18][19][20], the pre-trained models can still not deal with concept drift [7]. Moreover, retraining is time consuming and memory intensive, especially for large-scale datasets. The idea of applying regret minimisation techniques to Autoregressive Moving Average (ARMA) prediction was first introduced in [6]. The authors propose online algorithms incrementally producing predictions close to the values generated by the best ARMA model. This idea has been extended to ARIMA(p, q, d) models in [7], by learning the AR(m) model of the higher order differencing of the time series. Further extensions to multiple time series can be found in [8,9], while the problem of predicting time series with missing data has been addressed in [21].
In order to obtain accurate predictions, the lag of the AR model and the order of differencing have to be tuned, which has been well studied in the offline setting. In some textbooks [15,22,23], Akaike's Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are recommended for this task. Both of them require prior knowledge and strong assumptions about the variance of the noise [15], and are time and space consuming as they require numerical simulation, such as cross-validation on previously collected datasets. Nevertheless, given a properly selected lag m and order d, online convex optimisation techniques such as Online Newton Step (ONS) or Online Gradient Descent (OGD) can be applied to fitting the model in the regret minimisation framework [6-9]. However, both algorithms introduce additional hyperparameters for controlling the learning rate and the numerical stability.
The idea of selecting hyperparameters for online time series prediction has been proposed in [12,13]. Regarding the online AR predictor with different lags as experts, the authors aggregate over predictors by applying multiplicative weights algorithm for prediction with expert advice. The proposed algorithm is not optimal for time series prediction, since the regret bound of the chosen algorithm depends on the largest loss incurred by the experts [24]. Furthermore, each individual expert still requires that the parameters are taken from a compact decision set, the diameter of which needs to be tuned in practice. A series of recent works on parameter-free online learning [25][26][27][28] have provided possibilities of achieving sublinear regret without prior information on the decision set. Given an unconstrained online convex optimisation problem, these algorithms usually relax the problem by assuming a bound on the gradient and guarantee a regret bound depending on it. Unfortunately, a bound on the gradient is an unrealistic assumption for unbounded time series and the squared error loss.
Our idea is based on the combination of the adaptive FTRL framework [10] and the idea of handling relative Lipschitz continuous functions [11], which allows us to devise an online algorithm with a data-dependent regret upper bound. To aggregate the results, we propose an adaptive optimistic algorithm such that the overall regret depends on the data sequence instead of worst case loss.

Preliminary and Learning Model
Let X t denote the value observed at time t of a time series. We assume that X t is taken from a finite dimensional real vector space X with norm · . We denote by L(X, X) the vector space of bounded linear operators from X to X and α op = sup x∈X,x =0 αx x the corresponding operator norm. An AR(p) model is given by where α i ∈ L(X, X) is a linear operator and t ∈ X is some error term. The ARMA(p, q) model extends the AR(p) model by adding a moving average (MA) component as follows where t ∈ X is the error term and β i ∈ L(X, X). We define the d-th order differencing of the time series as d X t = d−1 X t − d−1 X t−1 for d ≥ 1 and 0 X t = X t . The ARIMA(p, q, d) model assumes that the d-th order differencing of the time series follows an ARMA(p, q) model. In this section, this general setting suffices for introducing the learning model. In the following sections, we fix the basis of X to obtain implementable algorithms, for which different kinds of norms and inner products for vectors and matrices are needed. We provide a table of required notation in Appendix C.
In this paper, we consider the setting of online learning, which can be described as an iterative game between a player and an adversary. In each round t of the game, the player makes a predictionX t . Next, the adversary chooses some X t and reveals it to the player, who then suffers the loss l(X t ,X t ) for some convex loss function l : X × X → R. The ultimate goal is to design a strategy for the player to minimise the cumulative loss ∑ T t=1 l(X t ,X t ) of T rounds. For simplicity, we define In classical textbooks about time series analysis, the signal is assumed to be generated by a model, based on which the predictions are made. In this paper, we make no assumptions on the data generation. Therefore, minimising the cumulative loss is in general impossible. An achievable objective is to keep a possibly small regret of not having chosen some ARIMA(p, q, d) model to generate the predictionX t . Formally, we denote byX t (α, β) the prediction using the ARIMA(p, q, d) model parameterised by α and β, given by 1 The cumulative regret of T rounds is then given by The goal of this paper is to design a strategy for the player such that the cumulative regret grows sublinearly in T. In the ideal case, in which the data are actually generated by an ARIMA process, the prediction generated by the player yields a small loss. Otherwise, the predictions are always close to these produced by the best ARIMA model independent of the data generation. Following the adversarial setting in [6], we allow the sequences {X t }, { t } and the parameter α, β to be selected by the adversary. Without any restrictions on 1 In the paper, we do not directly address the problem of the cointegration, where the third term should be applied to a low rank linear operator. the model, this is nothing different than the impossible task of minimising the cumulative loss, since t−1 can always be selected such that X t =X t (α, β) holds for all t. Therefore, we make the following assumptions throughout this paper: Assumption 1. X t = t +X t (α, β) and there is some R > 0 such that t ≤ R for all t = 1, . . . T.
Since we are interested in competing against predictions generated by ARIMA models, we assume that t is selected in the way as if X t is generated by the ARIMA process. Furthermore, we assume the norm t is upper bounded within T iterations. Assumption 2 is a sufficient condition for the MA component to be invertible, which prevents it from going to infinity as t → ∞ [23].
Our work is based on the fact that we can compete against an ARIMA(p, q, d) model by taking predictions from AR(m) model of the d-th order differencing for large enough m, which is shown in the following lemma, the proof of which can be found in Appendix A.
Lemma 1. Let {X t }, { t }, α and β be as assumed in Assumption 1-2. Then there is some As can be seen from the lemma, a predictionX t (γ) generated by the process is close to the predictionX t (α, β) generated by the ARIMA process. In the previous works [6,7], the loss function l t is assumed to be Lipschitz continuous to control the difference of loss incurred by the approximation. In general, this does not hold for squared error. However, from Assumption 1 and Lemma 1, it follows that bothX t (α, β) andX t (γ) lie in a compact set around X t with a bounded diameter. Given the convexity of l, which is local Lipschitz continuous in the compact convex domain, we obtain a similar property where L(X t ) is some constant depending on X t . For squared error, it is easy to verify that the Lipschitz constant depends on d X t , the boundedness of which can be reasonably assumed. To avoid extraneous details, we simply add the third assumption: There is a compact convex set X ⊇ T t=1 X t , such that l t is L-Lipschitz continuous in X for t = 1, . . . T.
The next corollary shows that the losses incurred by the ARIMA and its approximation are close, which allows us to take predictions from the approximation. Corollary 1. Let {X t }, { t }, α, β and l be as assumed in Assumption 1-3. Then there is some holds for all t = 1 . . . T.

Algorithms and Analysis
From Corollary 1, it follows clearly that an ARIMA(p, q, d) model can be approximated by an integrated AR model with large enough m. However, neither the order of differencing d nor the lag m is known. To circumvent tuning them using a previously collected dataset, we propose a framework with a two-level hierarchical construction, which is described in Algorithm 1.
The idea is to maintain a master algorithm M and a set of slave algorithms {A m |m = 1, . . . , K}. At each step t, the master algorithm receives predictionsX k t from A k for k = 1, . . . , K. Then it comes up with a convex combinationX t = ∑ K i=1 w i tX i t for some w t ∈ ∆ in the simplex. Next, it observes X t and computes the loss l t (X k t (γ)) for each slave A k , which is then used to update A k and w t+1 . Let {X k t } be the sequence generated by some slave k. We define the regret of not having chosen the prediction generated by slave k as and the regret of the slave k whereX t (γ k ) is the prediction generated by an integrated AR model parameterised by γ k . Let A k be some slave. Then the regret of this two-level framework can obviously be decomposed as Corollary 1 For γ k , α and β satisfying the condition in Corollary 1 2 , the marked term above is upper bounded by a constant, i.e.
If the regret of the master and the slaves grow sublinearly in T, we can achieve an overall Algorithm 1 Two-level framework Input: K instances of the slave algorithm A 1 , . . . , A K . An instance of master algorithm M.
sublinear regret upper bound, which is formally described in the following corollary.
Corollary 2. Let A i be an online learning algorithm against an AR(m i ) model parameterised by γ i for i = 1, . . . , K. For any ARIMA model parameterised by α and β, if there is a k ∈ {1, . . . , K} such thatX t (γ k ),X t (α, β) and {X t } satisfy Assumption 1-3, then running algorithm 1 with M and A 1 , . . . , A K guarantees Next, we design and analyse parameter-free algorithms for the slaves and the master. Given fixed m and d, an integrated AR(m) model can be treated as an ordinary linear regression model. In each iteration t, we select γ t = (γ 1,t , . . . , γ m,t ) ∈ L(X, X) m and make predictionX Since l t is convex, there is some subdifferential g t ∈ ∂l t (X t (γ t )) such that Thus, we can cast the online linear regression problem to an online linear optimisation problem. Unlike the previous work, we focus on the unconstrained setting, where γ t is not picked from a compact decision set. In this setting, we can apply an FTRL algorithm with an adaptive regulariser. To obtain an efficient implementation, we fix a basis for both X and X * . Now we can assume X = X * = R n and work with the matrix representation of γ ∈ L(X, X). It is easy to verify that (2) can be rewritten as is the Frobenius inner product. It is well known that the Frobenius inner product can be considered as a dot product of vectorised matrices, with which we obtain a simple first order 3 algorithm described in Algorithm 2. The cumulative regret of Algorithm 2 ARIMA-AdaFTRL

t end for end for
Algorithm 2 can be upper bounded using the following theorem.
Theorem 1. Let {X t } be any sequence of vectors taken from X. Algorithm 2 guarantees For an L-Lipschitz loss function l t , in which L T+1 is upper bounded by L, we obtain a sublinear regret upper bound depending on the sequence of d-th order differencing { d X t }. In case L is known, we can set L 0 = L, otherwise picking L 0 arbitrarily from a reasonable range, e.g. L 0 = 1, would not make a devastating impact on the performance of the algorithms.

Algorithms for Squared Errors
For the commonly used squared error given by it can be verified that g t can be represented as a vector for all t. Algorithm 2 could fail due to the dependency on g t 2 , which could be set arbitrarily large due to the adversarially selected X t . To design a parameter-free algorithm for the squared error, we equip FTRL with a time-varying polynomial regulariser described in Algorithm 3.
Algorithm 3 ARIMA-AdaFTRL-Poly and consider the matrix representation γ t = γ 1,t · · · γ m,t . Then we have g t = γ t x t − d X t , and the upper bound of the regret can be rewritten as The idea of Algorithm 3 is to run the FTRL algorithm with a polynomial regulariser for increasing sequences {λ t } and {η t }, which leads to updating rule given by for c satisfying λ t c 3 + η t c = θ t F . Since we have λ t ≥ 0 and η t > 0 for θ 1 = 0, c exists and has a closed form expression. The computational complexity per iteration has a linear dependency on the dimension of L(X, X) m . The following theorem provides a regret upper bound of Algorithm 3.
Theorem 2. Let {X t } be any sequence of vectors taken from X and be the squared error. We define x t = d X t−1 · · · d X t−m and γ = γ 1 · · · γ m , the matrix representation of γ 1 , . . . γ m ∈ L(X, X). Then, Algorithm 3 guarantees for all γ ∈ L(X, X) m .
For squared error, Algorithm 3 does not require a compact decision set and ensures a sublinear regret bound depends on the data sequence. Similar to Algorithm 2, one can set G 0 according to the prior knowledge about the bounds of the time series. Alternatively, we can simply set G 0 = 1 to obtain a reasonable performance.

Online Model Selection using Master Algorithms
The straightforward choice of the master algorithm would be the exponentiated gradient algorithm for prediction with expert advice. However, this algorithm requires tuning of the learning rate and losses bounded by a small quantity, which can not be assumed for our case. The AdaHedge [29] solves these problems. However, it yields a worst-case regret bound depending on the largest loss observed, which could be much worse compared to a data dependent regret bound.
Our idea is based on the adaptive optimistic follow the regularized leader (AO-FTRL) framework [10]. Given a sequence of hints {h t } and loss vectors {z t }, AO-FTRL guarantees a regret bound related to ∑ T t=1 z t − h t 2 t for some time varying norm · t . In our case, where the loss incurred by a slave is given by l(X t ,X k t ) at iteration t, we simply choose If l is L-Lipschitz in its first argument, then we have |z k,t − h k,t | ≤ L d X t , which leads to a data dependent regret. The obtained algorithm is described in Algorithm 4. Its regret is upper bounded by the following theorem, the proof of which is provided in Appendix B.
and {w t } be as generated in Algorithm 4. Assume l is L-Lipschitz in its first argument and convex in its second argument. Then for any sequence {X t } and slave algorithm A k , we have By Corollary 2, combining Algorithm 4 with Algorithm 2 or 3 guarantees a datadependent regret upper bound sublinear in T. Note that there is an input parameter d for Algorithm 4, which can be adjusted according to the prior knowledge of the dataset such that d X t 2 2 can be bounded by a small quantity. In case no prior knowledge can be obtained, we can set d to the maximal order of differencing used in the slave algorithms. Arguably, the Lipschitz-continuity is not a reasonable assumption for squared error with Algorithm 4 ARIMA-AO-Hedge Input: predictor A 1 , . . . , A K , d Initialise θ k,1 = 0, η 1 = 0 for i = 1, . . . , , we can assume that the loss function is locally-Lipschitz, however, with a Lipschitz constant depending on the prediction. In the next section, we show the performance of Algorithm 4 in combination with Algorithms 2 and 3 in different experimental settings.

Experiments and Results
In this section, we carry out experiments on both synthetic and real-world data to show that the proposed algorithms can generate promising predictions without tuning hyperparameters.

Experiment Settings
The synthetic data is generated randomly. We run 20 trials for each synthetic experiment and average the results. For numerical stability, we scale the real world data down so that the values are between 0 and 10. Note that the range of the data are not assumed or used in the algorithms.

Setting 3: Time Varying Models
To get a more adversarially selected time series values, we generate the first half of the values using a stationary 10-dimensional ARIMA(5, 2, 1) model and the second half of the values using a stationary 10-dimensional ARIMA(5, 2, 0) model. The model parameters are drawn randomly. We collect estimates of influenza activity of the northern hemisphere countries, which has an obvious seasonal pattern. In the experiment, we examine the performance of the algorithms for handling regular and predictable changes that occur over a fixed period Electricity Demand: Trend and Seasonality In this setting, we collect monthly load, gross electricity production, net electricity consumption and gross demand in Turkey from 1976 to 2010. The dataset contains both trend and seasonality.

Experiments for the Slave Algorithms
We first fix d = 1 and m = 16 and compare our slave algorithms with ONS, OGD from [9] for squared error l t (X t ) = 1 2 X t −X t 2 2 and Euclidean distance l t (X t ) = X t −X t 2 . ONS, OGD stack and vectorise the parameter matrices, and incrementally update the vectorised parameter using the following rule respectively, where g t is the vectorised gradient at step t, W is the decision set satisfying sup u∈W u 2 ≤ c and the operator Π W (v) projects v into W. We select a list of candidate values for each hyperparameter, evaluate their performance on the whole dataset, and select the configuration with the best performance for comparison. Since the synthetic data are generated randomly, we average the results over 20 trials for stability. The corresponding results are shown in Figures 1-6. To show the impact of the hyperparameters on the performance of the baseline algorithms, we also plot their performance using sub-optimal configurations. Note that, since the error term t cannot be predicted, an ideal predictor would suffer an average error rate of at least t 2 2 and t 2 for the two kinds of loss function. This is known for the synthetic datasets and plotted in the figures.
In all settings, both AdaFTRL and AdaFTRL-Poly have a performance on par with well-tuned OGD and ONS, which can have extremely bad performance using sub-optimal hyperparameter configurations. In the experiments using synthetic datasets, AdaFTRL suffers large loss at the beginning while generating accurate predictions after 2000 iterations. AdaFTRL-Poly has more stable performance compared to AdaFTRL. In the experiment with Google Flu data, all algorithms suffer huge losses around the iteration 300 due to an abrupt change in the dataset. OGD and ONS with sub-optimal hyperparameter configurations, despite good performance for the first half of the data, generate very inaccurate predictions after the abrupt change in the dataset. This could lead to a catastrophic failure in practice, when certain patterns do not appear in the dataset collected for hyperparameter tuning. Our algorithms are more robust against this change and perform similarly to OGD and ONS with optimal hyperparameter configurations.                 There are still several remaining issues that we want to address in future research.

24
First of all, it would be interesting to also develop a parameter-free algorithm for the 25 cointegrated vector ARMA model. Secondly, we believe that the strong assumption on the    Appendix A 43 We prove lemma 1 in this section. Consider the ARIMA model given by be the t-th value generated by this process.

50
The next lemma shows that it approximates an ARIMA(p, q, d) process.

51
Lemma A1. For any α, β and { t } satisfying A1-A2, we have, for t = 1, . . . , T, Proof. First of all, we have 53 for all τ ≤ 0. We assume the claim holds for some t, then we have which concludes the induction. Finally, we have 58 which is the claimed result. Next, we recursively define the following process where d X m t (α, β) = d X t for m ≤ 0. Let {X m t (α, β)} be the sequence generated as follows where Y t−i is defined as in the proof of Lemma A1. From the assumption, we have 69 dX t (α, β) − d X t = t ≤ R, and, as we have proved in Lemma A1, Y t ≤ R holds. Therefore, we obtain Z 0 t ≤ 2R, which is the induction basis. Next, assume the claim 71 holds for all 0, . . . , m − 1. Then we have From the induction hypothesis, we have From the proof of the induction basis, we have Therefore, Z m t can be further bounded using which is the claimed result.

77
This process of the d-th order differencing is actually an integrated AR(m + p) process 78 with order d, which is shown in the following lemma.

79
Lemma A3. For any data sequence {X m t (α, β)} generated by a process of the d-th order differenc-80 ing given by A1 and A2 there is a γ ∈ L(X, X) m+p such that holds for all t. Proof. Let { d X m t (α, β)} be the sequence generated by A1. We prove by induction on m 83 that, for allm ≤ m, there is a γ ∈ L(X, X)m +p , such that holds for all α and β. The induction basis follows directly from the definition that Assume that the claim holds for some m. Let α i be the zero linear functional for i > p and 86 β i be the zero linear functional for i > q. Then we have where the second equality follows from the fact that which is the claimed result.

95
Appendix B

96
In this section, we prove the theorems in Section 4.
the overall regret can be considered as the sum of the regrets ∑ T t=1 g i,t (γ i,t − γ i ). Next, 112 we analyse the regret of each i = 1, . . . m.
From the updating rule of G i,t , we have g i,t = 0 for G i,t = 0. Let t 0 be the smallest index 115 such that G i,t 0 > 0. Then we have From the definition of Frobenius norm, we have Then, we obtain where the second inequality uses the lemma 4 in [28] and the last inequality follows from 121 the fact that g i,t F ≤ L t d X t−i 2 ≤ L T+1 d X t−i 2 . Furthermore, we have Adding up from 1 to m, we have Proof of Theorem 2. Define ψ t (γ) = λ t γ 4 4 + λ t γ 2 2 . First of all, it is easy to verify that Define v t ∈ ∂ψ * t+1 (θ t ). Then we have where we defineψ t (γ) = λ t 4 γ 4 F andψ t (γ) = η t 2 γ 2 F . From the properties of the Frobe-128 nius norm, we have Thus, for λ t = 0, we have where, the second inequality uses the fact that 2ab − b 2 ≤ a 2 . Let t 0 be the smallest index 132 such that λ t 0 > 0. Then we have where the last inequality uses lemma 4 in [28]. Similarly, let t 1 be the smallest index such 134 that η t 0 > 0. Then we obtain the upper bound Combining A3, A4, A5 and A6, we obtain For θ 1 = 0, it is easy to verify that ψ * 1 (θ 1 ) ≤ w 1 , θ 1 F ≤ θ 1 2 F η 1 ≤ θ 1 F . By putting this in 137 the inequality above, we obtain the claimed result.

138
From the definition of ψ t , it follows that ψ T+1 (u) ≤ log K 2 ∑ T t=1 z t − h t 2 ∞ and ψ * 1 (θ 1 ) = 0 143 hold. Define v t ∈ ∂ψ * t (θ t+1 ). Next, we bound the third term as follows where the first inequality uses the fact that ψ t is 2η t strongly convex w.r.t. · 1 . Adding up 145 from 1 to T, we have Combining the inequalities, we obtain 147 T ∑ t=1 l(X t , where the first inequality follows from Jensen's inequality. Further more, if l is L-Lipschitz 148 in its first argument, then we have