Adaptive Online Learning for the Autoregressive Integrated Moving Average Models

Weijia Shao; Lukas Friedemann Radke; Fikret Sivrikaya; Sahin Albayrak

doi:10.3390/math9131523

,

and

¹

Faculty of Electrical Engineering and Computer Science, Technische Universität Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany

²

GT-ARC Gemeinnützige GmbH, Ernst-Reuter-Platz 7, 10587 Berlin, Germany

^*

Author to whom correspondence should be addressed.

Mathematics2021, 9(13), 1523;https://doi.org/10.3390/math9131523

This article belongs to the Special Issue Computational Optimizations for Machine Learning

Version Notes

Order Reprints

Abstract

This paper addresses the problem of predicting time series data using the autoregressive integrated moving average (ARIMA) model in an online manner. Existing algorithms require model selection, which is time consuming and unsuitable for the setting of online learning. Using adaptive online learning techniques, we develop algorithms for fitting ARIMA models without hyperparameters. The regret analysis and experiments on both synthetic and real-world datasets show that the performance of the proposed algorithms can be guaranteed in both theory and practice.

Keywords:

ARIMA model; time series analysis; online optimization; online model selection

1. Introduction

The autoregressive integrated moving average (ARIMA) model is an important tool for time series analysis [1], and has been successfully applied to a wide range of domains including the forecasting of household electric consumption [2], scheduling in smart grids [3], finance [4], and environment protection [5]. It specifies that the values of a time series depend linearly on their previous values and error terms. In recent years, online learning (OL) methods have been applied to estimate the univariate [6,7] and multivariate [8,9] ARIMA models for their efficiency and scalability. These methods are based on the fact that any ARIMA model can be approximated by a finite dimensional autoregressive (AR) model, which can be fitted incrementally using online convex optimization algorithms. However, to guarantee accurate predictions, these methods require a proper configuration of hyperparameters, such as the diameter of the decision set, the learning rate, the order of differencing, and the lag of the AR model. Theoretically, these hyperparameters need to be set according to prior knowledge about the data generation, which is impossible to obtain. In practice, the hyperparameters are usually tuned to optimize the goodness of fit on the unseen data, which requires numerical simulation (e.g., cross-validation) on a previously collected dataset. The numerical simulation is notoriously expensive, since it requires multiple training runs for each candidate hyperparameter configuration. Furthermore, a previously collected dataset containing ground truth is needed for validation of the fitted model, which is unsuited for the online setting. Unfortunately, the expensive tuning process needs to be regularly repeated if the statistical properties of the time series change over time in an unforeseen way.

Given a new problem of predicting time series values, it appears that tuning the hyperparameters of the online algorithms can negate the benefits of the online setting. This paper addresses this problem in the online learning framework by proposing new parameter-free algorithms for learning ARIMA models, while their performance can still be guaranteed in both theory and practice. A naive attempt for this would be to directly apply parameter-free online convex optimization (PF-OCO) algorithms to the AR approximation. However, the theoretical performance of the AR approximation and the parameter-free algorithms rely on the bounded gradient vectors of the loss function, which is unreasonable for the widely used squared error with an unbounded domain.

The key contribution of this paper is the design of online learning algorithms for ARIMA models, avoiding regular and expensive hyperparameter tuning without damaging the power of the models. Our algorithms update the model incrementally with a computational complexity that is linearly related to the size of the model parameters and the number of candidate models in each iteration. To obtain a solid theoretical foundation, we first show that, for any locally Lipschitz-continuous function, ARIMA models with fixed order of differencing can be approximated using an AR model of the same order for a large enough lag. Based on this, new algorithms are proposed for learning the AR model adaptively without requiring any prior knowledge about the model parameters. For Lipschitz-continuous loss functions, we apply a new algorithm based on the adaptive follow the regularized leader (FTRL) framework [10] and show that our algorithm achieves a sublinear regret bound depending on the data sequence and the Lipschitz constant. A special treatment on the commonly used squared error is required due to its non-Lipschitz continuity. To obtain a data-dependent regret bound, we combine a polynomial regularizer [11] with the adaptive FTRL framework. Finally, to find the proper order and lag of the AR model in an online manner, multiple AR models are simultaneously maintained, and an adaptive hedge algorithm is applied to aggregate their predictions. In the previous attempts [12,13] to solve this online model selection (OMS) problem, the exponentiated gradient (EG) algorithm has been directly applied to aggregate the predictions, which not only requires tuning the learning rate, but also yields a regret bound depending on the loss incurred by the worst model. Our adaptive hedge algorithm is parameter-free and guarantees a regret bound depending on the time series sequence. Table 1 provides a comparison of the online learning algorithms applied to the learning of the ARIMA models. In addition to the theoretical analysis, we also demonstrate the performance of the proposed algorithm using both synthetic and real-world datasets.

Table 1. Algorithms for online learning of ARIMA.

The rest of the paper is organized as follows. Section 2 reviews the existing work on the subject. The notation, learning model, and formal description of the problem are introduced in Section 3. Next, we present and analyze our algorithms in Section 4. Section 5 demonstrates the empirical performance of the proposed methods. Finally, we conclude our work with some future research directions in Section 6.

Algorithm 1 ARIMA-AdaFTRL.

Input:

L_{1} > 0

Initialize

θ_{1, i}

arbitrarily,

η_{1, i} = 0

,

G_{i, 0} = 0

for

i = 1, \dots, m

for

t = 1

to T do

for

i = 1

to m do

G_{i, t} = max {G_{i, t - 1}, {∥ ▿^{d} X_{t - i} ∥}_{2}}

η_{i, t} = {∥ θ_{i, 1} ∥}_{F} + \sqrt{\sum_{s = 1}^{t - 1} {∥ g_{i, s} ∥}_{F}^{2} + {(L_{t} G_{i, t})}^{2}}

if

η_{i, t} \neq 0

then

γ_{i, t} = \frac{θ_{i, t}}{η_{i, t}}

else

γ_{i, t} = 0

end if

end for

Play

{\tilde{X}}_{t} (γ_{t})

Observe

X_{t}

and

h_{t} \in \partial l_{t} ({\tilde{X}}_{t} (γ_{t}))

L_{t + 1} = max {L_{t}, {∥ g_{t} ∥}_{2}}

for

i = 1

to m do

g_{i, t} = g_{t} ▿^{d} X_{t - i}^{⊤}

θ_{i, t + 1} = θ_{i, t} - g_{i, t}

end for

Algorithm 2 ARIMA-AdaFTRL-Poly.

Input:

G_{0} > 0

Initialize

θ_{1}

arbitrarily,

G_{1} = max {G_{0}, {∥ ▿^{d} X_{0} ∥}_{2}, \dots, {∥ ▿^{d} X_{- m + 1} ∥}_{2}}

for

t = 1

to T do

η_{t} = {∥ θ_{1} ∥}_{F} + \sqrt{\sum_{s = 1}^{t - 1} {∥ ▿^{d} X_{s} x_{s}^{⊤} ∥}_{F}^{2} + {(G_{t} {∥ x_{t} ∥}_{2})}^{2}}

λ_{t} = \sqrt{\sum_{s = 1}^{t} {∥ x_{s} ∥}_{2}^{4}}

if

{∥ θ_{t} ∥}_{F} \neq 0

then

Select

c \geq 0

satisfying

λ_{t} c^{3} + η_{t} c = {∥ θ_{t} ∥}_{F}

γ_{t} = \frac{c θ_{t}}{{∥ θ_{t} ∥}_{F}}

else

γ_{t} = 0

end if

Play

{\tilde{X}}_{t} (γ_{t})

Observe

X_{t}

and

g_{t} = γ_{t} x_{t} - ▿^{d} X_{t}

G_{t + 1} = max {G_{t}, {∥ ▿^{d} X_{t} ∥}_{2}}

θ_{t + 1} = θ_{t} - g_{t} x_{t}^{⊤}

end for

Algorithm 3 ARIMA-AO-Hedge.

Input: predictor

A_{1}, \dots, A_{K}

, d

Initialize

θ_{k, 1} = 0

,

η_{1} = 0

for

i = 1, \dots, K

for

t = 1

to T do

Get prediction

{\tilde{X}}_{t}^{i}

from

A_{k}

for

i = 1, \dots, K

Set

Y_{t} = \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1}

Set

h_{i, t} = l (Y_{t}, {\tilde{X}}_{t}^{i})

for

i = 1, \dots, K

if

η_{1} = 0

then

Set

w_{i, t} = 1

for some

i \in arg {max}_{j \in {1, \dots, K}} h_{j, t}

else

Set

w_{i, t} = \frac{exp (η_{t}^{- 1} (θ_{i, t} - h_{i, t}))}{\sum_{i = 1}^{K} exp (η_{t}^{- 1} (θ_{i, t} - h_{i, t}))}

for

i = 1, \dots, K

end if

Predict

{\tilde{X}}_{t} = \sum_{i = 1}^{K} w_{i, t} {\tilde{X}}_{t}^{i}

Observe

X_{t}

, update

A_{i}

, and set

z_{i, t} = l (X_{t}, {\tilde{X}}_{t}^{i})

for

i = 1, \dots, K

θ_{t + 1} = θ_{t} - z_{t}

η_{t + 1} = \sqrt{\frac{1}{2 log K} \sum_{s = 1}^{t} {∥ h_{t} - z_{t} ∥}_{\infty}^{2}}

end for

2. Related Work

An ARIMA model can be fitted using statistical methods such as recursive least square and maximum likelihood estimation, which are not only based on strong assumptions such as the Gaussian distributed noise terms [18], linear dependencies [19], and data generated by a stationary process [20], but also require solution of non-convex optimization problems [21]. Although these assumptions can be relaxed by considering non-Gaussian noise [22,23], non-stationary processes [24], or a convex relaxation [21], the pre-trained models still cannot deal with concept drift [7]. Moreover, retraining is time consuming and memory intensive, especially for large-scale datasets. The idea of applying regret minimization techniques to autoregressive moving average (ARMA) prediction was first introduced in [6]. The authors propose online algorithms incrementally producing predictions close to the values generated by the best ARMA model. This idea was extended to

ARIMA (p, q, d)

models in [7] by learning the

AR (m)

model of the higher-order differencing of the time series. Further extensions to multiple time series can be found in [8,9], while the problem of predicting time series with missing data was addressed in [25].

In order to obtain accurate predictions, the lag of the AR model and the order of differencing have to be tuned, which has been well studied in the offline setting. In some textbooks [20,26,27], Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC) are recommended for this task. Both require prior knowledge and strong assumptions about the variance of the noise [20], and are time and space consuming as they require numerical simulation such as cross-validation on previously collected datasets. Nevertheless, given a properly selected lag m and order d, online convex optimization techniques such as online Newton step (ONS) or online gradient descent (OGD) can be applied to fitting the model in the regret minimization framework [6,7,8,9]. However, both algorithms introduce additional hyperparameters to control the learning rate and numerical stability.

The idea of selecting hyperparameters for online time series prediction was proposed in [12,13]. Regarding the online AR predictor with different lags as experts, the authors aggregate over predictors by applying a multiplicative weights algorithm for prediction with expert advice. The proposed algorithm is not optimal for time series prediction, since the regret bound of the chosen algorithm depends on the largest loss incurred by the experts [28]. Furthermore, each individual expert still requires that the parameters are taken from a compact decision set, the diameter of which needs to be tuned in practice. A series of recent works on parameter-free online learning have provided possibilities of achieving sublinear regret without prior information on the decision set. In [14], the unconstrained online learning problem is modeled as a betting game, based on which a parameter-free algorithm is developed. The algorithm was further extended in [15], so a better regret bound can be achieved for strongly convex loss functions. However, the coin betting algorithm requires that the gradient vectors are normalized, which is unrealistic for unbounded time series and the squared error loss. In [16,17], the authors introduced parameter-free algorithms without requiring normalized gradient vectors. Unfortunately, the regret upper bounds of the proposed algorithms depend on the norm of the gradient vectors, which could be extremely large in our setting.

The main idea of the current work is based on the combination of the adaptive FTRL framework [10] and the idea of handling relative Lipschitz continuous functions [11], which makes it possible to devise an online algorithm with a data-dependent regret upper bound. To aggregate the results, an adaptive optimistic algorithm is proposed, such that the overall regret depends on the data sequence instead of the worst-case loss.

3. Preliminary and Learning Model

Let

X_{t}

denote the value observed at time t of a time series. We assume that

X_{t}

is taken from a finite dimensional real vector space

X

with norm

∥ \cdot ∥

. We denote by

L (X, X)

the vector space of bounded linear operators from

X

to

X

and

{∥ α ∥}_{op} = {sup}_{x \in X, x \neq 0} \frac{∥ α x ∥}{∥ x ∥}

the corresponding operator norm. An

AR (p)

model is given by

X_{t} = \sum_{i = 1}^{p} α_{i} X_{t - i} + ϵ_{t},

where

α_{i} \in L (X, X)

is a linear operator and

ϵ_{t} \in X

is an error term. The

ARMA (p, q)

model extends the

AR (p)

model by adding a moving average (MA) component as follows:

X_{t} = \sum_{i = 1}^{p} α_{i} X_{t - i} + \sum_{i = 1}^{q} β_{i} ϵ_{t - i} + ϵ_{t},

where

ϵ_{t} \in X

is the error term and

β_{i} \in L (X, X)

. We define the d-th order differencing of the time series as

▿^{d} X_{t} = ▿^{d - 1} X_{t} - ▿^{d - 1} X_{t - 1}

for

d \geq 1

and

▿^{0} X_{t} = X_{t}

. The

ARIMA (p, q, d)

model assumes that the d-th order differencing of the time series follows an

ARMA (p, q)

model. In this section, this general setting suffices for introducing the learning model. In the following sections, we fix the basis of

X

to obtain implementable algorithms, for which different kinds of norms and inner products for vectors and matrices are needed. We provide a table of required notation in Appendix C.

In this paper, we consider the setting of online learning, which can be described as an iterative game between a player and an adversary. In each round t of the game, the player makes a prediction

{\tilde{X}}_{t}

. Next, the adversary chooses some

X_{t}

and reveals it to the player, who then suffers the loss

l (X_{t}, \tilde{X_{t}})

for some convex loss function

l : X \times X \to R

. The ultimate goal is to design a strategy for the player to minimize the cumulative loss

\sum_{t = 1}^{T} l (X_{t}, \tilde{X_{t}})

of T rounds. For simplicity, we define

l_{t} : X \to R, X \mapsto l (X_{t}, X) .

In classical textbooks about time series analysis, the signal is assumed to be generated by a model, based on which the predictions are made. In this paper, we make no assumptions on the data generation. Therefore, minimizing the cumulative loss is generally impossible. An achievable objective is to keep a possibly small regret of not having chosen some

ARIMA (p, q, d)

model to generate the prediction

{\tilde{X}}_{t}

. Formally, we denote by

{\tilde{X}}_{t} (α, β)

the prediction using the

ARIMA (p, q, d)

model parameterized by

α

and

β

, given by (in this paper, we do not directly address the problem of the cointegration, where the third term should be applied to a low-rank linear operator):

\begin{matrix} {\tilde{X}}_{t} (α, β) = \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} ϵ_{t - i} + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1} . \end{matrix}

(1)

The cumulative regret of T rounds is then given by

R_{T} (α, β) = \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t}) - \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (α, β)) .

The goal of this paper is to design a strategy for the player such that the cumulative regret grows sublinearly in T. In the ideal case, in which the data are actually generated by an ARIMA process, the prediction generated by the player yields a small loss. Otherwise, the predictions are always close to those produced by the best ARIMA model, independent of the data generation. Following the adversarial setting in [6], we allow the sequences

{X_{t}}

,

{ϵ_{t}}

and the parameters

α

,

β

to be selected by the adversary. Without any restrictions on the model, this is no different than the impossible task of minimizing the cumulative loss, since

ϵ_{t - 1}

can always be selected such that

X_{t} = {\tilde{X}}_{t} (α, β)

holds for all t. Therefore, we make the following assumptions throughout this paper:

Assumption 1.

X_{t} = ϵ_{t} + \tilde{X_{t}} (α, β)

, and there is some

R > 0

such that

∥ ϵ_{t} ∥ \leq R

for all

t = 1, \dots T

.

Assumption 2.

The coefficients

β_{i}

satisfy

\sum_{i = 1}^{q} {∥ β_{i} ∥}_{op} \leq 1 - ϵ

for some

ϵ > 0

.

Since we are interested in competing against predictions generated by ARIMA models, we assume that

ϵ_{t}

is selected as if

X_{t}

is generated by the ARIMA process. Furthermore, we assume the norm

∥ ϵ_{t} ∥

is upper bounded within T iterations. Assumption 2 is a sufficient condition for the MA component to be invertible, which prevents it from going to infinity as

t \to \infty

[27].

Our work is based on the fact that we can compete against an

ARIMA (p, q, d)

model by taking predictions from an

AR (m)

model of the d-th order differencing for large enough m, which is shown in the following lemma, the proof of which can be found in Appendix A.

Lemma 1.

Let

{X_{t}}

,

{ϵ_{t}}

, α, and β be as assumed in Assumptions 1 and 2. Then there is some

γ \in L {(X, X)}^{m}

with

m \geq \frac{q log T}{log \frac{1}{1 - ϵ}} + p

such that

∥ ▿^{d} {\tilde{X}}_{t} (γ) - ▿^{d} {\tilde{X}}_{t} (α, β) ∥ \leq {(1 - ϵ)}^{\frac{t}{q}} R + \frac{2 R}{T}

holds for all

t = 1 \dots T

, where we define

▿^{d} {\tilde{X}}_{t} (γ) = \sum_{i = 1}^{m} γ_{i} ▿^{d} X_{t - i}

.

As can be seen from the lemma, a prediction

{\tilde{X}}_{t} (γ)

generated by the process

{\tilde{X}}_{t} (γ) = \sum_{i = 1}^{m} γ_{i} ▿^{d} X_{t - i} + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1}

is close to the prediction

{\tilde{X}}_{t} (α, β)

generated by the ARIMA process. In the previous works [6,7], the loss function

l_{t}

is assumed to be Lipschitz continuous to control the difference of loss incurred by the approximation. In general, this does not hold for squared error. However, from Assumption 1 and Lemma 1, it follows that both

{\tilde{X}}_{t} (α, β)

and

{\tilde{X}}_{t} (γ)

lie in a compact set around

X_{t}

with a bounded diameter. Given the convexity of l, which is local Lipschitz continuous in the compact convex domain, we obtain a similar property:

l (X_{t}, {\tilde{X}}_{t} (γ)) - l (X_{t}, {\tilde{X}}_{t} (α, β)) \leq L (X_{t}) ∥ ▿^{d} {\tilde{X}}_{t} (γ) - ▿^{d} \tilde{X_{t}} (α, β) ∥,

where

L (X_{t})

is some constant depending on

X_{t}

. For squared error, it is easy to verify that the Lipschitz constant depends on

∥ ▿^{d} X_{t} ∥

, the boundedness of which can be reasonably assumed. To avoid extraneous details, we simply add the third assumption:

Assumption 3.

Define set

X_{t} = {X \in X | ∥ X - X_{t} ∥ \leq 4 R}

. There is a compact convex set

X \supseteq ⋃_{t = 1}^{T} X_{t}

, such that

l_{t}

is L-Lipschitz continuous in

X

for

t = 1, \dots T

.

The next corollary shows that the losses incurred by the ARIMA and its approximation are close, which allows us to take predictions from the approximation.

Corollary 1.

Let

{X_{t}}

,

{ϵ_{t}}

, α, β, and l be as assumed in Assumptions 1–3. Then there is some

γ \in L {(X, X)}^{m}

with

m \geq \frac{q log T}{log \frac{1}{1 - ϵ}} + p

, such that

\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ)) - l_{t} ({\tilde{X}}_{t} (α, β)) \leq L R (\frac{1}{1 - {(1 - ϵ)}^{\frac{1}{q}}} + 2)

holds for all

t = 1 \dots T

.

Proof.

It follows from Assumption 1 and Lemma 1 that

\tilde{X_{t}} (γ), \tilde{X_{t}} (α, β) \in X

holds for all

t = 1, \dots T

. Together with Assumption 3, we obtain

\sum_{t = 1}^{T} (l_{t} ({\tilde{X}}_{t} (γ)) - l_{t} ({\tilde{X}}_{t} (α, β))) \leq L \sum_{t = 1}^{T} ∥ {\tilde{X}}_{t} (γ) - {\tilde{X}}_{t} (α, β) ∥ .

Applying Lemma 1, we obtain the claimed result. □

4. Algorithms and Analysis

From Corollary 1, it follows clearly that an ARIMA(p,q,d) model can be approximated by an integrated AR model with large enough m. However, neither the order of differencing d nor the lag m is known. To circumvent tuning them using a previously collected dataset, we propose a framework with a two-level hierarchical construction, which is described in Algorithm 4.

Algorithm 4 Two-level framework.

Input: K instances of the slave algorithm

A_{1}, \dots, A_{K}

. An instance of master algorithm

M

.

for

t = 1

to T do

Get

{\tilde{X}}_{t}^{i}

from each

A_{i}

Get

w_{t} \in Δ^{K}

from

M

▹

Δ^{K}

is the standard K-simplex

Integrate the prediction:

{\tilde{X}}_{t} = \sum_{i = 1}^{K} w_{t}^{i} {\tilde{X}}_{t}^{i}

Observe

X_{t}

Define

z_{t} \in R^{K}

with

z_{i, t} = l_{t} ({\tilde{X}}_{t}^{i})

Update

A_{i}

using

z_{i, t}

for

i = 1, \dots, K

Update

M

using

z_{t}

end for

The idea is to maintain a master algorithm

M

and a set of slave algorithms

{A_{m} | m = 1, \dots, K}

. At each step t, the master algorithm receives predictions

{\tilde{X}}_{t}^{k}

from

A_{k}

for

k = 1, \dots, K

. Then it comes up with a convex combination

{\tilde{X}}_{t} = \sum_{i = 1}^{K} w_{t}^{i} {\tilde{X}}_{t}^{i}

for some

w_{t} \in Δ

in the simplex. Next, it observes

X_{t}

and computes the loss

l_{t} (X_{t}^{k} (γ))

for each slave

A_{k}

, which is then used to update

A_{k}

and

w_{t + 1}

. Let

{{\tilde{X}}_{t}^{k}}

be the sequence generated by some slave k. We define the regret of not having chosen the prediction generated by slave k as

R_{T} (k) = \sum_{t = 1}^{T} l_{t} (\sum_{i = 1}^{K} w_{t}^{i} {\tilde{X}}_{t}^{i}) - \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t}^{k}),

and the regret of the slave k

R_{T} (A_{k}) = \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t}^{k}) - \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{k})),

where

{\tilde{X}}_{t} (γ_{k})

is the prediction generated by an integrated AR model parameterized by

γ_{k}

. Let

A_{k}

be some slave. Then the regret of this two-level framework can obviously be decomposed as

R_{T} (α, β) = R_{T} (k) + R_{T} (A_{k}) + \underset{Corollary 1}{\underset{︸}{\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{k})) - \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (α, β)) .}}

For

γ_{k}

,

α

, and

β

satisfying the condition in Corollary 1 (this is not a condition of having a correct algorithm—with more slaves, there are more

α, β

satisfying the condition; we increase the freedom of the model by increasing the number of slaves), the marked term above is upper bounded by a constant, that is,

\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{k})) - \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (α, β)) \in O (1) .

If the regret of the master and the slaves grow sublinearly in T, we can achieve an overall sublinear regret upper bound, which is formally described in the following corollary.

Corollary 2.

Let

A_{i}

be an online learning algorithm against an

AR (m_{i})

model parameterized by

γ^{i}

for

i = 1, \dots, K

. For any ARIMA model parameterized by α and β, if there is a

k \in {1, \dots, K}

such that

{\tilde{X}}_{t} (γ^{k})

,

{\tilde{X}}_{t} (α, β)

and

{X_{t}}

satisfy Assumptions 1–3, then running Algorithm 4 with

M

and

A_{1}, \dots, A_{K}

guarantees

\sum_{t = 1}^{T} (l_{t} ({\tilde{X}}_{t}) - l_{t} ({\tilde{X}}_{t} (α, β))) \leq R_{T} (k) + R_{T} (A_{k}) + O (1) .

Next, we design and analyze parameter-free algorithms for the slaves and the master.

4.1. Parameter-Free Online Learning Algorithms

4.1.1. Algorithms for Lipschitz Loss

Given fixed m and d, an integrated

A R (m)

model can be treated as an ordinary linear regression model. In each iteration t, we select

γ_{t} = (γ_{1, t}, \dots, γ_{m, t}) \in L {(X, X)}^{m}

and make prediction

{\tilde{X}}_{t} (γ_{t}) = \sum_{i = 1}^{m} γ_{i, t} ▿^{d} X_{t - i} + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1} .

Since

l_{t}

is convex, there is some subdifferential

g_{t} \in \partial l_{t} ({\tilde{X}}_{t} (γ_{t}))

such that

l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \leq g_{t} (\sum_{i = 1}^{m} (γ_{i, t} - γ_{i}) ▿^{d} X_{t - i}),

for all

γ \in L {(X, X)}^{m}

. Define

g_{i, t} : L (X, X) \to R, v \mapsto g_{t} (v ▿^{d} X_{t - i})

. The regret can be further upper bounded by

\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \leq \sum_{t = 1}^{T} \sum_{i = 1}^{m} g_{i, t} (γ_{i, t} - γ_{i}) .

(2)

Thus, we can cast the online linear regression problem to an online linear optimization problem. Unlike the previous work, we focus on the unconstrained setting, where

γ_{t}

is not picked from a compact decision set. In this setting, we can apply an FTRL algorithm with an adaptive regularizer. To obtain an efficient implementation, we fix a basis for both

X

and

X_{*}

. Now we can assume

X = X_{*} = R^{n}

and work with the matrix representation of

γ \in L (X, X)

. It is easy to verify that (2) can be rewritten as

\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \leq \sum_{t = 1}^{T} \sum_{i = 1}^{m} {⟨ g_{t} ▿^{d} X_{t - i}^{⊤}, γ_{i, t} - γ_{i} ⟩}_{F},

where

{⟨ A, B ⟩}_{F} = tr (A^{⊤} B)

is the Frobenius inner product. It is well known that the Frobenius inner product can be considered as a dot product of vectorized matrices, with which we obtain a simple first-order (the computational complexity per iteration depends linearly on the dimension of the parameter, i.e.,

O (n^{2} m)

) algorithm described in Algorithm 1.

The cumulative regret of Algorithm 1 can be upper bounded using the following theorem.

Theorem 1.

Let

{X_{t}}

be any sequence of vectors taken from

X

. Algorithm 1 guarantees

\begin{matrix} \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \\ \leq & \sum_{i = 1}^{m} (\frac{{∥ γ_{i} ∥}_{F}^{2} L_{T + 1}}{2} + L_{T + 1} + \frac{L_{T + 1}^{2}}{L_{1}}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t - i} ∥}_{2}^{2}} \\ + \sum_{i = 1}^{m} \frac{(L_{T + 1} G_{i, T + 1} + {∥ θ_{i, 1} ∥}_{F}) {∥ γ_{i} ∥}_{F}^{2} + {∥ θ_{i, 1} ∥}_{F}}{2} . \end{matrix}

For an L-Lipschitz loss function

l_{t}

, in which

L_{T + 1}

is upper bounded by L, we obtain a sublinear regret upper bound depending on the sequence of d-th order differencing

{▿^{d} X_{t}}

. In case L is known, we can set

L_{0} = L

, otherwise picking

L_{0}

arbitrarily from a reasonable range (e.g.,

L_{0} = 1

) would not have a devastating impact on the performance of the algorithms.

4.1.2. Algorithms for Squared Errors

For the commonly used squared error given by

l_{t} ({\tilde{X}}_{t} (γ_{t})) = \frac{1}{2} {∥ {\tilde{X}}_{t} (γ_{t}) - X_{t} ∥}_{2}^{2},

it can be verified that

g_{t}

can be represented as a vector

g_{t} = \sum_{i = 1}^{m} γ_{i, t} ▿^{d} X_{t - i} - ▿^{d} X_{t}

for all t. Existing algorithms, which have a regret upper bound depending on

{∥ g_{t} ∥}_{2}

, could fail since

{∥ g_{t} ∥}_{2}

can be set arbitrarily large due to the adversarially selected data sequence

X_{1}, \dots, X_{t}

. To design a parameter-free algorithm for the squared error, we equip FTRL with a time-varying polynomial regularizer described in Algorithm 2.

Define

x_{t} = (\begin{matrix} ▿^{d} X_{t - 1} \\ ⋮ \\ ▿^{d} X_{t - m} \end{matrix})

and consider the matrix representation

γ_{t} = (\begin{matrix} γ_{1, t} & \dots & γ_{m, t} \end{matrix})

. Then we have

g_{t} = γ_{t} x_{t} - ▿^{d} X_{t}

, and the upper bound of the regret can be rewritten as

\sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \leq \sum_{t = 1}^{T} {⟨ (γ_{t} x_{t} - ▿^{d} X_{t}) x_{t}^{⊤}, γ_{t} - γ ⟩}_{F} .

The idea of Algorithm 2 is to run the FTRL algorithm with a polynomial regularizer

\frac{λ_{t}}{4} {∥ γ ∥}_{F}^{4} + \frac{η_{t}}{2} {∥ γ ∥}_{F}^{2},

for increasing sequences

{λ_{t}}

and

{η_{t}}

, which leads to updating rule given by

\begin{matrix} γ_{t} = & arg max_{γ \in L {(X, X)}^{m}} {⟨ θ_{t}, γ ⟩}_{F} - \frac{λ_{t}}{4} {∥ γ ∥}_{F}^{4} - \frac{η_{t}}{2} {∥ γ ∥}_{F}^{2} = \frac{c θ_{t}}{{∥ θ_{t} ∥}_{F}}, \end{matrix}

for c satisfying

λ_{t} c^{3} + η_{t} c = {∥ θ_{t} ∥}_{F}

. Since we have

λ_{t} \geq 0

and

η_{t} > 0

for

θ_{1} \neq 0

, c exists and has a closed-form expression. The computational complexity per iteration has a linear dependency on the dimension of

L {(X, X)}^{m}

. The following theorem provides a regret upper bound of Algorithm 2.

Theorem 2.

Let

{X_{t}}

be any sequence of vectors taken from

X

and

l_{t} ({\tilde{X}}_{t} (γ)) = \frac{1}{2} {∥ X_{t} - {\tilde{X}}_{t} (γ) ∥}_{2}^{2} = \frac{1}{2} {∥ ▿^{d} X_{t} - ▿^{d} {\tilde{X}}_{t} (γ) ∥}_{2}^{2}

be the squared error. We define

x_{t} = {(\begin{matrix} ▿^{d} X_{t - 1} & \dots & ▿^{d} X_{t - m} \end{matrix})}^{⊤}

and

γ = (\begin{matrix} γ_{1} & \dots & γ_{m} \end{matrix})

, the matrix representation of

γ_{1}, \dots γ_{m} \in L (X, X)

. Then, Algorithm 2 guarantees

\begin{matrix} \sum_{t = 1}^{T} (l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ))) \leq & \frac{(\sqrt{m} G_{T + 1}^{2} + {∥ θ_{1} ∥}_{F}) {∥ γ ∥}_{F}^{2}}{2} \\ + {∥ θ_{1} ∥}_{F} + (1 + \frac{{∥ γ ∥}_{F}^{4}}{4}) \sqrt{\sum_{t = 1}^{T} {∥ x_{t} ∥}_{2}^{4}} \\ + (1 + \frac{G_{T + 1}}{G_{0}} + \frac{{∥ γ ∥}_{F}^{2}}{2}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}} \end{matrix}

for all

γ \in L {(X, X)}^{m}

.

For squared error, Algorithm 2 does not require a compact decision set and ensures a sublinear regret bound depending on the data sequence. Similar to Algorithm 1, one can set

G_{0}

according to the prior knowledge about the bounds of the time series. Alternatively, we can simply set

G_{0} = 1

to obtain a reasonable performance.

4.2. Online Model Selection Using Master Algorithms

The straightforward choice of the master algorithm would be the exponentiated gradient algorithm for prediction with expert advice. However, this algorithm requires tuning of the learning rate and losses bounded by a small quantity, which can not be assumed for our case. The AdaHedge algorithm [29] solves these problems. However, it yields a worst-case regret bound depending on the largest loss observed, which could be much worse compared to a data-dependent regret bound.

Our idea is based on the adaptive optimistic follow the regularized leader (AO-FTRL) framework [10]. Given a sequence of hints

{h_{t}}

and loss vectors

{z_{t}}

, AO-FTRL guarantees a regret bound related to

\sum_{t = 1}^{T} {∥ z_{t} - h_{t} ∥}_{t}^{2}

for some time-varying norm

{∥ \cdot ∥}_{t}

. In our case, where the loss incurred by a slave is given by

l (X_{t}, \tilde{X_{t}^{k}})

at iteration t, we simply choose

h_{k, t} = l (\sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1}, \tilde{X_{t}^{k}})

. If l is L-Lipschitz in its first argument, then we have

| z_{k, t} - h_{k, t} | \leq L ∥ ▿^{d} X_{t} ∥

, which leads to a data-dependent regret. The obtained algorithm is described in Algorithm 3. Its regret is upper bounded by the following theorem, the proof of which is provided in Appendix B.

Theorem 3.

Let

{{\tilde{X}}_{t}}

,

{{\tilde{X}}_{t}^{k}}

,

{z_{t}}

,

{h_{t}}

, and

{w_{t}}

be as generated in Algorithm 3. Assume l is L-Lipschitz in its first argument and convex in its second argument. Then for any sequence

{X_{t}}

and slave algorithm

A_{k}

, we have

R_{T} (k) \leq (\sqrt{2 log K} + \sqrt{\frac{8}{log K}}) \sqrt{\sum_{t = 1}^{T} L^{2} {∥ ▿^{d} X_{t} ∥}_{2}^{2}} .

By Corollary 2, combining Algorithm 3 with Algorithms 1 or 2 guarantees a data-dependent regret upper bound sublinear in T. Note that there is an input parameter d for Algorithm 3, which can be adjusted according to the prior knowledge of the dataset such that

{∥ ▿^{d} X_{t} ∥}_{2}^{2}

can be bounded by a small quantity. In case no prior knowledge can be obtained, we can set d to the maximal order of differencing used in the slave algorithms. Arguably, the Lipschitz continuity is not a reasonable assumption for squared error with unbounded domain. With a bounded

{∥ ▿^{d} X_{t} ∥}_{2}^{2}

, we can assume that the loss function is locally Lipschitz, but with a Lipschitz constant depending on the prediction. In the next section, we show the performance of Algorithm 3 in combination with Algorithms 1 and 2 in different experimental settings.

5. Experiments and Results

In this section, we carry out experiments on both synthetic and real-world data to show that the proposed algorithms can generate promising predictions without tuning hyperparameters.

5.1. Experiment Settings

The synthetic data was generated randomly. We run 20 trials for each synthetic experiment and average the results. For numerical stability, we scale the real-world data down so that the values are between 0 and 10. Note that the range of the data are not assumed or used in the algorithms.

Setting 1: Sanity Check

For a sanity check, we generate a stationary 10-dimensional

ARIMA (5, 2, 1)

process using randomly drawn coefficients.

Setting 2: Time-Varying Parameters

Aimed at demonstrating the effectiveness of the proposed algorithm in the non-stationary case, we generate the non-stationary 10-dimensional

ARIMA (5, 2, 1)

process using time-varying parameters. We draw

α_{1},

α_{2}

, and

β_{1},

β_{2}

randomly and independent, and generate data at iteration t with the

ARIMA (5, 2, 1)

model parameterized by

α_{t} = \frac{t}{10^{4}} α_{1} + (1 - \frac{t}{10^{4}}) α_{2}

and

β_{t} = \frac{t}{10^{4}} β_{1} + (1 - \frac{t}{10^{4}}) β_{2}

.

Setting 3: Time-Varying Models

To get more adversarially selected time series values, we generate the first half of the values using a stationary 10-dimensional

ARIMA (5, 2, 1)

model and the second half of the values using a stationary 10-dimensional

ARIMA (5, 2, 0)

model. The model parameters are drawn randomly.

Stock Data: Time Series with Trend

Following the experiments in [8], we collect the daily stock prices of seven technology companies from Yahoo Finance together with the S&P 500 index for over twenty years, which has an obvious increasing trend and is believed to exhibit integration.

Google Flu Data: Time Series with Seasonality

We collect estimates of influenza activity of the northern hemisphere countries, which has an obvious seasonal pattern. In the experiment, we examine the performance of the algorithms for handling regular and predictable changes that occur over a fixed period.

Electricity Demand: Trend and Seasonality

In this setting, we collect monthly load, gross electricity production, net electricity consumption, and gross demand in Turkey from 1976 to 2010. The dataset contains both trend and seasonality.

5.2. Experiments for the Slave Algorithms

We first fix

d = 1

and

m = 16

and compare our slave algorithms with ONS and OGD from [9] for squared error

l_{t} ({\tilde{X}}_{t}) = \frac{1}{2} {∥ X_{t} - {\tilde{X}}_{t} ∥}_{2}^{2}

and Euclidean distance

l_{t} ({\tilde{X}}_{t}) = {∥ X_{t} - {\tilde{X}}_{t} ∥}_{2}

. ONS and OGD stack and vectorize the parameter matrices, and incrementally update the vectorized parameter respectively using the following rules

w_{t + 1} = Π_{W} (w_{t} - η {(\sum_{s = 1}^{t} g_{t} g_{t}^{⊤} + λ I)}^{- 1} g_{t})

and

w_{t + 1} = Π_{W} (w_{t} - η g_{t}),

where

g_{t}

is the vectorized gradient at step t,

W

is the decision set satisfying

{sup}_{u \in W} {∥ u ∥}_{2} \leq c

, and the operator

Π_{W} (v)

projects v into

W

. We select a list of candidate values for each hyperparameter, evaluate their performance on the whole dataset, and select the configuration with the best performance for comparison. Since the synthetic data are generated randomly, we average the results over 20 trials for stability. The corresponding results are shown in Figure 1, Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 (to amplify the differences of the algorithms, we use

l o g

plots for the y-axis for all settings; for the synthetic datasets, we also use

l o g

plot for the x-axis, so that the behavior of the algorithms in the first 1000 steps can be better observed). To show the impact of the hyperparameters on the performance of the baseline algorithm, we also plot their performance using sub-optimal configurations. Note that since the error term

ϵ_{t}

cannot be predicted, an ideal predictor would suffer an average error rate of at least

{∥ ϵ_{t} ∥}_{2}^{2}

and

{∥ ϵ_{t} ∥}_{2}

for the two kinds of loss function. This is known for the synthetic datasets and plotted in the figures.

Figure 1. Results for setting 1 (sanity check), using a stationary ARIMA(5,2,1) model.

Figure 2. Results for setting 2 (time-varying parameters), using a non-stationary ARIMA(5,2,1) model.

Figure 3. Results for setting 3 (time-varying models), using a combination of stationary ARIMA(5,2,1) and ARIMA(5,2,0) models.

Figure 4. Results for stock data.

Figure 5. Results for Google Flu data.

Figure 6. Results for electricity demand data.

In all settings, both AdaFTRL and AdaFTRL-Poly have a performance on par with well-tuned OGD and ONS, which can have extremely bad performance using sub-optimal hyperparameter configurations. In the experiments using synthetic datasets, AdaFTRL suffers large loss at the beginning while generating accurate predictions after 1000 iterations. The relative performances of the proposed algorithms after the first 1000 iterations compared to the best tuned baseline algorithms are plotted in Appendix D. AdaFTRL-Poly has more stable performance compared to AdaFTRL. In the experiment with Google Flu data, all algorithms suffer huge losses around iteration 300 due to an abrupt change in the dataset. OGD and ONS with sub-optimal hyperparameter configurations, despite good performance for the first half of the data, generate very inaccurate predictions after the abrupt change in the dataset. This could lead to a catastrophic failure in practice, when certain patterns do not appear in the dataset collected for hyperparameter tuning. Our algorithms are more robust against this change and perform similarly to OGD and ONS with optimal hyperparameter configurations.

5.3. Experiments for Online Model Selection

The performance of the two-level framework and Algorithm 3 for online model selection is demonstrated in Figure 7, Figure 8, Figure 9, Figure 10, Figure 11 and Figure 12. We simultaneously maintain 96

AR (m)

models of d-th-order differencing for

m = 1, \dots 32

and

d = 0, \dots 2

, which are updated by Algorithms 1 and 2 for squared error and Euclidean distance, respectively. The predictions generated by the AR models are aggregated using Algorithm 3 and the aggregation algorithm (AA) introduced in [13] with learning rate set to

\sqrt{T}

. We compare the average losses incurred by the aggregated predictions with those incurred by the best AR model. To show the impact of m and d, we also plot the average loss of some other sub-optimal AR models.

Figure 7. Model selection in setting 1.

Figure 8. Model selection in setting 2.

Figure 9. Model selection in setting 3.

Figure 10. Model selection for stock data.

Figure 11. Model selection for Google Flu.

Figure 12. Model Selection for electricity demand.

In all settings, AO-Hedge outperforms AA, although the differences are very slight in some of the experiments. We would like to stress again that the choice of the hyperparameters has a great impact on the performance of the AR model. In settings 1–3, the AR model with 0-th-order differencing has the best performance, although the data are generated using

d = 1

, which suggests that the prior knowledge about the data generation may not be helpful for the model selection in all cases. The experimental results also show that AO-Hedge has a performance similar to the best AR model.

6. Conclusions

We proposed algorithms for fitting ARIMA models in an online manner without requiring prior knowledge or tuning hyperparameters. We showed that the cumulative regret of our method grows sublinearly with the number of iterations and depends on the values of the time series. The comparison study on both synthetic and real-world datasets suggests that the proposed algorithms have a performance on par with the well-tuned state-of-the-art algorithms.

There are still several remaining issues that we want to address in future research. Firstly, it would be interesting to also develop a parameter-free algorithm for the cointegrated vector ARMA model. Secondly, we believe that the strong assumption on the

β

coefficient can be relaxed for multi-dimensional time series by generalizing Lemma 2 in [7]. Furthermore, we are also interested in applying online learning to other time series models such as the (generalized) ARCH model [30]. Finally, the proposed algorithms need to be empirically analyzed using more real-world datasets and loss functions, and compared with more recent predictive models such as recurrent neural networks and the models combining neural networks and ARIMA models [31].

Author Contributions

Conceptualization, W.S.; methodology, W.S. and L.F.R.; validation, W.S., L.F.R., and F.S.; formal analysis, W.S.; investigation, W.S. and L.F.R.; writing—original draft preparation, W.S. and L.F.R.; writing—review and editing, W.S., L.F.R., F.S., and S.A.; visualization, L.F.R.; supervision, F.S. and S.A. All authors have read and agreed to the published version of the manuscript.

Funding

We acknowledge support by the German Research Foundation and the Open Access Publication Fund of TU Berlin.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code for generating the synthetic data set, the implementation of the algorithms, and the detailed information about our experiments are available on GitHub: https://github.com/OnlinePredictorTS/AOLForTimeSeries (accessed on March 2021). The stock data are collected from https://finance.yahoo.com/ (accessed on March 2021). The Google Flu data are available in https://github.com/datalit/googleflutrends/ (accessed on March 2021). The detailed information about the electricity demand can be found in [32].

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We prove Lemma 1 in this section. Consider the ARIMA model given by

▿^{d} X_{t} (α, β) = \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} ϵ_{t - i} + ϵ_{t}

with

▿^{d} X_{t} (α, β) = ▿^{d} X_{t}

for

t \leq 0

. Let

X_{t} (α, β) = ▿^{d} X_{t} (α, β) + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1}

be the t-th value generated by the ARIMA process. To prove Lemma 1, we generalize the proof provided in [6]. To remove the MA component, we first recursively define a growing process of the d-th-order differencing

▿^{d} X_{t}^{\infty} (α, β) = \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{\infty} (α, β))

with

▿^{d} X_{t}^{\infty} (α, β) = ▿^{d} X_{t}

for

t \leq 0

. Let

X_{t}^{\infty} (α, β) = ▿^{d} X_{t}^{\infty} (α, β) + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1}

be the t-th value generated by this process.

The next lemma shows that it approximates an

ARIMA (p, q, d)

process.

Lemma A1.

For any α, β, and

{ϵ_{t}}

satisfying Assumptions 1 and 2, we have, for

t = 1, \dots, T

,

∥ X_{t}^{\infty} (α, β) - {\tilde{X}}_{t} (α, β) ∥ \leq {(1 - ϵ)}^{\frac{t}{q}} R .

Proof.

First of all, we have

\begin{matrix} X_{t}^{\infty} (α, β) - {\tilde{X}}_{t} (α, β) = & ▿^{d} X_{t}^{\infty} (α, β) - ▿^{d} {\tilde{X}}_{t} (α, β) \\ = & \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{\infty} (α, β) - ϵ_{t - i}) \end{matrix}

for

t \geq 0

. Define

Y_{t} = ▿^{d} X_{t} - ▿^{d} X_{t}^{\infty} (α, β) - ϵ_{t}

. W.l.o.g. we can assume

∥ ϵ_{t} ∥ \leq R

for

t \leq 0

. Next, we prove by induction on t that

∥ Y_{τ} ∥ \leq {(1 - ϵ)}^{\frac{τ}{q}} R

holds for all

τ \leq t

. For the induction basis, we have

∥ Y_{τ} ∥ = ∥ - ϵ_{t} ∥ \leq R

for all

τ \leq 0

. We assume the claim holds for some t, then we have

\begin{matrix} ∥ Y_{t + 1} ∥ = & ∥ ▿^{d} X_{t + 1} - ▿^{d} X_{t + 1}^{\infty} (α, β) - ϵ_{t + 1} ∥ \\ = & ∥ ▿^{d} X_{t + 1} - \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t + 1 - i} - \sum_{i = 1}^{q} β_{i} ϵ_{t + 1 - i} - ϵ_{t + 1} ∥ + ∥ \sum_{i = 1}^{q} β_{i} Y_{t + 1 - i} ∥ \\ = & \sum_{i = 1}^{q} ∥ Y_{t + 1 - i} ∥ {∥ β_{i} ∥}_{op} \\ \leq & {(1 - ϵ)}^{\frac{t + 1 - q}{q}} R \sum_{i = 1}^{q} {∥ β_{i} ∥}_{op} \\ \leq & {(1 - ϵ)}^{\frac{t + 1}{q}} R, \end{matrix}

which concludes the induction. Finally, we have

\begin{matrix} ∥ X_{t}^{\infty} (α, β) - {\tilde{X}}_{t} (α, β) ∥ = & ∥ \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} (α, β) - ▿^{d} X_{t - i}^{\infty} (α, β) - ϵ_{t - i}) ∥ \\ \leq & \sum_{i = 1}^{q} {∥ β_{i} ∥}_{op} ∥ Y_{t - i} ∥ \\ \leq & (1 - ϵ) {(1 - ϵ)}^{\frac{t - q}{q}} R \\ = & {(1 - ϵ)}^{\frac{t}{q}} R, \end{matrix}

which is the claimed result. □

Next, we recursively define the following process:

▿^{d} X_{t}^{m} (α, β) = \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{m - i} (α, β)),

(A1)

where

▿^{d} X_{t}^{m} (α, β) = ▿^{d} X_{t}

for

m \leq 0

. Let

{X_{t}^{m} (α, β)}

be the sequence generated as follows:

\begin{matrix} X_{t}^{m} (α, β) = ▿^{d} X_{t}^{m} (α, β) + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1} . \end{matrix}

(A2)

We show in the next lemma that it is close to

{X_{t}^{\infty} (α, β)}

.

Lemma A2.

For any α, β,

{l_{t}}

, and

{ϵ_{t}}

satisfying A1–A2, we have

∥ X_{t}^{m} (α, β) - X_{t}^{\infty} (α, β) ∥ \leq \frac{2 R}{T},

for

m = \frac{q log T}{log \frac{1}{1 - ϵ}}

.

Proof.

Define

Z_{t}^{m} = ▿^{d} X_{t}^{m} (α, β) - ▿^{d} X_{t}^{\infty} (α, β)

. We prove by induction on m that

∥ Z_{t}^{\tilde{m}} ∥ \leq {(1 - ϵ)}^{\frac{\tilde{m}}{q}} 2 R

holds for all

t = 1, \dots, T

and

0 \leq \tilde{m} \leq m

. For

m = 0

, we have for

t = 1, \dots, T

\begin{matrix} ∥ Z_{t}^{0} ∥ = & ∥ ▿^{d} X_{t}^{0} (α, β) - ▿^{d} X_{t}^{\infty} (α, β) ∥ \\ = & ∥ ▿^{d} X_{t} - ▿^{d} X_{t}^{\infty} (α, β) ∥ . \end{matrix}

By the definition of the stochastic process

{▿^{d} X^{\infty} (α, β)}

, we have

\begin{matrix} - ▿^{d} X_{t} + ▿^{d} X_{t}^{\infty} (α, β) \\ = & - ▿^{d} X_{t} + \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} (α, β) - ▿^{d} X_{t - i}^{\infty} (α, β)) \\ = & - ▿^{d} X_{t} + \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} ϵ_{t - i} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} (α, β) - ▿^{d} X_{t - i}^{\infty} (α, β) - ϵ_{t - i}) \\ = & ▿^{d} {\tilde{X}}_{t} (α, β) - ▿^{d} X_{t} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} (α, β) - ▿^{d} X_{t - i}^{\infty} (α, β) - ϵ_{t - i}) \\ = & ▿^{d} {\tilde{X}}_{t} (α, β) - ▿^{d} X_{t} + \sum_{i = 1}^{q} β_{i} Y_{t - i}, \end{matrix}

where

Y_{t - i}

is defined as in the proof of Lemma A1. From the assumption, we have

∥ ▿^{d} {\tilde{X}}_{t} (α, β) - ▿^{d} X_{t} ∥ = ∥ ϵ_{t} ∥ \leq R

, and, as we have proved in Lemma A1,

∥ Y_{t} ∥ \leq R

holds. Therefore, we obtain

∥ Z_{t}^{0} ∥ \leq 2 R

, which is the induction basis. Next, assume the claim holds for all

0, \dots, m - 1

. Then we have

\begin{matrix} ∥ Z_{t}^{m} ∥ = & ∥ \sum_{i = 1}^{q} β^{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{m - i} (α, β) - ▿^{d} X_{t - i} + ▿^{d} X_{t - i}^{\infty} (α, β)) ∥ \\ \leq & ∥ \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i}^{\infty} (α, β) - ▿^{d} X_{t - i}^{m - i} (α, β)) ∥ \\ \leq & \sum_{i = 1}^{m} ∥ β_{i} (▿^{d} X_{t - i}^{\infty} (α, β) - ▿^{d} X_{t - i}^{m - i} (α, β)) ∥ \\ + \sum_{i = m + 1}^{q} ∥ β_{i} (▿^{d} X_{t - i}^{\infty} (α, β) - ▿^{d} X_{t - i}) ∥ \end{matrix}

From the induction hypothesis, we have

∥ ▿^{d} X_{t - i}^{\infty} (α, β) - ▿^{d} X_{t - i}^{m - i} (α, β) ∥ \leq {(1 - ϵ)}^{\frac{m - i}{q}} 2 R .

From the proof of the induction basis, we have

\sum_{i = m + 1}^{q} ∥ β_{i} (▿^{d} X_{t - i}^{\infty} (α, β) - ▿^{d} X_{t - i}) ∥ \leq 2 R \sum_{i = m + 1}^{q} {∥ β_{i} ∥}_{op} .

Therefore,

∥ Z_{t}^{m} ∥

can be further bounded using

\begin{matrix} ∥ Z_{t}^{m} ∥ \leq & 2 R \sum_{i = 1}^{m} {∥ β^{i} ∥}_{op} {(1 - ϵ)}^{\frac{m - i}{q}} + 2 R \sum_{i = m + 1}^{q} {∥ β^{i} ∥}_{op} \\ \leq & 2 R \sum_{i = 1}^{m} {∥ β^{i} ∥}_{op} {(1 - ϵ)}^{\frac{m - i}{q}} + 2 R \sum_{i = m + 1}^{q} {∥ β^{i} ∥}_{op} {(1 - ϵ)}^{\frac{m - i}{q}} \\ \leq & {(1 - ϵ)}^{\frac{m - q}{q}} 2 R \sum_{i = 1}^{q} {∥ β^{i} ∥}_{op} \\ \leq & {(1 - ϵ)}^{\frac{m}{q}} 2 R . \end{matrix}

Choosing

m \geq \frac{q log T}{log \frac{1}{1 - ϵ}} = q {log}_{1 - ϵ} {(T)}^{- 1}

, we have

\begin{matrix} ∥ X_{t}^{m} (α, β) - X_{t}^{\infty} (α, β) ∥ \leq & \frac{2 R}{T}, \end{matrix}

which is the claimed result. □

This process of the d-th-order differencing is actually an integrated

AR (m + p)

process with order d, which is shown in the following lemma.

Lemma A3.

For any data sequence

{X_{t}^{m} (α, β)}

generated by a process of the d-th-order differencing given by (A1) and (A2) there is a

γ \in L {(X, X)}^{m + p}

such that

\sum_{i = 1}^{m + p} γ_{i} ▿^{d} X_{t - i} + \sum_{i = 0}^{d - 1} ▿^{i} X_{t - 1} = X_{t}^{m} (α, β)

holds for all t.

Proof.

Let

{▿^{d} X_{t}^{m} (α, β)}

be the sequence generated by (A1). We prove by induction on m that for all

\tilde{m} \leq m

there is a

γ \in L {(X, X)}^{\tilde{m} + p}

such that

▿^{d} X_{t}^{\tilde{m}} (α, β) = \sum_{i = 1}^{\tilde{m} + p} γ_{i} ▿^{d} X_{t - i}

holds for all α and β. The induction basis follows directly from the definition that

▿^{d} X_{t}^{0} (α, β) = \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} .

Assume that the claim holds for some m. Let

α_{i}

be the zero linear functional for

i > p

and

β_{i}

be the zero linear functional for

i > q

. Then we have

\begin{matrix} ▿^{d} X_{t}^{m + 1} (α, β) \\ = & \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{q} β_{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{m + 1 - i} (α, β)) \\ = & \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{m + 1} β_{i} ▿^{d} X_{t - i} - \sum_{i = 1}^{m + 1} β_{i} ▿^{d} X_{t - i}^{m + 1 - i} (α, β) \\ = & \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{m + 1} β_{i} ▿^{d} X_{t - i} - \sum_{i = 1}^{m + 1} β_{i} \sum_{j = 1}^{m + 1 - i + p} γ_{j}^{m + 1 - i} ▿^{d} X_{t - i - j} \\ = & \sum_{i = 1}^{p} α_{i} ▿^{d} X_{t - i} + \sum_{i = 1}^{m + 1} β_{i} ▿^{d} X_{t - i} - \sum_{i = 1}^{m + p + 1} (\sum_{j = 1}^{m + 1} β_{j} \sum_{k = 1}^{i - j} γ_{k}^{m + 1 - j}) ▿^{d} X_{t - i}, \end{matrix}

where the second equality follows from the fact that

β_{i} (▿^{d} X_{t - i} - ▿^{d} X_{t - i}^{m + 1 - i} (α, β)) = 0

for

i > m + 1

, the third line uses the induction hypothesis and the last line is obtained by rearranging and setting

\sum_{i = m}^{n} a_{i} = 0

for

m > n

. The induction step is obtained by setting

γ_{i}^{m + 1} = α_{i} + β_{i} - \sum_{j = 1}^{m + 1} β_{j} \sum_{k = 1}^{i - j} γ_{k}^{m + 1 - j}

for

i = 1, \dots, m + p + 1

, and the claimed result follows. □

Finally, we prove Lemma 1 by combining the results.

Proof of Lemma 1.

From Lemmas A1, A2, and A3, there is some

γ \in L {(X, X)}^{m}

with

m \geq \frac{q log T}{log \frac{1}{1 - ϵ}} + p

such that

\begin{matrix} ∥ ▿^{d} X_{t} (γ) - ▿^{d} \tilde{X_{t}} (α, β) ∥ \\ = & ∥ ▿^{d} X_{t}^{m} (γ) - ▿^{d} \tilde{X_{t}} (α, β) ∥ \\ \leq & ∥ ▿^{d} X_{t}^{m} (γ) - ▿^{d} {X_{t}}^{\infty} (α, β) ∥ + ∥ ▿^{d} X_{t}^{\infty} (γ) - ▿^{d} \tilde{X_{t}} (α, β) ∥ \\ \leq & {(1 - ϵ)}^{\frac{t}{q}} R + \frac{2 R}{T}, \end{matrix}

which is the claimed result. □

Appendix B

In this section, we prove the theorems in Section 4. The required notation is summarized in Appendix C. We apply some important properties of convex functions and their convex conjugate defined on a general vector space, which can be found in [17]. The proposed algorithms are instances of the adaptive optimistic follow the regularized leader (AO-FTRL) [10], which is described in Algorithm A1.

Algorithm A1 AO-FTRL.

Input: closed convex set

W \subseteq X

Initialize:

θ_{1}

arbitrary

for

t = 1

to T do

Get hint

h_{t}

w_{t} = ▿ ψ_{t}^{*} (θ_{t} - h_{t})

Observe

g_{t} \in X_{*}

θ_{t + 1} = θ_{t} - g_{t}

end for

Lemma A4.

We run AO-FTRL with closed convex regularizers

ψ_{1}, \dots, ψ_{T}

defined on

W \subseteq X

satisfying

ψ_{t} (w) \leq ψ_{t + 1} (w) s

for all

w \in W

and

t = 1, \dots, T

. Then, for all

u \in W

, we have

\sum_{t = 1}^{T} g_{t} (w_{t} - u) \leq ψ_{T + 1} (u) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t} - h_{t}),

where

B_{ψ_{_{t}}^{*}} (θ_{t + 1}, θ_{t} - h_{t})

is the Bregman divergence associated with

ψ_{_{t}}^{*}

.

Proof.

W.l.o.g. we assume

h_{T + 1} = 0

, since it is not involved in the algorithm. Then we have

\begin{matrix} \sum_{t = 1}^{T} (ψ_{t + 1}^{*} (θ_{t + 1} - h_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t})) \\ = & ψ_{T + 1}^{*} (θ_{T + 1} - h_{T + 1}) - (θ_{1} - h_{1}) w_{1} + ψ_{1} (w_{1}) \\ \geq & (θ_{T + 1} - h_{T + 1}) u - ψ_{T + 1} (u) + h_{1} w_{1} - θ_{1} w_{1} + ψ_{1} (w_{1}) \\ \geq & θ_{T + 1} u - ψ_{T + 1} (u) + h_{1} w_{1} - sup_{w \in W} (θ_{1} w_{1} - ψ_{1} (w_{1})) \\ = & - \sum_{t = 1}^{T} g_{t} u - ψ_{T + 1} (u) + h_{1} w_{1} - ψ_{1}^{*} (θ_{1}) . \end{matrix} .

Furthermore, we have

\begin{matrix} ψ_{t + 1}^{*} (θ_{t + 1} - h_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) \\ = & ψ_{t + 1}^{*} (θ_{t + 1} - h_{t + 1}) - ψ_{t}^{*} (θ_{t + 1}) + ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) \\ \leq & (θ_{t + 1} - h_{t + 1}) w_{t + 1} - ψ_{t + 1} (w_{t + 1}) - θ_{t + 1} w_{t + 1} + ψ_{t} (w_{t + 1}) + ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) \\ \leq & ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) - h_{t + 1} w_{t + 1} \end{matrix}

Combining the inequalities above, rearranging and adding

\sum_{t = 1}^{T} ⟨ g_{t}, w_{t} ⟩

to both sides, we obtain

\begin{matrix} \sum_{t = 1}^{T} g_{t} (w_{t} - u) \\ \leq & ψ_{T + 1} (u) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} (ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) + g_{t} w_{t} - h_{t} w_{t}) \\ = & ψ_{T + 1} (u) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} (ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) - (θ_{t + 1} - θ_{t} + h_{t}) ▿ ψ_{t}^{*} (θ_{t} - h_{t})) \\ = & ψ_{T + 1} (u) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t} - h_{t}), \end{matrix}

which is the claimed result. □

Proof of Theorem 1.

First of all, since we have

\begin{matrix} \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \leq & \sum_{t = 1}^{T} \sum_{i = 1}^{m} g_{i, t} (γ_{i, t} - γ_{i}) \\ = & \sum_{i = 1}^{m} (\sum_{t = 1}^{T} g_{i, t} (γ_{i, t} - γ_{i})), \end{matrix}

the overall regret can be considered as the sum of the regrets

\sum_{t = 1}^{T} g_{i, t} (γ_{i, t} - γ_{i})

. Next, we analyse the regret of each

i = 1, \dots m

. Define

ψ_{i, t} (γ_{i}) = \frac{η_{i, t}}{2} {∥ γ_{i} ∥}_{F}^{2}

. It is easy to verify

γ_{i, t} \in \partial ψ_{i, t}^{*} (θ_{i, t})

for

t = 1, \dots, T

. Applying Lemma A4 with

h_{t} = 0

, we obtain

\begin{matrix} \sum_{t = 1}^{T} g_{i, t} (γ_{i, t} - γ_{i}) \leq ψ_{i, T + 1} (γ_{i}) + ψ_{i, 1}^{*} (θ_{i, 1}) + \sum_{t = 1}^{T} B_{ψ_{i, t}^{*}} (θ_{i, t + 1}, θ_{i, t}) . \end{matrix}

From the updating rule of

G_{i, t}

, we have

g_{i, t} = 0

for

G_{i, t} = 0

. Let

t_{0}

be the smallest index such that

G_{i, t_{0}} > 0

. Then we have

\sum_{t = 1}^{T} B_{ψ_{i, t}^{*}} (θ_{i, t + 1}, θ_{i, t}) = \sum_{t = t_{0}}^{T} B_{ψ_{i, t}^{*}} (θ_{i, t + 1}, θ_{i, t}) .

For

G_{i, t} > 0

,

ψ_{i, t}

is

η_{i, t}

-strongly convex with respect to

{∥ \cdot ∥}_{F}

. From the duality of strong convexity and strong smoothness (see Proposition 2 in [17]), we have

\sum_{t = t_{0}}^{T} B_{ψ_{i, t}^{*}} (θ_{i, t + 1}, θ_{i, t}) \leq \sum_{t = t_{0}}^{T} \frac{1}{2 η_{i, t}} {∥ g_{i, t} ∥}_{F}^{2} = \sum_{t = t_{0}}^{T} \frac{{∥ g_{i, t} ∥}_{F}^{2}}{2 \sqrt{\sum_{s = 1}^{t - 1} {∥ g_{i, s} ∥}_{F}^{2} + {(L_{t} G_{i, t})}^{2}}} .

From the definition of Frobenius norm, we have

{∥ g_{i, t} ∥}_{F}^{2} = {∥ h_{t} ▿^{d} X_{t - i}^{⊤} ∥}_{F}^{2} = {∥ h_{t} ∥}_{2}^{2} {∥ ▿^{d} X_{t - i} ∥}_{2}^{2} \leq \frac{{∥ h_{t} ∥}_{2}^{2}}{L_{t}^{2}} L_{t}^{2} G_{i, t}^{2} .

Then, we obtain

\begin{matrix} \sum_{t = t_{0}}^{T} \frac{{∥ g_{i, t} ∥}_{F}^{2}}{2 \sqrt{\sum_{s = 1}^{t - 1} {∥ g_{i, s} ∥}_{F}^{2} + {(L_{t} G_{i, t})}^{2}}} \leq & \sum_{t = t_{0}}^{T} \frac{max {1, \frac{{∥ h_{t} ∥}_{2}}{L_{t}}} {∥ g_{i, t} ∥}_{F}^{2}}{2 \sqrt{\sum_{s = 1}^{t} {∥ g_{i, s} ∥}_{F}^{2}}} \\ \leq & max {1, \frac{{∥ h_{1} ∥}_{2}}{L_{1}}, \dots, \frac{{∥ h_{T} ∥}_{2}}{L_{T}}} \sqrt{\sum_{t = 1}^{T} {∥ g_{i, t} ∥}_{F}^{2}} \\ \leq & (1 + \frac{L_{T + 1}}{L_{1}}) \sqrt{\sum_{t = 1}^{T} {∥ g_{i, t} ∥}_{F}^{2}} \\ \leq & (L_{T + 1} + \frac{L_{T + 1}^{2}}{L_{1}}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t - i} ∥}_{2}^{2}}, \end{matrix}

where the second inequality uses Lemma 4 in [17] and the last inequality follows from the fact that

{∥ g_{i, t} ∥}_{F} \leq L_{t} {∥ ▿^{d} X_{t - i} ∥}_{2} \leq L_{T + 1} {∥ ▿^{d} X_{t - i} ∥}_{2}

. Furthermore, we have

\begin{matrix} ψ_{i, T + 1} (γ_{i}) \leq & \frac{{∥ γ_{i} ∥}_{F}^{2}}{2} \sqrt{\sum_{t = 1}^{T} {∥ g_{i, t} ∥}_{F}^{2}} + \frac{L_{T + 1} G_{i, T + 1} {∥ γ_{i} ∥}_{F}^{2}}{2} \\ \leq & \frac{{∥ γ_{i} ∥}_{F}^{2} L_{T + 1}}{2} \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t - i} ∥}_{2}^{2}} + \frac{L_{T + 1} G_{i, T + 1} {∥ γ_{i} ∥}_{F}^{2}}{2}, \end{matrix}

and

ψ_{i, 1}^{*} (θ_{i, 1}) \leq \frac{{∥ θ_{i, 1} ∥}_{F}}{2}

. Adding up from 1 to m, we have

\begin{matrix} \sum_{t = 1}^{T} l_{t} ({\tilde{X}}_{t} (γ_{t})) - l_{t} ({\tilde{X}}_{t} (γ)) \\ \leq & \sum_{i = 1}^{m} (\frac{{∥ γ_{i} ∥}_{F}^{2} L_{T + 1}}{2} + L_{T + 1} + \frac{L_{T + 1}^{2}}{L_{1}}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t - i} ∥}_{2}^{2}} \\ + \sum_{i = 1}^{m} \frac{L_{T + 1} G_{i, T + 1} {∥ γ_{i} ∥}_{F}^{2} + {∥ θ_{i, 1} ∥}_{F}}{2} \end{matrix}

□

Proof of Theorem 2.

Define

ψ_{t} (γ) = \frac{λ_{t} {∥ γ ∥}^{4}}{4} + \frac{λ_{t} {∥ γ ∥}^{2}}{2}

. First of all, it is easy to verify that

γ_{t} \in \partial ψ_{t}^{*} (θ_{t})

. Applying Lemma A4 with

h_{t} = 0

, we have

\begin{matrix} \sum_{t = 1}^{T} {⟨ g_{t} x_{t}^{⊤}, γ_{t} - γ ⟩}_{F} \leq & ψ_{T + 1} (γ) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t}) . \end{matrix}

(A3)

Define

v_{t} \in \partial ψ_{t + 1}^{*} (θ_{t})

. Then we have

\begin{matrix} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t}) = & ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t}) - {⟨ γ_{t}, θ_{t + 1} - θ_{t} ⟩}_{F} \\ = & {⟨ θ_{t + 1}, v_{t} ⟩}_{F} - ψ_{t} (v_{t}) - {⟨ θ_{t}, γ_{t} ⟩}_{F} + ψ_{t} (γ_{t}) - {⟨ γ_{t}, θ_{t + 1} - θ_{t} ⟩}_{F} \\ = & {⟨ θ_{t + 1}, v_{t} ⟩}_{F} - ψ_{t} (v_{t}) + ψ_{t} (γ_{t}) - {⟨ γ_{t}, θ_{t + 1} ⟩}_{F} \\ = & {⟨ θ_{t + 1}, v_{t} - γ_{t} ⟩}_{F} - ψ_{t} (v_{t}) + ψ_{t} (γ_{t}) \\ = & {⟨ g_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - ψ_{t} (v_{t}) + ψ_{t} (γ_{t}) + {⟨ θ_{t}, v_{t} - γ_{t} ⟩}_{F} \\ = & {⟨ g_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{ψ_{t}} (v_{t}, γ_{t}) \\ = & {⟨ γ_{t} x_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} + {⟨ - ▿^{d} X_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{ψ_{t}} (v_{t}, γ_{t}) \\ = & {⟨ γ_{t} x_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{\tilde{ψ_{t}}} (v_{t}, γ_{t}) \\ + {⟨ - ▿^{d} X_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{\bar{ψ_{t}}} (v_{t}, γ_{t}), \end{matrix}

(A4)

where we define

{\tilde{ψ}}_{t} (γ) = \frac{λ_{t}}{4} {∥ γ ∥}_{F}^{4}

and

{\bar{ψ}}_{t} (γ) = \frac{η_{t}}{2} {∥ γ ∥}_{F}^{2}

. From the properties of the Frobenius norm, we have

\begin{matrix} {⟨ γ_{t} x_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} \leq & {∥ γ_{t} x_{t} x_{t}^{⊤} ∥}_{F} {∥ γ_{t} - v_{t} ∥}_{F} \\ \leq & {∥ x_{t} ∥}_{2}^{2} {∥ γ_{t} ∥}_{F} {∥ γ_{t} - v_{t} ∥}_{F} \end{matrix}

Following the idea of [33], we can upper bound

{∥ γ_{t} ∥}_{F}^{2} {∥ γ_{t} - v_{t} ∥}_{F}^{2}

as follows:

\begin{matrix} \frac{λ_{t}}{2} {∥ γ_{t} ∥}_{F}^{2} {∥ γ_{t} - v_{t} ∥}_{F}^{2} \\ = & \frac{λ_{t}}{2} {∥ γ_{t} ∥}_{F}^{2} ({∥ γ_{t} ∥}_{F}^{2} + {∥ v_{t} ∥}_{F}^{2} - 2 {⟨ γ_{t}, v_{t} ⟩}_{F}) \\ \leq & \frac{λ_{t}}{4} ({∥ γ_{t} ∥}_{F}^{4} + {∥ v_{t} ∥}_{F}^{4} - 2 {∥ γ_{t} ∥}_{F}^{2} {∥ v_{t} ∥}_{F}^{2}) + \frac{λ_{t}}{2} {∥ γ_{t} ∥}_{F}^{2} ({∥ γ_{t} ∥}_{F}^{2} + {∥ v_{t} ∥}_{F}^{2} - 2 {⟨ γ_{t}, v_{t} ⟩}_{F}) \\ = & \frac{λ_{t}}{4} {∥ v_{t} ∥}_{F}^{4} + \frac{3 λ_{t}}{4} {∥ γ_{t} ∥}_{F}^{4} - λ_{t} {∥ γ_{t} ∥}_{F}^{2} {⟨ γ_{t}, v_{t} ⟩}_{F} \\ = & \frac{λ_{t}}{4} {∥ v_{t} ∥}_{F}^{4} - \frac{λ_{t}}{4} {∥ γ_{t} ∥}_{F}^{4} + λ_{t} {∥ γ_{t} ∥}_{F}^{2} {⟨ γ_{t}, γ_{t} ⟩}_{F} - λ_{t} {∥ γ_{t} ∥}_{F}^{2} {⟨ γ_{t}, v_{t} ⟩}_{F} \\ = & \frac{λ_{t}}{4} {∥ v_{t} ∥}_{F}^{4} - \frac{λ_{t}}{4} {∥ γ_{t} ∥}_{F}^{4} - λ_{t} {∥ γ_{t} ∥}_{F}^{2} {⟨ γ_{t}, v_{t} - γ_{t} ⟩}_{F} \\ = & B_{\tilde{ψ_{t}}} (v_{t}, γ_{t}) \end{matrix}

Thus, for

λ_{t} \neq 0

, we have

\begin{matrix} {⟨ γ_{t} x_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{\tilde{ψ_{t}}} (v_{t}, γ_{t}) \leq & 2 \sqrt{\frac{{∥ x_{t} ∥}_{2}^{4}}{2 λ_{t}} B_{\tilde{ψ_{t}}} (v_{t}, γ_{t})} - B_{\tilde{ψ_{t}}} (v_{t}, γ_{t}) \\ \leq & \frac{{∥ x_{t} ∥}_{2}^{4}}{2 λ_{t}}, \end{matrix}

where the second inequality uses the fact that

2 a b - b^{2} \leq a^{2}

. Let

t_{0}

be the smallest index such that

λ_{t_{0}} > 0

. Then we have

\begin{matrix} \sum_{t = 1}^{T} ({⟨ γ_{t} x_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{\tilde{ψ_{t}}} (v_{t}, γ_{t})) \\ \leq & \sum_{t = t_{0}}^{T} \frac{{∥ x_{t} ∥}_{2}^{4}}{2 λ_{t}} \\ = & \sum_{t = t_{0}}^{T} \frac{{∥ x_{t} ∥}_{2}^{4}}{2 \sqrt{\sum_{s = 1}^{t} {∥ x_{t} ∥}_{2}^{4}}} \\ \leq & \sqrt{\sum_{t = 1}^{T} {∥ x_{t} ∥}_{2}^{4}}, \end{matrix}

(A5)

where the last inequality uses Lemma 4 in [17]. Similarly, let

t_{1}

be the smallest index such that

η_{t_{0}} > 0

. Then we obtain the upper bound

\begin{matrix} \sum_{t = 1}^{T} ({⟨ - ▿^{d} X_{t} x_{t}^{⊤}, γ_{t} - v_{t} ⟩}_{F} - B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})) \\ \leq & \sum_{t = 1}^{T} ({∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F} {∥ γ_{t} - v_{t} ∥}_{F} - B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})) \\ \leq & \sum_{t = t_{1}}^{T} (\sqrt{\frac{2 {∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}}{η_{t}} B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})} - B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})) \\ \leq & \sum_{t = t_{1}}^{T} (2 \sqrt{\frac{{∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}}{2 η_{t}} B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})} - B_{{\bar{ψ}}_{t}} (v_{t}, γ_{t})) \\ \leq & \sum_{t = t_{1}}^{T} \frac{{∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}}{2 η_{t}} \\ = & \sum_{t = t_{1}}^{T} \frac{{∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}}{2 \sqrt{\sum_{s = 1}^{t - 1} {∥ ▿^{d} X_{s} x_{s}^{⊤} ∥}_{F}^{2} + L_{t}^{2} {∥ x_{t} ∥}_{2}^{2}}} \\ \leq & max {1, \frac{{∥ ▿^{d} X_{1} x_{1}^{⊤} ∥}_{F}}{G_{1}}, \dots, \frac{{∥ ▿^{d} X_{T} x_{T}^{⊤} ∥}_{F}}{G_{T}}} \sum_{t = t_{1}}^{T} \frac{{∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}}{2 \sqrt{\sum_{s = 1}^{t} {∥ ▿^{d} X_{s} x_{s}^{⊤} ∥}_{F}^{2}}} \\ \leq & max {1, \frac{{∥ ▿^{d} X_{1} x_{1}^{⊤} ∥}_{F}}{G_{1}}, \dots, \frac{{∥ ▿^{d} X_{T} x_{T}^{⊤} ∥}_{F}}{G_{T}}} \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}} \\ \leq & (1 + \frac{G_{T + 1}}{G_{1}}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}} \end{matrix}

(A6)

Combining (A3)–(A6), we obtain

\begin{matrix} \sum_{t = 1}^{T} {⟨ g_{t} x_{t}^{⊤}, γ_{t} - γ ⟩}_{F} \leq & \frac{(\sqrt{m} G_{T + 1}^{2} + {∥ θ_{1} ∥}_{F}) {∥ γ ∥}_{F}^{2}}{2} + ψ_{1}^{*} (θ_{1}) + (1 + \frac{{∥ γ ∥}_{F}^{4}}{4}) \sqrt{\sum_{t = 1}^{T} {∥ x_{t} ∥}_{2}^{4}} \\ + (1 + \frac{G_{T + 1}}{G_{1}} + \frac{{∥ γ ∥}_{F}^{2}}{2}) \sqrt{\sum_{t = 1}^{T} {∥ ▿^{d} X_{t} x_{t}^{⊤} ∥}_{F}^{2}} . \end{matrix}

For

θ_{1} \neq 0

, it is easy to verify that

ψ_{1}^{*} (θ_{1}) \leq {⟨ w_{1}, θ_{1} ⟩}_{F} \leq \frac{{∥ θ_{1} ∥}_{F}^{2}}{η_{1}} \leq {∥ θ_{1} ∥}_{F}

. By putting this in the inequality above, we obtain the claimed result. □

Proof of Theorem 3

Proof.

Define

ψ_{t} : Δ \to R, w \mapsto η_{t} \sum_{k \in I_{w}}^{K} w_{k} log w_{k} + η_{t} log K,

where

I_{w} = {i = 1, \dots, k | w_{i} \neq 0}

. It can be verified that

w_{t} \in \partial ψ_{t}^{*} (θ_{t})

. Applying Lemma A4, we obtain

\begin{matrix} \sum_{t = 1}^{T} z_{t}^{⊤} (w_{t} - u) \leq ψ_{T + 1} (u) + ψ_{1}^{*} (θ_{1}) + \sum_{t = 1}^{T} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t} - h_{t}) . \end{matrix}

From the definition of

ψ_{t}

, it follows that

ψ_{T + 1} (u) \leq \sqrt{\frac{log K}{2} \sum_{t = 1}^{T} {∥ z_{t} - h_{t} ∥}_{\infty}^{2}}

and

ψ_{1}^{*} (θ_{1}) = 0

hold. Define

v_{t} \in \partial ψ_{t}^{*} (θ_{t + 1})

. Next, we bound the third term as follows:

\begin{matrix} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t} - h_{t}) \\ = & ψ_{t}^{*} (θ_{t + 1}) - ψ_{t}^{*} (θ_{t} - h_{t}) - {(h_{t} - z_{t})}^{⊤} w_{t} \\ = & θ_{t + 1}^{⊤} v_{t} - ψ_{t} (v_{t}) - {(θ_{t} - h_{t})}^{⊤} w_{t} + ψ_{t} (w_{t}) - {(h_{t} - z_{t})}^{⊤} w_{t} \\ = & {(h_{t} - z_{t})}^{⊤} (v_{t} - w_{t}) - (ψ_{t} (v_{t}) - ψ_{t} (w_{t}) - {(θ_{t} - h_{t})}^{⊤} (v_{t} - w_{t})) \\ = & {(h_{t} - z_{t})}^{⊤} (v_{t} - w_{t}) - B_{ψ_{t}} (v_{t}, w_{t}) \\ = & {(h_{t} - z_{t})}^{⊤} (v_{t} - w_{t}) - η_{t + 1} {∥ v_{t} - w_{t} ∥}_{1}^{2} + η_{t + 1} {∥ v_{t} - w_{t} ∥}_{1}^{2} - B_{ψ_{t}} (v_{t}, w_{t}) \\ \leq & {(h_{t} - z_{t})}^{⊤} (v_{t} - w_{t}) - η_{t + 1} {∥ v_{t} - w_{t} ∥}_{1}^{2} + (η_{t + 1} - η_{t}) {∥ v_{t} - w_{t} ∥}_{1}^{2} \\ \leq & {∥ h_{t} - z_{t} ∥}_{\infty} {∥ v_{t} - w_{t} ∥}_{1} - η_{t + 1} {∥ v_{t} - w_{t} ∥}_{1}^{2} + 4 (η_{t + 1} - η_{t}) \\ \leq & \frac{{∥ h_{t} - z_{t} ∥}_{\infty}^{2}}{4 η_{t + 1}} + 4 (η_{t + 1} - η_{t}), \end{matrix}

where the first inequality uses the fact that

ψ_{t}

is

2 η_{t}

strongly convex w.r.t.

{∥ \cdot ∥}_{1}

. Adding up from 1 to T, we have

\begin{matrix} \sum_{t = 1}^{T} B_{ψ_{t}^{*}} (θ_{t + 1}, θ_{t} - h_{t}) \leq \sum_{t = 1}^{T} (\frac{{∥ h_{t} - z_{t} ∥}_{\infty}^{2}}{4 η_{t + 1}} + 4 (η_{t + 1} - η_{t})) \\ \leq \sqrt{\frac{log K}{2} \sum_{t = 1}^{T} {∥ h_{t} - z_{t} ∥}_{\infty}^{2}} + 4 η_{T + 1} \\ \leq \sqrt{\frac{log K}{2} \sum_{t = 1}^{T} {∥ h_{t} - z_{t} ∥}_{\infty}^{2}} + \sqrt{\frac{8}{log K} \sum_{t = 1}^{T} {∥ h_{t} - z_{t} ∥}_{\infty}^{2}} . \end{matrix}

Combining the inequalities, we obtain

\begin{matrix} \sum_{t = 1}^{T} l (X_{t}, \sum_{i = 1}^{K} w_{i, t} \tilde{X_{t}^{i}}) - \sum_{t = 1}^{T} l (X_{t}, \tilde{X_{t}^{k}}) \\ \leq & \sum_{t = 1}^{T} \sum_{i = 1}^{K} w_{i, t} l (X_{t}, \tilde{X_{t}^{i}}) - \sum_{t = 1}^{T} l (X_{t}, \tilde{X_{t}^{k}}) \\ = & \sum_{t = 1}^{T} w_{t}^{⊤} z_{t} - \sum_{t = 1}^{T} l (X_{t}, \tilde{X_{t}^{k}}) \\ \leq & (\sqrt{2 log K} + \sqrt{\frac{8}{log K}}) \sqrt{\sum_{t = 1}^{T} {∥ h_{t} - z_{t} ∥}_{\infty}^{2}}, \end{matrix}

where the first inequality follows from Jensen’s inequality. Furthermore, if l is L-Lipschitz in its first argument, then we have

{∥ h_{t} - z_{t} ∥}_{\infty} = max_{i \in {1, \dots, K}} | z_{i, t} - h_{i, t} | \leq L {∥ ▿^{d} X_{t} ∥}_{2} .

Finally, we obtain the regret upper bound

\begin{matrix} \sum_{t = 1}^{T} l (X_{t}, \sum_{i = 1}^{K} w_{i, t} \tilde{X_{t}^{i}}) - \sum_{t = 1}^{T} l (X_{t}, \tilde{X_{t}^{k}}) \leq (\sqrt{2 log K} + \sqrt{\frac{8}{log K}}) \sqrt{\sum_{t = 1}^{T} L^{2} {∥ ▿^{d} X_{t} ∥}_{2}^{2}}, \end{matrix}

which is the claimed result. □

Appendix C

We summarize the main notations used throughout the article in Table A1.

Table A1. Nomenclature.

$(X, ∥ \cdot ∥)$	finite dimensional norm space
$(X_{}, {∥ \cdot ∥}_{})$	the dual space with dual norm of $(X, ∥ \cdot ∥)$
$L (X, X)$	vector space of bounded linear operators
${∥ α ∥}_{op} = {sup}_{x \in X, x \neq 0} \frac{∥ α x ∥}{∥ x ∥}$	the operator norm of $α \in L (X, X)$
${∥ x ∥}_{2} = \sqrt{\sum_{i = 1}^{d} x_{i}^{2}}$	2 norm for $x \in R^{d}$
${∥ x ∥}_{1} = \sum_{i = 1}^{d} \| x_{i} \|$	1 norm for $x \in R^{d}$
${∥ x ∥}_{\infty} = max {\| x_{1} \|, \dots, \| x_{d} \|}$	max norm for $x \in R^{d}$
${⟨ A, B ⟩}_{F} = tr (A^{⊤} B)$	Frobenius inner product
${∥ A ∥}_{F} = \sqrt{{⟨ A, A ⟩}_{F}}$	Frobenius norm
$Δ^{d} : {x \in R^{d} \| \sum_{i = 1}^{d} x_{i} = 1, x_{i} \geq 0}$	standard d-simplex
$ψ : W \to R$	closed convex function
$\partial ψ (w) = {g \in X_{*} \| \forall v \in W . ψ (v) - ψ (w) \geq g (v - w)}$	the set of subdifferential of $ψ$ at w
$ψ^{} : X_{} \to R, θ \mapsto {sup}_{w \in W} θ w - ψ (w)$	convex conjugate of $ψ$
$B_{ψ} (u, v) = ψ (u) - ψ (v) - g (u - v)$ , where $g \in \partial ψ (u)$	the Bregman divergence

Appendix D

For the synthetic data, the relative performance of the proposed algorithms after the first 1000 iterations are plotted in Figure A1, Figure A2, and Figure A3. For each setting, we calculate the average loss after the first 1000 iterations and plot the difference of the proposed algorithms compared to the average loss incurred by the best baseline algorithm.

Figure A1. Relative performance for setting 1.

Figure A2. Relative performance for setting 2.

Figure A3. Relative performance for setting 3.

Similarly, we plot the relative performance for the real-world data over the time horizon in Figure A4, Figure A5 and Figure A6.

Figure A4. Relative performance for stock data.

Figure A5. Relative performance for Google Flu.

Figure A6. Relative Performance for electricity demand.

References

Shumway, R.; Stoffer, D. Time Series Analysis and Its Applications: With R Examples; Springer Texts in Statistics; Springer: New York, NY, USA, 2010. [Google Scholar]
Chujai, P.; Kerdprasop, N.; Kerdprasop, K. Time series analysis of household electric consumption with ARIMA and ARMA models. In Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013; Volume 1, pp. 295–300. [Google Scholar]
Ghofrani, M.; Arabali, A.; Etezadi-Amoli, M.; Fadali, M.S. Smart scheduling and cost-benefit analysis of grid-enabled electric vehicles for wind power integration. IEEE Trans. Smart Grid 2014, 5, 2306–2313. [Google Scholar] [CrossRef]
Rounaghi, M.M.; Zadeh, F.N. Investigation of market efficiency and financial stability between S&P 500 and London stock exchange: Monthly and yearly forecasting of time series stock returns using ARMA model. Phys. A Stat. Mech. Its Appl. 2016, 456, 10–21. [Google Scholar]
Zhu, B.; Chevallier, J. Carbon price forecasting with a hybrid Arima and least squares support vector machines methodology. In Pricing and Forecasting Carbon Markets; Springer: Berlin/Heidelberg, Germany, 2017; pp. 87–107. [Google Scholar]
Anava, O.; Hazan, E.; Mannor, S.; Shamir, O. Online learning for time series prediction. In Proceedings of the Conference on Learning Theory, Princeton, NJ, USA, 23–26 June 2013; pp. 172–184. [Google Scholar]
Liu, C.; Hoi, S.C.; Zhao, P.; Sun, J. Online ARIMA algorithms for time series prediction. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1867–1873. [Google Scholar]
Xie, C.; Bijral, A.; Ferres, J.L. Nonstop: A nonstationary online prediction method for time series. IEEE Signal Process. Lett. 2018, 25, 1545–1549. [Google Scholar] [CrossRef] [Green Version]
Yang, H.; Pan, Z.; Tao, Q.; Qiu, J. Online learning for vector autoregressive moving-average time series prediction. Neurocomputing 2018, 315, 9–17. [Google Scholar] [CrossRef]
Joulani, P.; György, A.; Szepesvári, C. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, variance reduction, and variational bounds. Theor. Comput. Sci. 2020, 808, 108–138. [Google Scholar] [CrossRef]
Zhou, Y.; Sanches Portella, V.; Schmidt, M.; Harvey, N. Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses. Adv. Neural Inf. Process. Syst. 2020, 33, 15823–15833. [Google Scholar]
Jamil, W.; Bouchachia, A. Model selection in online learning for times series forecasting. In UK Workshop on Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2018; pp. 83–95. [Google Scholar]
Jamil, W.; Kalnishkan, Y.; Bouchachia, H. Aggregation Algorithm vs. Average For Time Series Prediction. In Proceedings of the ECML PKDD 2016 Workshop on Large-Scale Learning from Data Streams in Evolving Environments, Riva del Garda, Italy, 23 September 2016; pp. 1–14. [Google Scholar]
Orabona, F.; Pál, D. Coin betting and parameter-free online learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 4–9 December 2016; pp. 577–585. [Google Scholar]
Cutkosky, A.; Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 1493–1529. [Google Scholar]
Cutkosky, A.; Boahen, K. Online learning without prior information. In Proceedings of the Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 643–677. [Google Scholar]
Orabona, F.; Pál, D. Scale-free online learning. Theor. Comput. Sci. 2018, 716, 50–69. [Google Scholar] [CrossRef] [Green Version]
Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994; Volume 2. [Google Scholar]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Georgiou, T.T.; Lindquist, A. A convex optimization approach to ARMA modeling. IEEE Trans. Autom. Control 2008, 53, 1108–1119. [Google Scholar] [CrossRef]
Lii, K.S. Identification and estimation of non-Gaussian ARMA processes. IEEE Trans. Acoust. Speech Signal Process. 1990, 38, 1266–1276. [Google Scholar] [CrossRef]
Huang, S.J.; Shih, K.R. Short-term load forecasting via ARMA model identification including non-Gaussian process considerations. IEEE Trans. Power Syst. 2003, 18, 673–679. [Google Scholar] [CrossRef] [Green Version]
Ding, F.; Shi, Y.; Chen, T. Performance analysis of estimation algorithms of nonstationary ARMA processes. IEEE Trans. Signal Process. 2006, 54, 1041–1053. [Google Scholar] [CrossRef]
Yang, H.; Pan, Z.; Tao, Q. Online Learning for Time Series Prediction of AR Model with Missing Data. Neural Process. Lett. 2019, 50, 2247–2263. [Google Scholar] [CrossRef]
Ding, J.; Noshad, M.; Tarokh, V. Order selection of autoregressive processes using bridge criterion. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; pp. 615–622. [Google Scholar]
Lütkepohl, H. New Introduction to Multiple Time Series Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
Steinhardt, J.; Liang, P. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proceedings of the International Conference on Machine Learning, PMLR, Bejing, China, 22–24 June 2014; pp. 1593–1601. [Google Scholar]
De Rooij, S.; Van Erven, T.; Grünwald, P.D.; Koolen, W.M. Follow the leader if you can, hedge if you must. J. Mach. Learn. Res. 2014, 15, 1281–1316. [Google Scholar]
Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef] [Green Version]
Deng, Y.; Fan, H.; Wu, S. A hybrid ARIMA-LSTM model optimized by BP in the forecast of outpatient visits. J. Ambient. Intell. Humaniz. Comput. 2020. [Google Scholar] [CrossRef]
Tutun, S.; Chou, C.A.; Canıyılmaz, E. A new forecasting framework for volatile behavior in net electricity consumption: A case study in Turkey. Energy 2015, 93, 2406–2422. [Google Scholar] [CrossRef]
Lu, H. “Relative Continuity” for Non-Lipschitz Nonsmooth Convex Optimization Using Stochastic (or Deterministic) Mirror Descent. Informs J. Optim. 2019, 1, 288–303. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Results for setting 1 (sanity check), using a stationary ARIMA(5,2,1) model.

Figure 2. Results for setting 2 (time-varying parameters), using a non-stationary ARIMA(5,2,1) model.

Figure 3. Results for setting 3 (time-varying models), using a combination of stationary ARIMA(5,2,1) and ARIMA(5,2,0) models.

Figure 4. Results for stock data.

Figure 5. Results for Google Flu data.

Figure 6. Results for electricity demand data.

Figure 7. Model selection in setting 1.

Figure 8. Model selection in setting 2.

Figure 9. Model selection in setting 3.

Figure 10. Model selection for stock data.

Figure 11. Model selection for Google Flu.

Figure 12. Model Selection for electricity demand.

Table 1. Algorithms for online learning of ARIMA.

Problem	Algorithm	Reference	Tuning-Free	Loss Function	Regret Dependence
OL for ARIMA	OGD	[6,7,8,9]	✗	any	largest gradient norm
OL for ARIMA	ONS	[6,7,8,9]	✗	exp-concave	largest gradient norm
PF-OCO	Coin Betting	[14,15]	✔	normalized gradient	gradient vectors
PF-OCO	FreeRex	[16]	✔	any	largest gradient norm
PF-OCO	SF-MD	[17]	✗	any	gradient vectors
PF-OCO	SOLO-FTRL	[17]	✔	any	largest gradient norm
OL for ARIMA	Algorithm 1	This Paper	✔	Lipschitz	data sequence
OL for ARIMA	Algorithm 2	This Paper	✔	squared error	data sequence
OMS for ARIMA	EG	[12,13]	✗	bounded	loss of the worst model
OMS for ARIMA	Algorithm 3	This Paper	✔	local Lipschitz	data sequence

For non-Lipschitz-continuous loss functions, the gradient norm can be unbounded. These algorithms with performance depending on the gradient norm can fail without making further assumptions on the data generation. For OGD, the learning rate and the diameter of the decision set need to be tuned in practice. ONS has an additional hyperparameter controlling the numerical stability. Applying SF-MD to ARIMA, the diameter of the model parameter has to be tuned. To obtain optimal performance, the learning rate of EG has to be tuned.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Adaptive Online Learning for the Autoregressive Integrated Moving Average Models

Abstract

1. Introduction

2. Related Work

3. Preliminary and Learning Model

4. Algorithms and Analysis

4.1. Parameter-Free Online Learning Algorithms

4.1.1. Algorithms for Lipschitz Loss

4.1.2. Algorithms for Squared Errors

4.2. Online Model Selection Using Master Algorithms

5. Experiments and Results

5.1. Experiment Settings

5.2. Experiments for the Slave Algorithms

5.3. Experiments for Online Model Selection

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

Appendix D

References

Article Metrics

Citations

Article Access Statistics