# Adaptive Online Learning for the Autoregressive Integrated Moving Average Models

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

Algorithm 1 ARIMA-AdaFTRL. |

Input: ${L}_{1}>0$ Initialize ${\theta}_{1,i}$ arbitrarily, ${\eta}_{1,i}=0$, ${G}_{i,0}=0$ for $i=1,\dots ,m$ for$t=1$ to T dofor $i=1$ to m do ${G}_{i,t}=max\{{G}_{i,t-1},{\parallel {\u25bf}^{d}{X}_{t-i}\parallel}_{2}\}$ ${\eta}_{i,t}={\parallel {\theta}_{i,1}\parallel}_{F}+\sqrt{{\sum}_{s=1}^{t-1}{\parallel {g}_{i,s}\parallel}_{F}^{2}+{\left({L}_{t}{G}_{i,t}\right)}^{2}}$ if ${\eta}_{i,t}\ne 0$ then ${\gamma}_{i,t}=\frac{{\theta}_{i,t}}{{\eta}_{i,t}}$ else ${\gamma}_{i,t}=0$ end ifend for Play ${\tilde{X}}_{t}\left({\gamma}_{t}\right)$ Observe ${X}_{t}$ and ${h}_{t}\in \partial {l}_{t}\left({\tilde{X}}_{t}\left({\gamma}_{t}\right)\right)$ ${L}_{t+1}=max\{{L}_{t},{\parallel {g}_{t}\parallel}_{2}\}$ for $i=1$ to m do ${g}_{i,t}={g}_{t}{\u25bf}^{d}{X}_{t-i}^{\top}$ ${\theta}_{i,t+1}={\theta}_{i,t}-{g}_{i,t}$ end forend for |

Algorithm 2 ARIMA-AdaFTRL-Poly. |

Input: ${G}_{0}>0$ Initialize ${\theta}_{1}$ arbitrarily, ${G}_{1}=max\{{G}_{0},{\parallel {\u25bf}^{d}{X}_{0}\parallel}_{2},\dots ,{\parallel {\u25bf}^{d}{X}_{-m+1}\parallel}_{2}\}$ for$t=1$ to T do ${\eta}_{t}={\parallel {\theta}_{1}\parallel}_{F}+\sqrt{{\sum}_{s=1}^{t-1}{\parallel {\u25bf}^{d}{X}_{s}{x}_{s}^{\top}\parallel}_{F}^{2}+{\left({G}_{t}{\parallel {x}_{t}\parallel}_{2}\right)}^{2}}$ ${\lambda}_{t}=\sqrt{{\sum}_{s=1}^{t}{\parallel {x}_{s}\parallel}_{2}^{4}}$ if ${\parallel {\theta}_{t}\parallel}_{F}\ne 0$ then Select $c\ge 0$ satisfying ${\lambda}_{t}{c}^{3}+{\eta}_{t}c={\parallel {\theta}_{t}\parallel}_{F}$ ${\gamma}_{t}=\frac{c{\theta}_{t}}{{\parallel {\theta}_{t}\parallel}_{F}}$ else ${\gamma}_{t}=0$ end if Play ${\tilde{X}}_{t}\left({\gamma}_{t}\right)$ Observe ${X}_{t}$ and ${g}_{t}={\gamma}_{t}{x}_{t}-{\u25bf}^{d}{X}_{t}$ ${G}_{t+1}=max\{{G}_{t},{\parallel {\u25bf}^{d}{X}_{t}\parallel}_{2}\}$ ${\theta}_{t+1}={\theta}_{t}-{g}_{t}{x}_{t}^{\top}$ end for |

Algorithm 3 ARIMA-AO-Hedge. |

Input: predictor ${\mathcal{A}}_{1},\dots ,{\mathcal{A}}_{K}$, d Initialize ${\theta}_{k,1}=0$, ${\eta}_{1}=0$ for $i=1,\dots ,K$ for$t=1$ to T do Get prediction ${\tilde{X}}_{t}^{i}$ from ${\mathcal{A}}_{k}$ for $i=1,\dots ,K$ Set ${Y}_{t}={\sum}_{i=0}^{d-1}{\u25bf}^{i}{X}_{t-1}$ Set ${h}_{i,t}=l({Y}_{t},{\tilde{X}}_{t}^{i})$ for $i=1,\dots ,K$ if ${\eta}_{1}=0$ then Set ${w}_{i,t}=1$ for some $i\in arg{max}_{j\in \{1,\dots ,K\}}{h}_{j,t}$ else Set ${w}_{i,t}=\frac{exp\left({\eta}_{t}^{-1}({\theta}_{i,t}-{h}_{i,t})\right)}{{\sum}_{i=1}^{K}exp\left({\eta}_{t}^{-1}({\theta}_{i,t}-{h}_{i,t})\right)}$ for $i=1,\dots ,K$ end if Predict ${\tilde{X}}_{t}={\sum}_{i=1}^{K}{w}_{i,t}{\tilde{X}}_{t}^{i}$ Observe ${X}_{t}$, update ${\mathcal{A}}_{i}$, and set ${z}_{i,t}=l({X}_{t},{\tilde{X}}_{t}^{i})$ for $i=1,\dots ,K$ ${\theta}_{t+1}={\theta}_{t}-{z}_{t}$ ${\eta}_{t+1}=\sqrt{\frac{1}{2logK}{\sum}_{s=1}^{t}{\parallel {h}_{t}-{z}_{t}\parallel}_{\infty}^{2}}$ end for |

## 2. Related Work

## 3. Preliminary and Learning Model

**Assumption**

**1.**

**Assumption**

**2.**

**Lemma**

**1.**

**Assumption**

**3.**

**Corollary**

**1.**

**Proof.**

## 4. Algorithms and Analysis

Algorithm 4 Two-level framework. |

Input: K instances of the slave algorithm ${\mathcal{A}}_{1},\dots ,{\mathcal{A}}_{K}$. An instance of master algorithm $\mathcal{M}$. for$t=1$ to T do Get ${\tilde{X}}_{t}^{i}$ from each ${\mathcal{A}}_{i}$ Get ${w}_{t}\in {\Delta}^{K}$ from $\mathcal{M}$ ▹${\Delta}^{K}$ is the standard K-simplex Integrate the prediction: ${\tilde{X}}_{t}={\sum}_{i=1}^{K}{w}_{t}^{i}{\tilde{X}}_{t}^{i}$ Observe ${X}_{t}$ Define ${z}_{t}\in {\mathbb{R}}^{K}$ with ${z}_{i,t}={l}_{t}\left({\tilde{X}}_{t}^{i}\right)$ Update ${\mathcal{A}}_{i}$ using ${z}_{i,t}$ for $i=1,\cdots ,K$ Update $\mathcal{M}$ using ${z}_{t}$ end for |

**Corollary**

**2.**

#### 4.1. Parameter-Free Online Learning Algorithms

#### 4.1.1. Algorithms for Lipschitz Loss

**Theorem**

**1.**

#### 4.1.2. Algorithms for Squared Errors

**Theorem**

**2.**

#### 4.2. Online Model Selection Using Master Algorithms

**Theorem**

**3.**

## 5. Experiments and Results

#### 5.1. Experiment Settings

#### 5.2. Experiments for the Slave Algorithms

#### 5.3. Experiments for Online Model Selection

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A

**Lemma**

**A1.**

**Proof.**

**Lemma**

**A2.**

**A1–A2**, we have

**Proof.**

**Lemma**

**A3.**

**Proof.**

**Proof of Lemma**

**1.**

## Appendix B

Algorithm A1 AO-FTRL. |

Input: closed convex set $\mathcal{W}\subseteq \mathbb{X}$ Initialize: ${\theta}_{1}$ arbitrary for$t=1$ to T do Get hint ${h}_{t}$ ${w}_{t}=\u25bf{\psi}_{t}^{*}({\theta}_{t}-{h}_{t})$ Observe ${g}_{t}\in {\mathbb{X}}_{*}$ ${\theta}_{t+1}={\theta}_{t}-{g}_{t}$ end for |

**Lemma**

**A4.**

**Proof.**

**Proof of Theorem**

**1.**

**Proof of Theorem**

**2.**

**Proof.**

## Appendix C

$(\mathbb{X},\parallel \xb7\parallel )$ | finite dimensional norm space |

$({\mathbb{X}}_{*},{\parallel \xb7\parallel}_{*})$ | the dual space with dual norm of $(\mathbb{X},\parallel \xb7\parallel )$ |

$\mathcal{L}(\mathbb{X},\mathbb{X})$ | vector space of bounded linear operators |

${\parallel \alpha \parallel}_{op}={sup}_{x\in \mathbb{X},x\ne 0}\frac{\parallel \alpha x\parallel}{\parallel x\parallel}$ | the operator norm of $\alpha \in \mathcal{L}(\mathbb{X},\mathbb{X})$ |

${\parallel x\parallel}_{2}=\sqrt{{\sum}_{i=1}^{d}{x}_{i}^{2}}$ | 2 norm for $x\in {\mathbb{R}}^{d}$ |

${\parallel x\parallel}_{1}={\sum}_{i=1}^{d}\left|{x}_{i}\right|$ | 1 norm for $x\in {\mathbb{R}}^{d}$ |

${\parallel x\parallel}_{\infty}=max\{\left|{x}_{1}\right|,\dots ,\left|{x}_{d}\right|\}$ | max norm for $x\in {\mathbb{R}}^{d}$ |

${\langle A,B\rangle}_{F}=tr\left({A}^{\top}B\right)$ | Frobenius inner product |

${\parallel A\parallel}_{F}=\sqrt{{\langle A,A\rangle}_{F}}$ | Frobenius norm |

${\Delta}^{d}:\{x\in {\mathbb{R}}^{d}|{\sum}_{i=1}^{d}{x}_{i}=1,{x}_{i}\ge 0\}$ | standard d-simplex |

$\psi :\mathcal{W}\to \mathbb{R}$ | closed convex function |

$\partial \psi \left(w\right)=\{g\in {\mathbb{X}}_{*}|\forall v\in \mathcal{W}.\psi \left(v\right)-\psi \left(w\right)\ge g(v-w)\}$ | the set of subdifferential of $\psi $ at w |

${\psi}^{*}:{\mathbb{X}}_{*}\to \mathbb{R},\theta \mapsto {sup}_{w\in \mathcal{W}}\theta w-\psi \left(w\right)$ | convex conjugate of $\psi $ |

${\mathcal{B}}_{\psi}(u,v)=\psi \left(u\right)-\psi \left(v\right)-g(u-v)$, where $g\in \partial \psi \left(u\right)$ | the Bregman divergence |

## Appendix D

## References

- Shumway, R.; Stoffer, D. Time Series Analysis and Its Applications: With R Examples; Springer Texts in Statistics; Springer: New York, NY, USA, 2010. [Google Scholar]
- Chujai, P.; Kerdprasop, N.; Kerdprasop, K. Time series analysis of household electric consumption with ARIMA and ARMA models. In Proceedings of the International MultiConference of Engineers and Computer Scientists, Hong Kong, China, 13–15 March 2013; Volume 1, pp. 295–300. [Google Scholar]
- Ghofrani, M.; Arabali, A.; Etezadi-Amoli, M.; Fadali, M.S. Smart scheduling and cost-benefit analysis of grid-enabled electric vehicles for wind power integration. IEEE Trans. Smart Grid
**2014**, 5, 2306–2313. [Google Scholar] [CrossRef] - Rounaghi, M.M.; Zadeh, F.N. Investigation of market efficiency and financial stability between S&P 500 and London stock exchange: Monthly and yearly forecasting of time series stock returns using ARMA model. Phys. A Stat. Mech. Its Appl.
**2016**, 456, 10–21. [Google Scholar] - Zhu, B.; Chevallier, J. Carbon price forecasting with a hybrid Arima and least squares support vector machines methodology. In Pricing and Forecasting Carbon Markets; Springer: Berlin/Heidelberg, Germany, 2017; pp. 87–107. [Google Scholar]
- Anava, O.; Hazan, E.; Mannor, S.; Shamir, O. Online learning for time series prediction. In Proceedings of the Conference on Learning Theory, Princeton, NJ, USA, 23–26 June 2013; pp. 172–184. [Google Scholar]
- Liu, C.; Hoi, S.C.; Zhao, P.; Sun, J. Online ARIMA algorithms for time series prediction. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1867–1873. [Google Scholar]
- Xie, C.; Bijral, A.; Ferres, J.L. Nonstop: A nonstationary online prediction method for time series. IEEE Signal Process. Lett.
**2018**, 25, 1545–1549. [Google Scholar] [CrossRef] [Green Version] - Yang, H.; Pan, Z.; Tao, Q.; Qiu, J. Online learning for vector autoregressive moving-average time series prediction. Neurocomputing
**2018**, 315, 9–17. [Google Scholar] [CrossRef] - Joulani, P.; György, A.; Szepesvári, C. A modular analysis of adaptive (non-) convex optimization: Optimism, composite objectives, variance reduction, and variational bounds. Theor. Comput. Sci.
**2020**, 808, 108–138. [Google Scholar] [CrossRef] - Zhou, Y.; Sanches Portella, V.; Schmidt, M.; Harvey, N. Regret Bounds without Lipschitz Continuity: Online Learning with Relative-Lipschitz Losses. Adv. Neural Inf. Process. Syst.
**2020**, 33, 15823–15833. [Google Scholar] - Jamil, W.; Bouchachia, A. Model selection in online learning for times series forecasting. In UK Workshop on Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2018; pp. 83–95. [Google Scholar]
- Jamil, W.; Kalnishkan, Y.; Bouchachia, H. Aggregation Algorithm vs. Average For Time Series Prediction. In Proceedings of the ECML PKDD 2016 Workshop on Large-Scale Learning from Data Streams in Evolving Environments, Riva del Garda, Italy, 23 September 2016; pp. 1–14. [Google Scholar]
- Orabona, F.; Pál, D. Coin betting and parameter-free online learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 4–9 December 2016; pp. 577–585. [Google Scholar]
- Cutkosky, A.; Orabona, F. Black-box reductions for parameter-free online learning in banach spaces. In Proceedings of the Conference on Learning Theory, Stockholm, Sweden, 6–9 July 2018; pp. 1493–1529. [Google Scholar]
- Cutkosky, A.; Boahen, K. Online learning without prior information. In Proceedings of the Conference on Learning Theory, Amsterdam, The Netherlands, 7–10 July 2017; pp. 643–677. [Google Scholar]
- Orabona, F.; Pál, D. Scale-free online learning. Theor. Comput. Sci.
**2018**, 716, 50–69. [Google Scholar] [CrossRef] [Green Version] - Hamilton, J.D. Time Series Analysis; Princeton University Press: Princeton, NJ, USA, 1994; Volume 2. [Google Scholar]
- Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
- Brockwell, P.J.; Davis, R.A. Time Series: Theory and Methods; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Georgiou, T.T.; Lindquist, A. A convex optimization approach to ARMA modeling. IEEE Trans. Autom. Control
**2008**, 53, 1108–1119. [Google Scholar] [CrossRef] - Lii, K.S. Identification and estimation of non-Gaussian ARMA processes. IEEE Trans. Acoust. Speech Signal Process.
**1990**, 38, 1266–1276. [Google Scholar] [CrossRef] - Huang, S.J.; Shih, K.R. Short-term load forecasting via ARMA model identification including non-Gaussian process considerations. IEEE Trans. Power Syst.
**2003**, 18, 673–679. [Google Scholar] [CrossRef] [Green Version] - Ding, F.; Shi, Y.; Chen, T. Performance analysis of estimation algorithms of nonstationary ARMA processes. IEEE Trans. Signal Process.
**2006**, 54, 1041–1053. [Google Scholar] [CrossRef] - Yang, H.; Pan, Z.; Tao, Q. Online Learning for Time Series Prediction of AR Model with Missing Data. Neural Process. Lett.
**2019**, 50, 2247–2263. [Google Scholar] [CrossRef] - Ding, J.; Noshad, M.; Tarokh, V. Order selection of autoregressive processes using bridge criterion. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015; pp. 615–622. [Google Scholar]
- Lütkepohl, H. New Introduction to Multiple Time Series Analysis; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2005. [Google Scholar]
- Steinhardt, J.; Liang, P. Adaptivity and optimism: An improved exponentiated gradient algorithm. In Proceedings of the International Conference on Machine Learning, PMLR, Bejing, China, 22–24 June 2014; pp. 1593–1601. [Google Scholar]
- De Rooij, S.; Van Erven, T.; Grünwald, P.D.; Koolen, W.M. Follow the leader if you can, hedge if you must. J. Mach. Learn. Res.
**2014**, 15, 1281–1316. [Google Scholar] - Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom.
**1986**, 31, 307–327. [Google Scholar] [CrossRef] [Green Version] - Deng, Y.; Fan, H.; Wu, S. A hybrid ARIMA-LSTM model optimized by BP in the forecast of outpatient visits. J. Ambient. Intell. Humaniz. Comput.
**2020**. [Google Scholar] [CrossRef] - Tutun, S.; Chou, C.A.; Canıyılmaz, E. A new forecasting framework for volatile behavior in net electricity consumption: A case study in Turkey. Energy
**2015**, 93, 2406–2422. [Google Scholar] [CrossRef] - Lu, H. “Relative Continuity” for Non-Lipschitz Nonsmooth Convex Optimization Using Stochastic (or Deterministic) Mirror Descent. Informs J. Optim.
**2019**, 1, 288–303. [Google Scholar] [CrossRef] [Green Version]

**Figure 2.**Results for setting 2 (time-varying parameters), using a non-stationary ARIMA(5,2,1) model.

**Figure 3.**Results for setting 3 (time-varying models), using a combination of stationary ARIMA(5,2,1) and ARIMA(5,2,0) models.

Problem | Algorithm | Reference | Tuning-Free | Loss Function | Regret Dependence |
---|---|---|---|---|---|

OL for ARIMA | OGD | [6,7,8,9] | ✗ | any | largest gradient norm |

OL for ARIMA | ONS | [6,7,8,9] | ✗ | exp-concave | largest gradient norm |

PF-OCO | Coin Betting | [14,15] | ✔ | normalized gradient | gradient vectors |

PF-OCO | FreeRex | [16] | ✔ | any | largest gradient norm |

PF-OCO | SF-MD | [17] | ✗ | any | gradient vectors |

PF-OCO | SOLO-FTRL | [17] | ✔ | any | largest gradient norm |

OL for ARIMA | Algorithm 1 | This Paper | ✔ | Lipschitz | data sequence |

OL for ARIMA | Algorithm 2 | This Paper | ✔ | squared error | data sequence |

OMS for ARIMA | EG | [12,13] | ✗ | bounded | loss of the worst model |

OMS for ARIMA | Algorithm 3 | This Paper | ✔ | local Lipschitz | data sequence |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Shao, W.; Radke, L.F.; Sivrikaya, F.; Albayrak, S.
Adaptive Online Learning for the Autoregressive Integrated Moving Average Models. *Mathematics* **2021**, *9*, 1523.
https://doi.org/10.3390/math9131523

**AMA Style**

Shao W, Radke LF, Sivrikaya F, Albayrak S.
Adaptive Online Learning for the Autoregressive Integrated Moving Average Models. *Mathematics*. 2021; 9(13):1523.
https://doi.org/10.3390/math9131523

**Chicago/Turabian Style**

Shao, Weijia, Lukas Friedemann Radke, Fikret Sivrikaya, and Sahin Albayrak.
2021. "Adaptive Online Learning for the Autoregressive Integrated Moving Average Models" *Mathematics* 9, no. 13: 1523.
https://doi.org/10.3390/math9131523