Minimum Message Length in Hybrid ARMA and LSTM Model Forecasting

Modeling and analysis of time series are important in applications including economics, engineering, environmental science and social science. Selecting the best time series model with accurate parameters in forecasting is a challenging objective for scientists and academic researchers. Hybrid models combining neural networks and traditional Autoregressive Moving Average (ARMA) models are being used to improve the accuracy of modeling and forecasting time series. Most of the existing time series models are selected by information-theoretic approaches, such as AIC, BIC, and HQ. This paper revisits a model selection technique based on Minimum Message Length (MML) and investigates its use in hybrid time series analysis. MML is a Bayesian information-theoretic approach and has been used in selecting the best ARMA model. We utilize the long short-term memory (LSTM) approach to construct a hybrid ARMA-LSTM model and show that MML performs better than AIC, BIC, and HQ in selecting the model—both in the traditional ARMA models (without LSTM) and with hybrid ARMA-LSTM models. These results held on simulated data and both real-world datasets that we considered.We also develop a simple MML ARIMA model.


Introduction
Forecasting in time series is a difficult task due to the presence of trends and/or seasonal components. For example, economic time series data are highly impacted by seasonal factors and often show trends with long-run cycles. Such trends and seasonality are difficult to capture by the traditional Autoregressive Moving Average model (ARMA) [1]. The Bayesian Minimum Message Length (MML) principle [2], the Akaike Information Criterion (AIC) [3], Schwarz's Bayesian information criterion (BIC) [4] and Hannan-Quinn (HQ) [5] are often used in model selection for the ARMA model [6][7][8]. The models selected by MML87 [9] in ARMA time series have lower prediction errors than those from AIC, BIC, and HQ [10]. Schmidt previously showed that MML87 outperforms a variety of other (information-theoretic) approaches in ARMA time series modeling [11] (chapters 5 to 8). In this paper, we extended the traditional ARMA time series model to form the hybrid ARMA-LSTM by combining the neural network of long short-term memory (LSTM) in order to test the performance of MML in model selection. The results suggest that MML outperforms AIC, BIC, and HQ.
The ARIMA is used with integer differencing to achieve stationarity if the time series is not stationary. A time series with seasonal components can be modeled using the family of seasonal ARIMA (or SARIMA) models. On the other hand, this ARIMA family has been generated to include long memory time series using a suitable fractional order This section reviews the theory of Autoregressive Integrated Moving Average (ARIMA) modeling from Box and Jenkins (1970) [13,15]. Let {Y t } be a homogeneous nonstationary time series and suppose that the d th (d = 1, 2, . . .) difference of the series is stationary and is given by X t = (1 − B) d Y t , where B is the backshift operator. Then a stationary ARMA(p,q) model can be fitted for {X t }, satisfying where { t } ∼ W N(0, σ 2 ). Let φ(B) = 1 − φ 1 B − . . . − φ p B p ; θ(B) = 1 + θ 1 B + . . . + θ q B q , be two polynomials of degree p and q, respectively, such that the zeros of φ(B) and θ(B) are outside the unit circle. Then the ARMA(p,q) in Equation (1) can be written in a compact form as Now the corresponding ARIMA(p,d,q) model for the original series {Y t } is given by It is known that ARIMA is a form of a linear regression model with the lag order of time series data and corresponding residuals. In an application where the ARIMA model fits well for the given data, then the corresponding residuals through the model should form a random scatter plot with a constant mean and a constant variance over the time, see, for example, ref. [13]. If the ARIMA model is not well fitted for the data or an incorrect model has been fitted, then the residuals will not show a random scatter plot and instead indicate autocorrelations within the residuals. This reveals that the information hidden in the data has not been completely captured by the fitted ARIMA model, and we consider refitting an alternative ARIMA model [22].
The above family of ARIMA models are also capable of modeling a wide range of seasonal data using slight modifications. A seasonal extension of Model (3) can be written for a set of time series data with seasonality m. Incorporating both the seasonal and nonseasonal components together with additional polynomials, a new model is where Φ(B m ) = 1 − Φ 1 B m − . . . − Φ P B mP , Θ(B m ) = 1 + Θ 1 B m + . . . + Θ Q B mQ , and D is the degree of seasonal differencing. For simplicity, this is written as Model (4) is known as the Seasonal ARIMA or SARIMA model. To estimate the parameters of Model (4), it is important to identify the changes of variance in the autocorrelation function (ACF) plot of data. This ACF provides an indication of linear dependencies among the observation of time series, which is related to the order of the model. In addition, the corresponding partial autocorrelation function (PACF) can be used to confirm the approximate order required in the model.
In this study, we use non-seasonal ARIMA modeling because the non-seasonal degree of differencing d can be predetermined in practice. We consider the stationary time series data. Assuming the data are generated from a mean zero stationary ARMA(p,q) process with Gaussian errors, we use the fact that the distribution of data is a multivariate Gaussian distribution with mean µ = 0.
Suppose that we have a sample of N observations y = (y 1 , ..., y N ) generated through Model (2), with c = 0, and let β = (φ 1 , ..., φ p , θ 1 , ..., θ q , σ 2 ) be the vector of all the parameters. Then the corresponding unconditional log-likelihood function, L(yβ), can be written as: where Σ is the determinant of Σ and σ 2 Σ is the N × N theoretical autocovariance matrix of y. Gegenbauer Autoregressive Moving Average (GARMA) models have been used to model a general family of time series with long memory and seasonal components. This family can be used for a wide variety of applications in finance, engineering, and weather forecasting. See, for example, ref. [23] for a comprehensive review and [24] and references therein for estimation methods together with applications.

Minimum Message Length
The Bayesian information-theoretic Minimum Message Length (MML) principle [2,6,7,9,19,25] is based on coding theory and can be thought of in several equivalent ways. It can be thought of in terms of a transmitter encoding a two-part message and transmitting it to a receiver, where the first part of the message contains information encoding the model and the second part of the message encodes the data given the model. The length of the first part of the message can be thought of as the complexity of the model, and the length of the second part of the message (effectively, the statistical negative log-likelihood) is a measure of goodness of fit to the observed data. For example, with X = {A, B, C, D}, possible encodings would be, e.g., A = 00, B = 01, C = 10, and D = 11, or instead, e.g., A = 1, B = 01, C = 001, and D = 0001, with the length of code represented as I(), e.g., with A = 00, I(A) = 2. The code length is typically (close to) the negative logarithm of the probability.
MML thus gives a quantitative information-theoretic trade-off between model complexity (length of first part of message) and goodness of fit (length of second part of message) [26]. A smaller MML value (or, equivalently, a shorter message length) indicates the model is less complex and highly fitted to the data [6]. In practice, minimizing the message length can be expressed as: where I(θ) is the length of encoding the assertion (or model), and I(y N θ) is the length of encoding the detail (or data given the model). In MML, there is (Bayesian) prior knowledge (or a prior distribution), π, over the parameter space. Following Wallace and Freeman [9], MML has been shown to work well in time series models, such as autoregressive (AR) and moving average (MA) models [18,27,28]. We can thus estimate the parameters [7,9] by minimizing the message length: where is measuring the accuracy of data, h 3 (β) is the Bayesian prior distribution over the parameter set β, we model the parameter set β using uniform prior [0, 1] in the stationarity region h 3 (β) = 1, and h 1 (p) = 2 −(1+p) and h 2 (q) = 2 −(1+q) are the priors on the (nonnegative integer) parameters p and q, k = p + q + 1 is the number of continuous-valued parameters, f (y 1 , ..., y N β) is the standard statistical likelihood function, L = − log f , F(β) is the expected Fisher Information matrix (of expected second-order partial derivatives of L) and is a function of the parameter set β, F(β) is the expected Fisher information, κ k is the lattice constant (which accounts for the expected error in the log-likelihood function from ARMA model (Equation (6)) due to the quantization of the k-dimensional space, which is bounded above by 1 12 and bounded below by 1 2πe . For example, κ 1 = 1 12 , κ 2 = 5 36 √ 3 , κ 3 = 19 192 * 2 1/3 , and κ k → 1 2πe as k → ∞). Ignoring the − log h 1 (p), − log h 2 (q), and −N log( ) terms, the message length for the ARMA model β can also be represented as: MML87 is model invariant and avoids explicitly constructing the quantized parameter space [7][8][9]25]. This is used for model selection and parameter estimation by choosing the model that minimizes the message length.
Part of the reason for the above list is the universality of the MML approach [7] ( [19] Chapter 2) (seeking the single best theory) and that of the predictive approach (seeking a Bayesian weighted combination of theories) of Solomonoff [38,39] ([40] Section 3.1). The MML approach of Wallace and the algorithmic probability approach of Solomonoff both have many desirable properties, but they can be slow in practice, whereas deep learning often runs relatively quickly. This motivates us to combine these approaches, as we do using the deep learning approach of long short-term memory (LSTM). This gives us something of a combination of the simplicity and accuracy of MML and the speed of deep learning.
We note in passing that an earlier effort at combining MML with neural nets is [41]. We further note that some approaches to deep learning use a (suitably weighted) combination of a squared error term and a Kullback-Leibler divergence term. Given that squared error comes (or can come) from a Gaussian log-likelihood, this version of deep learning regularization bears similarities to D. F. Schmidt's MML approximation [11] [9] to allow for cases when the Bayesian prior is not approximately constant over the relevant region. D. F. Schmidt's MML approximation, just discussed, is a further modification, and explicitly introduces Kullback-Leibler divergence into the expression.) We also ask, for future work, whether our approach might be combined with graph neural networks [43] or (higher-dimensional) hyper-graph neural networks.

Long Short-Term Memory (LSTM)
With the development of computational power in electronic equipment, powerful computers provide many learning algorithms and approaches in time series forecasting [44][45][46]. Deep learning is one of the popular approaches in recent years; it provides a complex model that has at least the potential to capture (and often does capture) more general information from the predictors than a traditional model, such as ARMA. Long short-term memory (LSTM) is a special kind of recurrent neural network introduced by Hochreiter and Schmidhuber in 1997 [47]. LSTM manages the two state vectors, the short-term state h t and long term state c t , and uses the gating mechanism by adding linear components from the previous layer in order to provide the long memory. LSTM has been widely used in time series forecasting because it is able to capture more information in the time series data, particularly for the financial econometrics area, where the price of financial assets depends on various different factors that are difficult to represent by a linear model [46,48]. Each LSTM layer, including the cells of the forget gate, input gate, and output gate, is shown in Figure 1.

•
Forget gate: The forget gate uses a sigmoid function σ(x) from Equation (10). It has a value between 0 and 1, and it determines how much information should be forgotten. If the result from the sigmoid function is close to 0, then more information should be forgotten, and if the result from the sigmoid function is close to 1, then less information should be forgotten.
The input gate also uses the sigmoid function, the input gate controls the value input from the input function of g t = tanh(Wh t−1 + Ux t + b) using the tanh(x) function: The input gate controls how much information should be remembered. The LSTM long-term state uses an element-wise operation with c t = f t c t−1 + g t i t , where is element-wise multiplication (of two matrices of the same dimension), also known as the Hadamard product. The output gate o t controls how much long-term information c t should be carried forward to the next layer, and it also contributes to the short-term state of h t . The result from the output gate function is also between 0 and 1, and the LSTM short-term state also uses element-wise multiplication, with h t = o t tanh(c t ). An LSTM with more than one layer is shown in Figure 2, and its structure enables the LSTM to capture long-term and short-term information in order to forecast. As usual, an LSTM is trained by back propagation as other neural network models are. An LSTM requires time series data to train the model, and its time series pattern will be modeled in every layer of the network.

Hybrid ARMA-LSTM Model
In recent years, LSTM and its variants-along with some hybrid models-have been thought by many to largely dominate the financial time series forecasting domain [46]. The LSTM is able to capture the dependency of residuals across time, and the LSTM is trained by the time step [49]. In this paper, we are using the Moving Average lag order q from ARMA parameters selected by MML87, AIC, BIC, and HQ-if q = 0, then we only use ARMA to forecast the time series data without LSTM. Our LSTM model is composed of a single input layer with an input shape of MA order and the sequence learning features. The following LSTM layer also contains the sequence learning features, and the third LSTM layer with the same unit is followed by the fourth dense layer with one unit.
We developed Algorithm 1 based on [17] using a different loss function and activation function in the regression task. The hybrid ARMA-LSTM model trains the LSTM model by the residuals from the ARMA model. (This is similar in spirit to the discussion in ( [7] (Section 5.1))). The simple point here is that the LSTM has at least the potential to find dependencies that the ARMA model (on its own) can not express. In this paper, MML87, AIC, BIC, and HQ have been used to select the model parameter orders from the ARMA model; so, this paper not only compares the errors of the hybrid ARMA-LSTM model with those from the single ARMA model but also the hybrid model in terms of the selection(s) of MML87, AIC, BIC, and HQ. The forecast from the ARMA model is the fitted mean µ t+1 . Because information is hidden in the residuals from the ARMA model (in a similar vein to ([7] (Section 5.1))), the forecast of the hybrid model will bê where µ t+1 represents the linearity modeling of data from the ARMA model selected according to the information-theoretic MML87, AIC, BIC, and HQ. The term t is the residual Calculate root mean squared error by forecast from ARMA only

Experiments
The experiments have been designed to compare the results of the ARMA model itself with the hybrid ARMA-LSTM model and also to compare different versions of the hybrid model with the parameters variously selected by the MML87, AIC, BIC, and HQ. In order to analyze the accuracy of forecasting, we are using the root mean squared error, to compare the different results, where T stands for the forecast window size, and we are using rolling forecast in this experiment. To elaborate and clarify, for the financial data in Section 6.2, we do integer differencing with d = 1 to obtain stationarity before using the ARMA model and, as such, use an ARIMA or autoregressive integrated moving average model. We compare the performance of ARMA, ARMA-LSTM, and LSTM alone on simulated dataset(s) (Section 6.1) and also on real-world financial (Section 6.2) and air pollution (Section 6.3) datasets.
We argue elsewhere (( [6] footnotes 75 and 76) ( [25] Section 3) ([40] Section 4.1)) about various uniqueness and invariance properties of log-loss (or logarithm loss). Squared error is a popular method and is also a variant of log-loss.

Simulated Dataset(s)
In this section, we perform experiments using various previously described modeling methods on simulated data, and we begin (in terms of LNPPP space ([6] Section 0.2.7)) by describing the experiments. We use a uniform distribution on [−0.9, 0.9] (from minimum −0.9 to maximum 0.9) to randomize the parameters p and q of ARMA(p, q) for the data simulation by using the arima.sim function in R and then reject them if they are outside the stationarity region. There are 5 × 2 = 10 different parameter sets from p 1 , ..., p 5 and q 1 , q 2 . The values in the table are the average RMSE over 100 runs (with standard deviation in brackets) in the simulated dataset corresponding to the particular parameters. The dataset includes N = 50, 100, 200, 300, and 500 time series data points in one dataset and also includes forecast windows of window size(s) T = 3, 10, 30, and 50. Table 1 shows the average of RMSE trained by LSTM alone (with different numbers of LSTM time steps) with different forecast window sizes, T. The results suggest that the LSTM alone does not work well in ARMA simulated data. For convenience of reading, we have moved Tables A1-A8 to Appendix A; each value in Table A1 is the average RMSE of forecast errors over the datasets (with standard deviation in brackets). The bold texts indicate the smallest forecast errors from the different kinds of models. Tables A1-A4 provide a comparison of different forecast window sizes (or window size) with T = 3, 10, 30, and 50.  Table A2 shows the results for the average RMSE in the datasets for different simulated ARMA parameter sets, with the forecast window of T = 10. Table A3 provides the comparison of root mean squared error results of those datasets in different criteria, also comparing different simulated datasets with the forecast window of T = 30.
A large forecast window usually decreases the accuracy for the time series model. A window size of T = 50 (Data provided by Table A4) is 50% of the size of the in-sample set, and the MML87 hybrid model still outperforms its rivals. This indicates that the MML information criterion is efficient in model selection, and the algorithm of the hybrid model is also efficient in time series analysis, with the result of T = 50, as shown in Table A4. Table 2 shows the average of the ten different parameters of the simulated dataset in the forecast window sizes of T = 3 (Data provided by Table A1), 10 (Data provided by  Table A2), 30 (Data provided by Table A3) and 50 (Data provided by Table A4) with the in-sample size of N = 100. MML87 outperforms the rival methods in the in-sample size of N = 100 in all cases of T = 3, 10, 30, and 50. MML87 not only considers the goodness of fit of data but also considers the model complexity. Figure 3 shows that MML87 has a lower root mean squared error in most cases. The hybrid model selected by MML87 has the lowest error rate for T = 3, 10, and 30. These comparisons argue well for MML. The results of N = 100 with T = 50 seem to suggest that for a large size of the forecast window, the complex hybrid ARMA-LSTM model seems to perform better than the simple time series model. Given that the simulated data were generated from an ARMA model, it is not immediately apparent why adding LSTM to produce a hybrid model should be advantageous in the case of larger datasets (although we would typically expect this if not dealing with data that are purely from an ARMA model). Table 3 shows the average of the ten different parameters of the simulated dataset in the in-sample size of N = 50 (Data provided by   Tables A5-A8 compare six different models or model selection techniques in the RMSE of the dataset in N = 50, 200, 300, and 500, with the forecast window size T = 10. AIC tends to overfit for small datasets, such as N = 50 (Data provided by Table A5 in Appendix A). Through an increase in the amount for the in-sample dataset, the RMSE decreases in the hybrid ARMA-LSTM model because the larger size of data helps the LSTM to train and fit an accurate model. Thus, the results show the RMSE for the MML87 model is lower than the other models in the range N = 100, 200, and 300. Because of the efficiency in controlling the model complexity in MML87, the model can avoid the overfitting problem for small datasets.
The hybrid model with LSTM overfits when the in-sample size is small, basically because there is a larger amount of parameters that need to be estimated compared to the pure ARMA model. On the other hand, the hybrid model tends to perform well for a large in-sample size because the deep learning model is often better off for a large in-sample size, such as N = 200 (Data provided by Table A6), 300 (Data provided by Table A7), and 500 (Data provided by Table A8).
For a small in-sample size, such as N = 50, the BIC performance is good on the hybrid ARMA-LSTM because BIC is able to select the model well without overfitting. The MML87-Hybrid has the smallest average RMSE for N = 100, 200, and 300 for the different randomized datasets. The hybrid models work efficiently when there is enough in-sample data; otherwise, it can also overfit small datasets. In the meantime, by comparing the RMSE from MML87-ARMA, AIC-ARMA, BIC-ARMA, and HQ-ARMA, the results favor MML87 rather than AIC, BIC, and HQ. MML87 has a good performance in time series model selection and is able to select the ARMA model with lower forecasting errors. However, as noted earlier in this section, given that the simulated data were generated from an ARMA model, it is not immediately apparent why adding LSTM to produce a hybrid model should be advantageous in the case of larger datasets (although we would typically expect this if not dealing with data that are purely from an ARMA model). Figure 4 shows the comparison of RMSE in the in-sample size N = 50, 200, 300, and 500.

Financial Data-and Extension to ARIMA Models
Stock return prediction is one of the most popular research topics in economics and finance [51,52]. This section studies the performance of the hybrid model from MML87; the hybrid models from AIC, BIC, HQ; and the ARIMA models selected by MML87, AIC, BIC, and HQ. The stock prices were selected from the components of the Dow Jones Industrial  Table 4 shows the characteristic of stock prices selected, including mean, standard deviation, and partial autocorrelation.  [51,53,54]. However, traditional linear time series models are not able to take into account the effect of all those factors, thus requiring a more complex model to capture the information hidden in residuals from the ARIMA model. The hybrid model with LSTM is able to model publicly available and other information, which we have no reason to believe will be restricted, coming from a purely ARMA or ARIMA model. In order to make the stock price stationary in time series analysis, the ARIMA models are using the parameter d = 1 (or, equivalently, first-order differencing). As the experimental results show, MML87 outperforms the other information-theoretic criteria AIC, BIC, and HQ in terms of lower root mean squared error for out-of-sample forecasting. Figure 5 demonstrates the log prices for stock prices selected in this experiment. The hybrid model tends to outperform for a large forecast window size rather than the small forecast window size because a large lookahead in forecasting has higher uncertainty. For much-or perhaps even most-of the financial industry, there is high volatility in long forecasts. The notion of semi-strong market efficiency suggests that the stock price fully and fairly reflects publicly available information in the time horizontal in the forecast window and also reflects all past information (although by no means all authors agree with this in its entirety [55], partly due to principles of Solomonoff [39] and Wallace [7]). Thus, it is more likely that a complex model will at least be able to provide accurate results in predictions for a T greater than 100. Table 5 shows MML models have lower the RMSE in most cases for different forecast window sizes in financial data.  Table 6 provides the average of RMSE for the selected stocks in different sizes, T, of the forecast window (shown in different columns) and numbers of LSTM time steps (shown in different rows). The LSTM models are trained by scalers in the range of 0 to 1, and the LSTM model performs worse in the case without scaling, which indicates that the neural network LSTM is scale insensitive and that combining the traditional ARMA time series model makes the neural network more scale-sensitive [56]. The results from Table 6 suggest that the LSTM model alone (unenhanced by ARMA and ARIMA) is not particularly able to capture the time series pattern for the stock price. The figures of the average RMSE are significantly higher than traditional ARMA and ARMA-LSTM models. Figure 6 shows the comparison between the ARIMA model and the hybrid ARIMA-LSTM model in this experiment.

PM2.5 Pollution Data
In this section, we use environmental data of PM2.5 pollution levels in the city of Beijing, China, with ten sensors located in different areas. The data are hourly PM2.5 levels in 53 days in 2013.
We are using the same data length and information-theoretic methods from Section 6.2 in order to demonstrate the performance of MML compared to rival methods. Table 7 shows the comparison between MML, AIC, BIC, and HQ. The hourly PM2.5 data have a seasonality; the level of PM2.5 reaches its highest near midday and decreases to its lowest near midnight. The results suggest that MML is a good model selection technique in this case.  Table 8 shows the LSTM model alone in the PM2.5 data, and the results suggest that the LSTM model (on its own, unenhanced by ARMA and ARIMA) outperforms in the smaller-sized forecast windows, such as T = 3, 5, and 10. The RMSEs in larger window sizes (T ≥ 50) are much larger for the LSTM than for the ARMA model and hybrid ARMA-LSTM. Table 8. LSTM for PM2.5 Beijing data in different time steps and forecast windows.

Conclusions
We have investigated time series modeling in the Minimum Message Length framework using Wallace and Freeman's (1987) approximation [9]. The hybrid ARMA-LSTM model has been compared with the traditional ARMA (Autoregressive Moving Average) time series model based on the information-theoretic approaches: AIC, BIC, HQ and MML87. We performed experiments on simulated data and also on two real-world datasets (financial and environmental data). We conducted the experiments based on hybrid ARMA-LSTM (with LSTM) and ARMA without LSTM (long short-term memory). This could be broadly thought of as constituting two experiments each on three datasets or with six experiments. For each of the six experiments, the results show that MML87 outperforms the other information-theoretic criteria. The hybrid ARMA-LSTM model performs better than the traditional ARMA model, and the MML hybrid ARMA-LSTM model performed best out of everything considered. It is worth noting that the LSTM model alone with unscaled data performed worse than everything else considered. In summary, MML87 is able to select the lower forecasting errors better than the AIC, BIC, and HQ, as the experimental results show. Acknowledgments: The authors thank three anonymous referees for their valuable comments and useful suggestions to improve the quality of this version of the paper. The (first two) authors would further like to thank the Department of Data Science and Artificial Intelligence, Faculty of IT, Monash University, for their support.

Conflicts of Interest:
The authors declare no conflict of interest. Table A1. Simulated data with N = 100 and T = 3 from Section 6.1.  Table A2. Simulated data with N = 100 and T = 10 from Section 6.1.

Average of RMSE (and Standard Deviation)
Order    Table A7. Simulated data with N = 300 and T = 10 from Section 6.1.

Average of RMSE (and Standard Deviation)
Order