Air Pollution Prediction Using an Ensemble of Dynamic Transfer Models for Multivariate Time Series

Entering a new era of big data, analysis of large amounts of real-time data is important, and air quality data as streaming time series are measured by several different sensors. To this end, numerous methods for time-series forecasting and deep-learning approaches based on neural networks have been used. However, they usually rely on a certain model with a stationary condition, and there are few studies of real-time prediction of dynamic massive multivariate data. Use of a variety of independent variables included in the data is important to improve forecasting performance. In this paper, we proposed a real-time prediction approach based on an ensemble method for multivariate time-series data. The suggested method can select multivariate time-series variables and incorporate real-time updatable autoregressive models in terms of performance. We verified the proposed model using simulated data and applied it to predict air quality measured by five sensors and failures based on real-time performance log data in server systems. We found that the proposed method for air pollution prediction showed effective and stable performance for both shortand long-term prediction tasks. In addition, traditional methods for abnormality detection have focused on present status of objects as either normal or abnormal based on provided data, we protectively predict expected statuses of objects with provided real-time data and implement effective system management in cloud environments through the proposed method.


Introduction
Massive real-time data storage and real-time data visualization are available in many industries, and have improved data analysis techniques for real-time data. In other words, among various versions of time-series analysis, the prediction in real time is the one of main interests in the field. Also, much log big data has been produced between web or mobile applications and server systems because of developments of web and IoT systems, etc. Thus, analyzing this kind of data is becoming increasingly important these days [1,2]. In particular, APM (Application Performance Management) is a real-time log big data analysis system that collects and manages performance information of a server system between usages of user applications and services of a server system such as web application server or data base server.
Various extended models such as VAR (Vector Autoregressive) and VARMA (Vector Autoregressive Moving Average) have been used for multivariate time-series analysis. VAR and VEC (Vector Error-correction) were exploited for long-term prediction based on multivariate time-series data [3], and VARMA was used for multivariate time-series data in financial services [4]. Study in [5] researched prediction performance of the VAR model using direct multi-step estimation for both stationary and non-stationary time series generated in economic activities, and study in [6] forecasted price of electricity by VAR model using fractional cointegration. One drawback of VAR model is that the number of parameters to be estimated can become large [7]. Recently, predictions for Fuzzy time series were performed using a multivariate heuristic model [8], and a new method using Fuzzy relation based on a neural network algorithm was suggested for high-dimensional time series data [9]. However, these models need to satisfy too many conditions and constraints; the VARMA model is slow with complicated data [10]. In addition, even if the models handle multivariate time-series data, they are usually not suitable to forecast a certain dependent variable of the data. Thus, given many situations, traditional single models for forecasting multivariate time series have limits.
As one possible approach to these problems, an ensemble method combining many models have been proposed for time-series data in diverse situations. In particular, as solutions based on neural network algorithm have been in fashion, [11] suggested a new type of ensemble method using nonlinear weights with ARIMA, ANN, and RNN models. Despite some advantages, however, the method is not adequate for real-time forecasting and has a limitation of increasing complexity as the number of models increases. Similarly, based on research using a neural network algorithm, these kinds of multivariate time-series forecasting methods are not suitable for real-time analysis because of long training time with high complexity [12]. Recently, for real-time prediction, the ELMK method combining ELM (Extreme Learning Machine) and kernel methods was proposed [13]. It showed increased real-time forecasting performance on non-stationary time series by applying a fixed memory-based prediction algorithm that eliminates training data of the past. In addition, [14] suggested an online prediction model by applying versions of Newton and gradient descent algorithms into a loss function of the ARMA model. Although these kinds of studies on real-time prediction of time series have been conducted, there are few studies focusing on real-time analysis for multivariate time series. A new real-time multivariate forecasting model is needed to replace the univariate one to handle many time-series features collected online.
In this paper, we propose multivariate ensemble method based on dynamic transfer model for stable real-time prediction and verify its performance by applying it to predict failures with performance log data generated in a server system. Time-series prediction algorithms using a transfer model are a form of lagged regression where input variables X t affect autoregressive input variables of response variables Y t . There are various ways of building and identifying lagged regression models according to relationships between input X t and output Y t . One disadvantage of a lagged regression model is that it is difficult to consider uncertainty of a dynamic input variable process [15]. Also, when one attempts to find lags from cross correlation between input and output variables, the model can be unclear and too empirical [16]. Although some solutions such as using Monte-Carlo-based analysis [17] have been proposed to solve this problem, we suggest a novel ensemble-based method. An ensemble approach incorporates various features from various locations and scales that can be beneficial in many fields [18]. The proposed method selects various input variables for each lagged regression and generate diverse dynamic transfer models based on the selected input variables, with autoregressive time-series analysis of output variables. We perform this approach to build a basis ensemble model with offline training data and then forecast response variables in real time by updating the basis model online. Since there are many dynamic transfer basis models generated from diverse combinations of input variables, autoregressive orders, and difference orders, and because the best model is selected from a set of bases by an ensemble method, we obtain the best offline model that is suitable in as many situations as possible. Next, we update weights for each basis model and regression coefficients in real time, which enables the proposed approach to be faster, more accurate, and more stable for predictions in many environments. We consistently manage the suggested model by periodically offline updating the basis model to reflect unpredictable environmental changes that are possible in industry. Moreover, we effectively use memory since online updating does not require much information.
We verified significant performance of the proposed real-time forecasting algorithm based on the ensemble of dynamic transfer models for multivariate time series by analyzing simulated dataset. Next, we suggested a way of using the proposed method for server performance data in server systems. For server system data, we predict failures with real-time data from a performance management system. Given plenty of multivariate applications, servers, and databases in the field of performance management, it is important not only to assess status of the present system, but also to correctly predict failures. The precautionary action with high prediction accuracy enables us to consistently operate services in a cloud environment of a server system because of directly allocating resources such as CPU or memory. We showed that the provided method can achieve good prediction performance with the real-time multivariate time-series data in a server system.

Preliminaries: ARIMA, Transfer Function, Ensemble
In this section, we provide a summary of classical time-series models and an ensemble model for the prediction model proposed in the next section. A time series is a sequence of quantitative observations at successive time points. We denote x t to be the observation at time t. A sophisticated, classical, and widely used model is autoregressive integrated moving average (ARIMA) with order p, d, and q, ARIMA(p, d, q): in which a t is the zero-mean white noise at time t and B is a backward shift operator. Autoregressive and moving average, ARMA, is a special case of ARIMA(p, d, q) with a difference order d of zero. ARMA is a statistical model for time series to describe a weakly stationary stochastic process with autoregression and moving average terms. The model ARIMA(p, d, q) is ARIMA(p, 0, q) for d-step differential process (1 − D) d x t . If the moving average polynomial, (1 + ∑ q i=1 φ i B), is invertible, the model can be represented as an infinite autoregressive model, A transfer function model represents the relation between input x t and output Y t that reflects input-output delays [19]. The use of moving average polynomials δ(B) and ω(B) for both x t and Y t in the model enables determination of the causal relationship between two time series: for delay b and noise n t , δ(B)Y t = ω(B)x t−b + n t . If multiple inputs, x 1,t , . . . , x m,t exist, the model is where n t is noise [20]. Modeling transfer functions with multiple inputs and reflecting nonlinearity between input and output are challenging. In light of this observation, we propose a dynamic transfer model. An ensemble method is a way of integrating multiple machine-learning algorithms to obtain better performance than one algorithm alone. A key component in ensembles is to have multiple base learners exceeding random algorithm and including diversity. To promote diversity in base learners and numerically avoid overfitting, bootstrap aggregating is widely adopted by using a randomly drawn subset of a training set. In bootstrap aggregating, an ensemble classifier f is composed of base learners,f i , i = 1, · · · , B, with associated weight w i as follows: Weight w i is often determined according to the performance off i .

Proposed Method
The proposed method for multivariate time series consists of two components, dynamic transfer model and an ensemble of dynamic transfer model. First, we describe the formulation of dynamic transfer model and point out both similarities and differences between the existing models and the proposed one. Then, we describe the integration of dynamic transfer models for reliable and accurate prediction in multivariate time series.

Dynamic Transfer Models
In this section, we explain a dynamic transfer model using multivariate time series and expound an ensemble of such models for more accurate and stable performance. We consider two versions, one that stores a fixed number of past observations and another that does not need to store past observations but update the model parameters on a realtime basis. The basic idea of a dynamic transfer model is to combine a dynamic lagged regression model with a transfer function model to allow the model to adapt its parameters in an online fashion.
Consider a time series Y t corresponding to a response variable and independent timeseries signals x 1,t , x 2,t , · · · , x m,t , t = 1, · · · , T. We start with a dynamic regression model for response time series Y t , of autoregressive order p, difference order d, and input selector I of m × 1, where I j is 1 if the corresponding x j,t is included and zero otherwise and t is white noise. For example, for p = 2, d = 1, and I = [ As shown in the example, for d ≤ p, the model with p, d, and I produces the same result as the model with p, 0, and I.
The model attempts to determine the effect of input signals, x j,t , j = 1, · · · , m, on response signal Y t in such a way that the dynamic response of Y t may vary in x j,t using input selector I. We maintain several input selectors, I ( ) , = 1, · · · , L, in an ensemble stage, which will be explained in the following section. The model can capture the effect of x j,t that declines gradually to zero as in lagged regression models. For example, if Y t is modeled by an infinite number of lagged values x 1,t , among m = 4 input signals, with an exponentially decaying effect.
Then the model is rewritten as Y t = w 1−βB x 1,t + (t) for |β| < 1 and is simplified as a lagged dependent variable model, which is an instance of the proposed model in (1) The dynamic transfer model shares similarity with an autoregressive moving average with external terms (ARMAX) model. ARMAX is an application of ARMA with external explanatory variables. However, we mention the following two differences between the two models. First, the proposed model considers input signal x j,t to be stochastic, whereas ARMAX models consider external inputs as deterministic. We separately build the timedependent model of input signal x j,t . Second, the model allows the dynamics of a response signal by selection of input signals, using input selector I, in various ways. This approach enables the model to generate several meaningful results that will be used and aggregated in the ensemble stage.
Indeed, the proposed model faces a few computational challenges. For model training, it attempts to find the relationship between response signal Y t and current input signal x j,t only at time t. Given the input signals as a stochastic process, we try to explain Y t using input signals at a certain fixed time, usually current time t. This approach is considered a model simplification, and will be compensated by the ensemble stage in the next section. Difference order is also challenging, and several strategies such as empirical analysis of autocorrelation, numerical search of the lowest standard deviation, and theoretical testing of stationarity can be adopted in practice. For model identifiability, we need to fix p, d, and I in the model. Identification of optimal parameters is nontrivial but challenging. For example, to find the optimal difference order d, one can adopt several strategies such as empirical analysis of autocorrelation, numerical search of the lowest standard deviation, and theoretical testing of stationarity. To cope with this challenge, we set difference order d from 1 to D, assessing its performance in the following ensemble stage. The same strategy is applied to p and I. For fixed p, d, and I in the model, the model is identifiable, and we estimate the parameters using least-squares. The estimates are consistent but possibly biased in some cases [21]. We similarly build an autoregressive model for input x j,t of autoregressive order p j and difference order d j , We represent the prediction of input signal x j,t+h , h ≥ 1 by available measurements up to current time t when predicting Y t+h in Equation (1). This procedure is commonly used to replace an unobserved, usually random input with its expectations from an auxiliary model [22]. We denote the h-step-ahead prediction of Y t+h using observations up to t byŶ t+h|t . For example, suppose the exemplary model in (2) is used, and the following autoregressive models for x 1,t and x 4,t were obtained, respectively: Then, the one-step-ahead prediction Y t+1|t is obtained as follows:

Ensemble of Dynamic Transfer Models
For accurate and reliable online prediction using multivariate time series, we propose an ensemble approach that generates numerous dynamic transfer models, described in the previous section and used as a base learner in the ensemble method, and select a few models that work well in predefined prediction tasks. Then, we update the selected models on a real-time basis using either a fixed amount of past data or recursive least-squares.
To build an ensemble model, we conduct the following three steps: Step 1.
Generate numerous candidate models for response Y ( ) t as in (1) with p , d , and I , = 1, . . . , L, using training observations.
Choose the top-K models as base learners among the L candidate models in terms of minimum prediction error.
Generate prediction models for input variables, x j,t , using training observations.
The above steps are performed in an offline training session for a h-step-ahead prediction task. Specifically, we consider models of autoregressive orders from 1 to P, difference orders from 1 to D, and input variables no greater than q among m variables, which brings the total number of considered models to PD ∑ q j ( m j ) in Step 1. If PD ∑ q j ( m j ) models are computationally feasible to handle, we set L = PD ∑ q j ( m j ); otherwise, L is a large number that can be handled computationally. It is possible that different models for x j,t are used in Step 3. We apply a model-weighting strategy based on prediction error [23]. For a candidate model Y ( ) t , the model-performance weight, w , in training is calculated and normalized as follows: in which parameter G is the number of time observations for performance evaluation in offline training. Since the weight is reciprocal to the sum of squared error, the larger is w , the better is the model. Then, we choose the top-K models,Ŷ t , · · · ,Ŷ (K) t as base learners in terms of w , and we build the final ensemble model as the following: Similarly, for a prediction model of each input x j,t in Step 3, we consider models of autoregressive orders from 1 to P , and difference orders from 1 to D and construct an ensemble of them similarly. In short, the offline training session produces K feasible models of input x j,t for the h-step-ahead prediction task, yieldingx j,t+h . ThenŶ t+1|t andŶ t+2|t that isŶ t+h|t in (1), are constructed usingx j,t+h .
When online streaming observations of Y t , x 1,t , x 2,t , · · · , x m,t , t > T are available, we update the coefficients (β 0 , β k , and α j in (1)) in each of the top-K models. We have two options in updating. One is to use a fixed number of past observations from t and refit each model to calculate the coefficients. This option is reasonable in that we cannot indefinitely store streaming data in memory due to fixed memory size. Denote this version as EDT-w (ensemble of dynamic transfer model using fixed window). The second option is to apply recursive least-squares to update the coefficients. The second option also makes sense in that not only does the recursive least-squares (RLS) approach enable fast updating, but also makes possible the use of all past observations without storing them in memory. Denote this version as EDT-r (ensemble of dynamic transfer model using RLS). Then we update model-performance weight w using the most recent prediction task as follows:

Experiments
For all experiments, we deployed six traditional time-series forecasting methods, ARIMA, ARIMAX, ANN (Artificial Neural Network), ANNX (Artificial Neural Network with extra predictor), RNN (Recurrent Neural Network), VAR (Vector Autoregressive), and the two proposed methods, EDT-w and EDT-r. ANN is a collection of nodes, also known as neurons, with a layer structure to map inputs to outputs. ANNX is an application of ANN with external explanatory variables. RNN is a class of artificial neural networks with internal, hidden and autoregressive, states, and VAR is a multivariate autoregressive version of ARIMA. For the proposed methods, we set the parameters as P = 9, D = 3, P = 5, D = 1, K = 40, and window size = 200. To compare short-term and long-term forecasting performance, we provide four forecasting errors for t + 1, t + 3, t + 6, and t + 12 in terms of RMSE (root mean square error) and repeat the computation three times. Also, computing time in seconds is provided as reference, and all of them are in a reasonable range. The computing time of only two proposed methods are different for t + 1, t + 3, t + 6, and t + 12 because they select the local optimal parameters while training depending on the time window.

Simulation 1
The first simulation dataset is obtained using VAR.sim function from package multivar in R that aims to generate VAR-based simulation data. Since real-life datasets in the later sections are eight-dimensional, we generated the same eight-dimensional time-series data, (x 1,t , . . . , x 8,t ), and set the last one, x 8,t as a target time-series, y t , for forecasting. One example is provided in Figure 1. The length of each time-series is 5000 with the first 4000 data used for training and the remaining 1000 data for testing. There is no anomaly in this first simulation with stationarity compared with simulations without stationarity in the next section. Outputs for the first simulation are shown in Table 1. For t + 1, VAR is the most likely to show the best performance since the first simulation data were generated by the VAR model of VAR.sim function in R. Indeed, VAR was superior for short-term prediction. However, the proposed EDT-w method dominates the other methods for t + 3, t + 6, and t + 12 with longer-term.

Simulation 2
We used arima.sim function from package stats in R to generate the second simulation dataset. First, we generated seven independent time-series, (x 1,t , . . . , x 7,t ), as a form of ARIMA(m, n, 0) where 1 ≤ m ≤ 5 and 0 ≤ n ≤ 2. We also pre-set a range of autoregressive coefficients to be between −1 and 1. Then, we defined y t as a target time-series that is dependent on (x 1,t , . . . , x 7,t ). We first generated independent time-series with the same rule for (x 1,t , . . . , x 7,t ) and define it as z t . Then, we added a linear combination of (x 1,t , . . . , x 7,t ) at time t with coefficients between −1 and 1 to z t and defined it as the final target time-series, y t . For example, y t as a form of ARIMA(2, 2, 0) can be repreresented as y t = z t + α 1 x 1,t + · · · + α 7 x 7,t where z t = β 1 z t−1 + β 2 z t−2 + θ 1 t−1 + θ 2 t−2 + t and −1 ≤ α, β, θ ≤ 1. As in the previous simulations, the length of each time-series is 5000 items, and the first 4000 were used for training and the later 1000 for testing. One example is visualized in Figure 2.  The output in Table 2 is very similar to that of the previous simulation because this second simulation data were created using arima.sim function in R. Thus, ARIMA shows the lowest error for t + 1 and t + 3. However, as in the previous case, the proposed EDT-w method is always better than the other methods for t + 6 and t + 12 with longer-term.  [24], registered at UCI Machine-Learning Repository, has 9358 samples of hourly averaged responses obtained by five metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. Originally, this was used to prove cross-sensitivities and improve sensor capabilities of concentration estimation [25]. For application to the proposed method, we selected eight columns, CO(GT), PT08.S1(CO), C6H6(GT), PT08.S2(NMHC), NOx(GT), PT08.S3(NOx), PT08.S4(NO2), and PT08.S5(O3), and set the C6H6(GT) representing hourly averaged Benzene concentration as a target value for time-series forecasting because this is one of the most important target values in the previous reference. This is visualized in Figure 3. We use four types of experiments, and the first 5000, 6000, 7000, and 8000 data are used as training data for each. Then, the next immediate 1000 data are used as testing data for each scenario. Missing data are originally tagged with a −200 value and were replaced with na.approx function in R.
Outputs for the air quality application are shown in Table 3. EDT-w always shows the best performance except for two cases for t + 1. In addition, forecasting performance in terms of RMSE is usually more than twice that of the other methods. Thus, this experiment provides evidence of the strength of the proposed method.
In conclusion, EDT-w usually shows the best performance for long-term time-series forecasting such as t + 6 and t + 12. Also, it shows not the best but very high performance even for short-term forecasting such as t + 1 and t + 3. Therefore, the proposed EDT-w can be a good option for accurate and stable forecasting performance for longer-term.

Application 2: APM Data
Various events occur while applications are in use. Modifying an existing application, sudden heavy inflow of users, and delayed processing are representative examples of such events. They can be stored in a form of log data in connected database or server systems, and APM (application performance management) is an analytic solution that can detect and resolve failure or trouble in web applications by analyzing such log data. The given APM dataset provides 54,995 items of time-series data collected every 10 s for eight log data of Used memory, System CPU, User CPU, Count, Num http, Num javad, Num jdbc, and Response, as in Table 4, and they are visually presented in Figure 4. Table 4. Summary of data attributes in the APM data.

Log Data Description
Used memory memory occupancy for a target application System CPU CPU occupancy for a target application User CPU CPU occupancy for all application Count The number of simultaneous access Num http The number of thread for a target application about http requests Num javad The number of thread for a target application about javad Num jdbc The number of thread for a target application about jdbc Response Response time of a target application to a request of users We set the last variable, Response, as a target time-series to predict and the remaining seven variables as its covariates. The first 5000 data items were used for training, and eight manually selected subsets of 1000 items of time-series data are used for testing. We divide Response into eight subsets for comparison: four are stationary according to a stationarity test and the other four are non-stationary. Response without any system failure or trouble is usually in the normal range regardless of time point.
Outputs for the APM application are shown in Table 5. APM based on time-series forecasting is reasonable because its computing time is usually less than 1 or 2 s, which is much smaller than the 10-second interval of collecting data. Next, ANN or VAR shows better performance for t + 1 and t + 3, while the proposed EDT-r and EDT-w do so for t + 6 and t + 12. The most interesting thing is that VAR shows extremely bad performance for t + 6 and t + 12, indicating that VAR is not good for time-series with anomalies, while the proposed EDT-r and EDT-w are very stable in such a situation. Lastly, the second and third non-stationary cases for t + 12 are the only two where EDT-r achieves the best performance.

Conclusions
For the prediction task for streaming multivariate time series, numerous models are possible, ranging from classical models to modern deep-learning models. Effective models need to be accurate in a certain h-time ahead task byŶ t+h and consistent and reliable in several h-time head tasks. Transfer models together with ARIMA are nonlinear, rather classical, and established forecasting models. We aimed to improve the models to make them dynamic by including variable selectors and stable by constructing an ensemble of them. Though many sub-models participate in the final model, each of which is easy to modify in the face of new online data. From a computational viewpoint, model building requires a comparable amount of time in comparison to several existing models such as VAR and RNN, and computation time will decrease if parallel processing is in use for the building of each sub-model. The prediction performance of the proposed model is promising and effective not only in two adopted models, but also in two real-life multivariate time series. For the prediction task of atmospheric pollution, the data are interrelated and stochastic by nature, so the need for multivariate time series, including external variables, escalates. The current study only included seven internal chemicals for the target variable of air pollution. In future, we hope to include external factors in the proposed model and to verify the model capacity. We plan to refine the model by adding a forgetting factor to updating sub-models and by reflecting theoretical aspects of the models.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author while restrictions may apply to the availability of the APM data.

Conflicts of Interest:
The authors declare that there is no conflict of interest regarding publication of this paper.