A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series

Wang, Ping; Feng, Hongyinping; Zhang, Guisheng; Yu, Daizong

doi:10.3390/su12114730

Open AccessArticle

A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series

¹

College of Resources and Environment, Shanxi University of Finance and Economics, Taiyuan 030006, China

²

School of Mathematical Sciences, Shanxi University, Taiyuan 030006, China

³

School of Economics and Management, Shanxi University, Taiyuan 030006, China

^*

Author to whom correspondence should be addressed.

Sustainability 2020, 12(11), 4730; https://doi.org/10.3390/su12114730

Submission received: 30 April 2020 / Revised: 3 June 2020 / Accepted: 3 June 2020 / Published: 9 June 2020

Download

Browse Figures

Versions Notes

Abstract

An accurate, reliable and stable air quality prediction system is conducive to the public health and management of atmospheric ecological environment; therefore, many models, individual or hybrid, have been implemented widely to deal with the prediction problem. However, many of these models do not take into consideration or extract improperly the period information in air quality index (AQI) time series, which impacts the models’ learning efficiency greatly. In this paper, a period extraction algorithm is proposed by using a Luenberger observer, and then a novel period-aware hybrid model combined the period extraction algorithm and tradition time series models is build to exploit the comprehensive forecasting capacity to the AQI time series with nonlinear and non-stationary noise. The hybrid model requires a multi-phase implementation. In the first step, the Luenberger observer is used to estimate the implied period function in the one-dimensional AQI series, and then the analyzed time series is mapped to the period space through the function to obtain the period information sub-series of the original series. In the second step, the period sub-series is combined with the original input vector as input vector components according to the time points to establish a new data set. Finally, the new data set containing period information is applied to train the traditional time series prediction models. Both theoretical proof and experimental results obtained on the AQI hour values of Beijing, Tianjin, Taiyuan and Shijiazhuang in North China prove that the hybrid model with period information presents stronger robustness and better forecasting accuracy than the traditional benchmark models.

Keywords:

air quality index; time series forecasting; Luenberger observer; period-aware hybrid model

1. Introduction

Air quality index (AQI) with the dimensionless attribute reflects the air quality of a specific area quantitatively [1]. It is a comprehensive air quality index including six air pollutants (PM_2.5, PM₁₀, CO, O₃, SO₂, NO₂), which can be calculated by individual air quality index (IAQI) with reference to air quality standards (GB3095-2012) in China [2,3]. According to the definition of AQI, its values can be assigned to different classes corresponding to different air quality situation, which provides the basis for the relevant government departments to effectually reduce air pollution emissions, and offers the background for the public to reasonably arrange outdoor activities. Accurate and reliable prediction of the trends of air quality has become a significant focus in sustainable development, as it is closely bound up with ecological environment protection and human health guarantee [1,4,5,6].

Due to the characteristics of time series data composed of AQI, such as the non-stationary, non-linearity and complexity, it is a difficult task to accurately predict AQI [5,7,8]. At present, the hybrid forecast model composed of multiple single models has attracted more and more attention to ameliorate the accuracy and robustness of AQI models. Wang et al. [4] proposed a hybrid model based on multiple intelligent algorithms integrating the decomposition technique with extreme learning machine (ELM), which acquitted itself of AQI forecast in Beijing and Shanghai splendidly. Wu and Lin [6] applied wavelet decomposition to generate sub-sequences which were smoothed by sample entropy as the input of least squares support vector machine (LSTM) to form the hybrid model for daily AQI forecasting. Jiang et al. [5] used the improved pigeon-inspired optimization (IPIO) method to optimize the parameters of ELM, which was applied to assemble the subseries, and then K-means clustering methods combined with multidimensional scaling clustered the prediction results. This method is applied to the different terms prediction of Harbin’s AQI, which showed better generalization ability than the benchmark models. Zhu et al. [7] proposed a complex network method based on the phase space reconstruction theory to analyse the AQI fluctuation transmission properties of three cities in China. Zhu et al. [8] presented a novel optimal-combined AQI forecasting method to effectively avoid the uncertainty and instability brought by blind combined model.

Moreover, it should be noted that the AQI time series shows obvious periodicity, which may be due to the potential periodicity of factors affecting air quality (e.g., seasonal change of meteorological factors or diurnal variation of pollution source emissions). Time series prediction can be understood as forecasting a sequence of future values based on a sequence of past values. In order to capture the model’s periodic characteristics of the time series, the needed input sequence is usually longer than the its intrinsic period of the time series. Therefore, it is necessary to have a long-term memories to reuse specific input-output mechanism recording the potential periodicity [9]. However, the AQI prediction models mentioned above do not take into consideration the periodic characteristics in the process of establishing the input sequence [10], and experiments have shown the incapability of these approaches to detect the periods underlying the time series [9], which undoubtedly reduces the prediction accuracy of the models. To integrate the periodic characteristics into the prediction model, Bernas and Płaczek [10] proposed a period-aware local model detecting the periodicity of the time series by the autocorrelation function. The disadvantage of this method is that the period detection algorithm is a linear function, which is difficult to represent the complex nonlinear correlation. Cinar et al. [9] suggested an extended attention model according to the (relative) positions in the time series to reveal the periodic changes, which was applied to sequence-to-sequence RNNs (recurrent neural networks). However, it is an issue to be further explored whether this method is suitable for traditional machine learning. Obviously, the utilization of periodic extraction methods in AQI prediction can improve the prediction performance [9,10,11]. It is worth noting that periodic extraction is still subject to many constraints at present. Firstly, there are few periodic extraction techniques to consider whether it is appropriate for non-stationary nonlinear time series, and the related researches lack the corresponding theoretical basis. Secondly, the rationality of the hybrid model constructed by the extracted periodic information and the traditional model needs to be further verified. This paper solve the above problems and fills in this gap.

In this paper, a novel period-aware hybrid model combining Luenberger observer and traditional time series models is developed to predict AQI. In hybrid model, the Luenberger observer is applied to detect the periodic features implied in time series data and generate the periodic information. Then, the input sequence composed of periodic information and past observations is used to train the models to enhance the analysis and interpretation ability of periodic information in time series data. Finally, the above data serve as the input vector of the traditional time series prediction algorithm to establish the period-aware hybrid forecasting model. The main contributions of this study are as follows: (1) A new idea of period extraction algorithm based on Luenberger observer is proposed, which takes fully consider the dynamics and nonlinearity of potential periodic characteristics. (2) For the first time, a novel period-aware hybrid model combining period extraction algorithm with traditional time series prediction models is established to enhance the forecast accuracy of AQI series.

2. Data Description

The current AQI system in China based on single pollutant indicators is helpful to represent the pollution assessment results of air quality more comprehensively and reflect the actual air quality situation more clearly and intuitively [7,12,13]. AQI samples used in the simulation experiment are selected at four cities (Beijing (BJ), Tianjin (TJ), Taiyuan (TY) and Shijiazhuang (SJZ)) shown in Figure 1. The cities mentioned above are located in North China, which have suffered many air pollution events due to the influence of special terrain and a large number of anthropogenic pollutants, so it is of practical significance to take these cities as research objects [14,15]. The research data in this paper is hourly AQI in July 2017 with prominent change at day and night time. Compared with hourly observation data, due to the larger time scale of daily data, some significant period features of AQI are likely to be inevitably hidden. In order to better explore the prediction information implied by AQI period characteristics and obtain more convinced research results, it is very suitable to use AQI hourly data. From the perspective of data processing, few AQI missing values in the data set are supplemented by linear interpolation algorithm.

Table 1 presents the descriptive statistical results of AQI in the study cities, which are calculated by Eviews 6 software. The study data consist of 744 AQI observations. The mean values of AQI range from 78.71371 to 97.21505, which belong to AQI Grade II considering the standard of AQI in China. In particular, TJ has the smallest AQI mean in the study cities, which means that its air quality is the best, probably because of its geographical proximity to the sea. In comparison, SJZ with the largest mean has the most serious air pollution. Moreover, the larger Standard Deviation (Std.dev.) in Beijing indicates that there is a large difference between the observation values and their mean value. It can be observed from the Table 1 that all the series data do not obey the normal distribution with the small p-value.

3. Methods

The time series concerned in this paper is discrete univariate series, that is, it is composed of a series of observations, which can be defined as

X = {x_{1}, \dots, x_{t}, \dots, x_{n}}

. The time interval between any two adjacent points in AQI series is equal, which is also regarded that the time step is a constant. Time series prediction problem can be understood that the future value at T time point

y_{T} = (y_{T + 1}, \dots, y_{T + i}, \dots, y_{T + T^{'}})

are estimated by the prediction function

y = f (x)

established by historical sequence

x_{T} = (x_{T - w}, \dots, x_{T - j}, \dots, x_{T})

, where T expresses the T-th time point,

T^{'}

is forecast horizon ahead into the future (when

T^{'} > 1

, it means a multi-step prediction problem), and w represents the lag order of past observations.

3.1. Period Analysis

In order to analyze whether there are periodic characteristics in the experimental time series, the period analysis approach based on Box [16] methodology is applied to the experiments. In the case of no missing values, the period of time series can be mined by autocorrelation function (ACF). Suppose

X = {x_{1}, \dots, x_{t}, \dots, x_{n}}

is the time series, where

x_{t}

represents the value of time series X at time t. If h denotes the lag

(h = 1, 2, \dots, T)

, the calculation results of ACF for the given time series X through h lags can be calculated by the following formula.

A C F (X, h) = \frac{A V F (X, h)}{{max}_{i \in 1, 2, \dots, T} (A V F (X, i))},

(1)

A V F (X, h) = \sum_{i = m a x (1, - h)}^{m i n (n - h, n)} [x_{i + h} - \bar{x}] [x_{i} - \bar{x}],

(2)

where

x_{i}

is considered as the observed value of time series X at i time,

\bar{x}

represents the mean of the time series, and n is obtained by calculating card(x) denoting the cardinality of time series.

3.2. Period Extraction Algorithm Based on Luenberger Observer

Given the time series

x (t), t = 1, \dots, n

, the forecasting of

x (t)

mechanism design is actually an utilization of the prior knowledge of

x (t)

itself. In order to make use of the prior knowledge separately, we divide the time series into two parts:

x (t) = d (t) + r (t),

(3)

where the dynamic prior information about

x (t)

is contained in

d (t)

and other information is implied in the remainder

r (t)

. In the proposed model, the remainder

r (t)

will be treated by the conventional time series forecasting methods and the part

d (t)

will be treated by the dynamic method. The main dynamic prior information concerned in this paper is the periodicity of time series. Suppose that

d \in L^{2} [0, T]

is a periodic function with period T. We have the fourier expansion.

d (t) = \sum_{j = 0}^{\infty} (a_{j} sin \frac{2 j π t}{T} + b_{j} cos \frac{2 j π t}{T}),

(4)

where

a_{j}, b_{j}

,

j = 0, 1, \dots

are fourier coefficients. When m is large enough, we have the approximation.

d (t) \approx d_{m} : = \sum_{j = 0}^{m} (a_{j} sin \frac{2 j π t}{T} + b_{j} cos \frac{2 j π t}{T}) .

(5)

Since

d_{m}

is a harmonic signal, there must exist a pair

(G, C)

by which

d_{m}

can be represented dynamically.

\dot{v} (t) = G v (t), d_{m} (t) = C v (t),

(6)

for some initial state

v (0)

. For example, we can choose G such that the eigenvalues of G satisfy.

σ (G) = \{\frac{2 j π}{T} i | j = 1, 2, \dots, m\} \cup {0} .

(7)

Thanks to this dynamic representation, we can use the dynamic prior knowledge through the system Equation (6). Indeed, if the state

v (t)

of Equation (6) can be calculated for all time

t \geq 0

,

d_{m} (t)

is available for all time

t \geq 0

as well. As a result, we achieve the forecasting of

d_{m}

by virtue of the dynamic prior knowledge about d. By the general control theory, the state

v (t)

can be obtained by the Luenberger observer.

{\begin{cases} \dot{\hat{v}} (t) = G \hat{v} (t) + L [s (t) - C \hat{v} (t)], \\ {\hat{d}}_{m} (t) = C \hat{v} (t), \end{cases}

(8)

where

L \in R^{n \times 1}

such that the matrix

G - L C

is Hurwitz. Let

\tilde{v} (t) = v (t) - \hat{v} (t)

. Then, the error is determined by

\begin{array}{l} \dot{\tilde{v}} (t) = (G - L C) \tilde{v} (t), \end{array}

(9)

which implies that

[d_{m} (t) - {\hat{d}}_{m} (t)] \to 0 as t \to \infty .

(10)

So

{\hat{d}}_{m} (t)

is an estimation of

d_{m} (t)

for t is large enough.

A significant advantage of the period extraction algorithm based on dynamics method is that it is suitable for time series with non-stationary characteristic. In fact, we divide the the series

x (t)

properly such that

r (t)

in Equation (3) is a stationary series. In this way, the conventional methods are available for the forecasting. In addition, another advantage of the decomposition system is the good compatibility with other prediction models. The periodic information obtained by decomposition Equation (3) can be used as the input of the conventional prediction models in the form of vector. We have to point out that only a few data are required in the estimation of the state v in Equation (8), provided we choose

s (G - L C)

, the largest eigenvalue of

G - L C

, is sufficiently small. This means that even if the prior data is few, the dynamic method can still work well.

3.3. Period-Aware Hybrid Model

A new period-aware hybrid model is proposed, by combining the period extraction algorithm with the traditional time series prediction model. The details of this model are described in this section. The traditional methods used in this paper include the statistical model (autoregressive integrated moving average model (ARIMA)) and machine learning algorithms (artificial neural network (ANN) and support vector machine (SVM)), which have been tested and verified to be able to process time series well [3,17,18,19]. The novelty of the hybrid model lies mainly in the preprocessing stage, so the main idea of model’s construction can be summarized as the following. Firstly, the period information of one-dimensional time series is extracted by the period extraction algorithm based on Luenberger observer; then, the period information in the form of time series is added to the input vector as the input feature components according to the corresponding time points to build a new dataset, which realizes the incorporating the period information into the original data; finally, the new dataset is used to train the traditional time series forecast models. It can be found that the core of the hybrid model is the validity of periodic extraction, which fundamentally determines whether the generalization ability of hybrid model can be improved. The modeling process of AQI time series with this method is meticulously explained in Table 2.

3.4. Performance Indices of Model’s Prediction Accuracy

In this paper, mean absolute error (MAE), root mean square error (RMSE), index of agreement (IA) and direction accuracy (DA) are applied to quantitatively assess the performance of different models. Generally, the smaller the RMSE and MAE values, the closer the forecast values are to the observations, or the stronger the generalization capability of the forecast system. Conversely, the larger the results of IA and DA represent more accurate predictions of the algorithm. The calculation formula of these performance indices are shown in the Equations (11)–(14), where

y_{i}

is the observation,

a_{i}

represents the corresponding forecasting result at point i, and

\bar{y}

indicates the average of observed value.

\begin{matrix} M A E & = \frac{1}{l} \sum_{i = 1}^{l} ∣ a_{i} - y_{i} ∣, \end{matrix}

(11)

\begin{matrix} R M S E & = \sqrt[]{\frac{1}{l} \sum_{i = 1}^{l} {(a_{i} - y_{i})}^{2}}, \end{matrix}

(12)

\begin{matrix} I A & = 1 - \frac{\sum_{i = 1}^{l} {(a_{i} - y_{i})}^{2}}{\sum_{i = 1}^{l} (| a_{i} - \bar{y} | + | y_{i} - \bar{y} {|)}^{2}}, \end{matrix}

(13)

\begin{matrix} D A & = \frac{1}{l} \sum_{i = 1}^{l} w_{i}, w_{i} = \{\begin{cases} 1, & i f (y_{i + 1} - y_{i}) (a_{i + 1} - y_{i}) > 0, \\ 0, & o t h e r w i s e . \end{cases} \end{matrix}

(14)

4. Results and Discussion

4.1. Period Analysis Results

For the purpose of making the period result obtained by ACF analysis reliable, the time series should at least meet the requirement of weak stability, that is, the mean value and covariance of time series should be constant [10]. The original time series with non-stationary characteristics can be differentiated to get the de-trended and stationary time series. In our experiments, lags is set to 200 to fully reflect the implied periodic characteristics of data (lags should contain at least one period). Figure 2 illustrates the ACF values of each AQI and displays the diverse periods belonging to the series. The results of the autocorrelation function analysis show that the real-world AQI time series have an obvious daily period which contains 24 observed values collected in time intervals of 1 h. At the same time, it can be noted that there is a relatively weak period of 12 h in this type of time series. The superposition of different periods in the same time series further enhances the difficulty of period extraction. When ACF is used for period extraction, it can not be ignored that ACF is sensitive to noise, which will most likely produce local maximum value detected as period wrongly. Therefore, this paper uses the Luenberger observer with good anti-interference ability to mine the time series period.

4.2. The Results of Period Extraction Algorithm

The

d_{m}

series representing period features which extracted from AQI time series by period extraction algorithm based on Luenberger observer (Equation (5)) are shown in Figure 3. The results of

d_{m}

series can be understood as the best approximation of the period function by the mapping function which decomposes the original series into the m-dimensional period space and then merges it into the input space. The error terms in the figures are the error between the original series and

d_{m}

series. Intuitively, not all

d_{m}

values are very small, for the

d_{m}

series only fit the period features of the original series, and the rest information will remain in error term. This may result in some larger error values than other fitting algorithms.

4.3. Results of the Period-Aware Hybrid Model

In the simulation experiments, the hourly AQI series for Beijing, Tianjin, Taiyuan and Shijiazhuang are applied to verify the forecast accuracy of the period-aware hybrid method, and the ARIMA, ANN and SVM models are used to compare to testify the prevalence of the hybrid model in this paper. All the time series are segmented by reserving the first 480 data points for training-validation and the last 264 data to form a test set.

In the process of modeling, for making the test models show the best prediction ability, p, d and q of ARIMA model are determined according to Akaike’s information criterion (AIC) principle. The parameter d of the model represents the order of difference to ensures stationarity of the time series, p determines the order of autoregression, and q reflects the lag of moving average [20].

In general, the input vectors of ANN and SVM models should reflect the essential characteristics of time series and Application environment. Of course, there is also a tradeoff between the system’s complexity and prediction ability. If we want to detect the potential periods of time series by the input of prediction system, the input vector should contain at least one period of history values. According to the analysis in Section 4.1, all AQI time series have a period containing 24 data points, so the dimension of input vector of ANN and SVM model is at least 24. However, it is difficult to capture the periodic nature of the model based on the data within one or several periods, with a significant increase in complexity of the model design, which determines that this method is not suitable for the integration of period information and machine learning model. Therefore, we use partial auto-correlation function (PACF) representing the regression of the historical values in the past domain to explain the relationship between the predicted target and the historical values [21]. With the results of PACF as reference, the value for history size of ANN and SVM models is set to 1. The training verification set is further divided by keeping the first 90% for training and the last 10% for verification. The corresponding parameters are optimized by using cross-validation method to prevent over-fitting. Therefore, the logistic sigmoid function is selected as the transfer function of the neural networks, and polynomial kernel is used as the mapping function of SVM models. The number of input nodes is consistent with the input vector’s dimension of the prediction system. The hidden layer consists of three sigmoid nodes. The output layer contains one node as the result of prediction system. The above operations ensure the optimal generalization ability and stability during the models’ running process. The prediction results and comparative analysis of different models for AQI series originating from different cities will be concerned in the following contents.

4.3.1. Results of the ARIMA and PARIMA

The ARIMA and the hybrid model based on ARIMA require the experimental data to meet the stationarity requirement. Table 3 exhibits the results of unit root test including Augmented Dickey-Fuller (ADF) and Phillips-Perron test after 1st difference of AQI time series. According to the displayed results, we can draw the conclusion that the AQI data are provided with the stationarity property, which means that the AQI time series meet the requirements of ARIMA modeling.

Table 4 lists the comparison results for model’s performance indices between the prediction results of different traditional models (ARIMA, ANN and SVM) and those of corresponding period-aware hybrid models (PARIMA, PANN and PSVM). In general, the models proposed in this paper have smaller MAE, RMSE and larger IA and DA in all cities, which means that these models with period information are superior to the corresponding benchmark models.

By observing the prediction results of four data sets for Beijing, Tianjin, Taiyuan and Shijiazhuang, it is found that the prediction accuracy of PARIMA models are significantly higher than that of ARIMA in the process of tuning the global models. For Beijing AQI series, we can detect that the MAE expressing the accuracy of hybrid model PARIMA is 17.58% lower than that of individual model ARIMA. Similarly, the AQI series of Tianjin, Taiyuan and Shijiazhuang can offer the same results that MAEs of the proposed models are far less than those of the traditional methods. The improvement of indicator IA is not as good as for accuracy index (MAE, RMSE) as the correlation index of the prediction results of the baseline method ARIMA is very good, which makes it impractical to significantly improve indicator IA. However, the DA results of the PARIMA methods are always significantly better than those of ARIMA: for the four different AQI time series, the hybrid models can bring 11.23%, 22.28%, 8.58%, and 23.63% of obvious increase over individual models, respectively. A better direction index is helpful to improve the accuracy of AQI series trend estimation, because it is an important basis for us to judge whether the direction of time series changes. With the view of revealing the advantages of the PARIMA model more directly, AQI scatterplots of the ARIMA and the PARIMA model are presented in Figure 4 for comparison of predicted values and the actual AQI values. The straight lines with different colors in the scatterplots represent the linear regression lines of the predicted values of ARIMA and PARIMA model respectively. The

R^{2}

values of the hybrid model in Beijing, Tianjin, Taiyuan and Shijiazhuang are 0.9736, 0.9732, 0.9816, and 0.9709, which shows stronger fitting performance compared with the values 0.9444, 0.9343, 0.9517, and 0.9187 of the corresponding ARIMA models. It can also be seen from the figures that the regression lines of the forecasting values of the models with period information are closer to the diagonals meaning best fit, that is, closer to the observed values. Moreover, it is worth mentioning that the forecast values of ARIMA models are generally underestimated, which can be effectively improved by using hybrid models. From the results of the above model’s performance indicators, it can be concluded that the accuracy, correlation and trend estimation of PARIMA model are greatly improved compared with that of ARIMA model. The hybrid algorithm improves the generalization ability of the most suitable method for time series analysis as it considers the periodicity of time series.

4.3.2. Results of the ANN and PANN

For the sake of verifying the applicability of the period extraction algorithm, a hybrid model PANN based on ANN is constructed. As one can paid attention to, for all other time series except for SJZ series, the results from PANN are very outstanding, which are more effective for MAE and RMSE than those obtained by ANN. The results of PANN applied to SJZ series only show a slight improvement, which may be due to the risk of local optimization of ANN model in theory. The improvement of the hybrid model for the index MAE varies according to the different data sets: between 1.56% (SJZ) and 10.07% (TY). Compared with ANN, the improvement rate of PANN with regard to RMSE can reach from 1.29% (SJZ) to 8.93% (BJ). All in all, the above analysis shows that the hybrid model PANN can effectively ameliorate the application of ANN to time series prediction with period characteristics, and can verify the suitability and usefulness of the period extraction algorithm based on Luenberger observer.

4.3.3. Results of the SVM and PSVM

The results of SVM and its corresponding hybrid model PSVM are illustrated in Table 4. We can analyze the prediction performance of the models in different aspects by focusing on each performance index separately, and then boil down to the overall generalization ability. As far as MAE is concerned, the results of the hybrid model PSVM for the four AQI series are 5.1685, 4.6354, 3.9572, and 6.2472, which is lower than the 5.6431, 4.8751, 4.3461, and 6.2472 obtained by the traditional model respectively. The results of RMSE, representing the prediction accuracy, also agree that the hybrid model is better than the traditional method. For IA, the SVM has shown very good results, indicating that the predicted values have a strong correlation with the actual values, while the period-aware hybrid model can further enhance the forecast performance for AQI series. As for the DA index, the BJ series has achieved the best improvement of 11.93%, and the SJZ series with the most unsatisfactory performance also can increase by 6.40%. It can be said that the period incorporating in the form of vector components is an effective way to improve the generalization ability of machine learning models, and the operating speed of the forecast system will not be affected due to the very limited increase of data sets.

4.3.4. Comparison of Hybrid Models’ Results

The proposed period-aware hybrid models incorporating period information allow us to obtain more accurate prediction results than the traditional models. The forecasting errors of the six models used in this paper are represented by boxplots, shown in Figure 5, where can be detected that the lines in each box represent the median value of the samples, the top and bottom represent 25th and 75th percentiles respectively, and the outliers are defined to be 1.5 times beyond the interquartile range, which are shown as red plus sign. We can see that the prediction errors of the period-aware hybrid models are more concentrated around zero, and their variation ranges are significantly smaller than those of the traditional models. In addition, the error outliers of the hybrid model are far less than those achieved by the comparison models. Specially, there are the least outliers in the error boxplots of the PARIMA model. It is particularly noteworthy that the PARIMA model for the same dataset shows the most outstanding performance as shown in Table 4. Taking BJ series as an example, MAE of PARIMA is 4.2157, which is significantly lower than the MAE values 6.5492, 5.1685 of PANN and PSVM model. Similarly, the RMSE value of PARIMA also performs best in the hybrid models. Besides, the comparison results of the accuracy indicators of different models in Figure 6 are the same as the previous conclusions, which can also prove that the PARIMA models have the best prediction accuracy. Moreover, the PARIMA also have larger values of IA and DA, which indicates that the relationship between the observations and corresponding predicted results is stronger and the prediction of the time series trend is more accurate. This can be attributed to the fact that ARIMA, the basic model, has a good generalization ability because of considering the effect of the historical errors on the forecast target in the modeling process. Although the period-aware hybrid models have some advantages over the corresponding basic models, the optimal hybrid model is PARIMA based ARIMA model, whose comprehensive performance is significantly better than other hybrid models. The machine learning models ANN and SVM are better at dealing with the complex functional relationship of high-dimensional nonlinear data, but for testing the effectiveness of the period extraction algorithm, the relatively simple one-dimensional AQI series are selected as the experimental data, which to a certain extent affects the performance of machine learning models. It is undeniable that our method of extracting periodic information can also be easily applied to high-dimensional time series. The above experimental results fully prove that the proposed period extraction algorithm of time series is very effective and suitable for most traditional AQI series prediction models.

5. Conclusions

Time series forecasting is a practical issue in the field of atmospheric environment research. Periodicity is a useful characteristic of AQI time series. Whether it can be used effectively affects the generalization ability of prediction system to a great extent. In our study, a novel period extraction algorithm using Luenberger observer is put forward. In addition, a period-aware hybrid prediction model is constructed by combining the period extraction algorithm with machine learning techniques. We use the Luenberger observer to approximate the periodic function to decompose the periodic information from the original time series, which theoretically guarantees that the period extraction algorithm is not limited to the stationary time series, and the training sample size is very small. In order to objectively assess the validity and robustness of the period-aware hybrid forecast system, four AQI series gathered from cities in North China are used as simulation experimental data for empirical research. The experimental results express that the period-aware hybrid model reflects the periodic information underlying the original time series, and shows better prediction accuracy than the traditional models, which means better generalization ability.

According to the experimental results, four conclusions can be summarized: (1) The periodic information extracted by the period extraction algorithm is expressed in the form of sequence, which is very convenient to combine with the traditional time series models. (2) The prediction accuracy of the hybrid models based on different benchmark models has been improved to some extent, which strongly proves the validity and applicability of the period extraction algorithm. (3) The best performance of PARIMA in the hybrid models means that the forecast accuracy has a great correlation with the performance of the benchmark model itself, and also denotes that the proposed period extraction algorithm is more suitable for combining with the time series model with linear structure, which may be due to the complementarity between the nonlinearity of the period extraction algorithm and the linear characteristics of the ARIMA model. (4) It is very meaningful that the period-aware hybrid model is not only suitable for AQI time series prediction, but also for other time series with implied period characteristics, such as PM_2.5 concentration prediction, power load prediction and rainfall prediction.

In this paper, only the time series with one period is studied, so how to extend the period extraction algorithm to multi-period time series deserves further study and discussion. In the future work, we will focus on this problem to make the period extraction algorithm more practical.

Author Contributions

Conceptualization, G.Z.; Data curation, D.Y.; Methodology, P.W.; Software, H.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 61873153).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wu, Q.; Lin, H. Daily urban air quality index forecasting based on variational mode decomposition, sample entropy and LSTM neural network. Sustain. Cities Soc. 2019, 50, 101657. [Google Scholar] [CrossRef]
MEE. Ambient Air Quality Standards. (Document GB 3095-2012); Ministry of Ecology and Environment of the People’s Republic of China: Beijing, China, 2012.
Zhu, S.; Lian, X.; Liu, H.; Hu, J.; Wang, Y.; Che, J. Daily air quality index forecasting with hybrid models: A case in China. Environ. Pollut. 2017, 231, 1232–1244. [Google Scholar] [CrossRef] [PubMed]
Wang, D.; Wei, S.; Luo, H.; Yue, C.; Grunder, O. A novel hybrid model for air quality index forecasting based on two-phase decomposition technique and modified extreme learning machine. Sci. Total Environ. 2017, 580, 719–733. [Google Scholar] [CrossRef] [PubMed]
Jiang, F.; He, J.; Tian, T. A clustering-based ensemble approach with improved pigeon-inspired optimization and extreme learning machine for air quality prediction. Appl. Soft Comput. 2019, 85, 105827. [Google Scholar] [CrossRef]
Wu, Q.; Lin, H. A novel optimal-hybrid model for daily air quality index prediction considering air pollutant factors. Sci. Total Environ. 2019, 683, 808–821. [Google Scholar] [CrossRef] [PubMed]
Zhu, C.; Fan, R.; Sun, J.; Luo, M.; Zhang, Y. Exploring the fluctuant transmission characteristics of Air Quality Index based on time series network model. Ecol. Indic. 2020, 108, 105681. [Google Scholar] [CrossRef]
Zhu, S.; Yang, L.; Wang, W.; Liu, X.; Lu, M.; Shen, X. Optimal-combined model for air quality index forecasting: 5 cities in North China. Environ. Pollut. 2018, 243, 842–850. [Google Scholar] [CrossRef] [PubMed]
Cinar, Y.G.; Mirisaee, H.; Goswami, P.; Gaussier, E.; Aït-Bachir, A. Period-aware content attention RNNs for time series forecasting with missing values. Neurocomputing 2018, 312, 177–186. [Google Scholar] [CrossRef]
Bernas, M.; Płaczek, B. Period-aware local modelling and data selection for time series prediction. Expert Syst. Appl. 2016, 59, 60–77. [Google Scholar] [CrossRef]
Sang, Y.F.; Wang, Z.; Liu, C. Period identification in hydrologic time series using empirical mode decomposition and maximum entropy spectral analysis. J. Hydrol. 2012, 424–425, 154–164. [Google Scholar] [CrossRef]
Xue, J.; Xu, Y.; Zhao, L.; Wang, C.; Rasool, Z.; Ni, M.; Wang, Q.; Li, D. Air pollution option pricing model based on AQI. Atmos. Pollut. Res. 2019, 10, 665–674. [Google Scholar] [CrossRef]
Zheng, S.; Cao, C.X.; Singh, R.P. Comparison of ground based indices (API and AQI) with satellite based aerosol products. Sci. Total Environ. 2014, 488–489, 398–412. [Google Scholar] [CrossRef] [PubMed]
Song, Z.; Fu, D.; Zhang, X.; Han, X.; Song, J.; Zhang, J.; Wang, J.; Xia, X. MODIS AOD sampling rate and its effect on PM_2.5 estimation in North China. Atmos. Environ. 2019, 209, 14–22. [Google Scholar] [CrossRef]
Fu, D.; Xia, X.; Wang, J.; Zhang, X.; Li, X.; Liu, J. Synergy of AERONET and MODIS AOD products in the estimation of PM_2.5 concentrations in Beijing. Sci. Rep. 2018, 8. [Google Scholar] [CrossRef] [PubMed]
Box, G. Box and Jenkins: Time Series Analysis, Forecasting and Control; Palgrave Macmillan: London, UK, 2013; pp. 161–215. [Google Scholar]
Kumar, A.; Goyal, P. Forecasting of daily air quality index in Delhi. Sci. Total Environ. 2011, 409, 5517–5523. [Google Scholar] [CrossRef] [PubMed]
Maciąg, P.S.; Kasabov, N.; Kryszkiewicz, M.; Bembenik, R. Air pollution prediction with clustering-based ensemble of evolving spiking neural networks and a case study for London area. Environ. Model. Softw. 2019, 118, 262–280. [Google Scholar] [CrossRef]
Li, H.; Wang, J.; Li, R.; Lu, H. Novel analysis-forecast system based on multi-objective optimization for air quality index. J. Clean. Prod. 2019, 208, 1365–1383. [Google Scholar] [CrossRef]
Wang, J.; Hu, J. A robust combination approach for short-term wind speed forecasting and analysis—Combination of the ARIMA (Autoregressive Integrated Moving Average), ELM (Extreme Learning Machine), SVM (Support Vector Machine) and LSSVM (Least Square SVM) forecasts using a GPR (Gaussian Process Regression) model. Energy 2015, 93 Pt 1, 41–56. [Google Scholar] [CrossRef]
Mouatadid, S.; Raj, N.; Deo, R.C.; Adamowski, J.F. Input selection and data-driven model performance optimization to predict the Standardized Precipitation and Evaporation Index in a drought-prone region. Atmos. Res. 2018, 212, 130–149. [Google Scholar] [CrossRef]

Figure 1. Geographical distribution of study cities and air monitoring sites.

Figure 2. Autocorrelation function analysis for the periods on AQI series.

Figure 3. The

d_{m}

series and error between

d_{m}

and observation values of BJ, TJ, TY and SJZ.

Figure 3. The

d_{m}

series and error between

d_{m}

and observation values of BJ, TJ, TY and SJZ.

Figure 4. Scatterplots of AQI observations and predictions using ARIMA and PARIMAmodel.

Figure 5. Boxplots of forecasting error.

Figure 6. The accuracy index results of different models.

Table 1. Descriptive statistics results of AQI.

AQI Data	Mean	Median	Maximum	Minimum	Std.dev.	Skewness	Kurtosis	Jarue-Bera	Probability
BJ	87.31048	80.00000	204.0000	23.00000	40.46884	0.652530	2.757676	54.61900	0.000000
TJ	78.71371	71.00000	153.0000	34.00000	24.59191	0.790826	2.786825	78.95905	0.000000
TY	89.19355	89.00000	202.0000	26.00000	32.47751	0.379377	3.170667	18.74958	0.000085
SJZ	97.21505	94.50000	202.0000	29.00000	34.07915	0.409457	2.908065	21.05127	0.000027

Table 2. The period-aware hybrid model.

Require: the AQI time series

X = {x_{1}, \dots, x_{i}, \dots, x_{t}}

, where

x_{i} = A Q I_{i}

at time i.

Ensure: the forecasting value

x_{t + 1}

at time

t + 1

.

1: Forming the set

{(x_{i}^{^{'}}, y_{i})}_{i = 1}^{t}

based on AQI time series X, where

x_{i}^{^{'}} = (x_{i - 1}, \dots, x_{i - p})

,

y_{i} = x_{i}

, we get the training data set for time series forecasting model (ARIMA, ANN and SVM). For ARIMA model, the Akaike’s information criterion (AIC) rule is used to determine the p value representing step size of historical data and the remaining parameters d, q in the model. For ANN and SVM models, the lags of the historical values are determined by the partial autocorrelation function (PACF) value.

2: Applying period extraction algorithm based on Luenberger observer to time series X, the

d_{m_{i}}

representing period information is obtained.

3: Constructing the new training dataset

{({\tilde{x}}_{i}, y_{i})}_{i = 1}^{t}

, where

{\tilde{x}}_{i} = (x_{i - 1}, \dots, x_{i - p}, d_{m_{i - 1}})

and

d_{m_{i - 1}}

coming from the previous step, period information is integrated into training data.

4: Similarly, according to the period extraction information, the vector

{\tilde{x}}_{t + 1} = (x_{t}, \dots, x_{t - p}, d_{m_{t}})

representing the system input at time

t + 1

is built.

5: Training the time series forecasting system on new training set

{({\tilde{x}}_{i}, y_{i})}_{i = 1}^{t}

, where the vector

{\tilde{x}}_{i}

represents the input, and

y_{i}

is the output of the system, we optimize the relevant parameters of the prediction systems in accordance with the principle of risk minimization.

6: Inputting the feature vector

{\tilde{x}}_{t + 1}

into the trained time series prediction model, the output of the prediction system is the forecasting target

y_{t + 1}

at time

t + 1

.

Repeat the above steps to obtain the prediction results

{y_{t + 2}, \dots, y_{t + N}}

.

Table 3. Unit root test results for AQI.

AQI Data	ADF			Phillips–Perron
AQI Data	Statistic	Prob.	Test Critical (1%)	Statistic	Prob.	Test Critical (1%)
BJ	−17.29189 *	0.0000	−3.438936	−17.28091 *	0.0000	−3.438936
TJ	−15.42580 *	0.0000	−3.438960	−15.41498 *	0.0000	−3.438936
TY	−15.83096 *	0.0000	−3.438948	−14.57153 *	0.0000	−3.438936
SJZ	−24.68088 *	0.0000	−3.438936	−24.56396 *	0.0000	−3.438948

* Significance at the 1% level.

Table 4. Forecast accuracy of the period-aware hybrid models for AQI.

Dataset	Index	ARIMA	PARIMA	ANN	PANN	SVM	PSVM
BJ	MAE	5.1105	4.2157	7.1994	6.5492	5.6431	5.1685
	RMSE	7.4021	6.2486	9.3029	8.4713	8.0574	7.3530
	IA	0.9857	0.9907	0.9751	0.9795	0.9824	0.9853
	DA	0.7110	0.7909	0.5856	0.6312	0.5741	0.6426
TJ	MAE	4.5954	3.7715	5.2852	4.8915	4.8751	4.6354
	RMSE	6.3338	5.2479	7.1120	6.5920	6.8099	6.3021
	IA	0.9843	0.9902	0.9797	0.9825	0.9807	0.9837
	DA	0.6654	0.8137	0.5513	0.6084	0.5741	0.6160
TY	MAE	3.7335	2.7575	9.4205	8.4716	4.3461	3.9572
	RMSE	6.1442	4.1807	11.0859	10.1187	6.5090	6.1880
	IA	0.9876	0.9946	0.9522	0.9602	0.9853	0.9867
	DA	0.7529	0.8175	0.5970	0.6312	0.5894	0.6274
SJZ	MAE	5.7934	4.4954	8.5069	8.3741	6.3383	6.2472
	RMSE	8.9845	6.1776	11.1195	10.9753	9.2624	9.1298
	IA	0.9790	0.9909	0.9611	0.9624	0.9757	0.9765
	DA	0.6274	0.7757	0.6122	0.6236	0.5932	0.6312

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, P.; Feng, H.; Zhang, G.; Yu, D. A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series. Sustainability 2020, 12, 4730. https://doi.org/10.3390/su12114730

AMA Style

Wang P, Feng H, Zhang G, Yu D. A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series. Sustainability. 2020; 12(11):4730. https://doi.org/10.3390/su12114730

Chicago/Turabian Style

Wang, Ping, Hongyinping Feng, Guisheng Zhang, and Daizong Yu. 2020. "A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series" Sustainability 12, no. 11: 4730. https://doi.org/10.3390/su12114730

APA Style

Wang, P., Feng, H., Zhang, G., & Yu, D. (2020). A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series. Sustainability, 12(11), 4730. https://doi.org/10.3390/su12114730

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Period-Aware Hybrid Model Applied for Forecasting AQI Time Series

Abstract

1. Introduction

2. Data Description

3. Methods

3.1. Period Analysis

3.2. Period Extraction Algorithm Based on Luenberger Observer

3.3. Period-Aware Hybrid Model

3.4. Performance Indices of Model’s Prediction Accuracy

4. Results and Discussion

4.1. Period Analysis Results

4.2. The Results of Period Extraction Algorithm

4.3. Results of the Period-Aware Hybrid Model

4.3.1. Results of the ARIMA and PARIMA

4.3.2. Results of the ANN and PANN

4.3.3. Results of the SVM and PSVM

4.3.4. Comparison of Hybrid Models’ Results

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI