Prediction of Air Pollution Concentration Based on mRMR and Echo State Network

: Air pollution has become a global environmental problem, because it has a great adverse impact on human health and the climate. One way to explore this problem is to monitor and predict air quality index in an economical way. Accurate monitoring and prediction of air quality index (AQI), e.g., PM 2.5 concentration, is a challenging task. In order to accurately predict the PM 2.5 time series, we propose a supplementary leaky integrator echo state network (SLI-ESN) in this paper. It adds the historical state term of the historical moment to the calculation of leaky integrator reservoir, which improves the inﬂuence of historical evolution state on the current state. Considering the redundancy and correlation between multivariable time series, minimum redundancy maximum relevance (mRMR) feature selection method is introduced to reduce redundant and irrelevant information, and increase computation speed. A variety of evaluation indicators are used to assess the overall performance of the proposed method. The e ﬀ ectiveness of the proposed model is veriﬁed by the experiment of Beijing PM 2.5 time series prediction. The comparison of learning time also shows the e ﬃ ciency of the algorithm.


Introduction
With the rapid advancement of urbanization and industrialization, air quality has deteriorated severely, which has negatively affected the quality of the living environment and even hindered economic growth in some areas [1]. In particular, the inhalable particles produced by industrial pollution have small particle size, large diffusion area and strong activity, thus they can enter the human body through the respiratory tract, which has adverse effects on human health. Therefore, the prediction of air pollutants plays a crucial role in the early warning and control of environmental pollution [2]. Therefore, modeling and forecasting the air quality index (e.g., PM 2.5 concentration) has become an effective way to prevent and control air pollution, and it also provides a scientific basis for the development of effective measures [3]. The implementation of this idea can effectively reduce the health hazard of air pollution, thus achieving early warning and rational planning [4].
For a long time, many scholars have conducted in-depth research on air pollution. At the same time, a variety of predictive models have been proposed, such as autoregressive integrated moving average model [5], support vector machine [5], multiple linear regression model [6], neural networks [7,8], and so on [9]. All of them have been applied to predict air pollution concentration. Oprea et al. [10] applied artificial neural network and adaptive neuro-fuzzy inference system to predict PM 2.5 concentration. Deng et al. [11] proposed heterogeneous space-time artificial neural networks to deal with spatial heterogeneity, which has been applied to predict the concentration of

Feature Selection Method
Time series that affect air pollution are high-dimensional data, which not only contains rich information, but also has irrelevant or redundant factors. These irrelevant and redundant factors reduce the prediction accuracy and efficiency of the model. Thus, analyzing the relationship between variables and selecting valuable input variables are important for prediction.
Feature selection is the most typical data preprocessing method [21]. It consists of four parts: generation process, evaluation function, stop criterion and verification process. Currently, Common feature selection algorithms include random forest (RF), correlation feature selection (CFS), fast correlation-based filter (FCBF), mutual information (MI) [22], information gain (IG) [23], regularization models, relief-based algorithms, and genetic algorithm. However, these feature selection algorithms usually ignore the redundancy relationship between features. Random forests have over-fitting on noise regression problems and do not give continuous output. The CFS and FCBF feature selection methods are slow to calculate and cannot handle large-scale data efficiently. MI and IG do not require the type of data distribution but have high computational complexity for high-dimensional data. The regularization model has a better selection effect on high-dimensional data with a size much larger than the number of samples, but it does not perform satisfactorily in low-dimensional data selection. For the prediction of PM 2.5 concentration with high noise, how to select appropriate feature selection method is very important. In this paper, the minimum redundancy maximum relevance (mRMR) [20] is used for feature selection. After this, a set of purest features can be obtained, and the redundancy features are removed while guaranteeing the maximum relevance.

Echo State Network
The echo state network (ESN) proposed by Jaeger et al. [24] is a new type of recurrent neural network. It consists of an input layer, a reservoir and an output layer. The structure of ESN is shown in Figure 1. W in is the input weight matrix that connects the inputs and the reservoir, W is the internal connection weight matrix of the reservoir, W out is the output weight matrix that connects the reservoir and the output, and W back is the output-to-reservoir feedback connection weight matrix, which is usually set as a zero vector. Time series that affect air pollution are high-dimensional data, which not only contains rich information, but also has irrelevant or redundant factors. These irrelevant and redundant factors reduce the prediction accuracy and efficiency of the model. Thus, analyzing the relationship between variables and selecting valuable input variables are important for prediction.
Feature selection is the most typical data preprocessing method [21]. It consists of four parts: generation process, evaluation function, stop criterion and verification process. Currently, Common feature selection algorithms include random forest (RF), correlation feature selection (CFS), fast correlation-based filter (FCBF), mutual information (MI) [22], information gain (IG) [23], regularization models, relief-based algorithms, and genetic algorithm. However, these feature selection algorithms usually ignore the redundancy relationship between features. Random forests have over-fitting on noise regression problems and do not give continuous output. The CFS and FCBF feature selection methods are slow to calculate and cannot handle large-scale data efficiently. MI and IG do not require the type of data distribution but have high computational complexity for highdimensional data. The regularization model has a better selection effect on high-dimensional data with a size much larger than the number of samples, but it does not perform satisfactorily in lowdimensional data selection. For the prediction of PM2.5 concentration with high noise, how to select appropriate feature selection method is very important. In this paper, the minimum redundancy maximum relevance (mRMR) [20] is used for feature selection. After this, a set of purest features can be obtained, and the redundancy features are removed while guaranteeing the maximum relevance.

Echo State Network
The echo state network (ESN) proposed by Jaeger et al. [24] is a new type of recurrent neural network. It consists of an input layer, a reservoir and an output layer. The structure of ESN is shown in Figure 1. in W is the input weight matrix that connects the inputs and the reservoir, W is the internal connection weight matrix of the reservoir, out W is the output weight matrix that connects the reservoir and the output, and back W is the output-to-reservoir feedback connection weight matrix, which is usually set as a zero vector.

Input
Reservoir Output  Assuming that the system obtains a time series U(n) = [u 1 (n), u 2 (n), · · · , u K (n)] T at time n, which is the input of the reservoir at this time. The output of the reservoir well be y in (n) = U(n + λ), and the prediction of different time steps is realized according to the adjustment of λ. The reservoir receives two inputs. The one is U(n) from the input layer and the other is x(n − 1) from the previous state of the reservoir. The matrixes W and W in are randomly generated and remain unchanged during the training. The weight W is a large-scale sparse matrix, in which non-zero elements indicate activated neurons in the reservoir. The update formula of current state of reservoir is as follows: where x(n) ∈ R L×1 is the state of the reservoir at time n, and its initial state is a zero vector. The operator tanh(·) is the hyperbolic tangent activation function. With the infinite growth of time, the dependence of the current state affected by the initial state of reservoir gradually decreases or even disappears [25]. The information in reservoir is transferred to the output layer with a linear connection, and the output of the network is: Afterwards, Jaeger et al. [26] improved the ESN and proposed the leaky integrator ESN (LI-ESN) by introducing leaky integrator neurons into the reservoir. LI-ESN is a variant of ESN, but its reservoir contains the leak integrator neurons. The update formula of the state of reservoir in LI-ESN is: where a ∈ [0, 1] is the leaking rate. The leaky neuron in reservoir has a leaky integration of its activation, partially remembers its previous activation, and retains the effect of the previous moment on the current state.

PM 2.5 Time Series Prediction Model
In order to accurately predict the concentration of PM 2.5 , this paper proposes a PM 2.5 time series prediction model. To handle high-dimensional input variables, the mRMR method is utilized to select optimal subset. Then, the phase space is reconstructed from the subset to obtain evolution information of time series. Finally, we predict PM 2.5 time series based on supplemental leaky integrator echo state network (SLI-ESN) model, which is proposed in this paper.

mRMR Feature Selection Method
In this paper, mRMR method [20] is used for feature selection. Its basic idea is to maximize the correlation between input features and output, and to minimize the correlation between input features. The goal is to find the most representative feature subset. The evaluation of correlation between features is performed by mutual information method, and the appropriate feature subset is selected from original feature set. The maximum relevance and minimum redundancy of mRMR are calculated as follows: where f i and f k represent the ith and kth feature in set S, respectively. C denotes the target output. I( f i ; C) denotes the correlation function of the ith feature and output C. I( f i ; f k ) represents the correlation function of the ith and kth feature. Mutual information is defined as follows: Appl. Sci. 2019, 9, 1811 5 of 14 Therefore, the expression of mRMR is as follows: Applying the forward selection algorithm to solve the objective function (7), we can get the sort result of the input features. Finally, the optimal input feature subset can be obtained by cross-validation.

Phase Space Reconstruction
Phase space reconstruction is a very important step in time series analysis and prediction. In order to extract useful information from time series, Takens [27] proposed embedding theorem of phase space reconstruction. In the actual calculation, since the numerical differentiation is sensitive to the error. Therefore, phase space reconstruction of time series generally adopts coordinate delay method. The essence is to construct m-dimensional vector by different delay times of one-dimensional time series. Its calculation formula is as follows: where u i (n) i = 1, 2, · · · , K are input time series, m i is the embedding dimension, and τ i is the delay time. The reconstructed phase space contains all the evolution information of original system. The embedding theorem shows that for one-dimensional time series of infinitely long, noise-free m-dimensional chaotic attractors, an m-dimensional phase space can be found, as long as the embedding dimension satisfies m ≥ 2d + 1, where d is dimension of the dynamic system [28]. However, the existing time series are finite-length sequences with noise. The embedding dimension and delay time cannot be arbitrarily selected, otherwise the phase space quality of the reconstruction will be seriously affected.

Supplementary Leaky Integrator Echo State Network
The echo state network, including the model proposed in this paper, has the ability to memorize historical information. However, due to the unpredictability of chaotic time series, it is difficult to accurately retain the information before the current state of the reservoir for a long time. Moreover, due to the complexity of PM 2.5 time series, its current concentration information may be more dependent on the most recent state. Inspired by the leaked integrator reservoir, the update of the reservoir state is strongly dependent on the historical state. In this paper, considering the impact of previous state on current state, the state update formula of reservoir neurons is improved, and SLI-ESN model is proposed. The state update formula of reservoir is as follows: As with LI-ESN, the meaning of the attenuation parameter a has not changed, and its satisfy 0 < a < 1. The parameter b is the supplement factor, and it also need to satisfy 0 < b < 1. More importantly, like the side length of a triangle, each coefficient must be greater than 0, and satisfy a + b < 1. Because the current state of reservoir is more dependent on the state of neighboring moment, this paper considers influence of the first two historical moments. When n ≥ 2, supplement (1 − a − b) × x(n − 1) is established. Set W back = 0, the output layer is calculated as follows: The calculation from reservoir to output layer is solved by ridge regression [29]. It is a biased estimation regression method for collinear data analysis. It overcomes ill-posed problem in least squares solution and prevents over-fitting problem. The expression is as follows: where k is regularization parameter, I is the identity matrix, x(n) and y(n) are column vectors of X and Y, respectively.

Algorithm Flow
In order to accurately predict the PM 2.5 time series, this paper proposes the prediction structure based on feature extraction and improved echo state network model. The process is summarized as follows.

1.
Feature selection: select the optimal subset from original dataset based on mRMR feature selection method.

2.
Phase space reconstruction: reconstruct phase space of the selected optimal subset based on Takens' theorem and form a new set of input features.

3.
Data division: divide training set and testing set according to a certain proportion.

4.
Model training: train SLI-ESN model using ridge regression algorithm on training set.

5.
Prediction: predict PM 2.5 time series using SLI-ESN model on testing set.
In order to briefly explain prediction process of the proposed model, the schematic diagram of the method is shown in Figure 2. In order to briefly explain prediction process of the proposed model, the schematic diagram of the method is shown in Figure 2.

Results and Discussion
In order to prove the validity and practicability of the proposed method, the SLI-ESN model was applied to predict the actual observed PM2.5 time series of Beijing. At the same time, this paper also carried out comparative experiments of ESN, LI-ESN [26], extreme learning machine (ELM) [30], hierarchical ELM (H-ELM) [31] and stacked auto-encoder (SAE) [32]. The experimental environment is Windows 7 system, MATLAB 2016a programming software. The computer memory is 6 GB, clocked at 3.50 GHz, and Intel-i3 CPU.
Firstly, the AQI dataset and evaluation indicators are explained. Secondly, the feature selection experiment based on mRMR is introduced in detail, and the optimal subset is obtained. Finally, the reconstructed features of optimal subset is used for training the SLI-ESN model, and the validity of the proposed method in PM2.5 time series prediction is verified by predictive indicators.

Results and Discussion
In order to prove the validity and practicability of the proposed method, the SLI-ESN model was applied to predict the actual observed PM 2.5 time series of Beijing. At the same time, this paper also carried out comparative experiments of ESN, LI-ESN [26], extreme learning machine (ELM) [30], hierarchical ELM (H-ELM) [31] and stacked auto-encoder (SAE) [32]. The experimental environment is Windows 7 system, MATLAB 2016a programming software. The computer memory is 6 GB, clocked at 3.50 GHz, and Intel-i3 CPU.
Firstly, the AQI dataset and evaluation indicators are explained. Secondly, the feature selection experiment based on mRMR is introduced in detail, and the optimal subset is obtained. Finally, the reconstructed features of optimal subset is used for training the SLI-ESN model, and the validity of the proposed method in PM 2.5 time series prediction is verified by predictive indicators.

Data Description
This paper selects 8759 samples of hourly air pollution data from January to December 2016 in Haidian District, Beijing. The dataset is from the US Embassy (Harvard University Geographic Analysis Center Dataverse), including the average concentration of PM 2.5 , PM 10 , NO 2 , CO, O 3 , and SO 2 per hour, and hourly temperature (T), pressure (P), humidity (H), wind speed (WS) and wind direction (WD).
Because PM 2.5 is a particulate matter that can enter the lungs, it has a great impact on human health and the quality of atmospheric environment. Therefore, predicting PM 2.5 concentration can not only make an estimate of environmental quality, but also have a greater impact on better environmental governance and human health. This paper uses five predictive indicators to evaluate the prediction results of PM 2.5 time series, namely root mean square error (RMSE), normalized root mean square error (NRMSE), mean absolute error (MAE), symmetric mean absolute percentage error (SMAPE) and Pearson correlation coefficient (R).
where N is the number of samples,ŷ(t) is the predicted output, and y(t) is the target value.
In the above evaluation indicators, the smaller the values of the estimated indicators of RMSE, NRMSE, MAE, and SMAPE are, the better the prediction results of the model are. R = 1 indicates thatŷ(t) and y(t) are linear correlation, and R = 0 indicates there is no correlation. When R ∈ (0, 1), it indicates that there is a correlation, and the larger value is, the stronger the linear correlation is.

Data Processing
The optimal subset selection of original data was performed using mRMR. In this paper, the first 75% of the dataset is used for training set, and the last 25% is used for testing set. The optimal subset is selected based on training set, and each model parameter in the simulation is obtained only from the training set.
Firstly, PM 2.5 time series is chosen as reference variable. Other 10-dimensional variables are used as comparison variables. The data was quantitatively analyzed using mRMR to obtain sorting results: PM 2.5 , CO, WS, PM 10 , H, WD, SO 2 , NO 2 , P, T, and O 3 . According to the influence of different factors on PM 2.5 concentration, the correlation between different variables and PM 2.5 is shown in Figure 3.
75% of the dataset is used for training set, and the last 25% is used for testing set. The optimal subset is selected based on training set, and each model parameter in the simulation is obtained only from the training set.
Firstly, PM2.5 time series is chosen as reference variable. Other 10-dimensional variables are used as comparison variables. The data was quantitatively analyzed using mRMR to obtain sorting results: PM2.5, CO, WS, PM10, H, WD, SO2, NO2, P, T, and O3. According to the influence of different factors on PM2.5 concentration, the correlation between different variables and PM2.5 is shown in Figure 3. According to the mRMR method, the irrelevant and redundant variables in the original dataset are reduced. The selection result of obtained optimal subset is shown in Figure 4. According to the mRMR method, the irrelevant and redundant variables in the original dataset are reduced. The selection result of obtained optimal subset is shown in Figure 4.  According to the prediction results in Figure 4, when the predictor is with 5 dimensions, the prediction error is the smallest, namely 9.199. Therefore, the optimal subset is PM2.5, CO, WS, PM10, and H. The phase space reconstruction is then performed on the selected optimal subset. The delay time τ and embedding dimension m calculated by C-C method [33] are shown in Table 1. The 5 variables in the optimal subset are expressed as bold in the table. As shown in Table 1, the delay time for the obtained optimal subset is [8,8,6,4,4], and the embedding dimension is [2,2,2,4,4] for PM2.5, PM10, CO, H and WS respectively.

Experimental Results and Analysis
Phase space reconstruction is performed on the optimal subset to obtain a 14-dimensional reconstructed time series, which are used for inputs of prediction model. For the LI-ESN model, the reasonable range of the leaking rate a is (0, 1). For the SLI-ESN model, we use the cross-validation to select the two parameters a and b. According to experience, the feasible and effective range of the supplementy factor b is mainly within (0, 0.1). In this paper, ESN, LI-ESN, ELM, H-ELM and SAE are selected as comparison methods. The specific one-step (1 hour) prediction results are shown in Table  2.
It can be seen from Table 2 that the proposed method achieves better prediction results in one- According to the prediction results in Figure 4, when the predictor is with 5 dimensions, the prediction error is the smallest, namely 9.199. Therefore, the optimal subset is PM 2.5 , CO, WS, PM 10 , and H. The phase space reconstruction is then performed on the selected optimal subset.
The delay time τ and embedding dimension m calculated by C-C method [33] are shown in Table 1. The 5 variables in the optimal subset are expressed as bold in the table. As shown in Table 1, the delay time for the obtained optimal subset is [8,8,6,4,4], and the embedding dimension is [2,2,2,4,4] for PM 2.5 , PM 10 , CO, H and WS respectively.

Experimental Results and Analysis
Phase space reconstruction is performed on the optimal subset to obtain a 14-dimensional reconstructed time series, which are used for inputs of prediction model. For the LI-ESN model, the reasonable range of the leaking rate a is (0, 1). For the SLI-ESN model, we use the cross-validation to select the two parameters a and b. According to experience, the feasible and effective range of the supplementy factor b is mainly within (0, 0.1). In this paper, ESN, LI-ESN, ELM, H-ELM and SAE are selected as comparison methods. The specific one-step (1 hour) prediction results are shown in Table 2.   It can be seen from Table 2 that the proposed method achieves better prediction results in one-step (1 hour) prediction. The one-step (1 hour) prediction result of PM 2.5 concentration is shown in Figure 5. And Figure 6 plots the fit of predicted value to actual data. It can be seen from the figures that the prediction has a good linear relationship with actual value. SLI-ESN performs satisfactorily at peaks and undulating moments, which mainly depends on the full utilization of the historical state of the reservoir and the effective information obtained by phase space reconstruction.   At the same time, the simulation results of the five-step (5 hours) prediction are given in Figure  7. The prediction curve can better track original input, and the medium-term prediction effect is also good. Table 3 gives the five-step (5 hours) prediction results. The ten-step (10 hours) prediction result of SLI-ESN is shown in Figure 8. As seen in Figure 8, in some of the peaks, the prediction curve is still At the same time, the simulation results of the five-step (5 hours) prediction are given in Figure 7. The prediction curve can better track original input, and the medium-term prediction effect is also good. Table 3 gives the five-step (5 hours) prediction results. The ten-step (10 hours) prediction result of SLI-ESN is shown in Figure 8. As seen in Figure 8, in some of the peaks, the prediction curve is still able to roughly fit the fluctuation trend of the original data, which is precisely because SLI-ESN makes full use of the role of historical information. Its comparison with other algorithms is shown in Table 4. The proposed algorithm achieves the optimal value on all four parameter indicators, except for SMAPE, which fully demonstrates the effectiveness of SLI-ESN in long-term prediction.

RMSE
Appl. Sci. 2019, 9, x FOR PEER REVIEW 10 of 13 able to roughly fit the fluctuation trend of the original data, which is precisely because SLI-ESN makes full use of the role of historical information. Its comparison with other algorithms is shown in Table  4. The proposed algorithm achieves the optimal value on all four parameter indicators, except for SMAPE, which fully demonstrates the effectiveness of SLI-ESN in long-term prediction.
Longitudinal comparison between Tables 2-4 shows that the longer the prediction time is, the larger the error will be, which is consistent with the basic characteristics of the chaotic time series. Comparing Tables 2-4, it can be found that the recurrent neural networks in three kinds of networks (SAE as the representative of deep learning, ELM and H-ELM as the representative of feedforward neural network and ESN as the representative of recurrent neural network) have more satisfactory prediction performance. This exactly demonstrates the validity of the structure of the reservoir in time series prediction.        Table 5. The results show that the training time based on the ESN and ELM model is much smaller than the deep learning model. This is mainly because the training process of deep learning consumes a lot of time, and this time-consuming process is not required by the other neural networks. Moreover, SLI-ESN can complete training and testing within an acceptable time frame. It indicates that SLI-ESN can achieve good prediction results in both prediction accuracy and time consuming.

Conclusions
PM2.5 is the main source of air pollution, and predicting PM2.5 concentration is of great significance for protecting the environment. In order to improve prediction accuracy and reliability,   4 shows that the longer the prediction time is, the larger the error will be, which is consistent with the basic characteristics of the chaotic time series. Comparing Tables 2-4, it can be found that the recurrent neural networks in three kinds of networks (SAE as the representative of deep learning, ELM and H-ELM as the representative of feedforward neural network and ESN as the representative of recurrent neural network) have more satisfactory prediction performance. This exactly demonstrates the validity of the structure of the reservoir in time series prediction.
In order to further illustrate the performance of the proposed method, the running time of all comparison methods are shown in Table 5. The results show that the training time based on the ESN and ELM model is much smaller than the deep learning model. This is mainly because the training process of deep learning consumes a lot of time, and this time-consuming process is not required by the other neural networks. Moreover, SLI-ESN can complete training and testing within an acceptable time frame. It indicates that SLI-ESN can achieve good prediction results in both prediction accuracy and time consuming. Table 5. Comparison of running time on six methods (second).

Conclusions
PM 2.5 is the main source of air pollution, and predicting PM 2.5 concentration is of great significance for protecting the environment. In order to improve prediction accuracy and reliability, it is important to preprocess the data to eliminate irrelevant and redundant variables before prediction. In this paper, the mRMR is used to screen original dataset to obtain optimal subset. Phase space reconstruction is performed on the optimal subset. And the reconstructed data is used as new input time series of SLI-ESN model for prediction. Experiments show the validity of the SLI-ESN model, which has high prediction accuracy in medium and long term projects, good generalization performance and good application prospects.
Although this paper has achieved the desired results, there are still some issues that need to be addressed in future work. First of all, long-term predictions are not satisfactory. We want to extend the model to longer prediction interval, such as one day, one week or one month. In addition, optimal subset selection and model optimization take a lot of time. In the future, we expect to simultaneously implement input variable selection and model optimization based on an optimization algorithm. Optimization objects include, but are not limited to, input variables, model structure, and model parameters.
Funding: This research was funded by the National Natural Science Foundation of China (61773087).

Conflicts of Interest:
The authors declare no conflict of interest.