Prediction of PM2.5 Concentration on the Basis of Multi-Time Scale Fusion

: Long-term prediction of hour-concentration of PM2.5 (particles in atmospheric suspension with effective dimensions equal or lower than 2.5 microns) is of great signiﬁcance for environmental protection and people’s health. At present, the prediction of hour-concentration of PM2.5 is mostly single-step prediction, which is to predict PM2.5 concentration at a future time point based on a period of historical data. In this paper, a model based on multi-time scale fusion is proposed to study single-step prediction and multi-step prediction, respectively. Experimental results show that the proposed model is better than stacked LSTM and CNN-LSTM in predicting PM2.5 hour-concentration.


Introduction
At present, the traditional methods in the field of PM2.5 concentration prediction research mainly combine four conventional methods formed by meteorology, environmental science, mathematics, and computational science. That is, empirical model prediction based on historical data and statistical methods, probability model prediction based on statistical and mathematical methods or models, prediction based on synthetic methods, and prediction based on conventional machine learning models. With the rapid development of deep learning, Fan et al. used the recurrent neural network model for predicting PM2.5 concentration in the future 1 hour based on air quality and meteorological data in the past 48 h [1]. Qi et al. proposed the model GCN-LSTM (A model based on Graph Convolutional Network and Long Short-Term Memory), which proved that the model was superior to CNN (Convolutional Neural Networks) and LSTM in predicting the air quality in the future one hour [2]. He et al. combined the wavelet transform with the LSTM (Long Short-Term Memory) model and took the daily average concentration as the input to predict the pollutant concentration of the next day, and proved that the proposed model was superior to MLR (Mixed Logistic Regression), LSTM and WT-MLR (Mixed Logistic Regression based on Wavelet Transform) [3]. Huang and Kuo constructed the model APNet (Attention-based Parallel Networks) and proved through experiments that the model was superior to CNN and LSTM in predicting PM2.5 concentration in the future one hour [4].
However, the above studies only focus on PM2.5 concentration of single-step prediction, did not predict the PM2.5 concentration for a period of time in the future, that is, did not make a multi-step prediction. To solve this problem, this article builds a CNN-LSTM network on the basis of the combination of attention mechanism, the multi-time scale fusion model of multi-time scale features is integrated. The aim is to accurately predict the value of PM2.5 corresponding to each hour in a continuous period of time in the future. Through experiments, the validity and superiority of the method proposed in this paper are verified.

LSTM (Long Short-Term Memory)
LSTM is a variant produced to solve the long-term dependence problem that RNN (Recurrent Neural Network) cannot solve, which effectively alleviates the gradient explosion problem that RNN cannot avoid and can better predict the time series [5]. The architecture of an LSTM memory cell is shown in Figure 1, where each cell has three "gate" structures, include, the input gate, the forget gate, and the output gate. A chain of repeating cells forms the LSTM layer. The calculation process of the spatiotemporal feature matrix X = [x 1 , x 2 , . . . , x t ] in the LSTM layer is given in Equations (1)- (6). Equation (1) represents the forget gate and it decides what information should be thrown away from the cell state. The directions are: Input h t−1 and x t into the forget gate, and calculate the output value ft of the forget gate through the sigmoid activation function. Equations (2) and (3) represent the input gate, which decides what new information should be stored in the state of cell. The directions are: Input h t−1 and x t into the input gate, and get i t and c t through the sigmoid activation function and tanh activation function respectively. Equation (4) uses the output of the forget gate and the input gate to update the current cell state. Equations (5) and (6) together constitute the output of the current cell. The directions are: First, input h t−1 and x t into the output gate, and calculate the output o t of the output gate through the sigmoid activation function. Then get the current cell output h t by calculating the output of the output gate and the state of the current cell.

LSTM (Long Short-Term Memory)
LSTM is a variant produced to solve the long-term dependence probl (Recurrent Neural Network) cannot solve, which effectively alleviates the g sion problem that RNN cannot avoid and can better predict the time series tecture of an LSTM memory cell is shown in Figure 1, where each cell ha structures, include, the input gate, the forget gate, and the output gate. A ch ing cells forms the LSTM layer. The calculation process of the spatiotempor trix X = [x1, x2, …, xt] in the LSTM layer is given in Equations (1)- (6). Equa sents the forget gate and it decides what information should be thrown awa state. The directions are: Input ht−1 and xt into the forget gate, and calcula value ft of the forget gate through the sigmoid activation function. Equatio represent the input gate, which decides what new information should be state of cell. The directions are: Input ht-1 and xt into the input gate, and through the sigmoid activation function and tanh activation function respe tion (4) uses the output of the forget gate and the input gate to update th state. Equations (5) and (6) together constitute the output of the current cell. are: First, input ht−1 and xt into the output gate, and calculate the output ot gate through the sigmoid activation function. Then get the current cell outp lating the output of the output gate and the state of the current cell. The following Equations (1)-(6) describe the internal calculation proce neural unit: The following Equations (1)-(6) describe the internal calculation process of an LSTM neural unit: where f t is the output of forget gate, the value range of f t is (0,1); i t is the output of input gate, the value range of i t is (0,1); c t is the state of the current cell; o t is the output of output gate, the value range of o t is (0,1); h t−1 is the output of the previous cell; h t is the output of the current cell; w f , w i , w c , and w o are the weight matrices for input vector x t at time step t; b f , b i , b c , and b o are the bias vectors; σ is sigmoid activation function; tanh is hyperbolic sine function; stands for element-wise multiplication of the matrix; ⊗ stands for multiplication; ⊕ stands for the sum operation;

Ensemble Empirical Mode Decomposition (EEMD)
As a noise-assisted signal decomposition method, EEMD adds white noise to the original signal and performs EMD decomposition on it, and finally calculates lumped average using the results of multiple decomposition [6].
The specific operation steps are as follows: (1) Set the overall average times M; (2) Add a white noise n i (t) with standard normal distribution to the original signal x(t) to generate a new signal: where n i (t) is i-th additive white noise sequence; x i (t) is the additional noise signal of the i-th test, i = 1, 2, 3, . . . M.
(3) EMD decomposition is performed on the obtained signal x i (t) containing noise to obtain the form of their respective IMF (Intrinsic Mode Function) sum: where c i,j (t) is the J-th IMF obtained by decomposing after adding white noise for the i-th time. r i,j (t) is the residual term represents the average trend of the signal, and j is the number of IMF; (4) Repeat steps (2) and (3) for M times, decompose and add white noise signals with different amplitudes each time, and the set of IMF is: c 1,j (t), c 2,j (t), . . . c M,j (t), where j = 1, 2, 3, . . . J; (5) Based on the principle that the statistical average value of unrelated sequences is zero, the above IMF is calculated by aggregate average to obtain the final IMF, namely: where c j (t) is the j-th IMF, i = 1, 2, . . . M, j = 1, 2, . . . J;

Attention Mechanism
The attention mechanism mimics the internal process of biological observation behavior [7]. His principle is through a set of weights , . . . α T t T s to express the value of a certain time slice in the target sequence x T t and the dependent sequence x T s−e = [x T e , x T e+1 , . . . , x T s ] relevance. Eeach element in x T t and x T s−e has the same dimension. Map x T t and x T s−e to the parameter space: where W Q is dx*dq dimensional Query parameter matrix; W k is dx*dk dimensional Key parameter matrix; W v is dx*dv dimensional Value parameter matrix; The attention mechanism is divided into three stages: in the first stage, the target sequence is mapped from x T t map of dx dimension to Query of dq dimension, and similarly transformed x T s−e into matrix mapping to Key matrix with dk element dimension and Value matrix with dv element dimension, calculating the similarity between Query and Key; In the second stage, the original score of the first stage is normalized, and the α T t T s−e weight of Value is calculated by Softmax. In the third stage, the Value is weighted and summed according to the weight coefficient to obtain the attention Value.

Multi Time Scale Fusion Model
In this paper, the multi-time scale fusion model is applied to the prediction of PM2.5 hour-concentration for the first time, and the model process is shown in Figure 2. EEMD (Ensemble Empirical Mode Decomposition) decomposition can decompose the original PM2.5 sequence into new sequences with different time scales. CNN-LSTM was employed to extract characteristic information of time series. Attention_layer pays attention to important features and ignores non-important features through attention mechanism to improve prediction accuracy.

Experimental Configuration and Data Set Description
The experimental environment of this paper uses TensorFlow + Keras framew Python 3.7 development language, the system uses Windows, with multiple Pytho brary functions for code implementation and result analysis.
The data in this paper are the monitoring data from ground stations in Har mainly including AQI, PM2.5, PM10, O3, and other data. The update frequency is hour, and the time span is from May 2014 to April 2021. PM2.5 is shown in Figure 3. The specific steps are as follows: (1) Input the original PM2.5 sequence into the EEMD model, and perform EEMD decomposition on the original PM2.5 concentration data. This is the first improvement made by the model in this paper on the basis of CNN-LSTM model. Compared with the original sequence, the decomposed sequence can more precisely express the period of the original sequence and better obtain information of different time scales.
(2) The original PM2.5 data sequence and the decomposed PM2.5 sequence were input into CNN-LSTM network composed of two layers of Conv1d and one layer of LSTM respectively for feature extraction. As convolutional neural network has excellent feature extraction and feature expression capabilities, LSTM has natural advantages in processing time sequence. Therefore, CNN and LSTM are used in combination in feature extraction in this paper. In this paper, the decomposed sequences are recombined into new sequences according to different time scales and used as the input of different network layers respectively with the original sequence. (3) The outputs of different LSTM layers output the prediction results through the attention mechanism layer. Attention mechanism is another improvement based on CNN-LSTM. Through attention mechanism, more important feature information can be paid attention to in features of different time scales to improve the accuracy of prediction.

Experimental Configuration and Data Set Description
The experimental environment of this paper uses TensorFlow + Keras framework, Python 3.7 development language, the system uses Windows, with multiple Python library functions for code implementation and result analysis.
The data in this paper are the monitoring data from ground stations in Harbin, mainly including AQI, PM2.5, PM10, O3, and other data. The update frequency is one hour, and the time span is from May 2014 to April 2021. PM2.5 is shown in Figure 3.

Experimental Configuration and Data Set Description
The experimental environment of this paper uses TensorFlow + Keras fr Python 3.7 development language, the system uses Windows, with multiple brary functions for code implementation and result analysis.
The data in this paper are the monitoring data from ground stations mainly including AQI, PM2.5, PM10, O3, and other data. The update freque hour, and the time span is from May 2014 to April 2021. PM2.5 is shown in Fig   Figure 3. Changes in PM2.5 concentration over time.

Data Pre-Processing
In this paper, data pre-processing includes data cleaning and data norm During data cleaning, clear redundant data. When the pollutant data is missing, uses 8 h moving average data to replace it. After processing, the short-term miss

Data Pre-Processing
In this paper, data pre-processing includes data cleaning and data normalization. During data cleaning, clear redundant data. When the pollutant data is missing, this paper uses 8 h moving average data to replace it. After processing, the short-term missing values that still exist are supplemented by simple linear interpolation of adjacent values, and the missing data that are too long are deleted.
The normalization of maximum and minimum values is used in this paper, as follows: where f max is the maximum value of sample data; f min is the minimum value of sample data.

EEMD Decomposition of PM2.5 Concentration
In this paper, the pre-treated TIME series of PM2.5 value is decomposed into 14 IMF series and one trend item, as shown in Figure 4.
For the period calculation of IMF components, this paper uses the average period as the period of IMF components. The calculation results are shown in Table 1 below. According to the cycle calculation results, imF1-IMF4 is hour scale, IMF5-IMF9 is day scale, IMF10-IMF12 is month scale, and IMF13-IMF14 is year scale.
where fmax is the maximum value of sample data; fmin is the minimum value of sample data.

EEMD Decomposition of PM2.5 Concentration
In this paper, the pre-treated TIME series of PM2.5 value is decomposed into 14 IMF series and one trend item, as shown in Figure 4. For the period calculation of IMF components, this paper uses the average period as the period of IMF components. The calculation results are shown in Table 1 below. According to the cycle calculation results, imF1-IMF4 is hour scale, IMF5-IMF9 is day scale, IMF10-IMF12 is month scale, and IMF13-IMF14 is year scale.

Evaluation Index
The following indicators are selected as the evaluation criteria in this paper: where y m is the true value in the test set; y m ' is the predicted value.
(2) MAE (Mean Absolute Error) where Y is predicted results; Y is true value.
(3) R2 adj (Adjusted R-Square) where y m is the true value in the test set; − y m is the predicted value; − y is the average of the true values in the test set; R2 is R-Square; n is the number of samples; p is the number of features; R2 adj offsets the impact of the number of samples on R2, so that the value of R2 adj is between zero and one, and the larger the value of R2 adj , the better the performance of the model.

Impact of Historical Time Windows on Model Performance
PM2.5 data is affected by a variety of related time series, but the change of each time series value does not immediately affect PM2.5 concentration value, which means that the variable value at the previous moment has a lag effect on the PM2.5 concentration value at the next moment, which may be strong in the short term and weak in the long term [8]. A smaller window size cannot guarantee sufficient long-term memory input for LSTM model, while a larger window size will increase the input of irrelevant information and increase the unnecessary computational complexity of the model [9]. In order to determine the appropriate historical time window, the historical time window in this study starts from 12 h, and every 12 h is a time interval. The prediction scale is the concentration of 1 h PM2.5 in the future. The results are shown in Table 2 below. When the historical time window is 36 h, the RMSE, MAE and R2 of the model in this paper are 9.66, 6.95, and 0.95, respectively, which are the best. For LSTM model, when the history time window is 24 h, RMSE 14.0 is the best. When the historical time window is 36 h, MAE is 7.63 and R2 is 0.89. For CNN-LSTM model, when the historical time window is 24 h, RMSE is 13.66, MAE is 9.88, and R2 is 0.91. The model in this paper is superior to the comparison model in terms of indicators. The RMSE of the model is 31% lower than that of LSTM and 25% lower than that of CNN-LSTM. For the index MAE, it is 24% lower than LSTM and 22% lower than CNN-LSTM. For index R2, it is 5% higher than LSTM and 3% higher than CNN-LSTM.

Performance Comparison of Multi-Step Prediction
In order to test the multi-step prediction performance of the model in this paper for PM2.5 hour-concentration, experiments were carried out on the three models for 1 h, 4 h, 8 h, 12 h, and 24 h in the future, respectively, and the results are shown in Table 3. It can be seen from Table 3 that: (1) each model achieves the best effect when the prediction step size is one hour, and the evaluation indexes of the model proposed in this paper are better.
(2) With the increase of prediction step size, the accuracy of prediction decreases, but the prediction evaluation index of the model proposed in this paper is superior to LSTM and CNN-LSTM in each prediction time scale. Therefore, it indicates that the model proposed in this paper is effective in improving the long-term prediction accuracy. In order to display the forecast results intuitively, the forecast data from 26 February 2021 to 18 March 2021 are selected for display, as shown in Figures 5-9 below. The blue represents the real data value, the yellow is the predicted value of the LSTM model, the green is the predicted value of the CNN-LSTM model, and the red is the predicted value of the model in this article. It can be seen from Figures 5 and 6 that when the prediction step length is short, although the prediction results of the other two models and the predicted future trend can be well consistent with the real data, the model proposed in this article has achieved better results. At the same time, the model proposed in this article is also superior to the other two models in peak prediction. It can be seen from Figures 7-9 that as the prediction duration increases, the accuracy of the peak prediction and the prediction of the future trend of each model decreases. When the prediction time step is 24 h, the prediction trend of LSTM and CNN-LSTM starts to be opposite to that of the real data, as shown in the predicted value between 400 h and 450 h in Figure 9. The prediction results and future trends of the model in this article can be better agreement with the real data. Therefore, the model in this article can better simulate the long-term forecast of PM2.5. the real data, as shown in the predicted value between 400 h and 450 h in Figure 9. The prediction results and future trends of the model in this article can be better agreement with the real data. Therefore, the model in this article can better simulate the long-term forecast of PM2.5.   the real data, as shown in the predicted value between 400 h and 450 h in Figure 9. The prediction results and future trends of the model in this article can be better agreement with the real data. Therefore, the model in this article can better simulate the long-term forecast of PM2.5.

Conclusions
The prediction of PM2.5 concentration is of great significance for People's Daily life and environmental governance. Because the characteristic information of different time scales has different influence on the prediction results, a multi-time scale fusion model is proposed in this paper. The experimental results show that the proposed multi-time scale fusion model is superior to the comparison model in single and multi-step prediction, indicating that the multi-time scale fusion is effective for long-term prediction. In addition, in this paper, only the data of one site is used for the experiment, the amount of data is too small, and the influence between sites is not taken into account. In the future, PM2.5 between adjacent stations will be studied and analyzed, and the accuracy of prediction will be improved by studying the spatial correlation between stations.

Data Availability Statement:
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
The prediction of PM2.5 concentration is of great significance for People's Daily life and environmental governance. Because the characteristic information of different time scales has different influence on the prediction results, a multi-time scale fusion model is proposed in this paper. The experimental results show that the proposed multi-time scale fusion model is superior to the comparison model in single and multi-step prediction, indicating that the multi-time scale fusion is effective for long-term prediction. In addition, in this paper, only the data of one site is used for the experiment, the amount of data is too small, and the influence between sites is not taken into account. In the future, PM2.5 between adjacent stations will be studied and analyzed, and the accuracy of prediction will be improved by studying the spatial correlation between stations.  Data Availability Statement: The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.