A Simple Dendritic Neural Network Model-Based Approach for Daily PM 2.5 Concentration Prediction

: Air pollution in cities has a massive impact on human health, and an increase in ﬁne particulate matter (PM 2.5 ) concentrations is the main reason for air pollution. Due to the chaotic and intrinsic complexities of PM 2.5 concentration time series, it is difﬁcult to utilize traditional approaches to extract useful information from these data. Therefore, a neural model with a dendritic mechanism trained via the states of matter search algorithm (SDNN) is employed to conduct daily PM 2.5 concentration forecasting. Primarily, the time delay and embedding dimensions are calculated via the mutual information-based method and false nearest neighbours approach to train the data, respectively. Then, the phase space reconstruction is performed to map the PM 2.5 concentration time series into a high-dimensional space based on the obtained time delay and embedding dimensions. Finally, the SDNN is employed to forecast the PM 2.5 concentration. The effectiveness of this approach is veriﬁed through extensive experimental evaluations, which collect six real-world datasets from recent years. To the best of our knowledge, this study is the ﬁrst attempt to utilize a dendritic neural model to perform real-world air quality forecasting. The extensive experimental results demonstrate that the SDNN offers very competitive performance relative to the latest prediction techniques.


Introduction
In recent years, with the development of the economy and urban industries, atmospheric pollution has increased and gained global attention. In particular, air pollution is increasingly serious and threatens our living environments and human health. In the air, high concentrations of fine particulate matter (PM 2.5 and fine aerosols with a particle size of less than or equal to 2.5 µm) are the main pollutants [1]. The composition of fine particles is very complex and difficult to control and contains various hazardous and toxic substances. To protect the environment and human health, many countries are incorporating environmental governance into their development strategies, and many observation stations have been built to monitor real-time PM 2.5 concentrations. Based on reliable and accurate forecasting values, announcing the concentration of pollutants days or hours in advance can help the public become aware of this hazard and make early-warning decisions. Therefore, PM 2.5 concentration prediction is very important for environmental management.
The precise forecasting of PM 2.5 concentrations is a challenging task due to their diverse impacts, irregular properties and chaotic nonlinear characteristics. As one of the most crucial methods for assessing air quality, PM 2.5 forecasting has become a major research focus in air pollution research. In addition, some researchers have begun to predict and analyse other specific pollutant concentrations, such as NO 2 [2,3], PM 10 [2,4], SO 2 [2], and the air quality index [5]. In general, the methods proposed for PM 2.5 forecasting can be mainly divided into deterministic approaches, statistical approaches and machine learning methods. Deterministic approaches are typically knowledge-based approaches that use chemical and physical theories to simulate the transformation and transportation of air pollutants for forecasting. However, relevant studies have verified that deterministic approaches have difficulty accurately predicting PM 2.5 concentrations since they cannot be used to describe the nonlinear relationships and time-varying characteristics of data [6]. In contrast, statistical approaches generally apply data and use regression methods and time series theory to explain the correlation between historical and future data. These methods are also considered simpler and more efficient than knowledge-based deterministic methods [7].
Because of the irregularity and non-linearity of PM 2.5 concentration data, these statistical methods cannot obtain more reliable and accurate prediction results to satisfy the requirements of practical application. To overcome this limitation, various machine learning methods have been recently proposed for PM 2.5 concentration prediction, such as the random forest [8] and support vector regression (SVR) [9] methods. In addition, due to their assorted memory, self-learning, data-adaptable, and data-driven characteristics, many researchers pay attention to artificial neural networks (ANNs), which can learn to accurately and reliably map the correlations between inputs and outputs. However, it is difficult to select the most suitable ANN for different PM 2.5 concentration time series because each one has its own advantages and limitations. Accordingly, considering the calculation costs and feasibility of the method, we attempt to improve the PM 2.5 prediction performance using a very simple ANN named the dendritic neural network model (DNN), which was proposed in our previous studies [10]. The DNN uses a multiplicative operation to capture the nonlinear relationships between features. Compared to other ANNs, the DNN can be considered a more realistic neuron model, since it considers the nonlinear computation of synapses and dendritic structures, which is inspired by the biological phenomena in neurons [11]. Such models have been successfully employed for various applications such as computer-aided medical diagnosis [12], time series prediction [13], and morphological hardware realization [14]. However, the original DNN and simplified variation with a single branch (S-DNN) are trained by an error back-propagation (BP) algorithm. The BP algorithm is based on gradient descent information, which makes it easily fall into local optima and thus sensitive to initial conditions, overfitting and slow convergence. These disadvantages largely limit the performance of the DNN and its variations. To overcome these issues, it is necessary to identify a more powerful learning algorithm to train the DNN. In this paper, a recently proposed heuristic optimization algorithm, which is named the states of matter search (SMS) algorithm [15], is selected to optimize the weights and thresholds of the DNN and utilize it for PM 2.5 concentration time series prediction. The evolutionary processes of the SMS can be divided into a gas state, a liquid state, and a solid state. In each state, the positions of the agents are updated based on the direction vector operator, collision operator, and random behaviour. As a global search algorithm, the SMS algorithm offers powerful optimization abilities that can effectively avoid local optima during the training phase and significantly enhance the prediction accuracy of the DNN.
Since real-world PM 2.5 concentration time series are based on one-dimensional, irregular and unpredictable data and should be mapped to a high-dimensional space based on a certain time delay and embedding dimension, some intrinsic properties will be revealed. Takens' theorem is a commonly used approach [16] that applies the phase space reconstruction (PSR) approach to transform these time series data into new high-dimensional embedding spaces while preserving the topological structure of the chaotic attractors. Therefore, we calculate the time delay using the mutual information (MI)-based method [17], and embedding dimensions are obtained by the false nearest-neighbour (FNN) approach [18]. Then, the PSR is performed depending on the time delay and embedding dimensions, and the maximum Lyapunov exponent (MLE) is used to detect the predictability and chaotic properties [19]. Finally, the trained SDNN is used to forecast the PM 2.5 concentration. In our experiments, six PM 2.5 concentration datasets are used to evaluate the prediction performance of the SDNN. The SMS training results are compared to those of seven other optimization algorithms, and the prediction performance of the SDNN is compared to the results of some competitive forecasting approaches. To obtain reliable results, each experiment is independently performed 30 times. The experimental and statistical analysis results suggest that the SDNN can achieve very competitive prediction results. Moreover, in order to verify whether the proposed method can be applied to more time series predictions, we discuss the simulations on an open available PM 2.5 dataset from UCI machine learning repository.
The main contributions of this study are as follows: (1) A more realistic SDNN that considers nonlinear computation in dendritic structures and synapses is applied to PM 2.5 concentration prediction for the first time. (2) To enhance the prediction stability and accuracy, a global optimization algorithm named the SMS is selected to train the SDNN. Experimental results show that compared to other state-of-the-art prediction approaches, the DNN obtains prominent competitive performance for PM 2.5 concentration forecasting.
(3) The study shows that expanding the application scope of the DNN for prediction problems can help us better understand the capacities of the DNN.
The remainder of this paper is organized as follows. Section 2 introduces some related works on PM 2.5 concentration forecasting. Section 3 elaborates on the SDNN, SMS algorithm and relevant methods to predict the PM 2.5 concentration time series in detail. Sections 4 and 5 present our parameter settings, experimental and statistical results and a discussion, respectively. The Section 6 draws conclusions.

Related Work
In the literature, various ANN architectures have provided strong advantages in PM 2.5 concentration forecasting, such as back-propagation (BP) neural networks [20], fuzzy neural networks [21] and long short-term memory (LSTM) neural networks [22]. Specifically, Xu and Yoneda employed the LSTM auto-encoder multi-task learning model for air quality prediction in [23]. The employment of a recurrent neural network (RNN) to forecast the air quality is presented in [24,25], and more RNN architectures for multisequence indoor PM 2.5 concentration prediction are compared and analysed in [25]. In [21], Lin et al. proposed a neuron-fuzzy modelling system for forecasting. In addition, several deep learning models have been successfully applied in air quality forecasting [26][27][28]. More references regarding the ANN-based PM 2.5 concentration prediction approaches can be found in [29][30][31][32][33][34][35].
In addition to the above methods, hybrid models are another popular choice for air quality prediction in the literature. Feng et al. proposed a hybrid model that combined a geographic model, wavelet transformation analysis and ANN to enhance air quality forecasting accuracy [36]. The combination of the ANN and multiple linear and continuous regression models are introduced in [37]. Sun et al. developed a novel approach based on the least-square SVM and principal component analysis technique [38], and an integrated model composed of SVM and autoregressive integrated moving average model is presented in [5]. Liu et al. utilized a multi-resolution multi-objective ensemble model for PM 2.5 prediction [39]. Qi et al. integrated the LSTM and graph convolutional networks to model PM 2.5 forecasting [40]. Combined with feature extraction based on the ensemble empirical mode decomposition approach, Bai et al. applied the LSTM approach to PM 2.5 concentration prediction [22]. The hybrid model based on a BP neural network and convolutional neural network can make accurate PM 2.5 predictions in [41]. A hybrid prediction model using land use regression and a chemical transport model can be found in [42]. Overall, although various machine learning techniques and hybrid methods are widely applied for air quality forecasting, which can achieve satisfactory prediction performance to a certain degree, they consume large amounts of calculation costs.

SDNN Structure
The original SDNN is inspired by the dendritic mechanism of biological neurons. It is composed of three layers: a synaptic layer, a dendritic layer and a soma layer. The weights and thresholds are trained by the optimization algorithm. The structural morphology of the SDNN is shown in Figure 1, which has M dendritic branches and n synaptic layers depending on specific problems, and a 1 -a n are the attributes of a certain problem. Incoming signals a 1 -a n from the synaptic layer enter the dendritic structure through synapses. Then, the results for each dendritic layer are collected and sent to the soma layer. A mathematical description of the SDNN is provided as follows.

Synapses
The synaptic layer is the synaptic connection structure from the dendrite of a neuron, and each synapse receives the incoming signal from the feature attributes of the training data and transfers it to the next layer through a sigmoid function. The computation operator that describes the j-th (j = 1, 2, ..., M) branch receiving the i-th (i = 1, 2, ..., n) input is expressed as follows: where S i,j is the result of the i-th synapse for the j-th dendritic branch and K is a positive constant. Synaptic parameters w i,j and q i,j must be trained by the training algorithm. According to q i,j and w i,j , the synaptic layers have four connection cases, which are illustrated in Figure 2. Moreover, threshold α i,j for the synaptic layer is obtained from α i,j = q i,j /w i,j .
• Case (Constant-1 connection): When q i,j < w i,j < 0 or q i,j < 0 < w i,j , in this case, the output of the synapse is always approximately 1 despite the changes in the input. • Case (Constant-0 connection): When 0 < w i,j < q i,j or w i,j < 0 < q i,j , in this case, the result is always 0 despite the changes in the input. • Case (Inverse connection): When w i,j < q i,j < 0, where a i > α i,j , the output is approximately 0; otherwise, the output tends to be 1. • Case (Direct connection): When 0 < q i,j < w i,j , where a i > α i,j , the output tends to be 1; otherwise, the output is approximately 0.

Mathematical Representation Graphical Description
Case (Constant 1 connection):

Dendrites
This layer performs a nonlinear operation for the incoming signals of each dendritic branch. The simplest multiplication operation plays a significant role in the processing and transmission of neural computation [43], which is calculated by the following equation:

Soma
The soma is the core part of the neuron. First, the soma layer accumulates signals from all dendritic branches and performs the summation function from the previous layer. Then, the results are transferred to the soma, where a sigmoid function is commonly employed to represent the computational process of this layer. The soma can be described by the following equation: where β is a user-defined constant threshold, K s is an adjustable constant parameter, and Soma is the final output of the model.

Training Algorithm
The multiplication operation is applied to each dendritic branch of the SDNN, which makes the results of the SDNN extremely sensitive to each attribute. Moreover, the parameter space of the SDNN is very complex and large. Thus, this situation requires an optimization algorithm, which has powerful search ability for the SDNN optimization. In this study, a swarm-based optimization algorithm, which is called the SMS algorithm, is adopted as a training algorithm to optimize the parameters of the SDNN. In this section, the SMS algorithm is briefly described in more detail.
The SMS algorithm emulates the states of matter phenomenon [38], and a population of optimized agents is described as molecules that interact with one another by evolutionary operators based on the physical principles of the thermal-energy motion ratio. The evolutionary process of the SMS algorithm can be divided into three phases: (1) a gas state, (2) a liquid state, and (3) a solid state. The agents have different exploitationexploration energies in each stage. In the first (gas) state, agents experience severe collisions and motions at the beginning of the optimization process. The second state is the liquid state, which restricts the collision and movement energy of agents more than the gas state. The final state is the solid state, where individuals are prevented from freely moving due to the forces among them. The overall optimization process of the SMS algorithm is described in Figure 3.

Evolution process
Gas state (50%) Liquid state (40%) Solid state (10%) In this optimization algorithm, the agents are considered molecules whose positions change when the process iterates. The movement of these molecules is analogous to the motion governing heat, which depends on three optimization states: (1) direction vector operator, (2) collision operator, and (3) random behaviour.

Direction Vector
First, the SMS algorithm randomly generates a position for each agent. Position P i of each agent is described as vector d i in the search space. When the process evolves, the direction vector operator provides an attraction phenomenon by moving each molecule towards the current best particle. Thus, these direction vectors are iteratively updated and can be defined as follows: where P best is the current best individual seen thus far and t and i max are the current iteration number and maximum number of iterations, respectively. Once the direction has been determined, we can calculate the velocity vector as follows: where b high m and b low m are the upper and lower m-th parameter bounds, respectively, and γ ∈ [0, 1]. n is the number of decision variables. Once the direction and velocity are obtained from these two equations, the new position of each molecule is calculated from: where α ∈ [0.5, 1] and rand (0, 1) is a random number between 0 and 1.

Collisions
The collision operator emulates the collision phenomenon, where molecules interact with one another if the distances among these molecules are shorter than a proximity collision radius, and the collision operator provides a diversity of individuals, which prevents premature convergence. The collision radius is defined as follows: where β ∈ [0, 1]. If two molecules (P i and P m ) have collided, the direction vectors of the two molecules (d i and d m ) are modified by exchanging their direction vectors as follows:

Random Behaviour
The transition of molecules from one state to another commonly exhibits random behaviour. The SMS algorithm allows molecules to randomly change position by following a probabilistic criterion in a feasible space, which can be defined as follows: where H is a probability depending on the current SMS state. Furthermore, m ∈ {1, ..., n}.
Based on different states, the SMS algorithm controls the motion operator by adjusting parameters γ, α, β, and H. The values of these parameters are provided by [38] and summarized in Table 1.

Time Delay and Embedding Dimensions
According to chaos theory, a PM 2.5 concentration time series can be mapped into a high-dimensional space by the PSR. To perform PSR, time delay τ and embedding dimensions m are necessary, which can be calculated by the MI approach and FNN method, respectively. Then, the MLEs are calculated to detect the chaos characteristics of the PM 2.5 concentration data.
The MI approach has gained broad acceptance as a metric of association between variables, which measures both nonlinear and linear correlations. Parameter time delay τ is employed to map one-dimensional data to a higher-dimensional space, where each point is independent and identically distributed. A suitable time delay value can ensure that the data points are highly correlated, independent, smooth, and identifiable [44]. According to the information entropy theory, τ can be determined from the MI (I(a t , a t+τ )), which is described as follows: where P(a t , a t+τ ) is the joint probability; P(a t ) and P(a t+τ ) are the marginal probabilities of a t and a t+τ , respectively. The optimal τ is a possible integer and determined by the first minimum value of I(τ).
In addition, the FNN method is utilized to calculate the embedding dimensions. Similarly, an appropriate embedded dimension value must ensure the behaviour of the original data and maintain relevance among the data. The optimal m value is also a positive integer and can be obtained by the first minimum value of the FNN rate. This method employs two conditions to evaluate the points as false neighbours, which are described as follows: • Calculate Euclidean distance D 1 between point a i and its nearest point a NN j . Both a i and a NN j are joined by the dimension from d to d + 1; then, compute the new Euclidean distance D 2 . If the result is greater than threshold µ, then the points are considered false neighbours; otherwise, verify the second condition.
• If D 2 cannot satisfy the following condition, then the points are considered false neighbours.
where δ pm is the standard deviation of the PM 2.5 concentration time series, and V tol is the positive integer threshold that describes the attractor size.

PSR and the MLE
According to long-term monitoring, the real-world PM 2.5 concentration time series shows chaoticity and unpredictability. Hence, it is quite difficult to make accurate predictions. Nevertheless, the periodicity is proved when reconstructed as points of the phase space. The PSR technique applies a basic theory of chaotic dynamic systems widely utilized in the analysis of nonlinear systems [39]. It is confirmed that PSR can expand the time series into a new space while preserving the topological structure of the high-dimensional space with chaotic attractors. Crucial factors τ and m of PSR for the real-world PM 2.5 concentration time series can be obtained using the above methods. Thus, PSR and target data T can be expressed as follows: The MLE is generally employed to confirm the properties of chaotic dynamics [45] and estimate whether a sequence has chaotic characteristics. In general, the sequence motion is chaotic only if the value of the MLE becomes positive [45]. In this study, we use this approach to identify the chaotic characteristics of the PM 2.5 concentration time series. The MLE can be calculated as follows: where t = 1, 2, . . . , N − (m − 1)τ. We assume that a phase point and the initial time are t 0 and a(t 0 ), respectively. L 0 is the minimum distance from a neighbouring phase point. In addition, we set the distance L 0 (||a(t 1 ) − a(t 0 )||) to be larger than a positive threshold at time t 1 . L 0 is replaced with L 1 when the next distance L 1 to another phase point is greater than L 0 at time t 2 , and this computational process continues until the last phase point a N . As mentioned above, a dynamic system manifests chaotic characteristics when the MLE exceeds 0, and the value of the MLE is typically 0-1 to enable the long-term prediction [46]. Figure 4 shows the time delays and embedding dimensions of six PM 2.5 concentration time series based on the MI approach and FNN method, respectively. The results of the time delay and embedding dimensions are obtained for each training dataset, and the computational results of τ, m and MLE for all PM 2.5 concentration datasets are summarized in Table 2.

Experiments
In this study, all prediction models are evaluated on six PM 2.5 concentration datasets obtained from the Beijing Monitoring Center Station in China in eastern Asia. Predictions are made 1 day ahead.

Dataset Description
Our experiment uses six real-world daily PM 2.5 concentration datasets collected from the Ministry of Ecology and Environment of China over 4 years and 6 months (1 January 2016 to 30 June 2020). We select six datasets of 2-year terms and divide them into two subsets: a training set and a prediction set. The number of instances used for training and prediction are approximately the top 75% and bottom 25%, respectively, of each set of data. The details of the experimental datasets are presented in Table 3.

Normalization
First, to improve the computation speed and reduce the computation complexity, we normalize all inputs to a range of [0, 1] based on the following equation: where a min and a max are the minimal and maximal values, respectively, of the original vector. Notably, normalization is performed during both the training phase and testing phase. In addition, the inverse normalization operation is performed on the outputs of the model.

Parameter Settings
As mentioned above, 3 hyperparameters impact the performance of the SDNN: K, M, and β. K is a positive integer value of the sigmoid function in the synaptic layer of the SDNN; M is the number of branches in the dendritic layer, which is commonly greater than the feature number; β is a threshold of the soma layer.
In general, the exhaustive approach employed to determine these parameters is resource intensive. To achieve the best performance while simultaneously decreasing the material, labour, and time costs, Taguchi's method, which utilizes orthogonal arrays to find a reasonable parameter combination for each dataset, is used to reduce the number of experimental runs [47]. Then, L 16 (4 3 ) orthogonal arrays are generated, which cover only 16 (of 64) experiments in the preliminary work. To achieve reliable average performance, we perform each experiment over 30 independent runs using Taguchi's method, and the experimental results are summarized in Table 4. Each dataset clearly corresponds to a set of optimal parameter combinations. In addition, the population size and maximum number of iterations are set to 50 and 1000, respectively.

Evaluation Criteria
To perform a comprehensive performance comparison, the performance of each approach can be assessed by five commonly utilized metrics: the mean squared error of the predictions (MSE), mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error of the predictions (RMSE), and correlation exponents of the prediction (CEs), which are defined by the following formulas: • The MSE of the predictor for the normalized data is obtained as follows: • The MAE is defined as follows: • The MAPE of the predictions is defined as follows: • The RMSE for the normalized distribution can be defined as follows: • The CEs of the prediction phase can be given by the following: wheref i is a target vector, f i is the output of the utilized prediction model, and n is the number of instances.

Performance Comparison
In our study, six optimization algorithms and nine prediction models are utilized as competitors of the SDNN. To achieve a reliable evaluation, the experiments for each approach and model are independently repeated 30 times. All experiments are performed on a PC equipped with a 3.80 GHz Intel(R) Core(TM) i7-10770k CPU and 32 GB of RAM using MATLAB R2018b.

Comparison with Other Optimization Algorithms
In this section, we compare the training performance of the SMS algorithm to the performance of seven other optimization algorithms: genetic algorithm (GA) [48], cuckoo search (CS) [49], firefly algorithm (FA) [50], gravitational search algorithm (GSA) [51], adaptive differential evolution with an optional external archive (JADE) [52], adaptive differential evolution with linear population size reduction (L-SHADE) algorithm [53], and particle swarm optimization (PSO) [54]. To ensure the performance of these algorithms, the initial hyperparameters are obtained from the literature listed above. The maximum number of iterations is 1000, and the population size is set to 50 for all optimization algorithms. The optimization algorithms are separately employed to train the DNN for PM 2.5 concentration forecasting. For a fair comparison, we select the identical parameter combination for the SDNN for each dataset. The results achieved by these optimization algorithms for six prediction problems over 30 runs are summarized in Table 5. The SMS algorithm achieves smaller mean and lower standard deviation of MSE for most PM 2.5 concentration datasets, which implies that the SMS algorithm has more powerful optimization capabilities than the other methods. The exception is that L-SHADE provides the best performance on one of the datasets due to its powerful search ability. To further demonstrate the effectiveness of the SMS algorithm, a nonparametric statistical method called Friedman's test is used to detect significant differences among multiple groups. Friedman's test provides a list of ranks to evaluate the performance of all schemes. A lower rank indicates better performance. The average ranks of the seven optimization algorithms for the six PM 2.5 concentration data prediction problems are listed in Table 6, which shows that the SMS achieves the best performance (ranked 1st), while L-SHADE is the second-best method. Moreover, based on the unadjusted p-values (the probability of several false discoveries), the family-wise error rate is typically ignored for multiple pairwise comparisons. In general, a post hoc test approach called the Bonferroni-Dunn procedure is used to adjust the p-values, which are defined as the p bon f value. The corresponding significance level is set to 0.1. Through the above method, the p bon f values are calculated and presented in Table 6. These statistical results imply that the SMS algorithm is significantly better than the GA, CS, FA, GSA and JADE methods, while there is no significant difference between SMS and L-SHADE or between SMS and PSO. Since the SMS algorithm has a better ranking than the L-SHADE and PSO algorithms, the SMS algorithm is a better choice to train the DNN model. In summary, the SMS algorithm shows obvious advantages over other optimization algorithms in training the DNN for the daily PM 2.5 concentration prediction.

Comparison with Other Prediction Approaches
The above experimental results show that the SMS algorithm is a promising learning algorithm for optimizing the SDNN with less prediction error and more stability than the other methods. We also compare the SDNN to eight other commonly applied prediction models: the multilayer perceptron (MLP) [55], classic DNN trained by the BP algorithm (DNN-BP), S-DNN, decision tree (DT) model, SVR with a linear kernel (SVR-L), SVR with a polynomial kernel (SVR-P), SVR with a radial basis function kernel (SVR-R), and LSTM model. For a fair comparison, the high parameters of all DNN-related models are determined by Taguchi's method in accordance with the SDNN. The initial hyperparameters of these prediction models for each dataset are presented in Table 7. Based on PSR, the six one-dimensional PM 2.5 concentration time series data independently transform into six high-dimensional training datasets, which are input into the SDNN for training. Figure 5 (left) shows the corresponding forecast PM 2.5 concentration obtained after the training process compared to the monitoring value, where the black and light blue lines represent the observed and predicted PM 2.5 concentrations, respectively. The observed and predicted time series are relatively close for each dataset. In addition, to examine the correlation between the observed and predicted data, scatter plots are shown in Figure 5 (right). The figure demonstrates that the distribution of the points approximately converges very near the regression line for all PM 2.5 concentration data. Notably, the SDNN fails at a few valley and peak values, which can be confirmed from these scatter plots. Thus, the SDNN must still be improved to avoid overestimating or underestimating lower or higher PM 2.5 concentrations during air quality forecasting. To further verify the superiority of the SDNN in forecasting PM 2.5 concentrations, a quantitative evaluation is performed. The SDNN is compared to the MLP, DNN-BP, S-DNN, DT, LSTM and SVR models with three different kernels. The overall performances of the prediction models, which are measured by the average value of five estimation metrics for 30 repeated experiments, are summarized in Tables 8 and 9. The optimal values are marked in bold. To detect significant differences between the SDNN and the other prediction models, the Wilcoxon signed-rank test, which is a nonparametric statistical test, is employed in this section. The p-values are calculated and presented on the right of each evaluation metric in Tables 8 and 9, where "-" denotes "not applicable". The significance level is set to 0.05 [56], which indicates that if the p-value exceeds 0.05, there is no significant difference between the two compared models. Otherwise, there are significant advantages over the competitor.
As illustrated in Tables 8 and 9, the MSE, MAE and CE of the SDNN are clearly better than those of the other prediction approaches for all datasets. The corresponding p-values imply that on most of these evaluation metrics, the SDNN and its competitors show significant differences. The better forecasting performance of the proposed SDNN is thus evident. With respect to MAPE and RMSE, the SDNN performs better than the MLP, DNN-BP, S-DNN, SVR-P, SVR-R and LSTM methods for most of the prediction datasets. However, it performs worse than the DT and SVR-L methods. Specifically, the MAPE of the DT model shows the best results for 5 (of 6) datasets and the RMSE of SVR-L shows the best results for 3 (of 6) datasets. Surprisingly, these two relatively simple machine learning techniques perform better than other more complex approaches such as the LSTM and SDNN methods. The advantages of the SDNN here are not significant and can be the reason for the low smoothness levels in airborne pollution.
In general, from the results in Tables 8 and 9, the SDNN shows obvious advantages in terms of PM 2.5 concentration prediction. The SDNN is more stable and robust than other prediction approaches, since the SDNN complex dendritic structure can more deeply and effectively extract useful feature information and nonlinear relationships between distinct features of the input datasets than its competitors. To better demonstrate the integrated capabilities of SDNN, including the MSE, MAE, MAPE, and RMSE values, the error stacked bars are plotted in Figure 6. The SDNN achieves a lower error column than the other eight models for all PM 2.5 concentration datasets. This result confirms that the proposed SDNN exhibits effective predictive performance and strong robustness. According to the above experimental results, compared to other prediction models, the SDNN achieves very competitive forecasting performance and can be considered an efficient and effective PM 2.5 concentration forecasting approach. In addition, the CEs of all models are not ideal (less than 0.8), so there is still much room to improve the forecasting performance of machine learning approaches.

Extension
As presented above, the proposed SDNN can successfully predict PM 2.5 concentration. In this section, the performance of the proposed algorithm is evaluated on an open available PM 2.5 dataset from UCI machine learning repository (https://archive.ics.uci.edu/ml/ datasets.php, accessed on 1 January 2021), which is the hourly PM 2.5 concentration of US Embassy in Beijing and the meteorological data from Beijing Capital International Airport. In addition, we compare the predictive performance of SDNN with four other prediction approaches in the literature. Table 10 summarizes the comparison between the SDNN and other prediction techniques on the hourly PM 2.5 concentration prediction in terms of RMSE and MAE. In order to further intuitively compare these two evaluation metrics, both the RMSE and MAE are performed the operation of inverse normalization . The best results of the prediction model are highlighted in bold, and all values are the average of the experimental results. It can be observed that our proposed SDNN obtains the best result on the UCI hourly PM 2.5 concentration time series datasets, the performance of the SDNN the ranks first among five prediction techniques. Accordingly, it can be concluded that the overall performance of the SDNN is evidently better than those of other prediction models.

Conclusions
Predicting the air quality is beneficial for the protection, early monitoring and governance of the environment. However, due to the characteristics of PM 2.5 motion, it is difficult to predict PM 2.5 concentrations with high accuracy and stability. In this paper, a novel SDNN is proposed to improve the accuracy of PM 2.5 concentration time series forecasting. The proposed SDNN is trained by the SMS global optimization algorithm due to its powerful search abilities. To evaluate the effectiveness of the SDNN, six prediction datasets are adopted in our experiments. The MI and FNN approaches are employed to obtain the time delay and embedding dimensions, respectively. Then, the phase space is reconstructed based on these two factors, and the MLE is used to analyse the predictable limit and chaotic characteristics of PM 2.5 concentration datasets. Finally, the prediction results of the SDNN are tested for the regenerated datasets and compared to those of DNNs trained by six optimization algorithms and eight commonly used prediction models. The experimental results and statistical analysis demonstrate that the SDNN dominates in terms of the four evaluation metrics. Thus, the proposed model can effectively enhance the stability and accuracy of PM 2.5 concentration predictions. While the SDNN achieves more competitive forecasting results, there is still much room to improve the forecasting performance of machine learning approaches in terms of CE results. While this study employs only historical PM 2.5 concentrations as an influencing factor, more auxiliary information, such as weather conditions, economic factors and geographical positions, will be considered in our future study. In addition, the SDNN must be applied to solve other real-world time series prediction problems such as those of traffic flow forecasting and financial time series prediction.