Urban Air Quality Analysis and Forecast Based on Intelligent Algorithm with Parameter Optimization and Decision Rules

Featured Application: Air pollution has become an unavoidable reality in today’s world. With the rapid development of various industries and motorized transportation, large amounts of harmful substances such as soot, sulfur dioxides, nitrogen oxides, carbon monoxides, and hydrocarbons are released into the atmosphere, lasting a long time and in concentrations exceeding tolerable environmentallimits. In this study, we investigatedanintelligent algorithmthat hadthefunctions of parameter optimization and decision rules, which we applied to Beijing air quality data to analyze and forecast urban air quality. Abstract: Air pollution has an ongoing devastating impact on the planet, damaging ecosystems, depleting natural resources, and endangering human health. This paper proposes a new intelligent algorithm that includes parameter optimization and decision rules to forecast and analyze of urban air quality. Through analysis of 24-h daily air quality data provided by the Beijing Air Quality Monitoring Station, simulated annealing (SA) and a decision tree (DT) emerge as the key factors. We prove that in the investigated algorithm, SA and DT can be used to make decision rules and achieve better accuracy for classiﬁcation. We ﬁnd that SA can be used to adjust the best parameter settings for the DT. Simulation results show that the accuracy of the proposed algorithm for classiﬁcation is far better than other existing approaches.


Introduction
With the global trend and people's attention, how to monitor the air quality scientifically and effectively and how to further prevent and control air pollution has become a hot topic. The problem of air pollution is very complex, which is characterized by multi pollution coexistence, multi-scale correlation, and multi process evolution. In order to solve this complex problem, it is particularly important to strengthen the construction of air quality monitoring and air quality management information. Only effective prediction, analysis, and research on air quality can effectively improve air quality. How to effectively use the real-time monitoring data of each city's automatic air monitoring station, mine its internal information, use the monitoring data to build a bridge for analyzing the pollution problem [1], effective improvement of air quality, improvement of people's living environment to maintain people's health is an urgent problem to be solved.
The air quality index (AQI) is calculated by monitoring the concentration of fine particulate matter (PM 2.5 ), inhalable particulate matter (PM 10 ), sulfur dioxide (SO 2 ), nitrogen dioxide (NO 2) , ozone (O 3 ), and carbon monoxide (CO). In recent years, due to the increasing consumption of various energy resources and the increase of cumulative emissions, the problem of air pollution has seriously increased. More and more attention has been paid to the study of air pollution. In order to better adapt to the global trend and create a good air environment, using data mining technology to establish air quality analysis, the forecast model has become an important topic [2][3][4][5][6][7][8][9].
Collecting AQI data is key for monitoring pollution problems. To solve the AQI problem, various approaches have been used for data mining, including artificial neural network (ANN), genetic algorithm (GA), decision tree (DT), random forest (RF), and support vector machine (SVM) [10][11][12][13][14][15][16][17][18][19][20]. Each method has a single basic point of view and provides a general performance analysis of air quality indicators, but it is difficult to distinguish the best method. Recent studies have proposed various intelligent systems, and the results seem applicable [21][22][23][24][25][26]. However, these investigated methods have an important shortcoming: they cannot simultaneously provide parameter optimization for an algorithm as well as decision rules. For AQI evaluation, decision rules can be updated according to datasets in the evaluation process and can be used to predict new evaluation results. For that we would aim to investigate an algorithm based on the characteristics of AQI decision rule establishment and parameter optimization. Then the urban air quality forecast and analysis can be based on an intelligent algorithm with parameter optimization and decision rules. We therefore propose an intelligent algorithm that combines DT and simulated annealing (SA), in which DT generates decision rules and SA converges to a global optimum, and the parameters of DT are determined by SA. The rules extracted in this paper could be used to analyze collected information, then forecast a new AQI. In what follows, we review the decision tree in Section 2, introduce the proposed algorithm in Section 3, and analyze the simulation results and discussions in Section 4. Finally, we draw the conclusion.

A Brief Description of the Decision Tree Algorithm
In our previous work [27,28], we applied the DT algorithm in anomaly intrusion detection and found it to have excellent classification performance. The DT has the advantages of intuitive expression and convenient operation and is widely used in research [29][30][31][32][33][34][35][36]. It consists of a root node, a child node, and a leaf node. After the structure is established, the required data are tested, starting from the root node. Depending on the different data attributes, the sub-node selects a property and moves to another sub-node recursively until the leaf is reached. Nodes and leaf nodes are the classifications for data prediction. When a DT is constructed, the attribute with the highest information gain rate is the split attribute of the current node. With recursive calculation, the information gain rate of the calculated attributes becomes smaller and smaller, and in the latest stage, the attribute with relatively large information gain rate will be selected as the splitting attribute, and the DT uses the Gini coefficient minimization criterion to perform feature selection to generate a binary tree [29,30]. The Gini coefficient minimization criterion is calculated as follows: p k indicates the probability that the selected sample belongs to the k class; the probability that the sample is split is (1 − p k ). For a given sample set D, the Gini index is: Here, k is the sample belonging to the k th class in D, and k is the number of classes. If the sample set D is divided into two parts D 1 and D 2 according to whether a feature A takes a certain value a, namely: then under the condition of feature A, the Gini index of set D is defined as: The Gini index Gini (D) represents the uncertainty value of the set D, and the Gini index Gini (D, A) represents the uncertainty value of the set D after A = a partitioning. The larger the Gini index value, the greater the uncertainty result of the sample set. When using the DT algorithm, the two parameters of minimum case (M) and the pruning confidence factor (CF) will have different combinations when facing different problems or cases [29]. In this paper, the SA algorithm is used to adjust and determine the best combination of these two parameters and the best solution of the problem.

The Proposed Algorithm
This paper proposes an algorithm for urban air quality forecast and analysis that is based on an intelligent algorithm with parameter optimization and decision rules. In the study, in order to verify the performance of the proposed algorithm, we use Beijing air quality data in CSV (Comma-Separated Values) format [37]. A partial original data is shown in Table 1. The seven different features are listed in Table 2 [38]. As shown in Table 2, these pollutants cause poor air quality, affect human living environment and harm human health. The real-time historical data of AQI from 1 January 2017 to 6 October 2018 in the District of Dongcheng, Beijing, 11270 AQI instances with seven different features were collected. Table 3 presents partial data for the resulting AQI.  Particles in the atmosphere with a diameter less than or equal to 2.5 µm, also known as particulate matter, have an important effect on air quality and human health.
PM 10 Particulate matter in the atmosphere with a diameter of 10 µm or less is known as fly ash. This can enter the lungs, and it has an important impact on air quality and human health.

SO 2
Sulfur dioxide is one of the main atmospheric pollutants. When sulfur dioxide is dissolved in water, sulfurous acid is formed, the main component of acid rain.

NO 2
Nitrogen dioxide comes mainly from high-temperature combustion processes, such as vehicle exhaust and boiler exhaust emissions. It's another cause of acid rain, which reduces atmospheric visibility and contributes to the acidification and eutrophication of surface water.
The increasing concentration of ozone in the troposphere has a detrimental effect on human health and plants. Ozone has a stimulating effect on the eyes and respiratory organs and at above-normal levels negatively affects lung function.

CO
Carbon monoxide easily combines with hemoglobin to form carboxyhemoglobin, which prevents hemoglobin from carrying oxygen and causes tissue suffocation and death. Carbon monoxide has toxic effects on all body tissue cells, especially the cerebral cortex.

AQI
As the AQI increases, air quality worsens and pollution becomes more serious. According to Environmental Air Quality Standards GB 3095-2012, discrete AQI data are classified according to pollution levels one (excellent) through six (serious). The corresponding relationship between AQI and air quality level is shown in Table 4. Table 5 presents partial data for air quality level. Metropolis introduced SA and proposed an importance sampling method-i.e., accepting new states with probability-called the Metropolis criterion [39]. This is the basic idea of SA algorithms. Kirkpatrick et al. first proposed the simulated annealing algorithms in 1983 [40,41]. SA makes the optimal solution asymptotically convergent and is widely used to solve optimization problems. In recent years, with the rapid increase of information, there has been a huge amount of data (big data) which is larger than the traditional data. Under such a large amount of AQI data, how to find useful data from it has become an important issue. DT is based on the tree structure, presenting the data rules, enabling analysts to understand the implicit knowledge of the data and interpret it, which is widely used in various fields [30][31][32]. However, before establishing the decision tree model, it is necessary to set its relevant parameters, which will affect the result. Under different parameter combinations, if the parameter values are not adjusted properly, the classification result will be poor. Because parameters minimum case (M) and the pruning confidence factor (CF) of the DT will be different due to different problems, it is very time-consuming to manually adjust them.
Therefore, this paper proposes an intelligent algorithm combining DT and SA, and studies an algorithm based on AQI decision rule establishment and parameter optimization. Then, based on the intelligent algorithm of parameter optimization and decision rules, the urban air quality is predicted and analyzed. This study combines the advantages of DT and SA. DT generates decision rules, SA converges to the global optimum, and the parameters minimum case (M) and the pruning confidence factor (CF) of the DT determined by SA. The rules extracted in this paper can be used to analyze the collected information and then forecast a new AQI. Figure 1 shows a flow chart of the proposed algorithm; the AQI dataset is pre-processed as training and testing data, then initial values for the parameters are proposed; after that, the initial solution can be generated randomly. The proposed algorithm begins with four parameters, namely I gen , T 0 , T f , and λ, where I gen denotes the number of generations, T 0 represents the initial temperature, T f represents the final temperature that stops the proposed algorithm if the current temperature is lower than T f , and λ is the coefficient controlling the cooling rate" respectively. The current temperature T is set to be the same as T 0 . The solution is represented as seven features followed with two variables M, and CF as shown in Table 6. An initial solution α is randomly generated according to the representation of solution in Table 6. For each generation, the next solution β is generated from α by randomly swapping these seven features and randomly generating these values of four variables in the current solution. T is decreased after running I gen generations, according to a formula T ← λT , where 0 < λ < 1. Let ob j(α) denotes the testing accuracy of α, and ∆ denote the difference between ob j(α) and ob j(β); that is ∆ = ob j(α) − ob j(β). The probability of replacing α with β, where α is the current solution and β is the next solution, given that ∆ > 0, is e −∆/T . This is accomplished by generating a random number r ∈ [0,1] and replacing the solution with β if <e −∆/T . Meanwhile, if ∆ ≤ 0, the probability of replacing α with β is one. In the proposed algorithm, SA and DT are performed to optimize parameters (M and CF) to increase the testing accuracy for selected features and build the decision rules. The proposed algorithm is repeated until T is lower than T f . Thereafter, the best testing accuracy, and decision rules are reported.  The proposed approach uses the accuracy based on the confusion matrix, which can test the performance of the classification method. The confusion matrix is shown as Table 7.  The proposed approach uses the accuracy based on the confusion matrix, which can test the performance of the classification method. The confusion matrix is shown as Table 7. TP, FP, FN, and TN represent true positive class, false positive class, false negative class, and true negative class, respectively. The predicted value is a positive example, which is recorded as P (positive). The predicted value is a negative example, which is recorded as N (negative). When the predicted value is the same as or opposite to the actual value, they are recorded as T (true) or F (false), respectively. The receiver operating characteristic curve (ROC curve) and area under the curve (AUC) can test the performance of classification results. Because ROC curve has a good characteristic, when the distribution of positive and negative samples in the test set is changed, ROC curve can be still unchanged. Class imbalance often occurs in the actual data set, that is, there are many more negative samples than positive samples (or vice versa), and the distribution of positive and negative samples in the test data may change with time. The area under ROC curve is calculated as the evaluation method of imbalanced data. It can comprehensively describe the performance of classifier under different decision thresholds. AUC calculation formula is as follows:

Simulation Results and Discussions
This study adopts 10-fold cross-validation to evaluate results. The data was divided into 10 portions. Nine portions of data were retrieved as training data and the other one was used for testing data. To verify its performance, the proposed algorithm was used with the RF and SVM approaches, and the simulation results were compared. The SVM is a learning system that uses a hypothesis space of linear function in a high-dimensional feature space. The RF is an ensemble learning method for classification that constructs multiple decision trees at training time, and outputs the class that depends on the majority of the classes. The SA parameters were set to the number of generations I gen = 5000, the initial temperature T 0 = 100, the final temperature T f = 0.01, and the cooling rate λ = 0.95 [42]. The search range of the DT parameter named M was changed from 2 to 100 and that of CF was changed from 0.01 to 0.5.

Comparative Analysis of Classification Accuracy with Proposed Algorithm and Other Methods
The simulation results in Table 8 show the classification accuracy of the proposed algorithm and other approaches using training data. From Table 8, it can be found that the proposed algorithm has a classification accuracy of 99.92%, which is better than other approaches such as decision tree (DT), random forest (RF), and support vector machine (SVM). As can be seen from Table 8, the classification accuracy of the only DT is 95.34%, and the classification accuracy of the proposed algorithm is 99.92%, which means that the SA algorithm has an additive effect on DT, and the parameters can be adjusted to improve its accuracy. Because SA has the advantage of jumping out of local optimum according to probability, it can effectively prevent the search process from falling into a local optimum. This paper proposes an intelligent algorithm SA added to DT. Using the advantages of SA, the M and CF in DT can be effectively determined.

Analysis of Decision Rules Obtained from the Proposed Algorithm
The obtained decision rules have a total of eight DT rules, which are shown in Table 9. PM 2.5 and PM 10 are the main factors that will affect air quality. The DT is divided by PM 2.5 as the root node, indicating that PM 2.5 is the most important indicator that will affect air quality level. Table 9. Decision rules obtained from the proposed algorithm.

No. Rules 1
When PM2.5 < 49.5 and PM10 < 35.5, the air quality rating is 1. It means, data required to start the test from the root node PM2.5 < 49.5. According to different data attributes, it moves to the sub node PM10 < 35.5, finally reaches the leaf node of the air quality level 1.

2
When PM2.5 < 49.5 and PM10 ≥ 35.5, the air quality rating is 2. It means, data required to start the test from the root node PM2.5 < 49.5. According to different data attributes, it moves to the sub node PM10 ≥ 35.5, finally reaches the leaf node of the air quality level 2.

3
When PM2.5 ≥ 49.5, PM10 < 74.5 and PM2.5 < 150.5, the air quality rating is 2. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 < 74.5, then recursively moves to another sub node PM2.5 < 150.5, finally reaches the leaf node of the air quality level 2.

4
When PM2.5 ≥ 49.5, PM10 < 74.5 and PM10 ≥ 150.5, the air quality rating is 3. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 < 74.5, then recursively moves to another sub node PM2.5 ≥ 150.5, finally reaches the leaf node of the air quality level 3.

5
When PM2.5 ≥ 49.5, PM10 ≥ 74.5 and PM10 < 114.5, the air quality evaluation level is 3. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 ≥ 74.5, then recursively moves to another sub node PM10 < 114.5, finally reaches the leaf node of the air quality level 3.

6
When PM2.5 ≥ 49.5, PM10 ≥ 74.5 and PM10 ≥ 114.5, further judging that when PM10 < 149.5, the air quality evaluation level is 4. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 ≥ 74.5, then recursively moves to another sub node PM10 ≥ 114.5, then recursively moves to another sub node PM10 < 149.5, finally reaches the leaf node of the air quality level 4.

7
When PM2.5 ≥ 49.5, PM10 ≥ 74.5 and PM10 ≥ 114.5, it is further judged that when PM10 ≥ 149.5 and PM10 < 249.5, the air quality rating is 5. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 ≥ 74.5, then recursively moves to another sub node PM10 ≥ 114.5, then recursively moves to another sub node PM10 ≥ 149.5, then recursively moves to another sub node PM10 < 249.5, finally reaches the leaf node of the air quality 5.

8
When PM2.5 ≥ 49.5, PM10 ≥ 74.5 and PM10 ≥ 114.5, it is further judged that when PM10 ≥ 149.5 and PM10 ≥ 249.5, the air quality rating is 6. It means, data required to start the test from the root node PM2.5 ≥ 49.5. According to different data attributes, it moves to the sub node PM10 ≥ 74.5, then recursively moves to another sub node PM10 ≥ 114.5, then recursively moves to another sub node PM10 ≥ 149.5, then recursively moves to another sub node PM10 ≥ 249.5, finally reaches the leaf node of the air quality 6.

Analysis of Factors Influencing Air Quality
To illustrate the extent of the air quality influence factor, Table 10 presents the results of using the values of influence factors from IncNodePurity (increased node purity) for air quality. IncNodePurity is an evaluation method that can use the non-negative sum of residuals to find a simulated value. This value can elucidate the extent to which the various important factors affect air quality. An influence factors diagram is shown in Figure 2. Table 10 and Figure 2 indicate the influence factors to be PM 2.5 > PM 10 > SO 2 > CO > NO 2 > O 3 . It can also be seen that the value of the IncNodePurity of PM 2.5 is the largest, which suggests that PM 2.5 has the greatest impact on air quality. IncNodePurity of PM2.5 is the largest, which suggests that PM2.5 has the greatest impact on air quality.

Air Quality Data Set Analysis of ROC and AUC
In our implementation, the area under the receiver operating characteristic (ROC) curve is the area under the curve (AUC), which is used to evaluate the performance of the proposed approach. The value of AUC varies from 0 to 1, with larger values being better. In Figure 3, the value of AUC for the air quality data set is 0.968, demonstrating that the proposed algorithm performs well.

Air Quality Data Set Analysis of ROC and AUC
In our implementation, the area under the receiver operating characteristic (ROC) curve is the area under the curve (AUC), which is used to evaluate the performance of the proposed approach. The value of AUC varies from 0 to 1, with larger values being better. In Figure 3, the value of AUC for the air quality data set is 0.968, demonstrating that the proposed algorithm performs well.

Air Quality Data Set Analysis of ROC and AUC
In our implementation, the area under the receiver operating characteristic (ROC) curve is the area under the curve (AUC), which is used to evaluate the performance of the proposed approach. The value of AUC varies from 0 to 1, with larger values being better. In Figure 3, the value of AUC for the air quality data set is 0.968, demonstrating that the proposed algorithm performs well.

Conclusions
This paper has proposed urban air quality analysis and a forecast based on an intelligent algorithm with parameter optimization and decision rules. The proposed algorithm was applied to test the AQI using Beijing's dataset. SA and DT were used to achieve the best classification accuracy and classify air quality by the obtained decision rules, and they were shown to be efficient for generating decision rules. In addition, parameters minimum case (M) and the pruning confidence factor (CF) of the DT were calculated and applied automatically. This research provided a prediction model for improving air quality and this model could effectively improve people's living environment protect people's health. In our implementation, the training data accuracy classification was 99.92%, the air quality impact factors were sorted as PM 2.5 > PM 10 > SO 2 > CO > NO 2 > O 3 , and the AUC value for the air quality data set was 0.968. From the simulation results, we determined that the performance of the proposed algorithm is better than that of other current approaches.
Further research will focus on the following aspects: (1) using a simulated annealing algorithm for other data mining technologies (such as support vector machines, neural networks, etc.) to find the best parameters and improve the accuracy of the method; (2) improving the algorithm or combining the advantages of other algorithms to conduct data mining and compare the results.