Air Quality Index and Air Pollutant Concentration Prediction Based on Machine Learning Algorithms

: Air pollution has become an important environmental issue in recent decades. Forecasts of air quality play an important role in warning people about and controlling air pollution. We used support vector regression (SVR) and random forest regression (RFR) to build regression models for predicting the Air Quality Index (AQI) in Beijing and the nitrogen oxides (NO X ) concentration in an Italian city, based on two publicly available datasets. The root-mean-square error (RMSE), correlation coe ﬃ cient ( r ), and coe ﬃ cient of determination ( R 2 ) were used to evaluate the performance of the regression models. Experimental results showed that the SVR-based model performed better in the prediction of the AQI (RMSE = 7.666, R 2 = 0.9776, and r = 0.9887), and the RFR-based model performed better in the prediction of the NO X concentration (RMSE = 83.6716, R 2 = 0.8401, and r = 0.9180). This work also illustrates that combining machine learning with air quality prediction is an e ﬃ cient and convenient way to solve some related environment problems.


Introduction
Many environmental crises currently confront us: global warming, hazardous waste, resource depletion, air pollution, and many more [1][2][3][4].Millions of people die every year from diseases caused by exposure to outdoor air pollution [5].In China, the health risk from overexposure to particles is becoming an important public health concern [6].China's capital, Beijing, has experienced rapid development in economy and energy consumption.Along with this rapid development, hazy weather caused by air pollution has become increasingly serious in Beijing.Air-pollution-weather occurs often and lasts for a long time, although it has seemed better in the last two years [7].
The Air Quality Index (AQI) is an important indicator to reflect and evaluate air quality [8].According to the Chinese Standard GB3095-2012 [9], AQI is calculated by six major pollutants: fine particulate matter (PM 2.5 ), ozone (O 3 ), sulfur dioxides (SO 2 ), inhalable particles (PM 10 ), nitrogen dioxides (NO 2 ), and carbon monoxide (CO) [10].The AQI measures the overall quality of the air on a scale with a range of 0 to 500 that is divided into six levels (good, moderate, lightly polluted, moderately polluted, heavily polluted, and severely polluted); these levels show the impact on human health and provide a good reference for people's outdoor activities in a numerical form [11].A low number means good air quality, while a higher number means worse air quality, which has ramifications for people's outdoor activities.
Governing and solving air pollution problems is a long-term process.Air quality forecasting can help with the prevention of damage caused by air pollution.Therefore, it is necessary for air quality forecasting to occur in a timely manner, to allow government departments and the public to take protective measures and prevent serious pollution incidents.For instance, some plants in Beijing, such as coal-fired power plants and coking plants, have shut down temporarily as a result of the predicted air quality [12].
Machine learning is the scientific study of algorithms and statistical models that computer systems use to make predictions or decisions without being explicitly programmed to perform the task [13].Machine learning has gained tremendous popularity for its powerful and fast predictions with big data [14].Big data means enormous datasets, including masses of unstructured data that need more real-time analysis to help us to gain an in-depth understanding of its hidden values [15].Some researchers have applied machine learning algorithms to the short-and long-term prediction of air quality successfully [16][17][18][19].Pérez et al. predicted the hourly concentration of PM 2.5 in Santiago using a multilayer neural network [20].Focusing on the problem of poor prediction, authors have discussed using a larger dataset for improvements.Zhu et al. proposed two hybrid models (empirical mode decomposition (EMD)-support vector regression (SVR) hybrid and EMD-intrinsic mode functions (IMF) hybrid) to forecast AQI in Xingtai, with the EMD-SVR hybrid model achieving a highest overall accuracy of 80% [11].However, in the experiment, the authors used only a single indicator (past AQI data) to predict the present AQI, and ignored the correlation between different pollutants (such as PM 2.5 , PM 10 , and so on).Corani et al. used feed-forward neural networks to predict ozone and PM 10 in Milan [21].Although the predictions showed a satisfactory reliability, the model still has the tendency toward overfitting, as the authors reported.Biancofiore et al. used a recursive neural network model to forecast PM 10 concentration 1, 2, and 3 days ahead [22].The recursive neural network model forecasted correctly 95% of the days.However, the percentage of false positives was 30%, which highlights the limits of the neural network model in simulating concentration peaks.Fuller et al. devised an empirical model to predict concentrations of PM 10 at background and roadside locations in London [23].However, the performance of the model relied heavily on the currently observed ratio between NO X and PM 10 .
Motivated by prior studies, we first explored the correlation of various air indicators, such as the AQI, the concentration of PM 2.5 , the concentration of total nitrogen oxides (NO X ), and so on.Second, we used support vector regression (SVR) and random forest regression (RFR) to build prediction models.Finally, we used RMSE, correlation coefficient r, and coefficient of determination R 2 to evaluate the performance of the regression models.
The rest of this paper is organized as follows.Details about the datasets and pattern recognition methods (SVR and RFR) are briefly reviewed in Section 2. Section 3 shows the results of the models' prediction, along with corresponding discussion.Conclusions are drawn in Section 4.

Area of Investigation and Datasets
In this study, two publicly available datasets were used.The first dataset, the Beijing Air Quality Dataset (December 2013 to August 2018), is from the Beijing Municipal Environmental Monitoring Center [24].The dataset has 1738 instances.Each instance consists of hourly averaged AQI and the concentrations for PM 2.5 , O 3 , SO 2 , PM 10 , and NO 2 in Beijing, provided by an officially certified analyzer.
The second dataset comes from an air quality recording that contains the responses of a gas multi-sensor device deployed on a field in an Italian city.The dataset [3,25,26] contains 9358 instances of hourly averaged responses from an array of five metal oxide chemical sensors embedded in an air quality chemical multi-sensor device.Data were recorded from March 2004 to February 2005 (one year), representing the longest freely available recordings of an air quality chemical sensor device deployed in the field.Hourly averaged concentrations for CO, non-methane hydrocarbons, benzene, NO X , and NO 2 were provided by a co-located reference certified analyzer.Missing values are tagged with −200 value.For the purposes of this study, we focus on NO X prediction, using the second dataset.As we know, NO X is another important indicator for air quality evaluation.NO X emissions have been linked to acid rain, photochemical smog, and tropospheric ozone destruction [27].Moreover, when nitrogen oxides are inhaled by the human body, they disrupt the alveolar structures and their function in lungs, posing a great threat to human health [28].

Support Vector Machines (SVMs)
Support vector machines (SVMs) are supervised learning models, with associated learning algorithms that analyze the data used for classification and regression analyses [29,30].
A type of SVM for regression, support vector regression (SVR), was originally proposed by Vapnik and his coworkers [28].In SVR, the set of training data includes predictor variables and observed response values.The goal is to find a function f(x) that deviates from y n (sample labels) by a value no greater than ε (bias) for each training point x-that is, remain as flat as possible.Therefore, SVR is also known as tube regression.Its schematic diagram is shown in Figure 1.
Appl.Sci.2019, 9, x FOR PEER REVIEW 3 of 9 with -200 value.For the purposes of this study, we focus on NOX prediction, using the second dataset.As we know, NOX is another important indicator for air quality evaluation.NOX emissions have been linked to acid rain, photochemical smog, and tropospheric ozone destruction [27].Moreover, when nitrogen oxides are inhaled by the human body, they disrupt the alveolar structures and their function in lungs, posing a great threat to human health [28].

Support Vector Machines (SVMs)
Support vector machines (SVMs) are supervised learning models, with associated learning algorithms that analyze the data used for classification and regression analyses [29,30].
A type of SVM for regression, support vector regression (SVR), was originally proposed by Vapnik and his coworkers [28].In SVR, the set of training data includes predictor variables and observed response values.The goal is to find a function f(x) that deviates from yn (sample labels) by a value no greater than ε (bias) for each training point x-that is, remain as flat as possible.Therefore, SVR is also known as tube regression.Its schematic diagram is shown in Figure 1.According to the literature on SVM linear regression [28], the solution to SVR is as follows: Where x is the input feature vector, b is the distance parameter, n a and * n a are the introduced Lagrange multipliers.However, some regression problems cannot be described adequately using a linear model.In that case, we can obtain a nonlinear SVR model by replacing the dot product , where φ(x) is a transformation that maps x to a high-dimensional space.Therefore, the final solution to nonlinear SVR can be obtained as: Random forests (RFs), or random decision forests, are an ensemble learning method for classification, regression, and other tasks.An RF operates by constructing multiple decision trees at According to the literature on SVM linear regression [28], the solution to SVR is as follows: Where x is the input feature vector, b is the distance parameter, a n and a * n are the introduced Lagrange multipliers.However, some regression problems cannot be described adequately using a linear model.In that case, we can obtain a nonlinear SVR model by replacing the dot product x T n x with a nonlinear kernel function K(x 1 , x 2 ) = <ϕ(x 1 ), ϕ(x 2 )>, where ϕ(x) is a transformation that maps x to a high-dimensional space.Therefore, the final solution to nonlinear SVR can be obtained as: Appl.Sci.2019, 9, 4069 4 of 9

Random Forest (RF)
Random forests (RFs), or random decision forests, are an ensemble learning method for classification, regression, and other tasks.An RF operates by constructing multiple decision trees at different training times, and outputting the class representing the mode of classes (classification) or the mean prediction (regression) of individual trees [31].
The RF algorithm incorporates growing classification and regression trees (CARTs).Each CART is built using random vectors.For the RF-based classifier model, the main parameters were the number of decision trees, as well as the number of features (N F ) in the random subset at each node in the growing trees.During model training, the number of decision trees was determined first.For the number of trees, a larger number is better, but takes longer to compute.A lower N F leads to a greater reduction in variance, but a larger increase in bias.N F can be defined using the empirical formula M, where M denotes the total number of features [32].RF can be applied to classification and regression problems, depending on whether the trees are classification or regression trees.The regression model is shown in Figure 2. Assuming that the model includes T regression trees (learners) for regression prediction, the final output of the regression model is where T is the number of regression trees, and h i (x) is the output of the i-th regression tree (h i ) on sample x.Therefore, the prediction of the RF is the average of the predicted values of all the trees.
Appl.Sci.2019, 9, x FOR PEER REVIEW 4 of 9 different training times, and outputting the class representing the mode of classes (classification) or the mean prediction (regression) of individual trees [31].
The RF algorithm incorporates growing classification and regression trees (CARTs).Each CART is built using random vectors.For the RF-based classifier model, the main parameters were the number of decision trees, as well as the number of features (NF) in the random subset at each node in the growing trees.During model training, the number of decision trees was determined first.For the number of trees, a larger number is better, but takes longer to compute.A lower NF leads to a greater reduction in variance, but a larger increase in bias.NF can be defined using the empirical formula M , where M denotes the total number of features [32].
RF can be applied to classification and regression problems, depending on whether the trees are classification or regression trees.The regression model is shown in Figure 2. Assuming that the model includes T regression trees (learners) for regression prediction, the final output of the regression model is where T is the number of regression trees, and hi(x) is the output of the i-th regression tree (hi) on sample x.Therefore, the prediction of the RF is the average of the predicted values of all the trees.

Results and discussion
In this work, SVR and RFs were used to build prediction models for the AQI of Beijing and the NOX of an Italian city, respectively.RMSE, r, and R 2 were used to evaluate the performance of the regression models.To obtain a good regression model, the following criteria could be used as references: (1) low RMSE, (2) high r, and (3) high R 2 .The mathematical expressions of the parameters are shown in Equations ( 4)-( 6), where i y is the label of the i-th sample, i y is the predicted value of the i-th sample, and the superscript horizontal line indicates the average value.RMSE represents the sample's standard deviation of the differences between the predicted values and the observed values.The correlation coefficient (r) is a number that quantifies the type of correlation and dependence, implying the statistical relationships between two or more values in fundamental statistics.The coefficient of determination (R 2 ) is proportional to the variance in the dependent variable that is predictable from the independent variable(s) [33].

Results and Discussion
In this work, SVR and RFs were used to build prediction models for the AQI of Beijing and the NO X of an Italian city, respectively.RMSE, r, and R 2 were used to evaluate the performance of the regression models.To obtain a good regression model, the following criteria could be used as references: (1) low RMSE, (2) high r, and (3) high R 2 .The mathematical expressions of the parameters are shown in Equations ( 4)- (6), where y i is the label of the i-th sample, ŷi is the predicted value of the i-th sample, and the superscript horizontal line indicates the average value.RMSE represents the sample's standard deviation of the differences between the predicted values and the observed values.The correlation coefficient (r) is a number that quantifies the type of correlation and dependence, implying the statistical relationships between two or more values in fundamental statistics.The coefficient of determination (R 2 ) is proportional to the variance in the dependent variable that is predictable from the independent variable(s) [33].
We know that the AQI and the concentration of air pollutants (CAP) are affected by many factors.In particular, when gas diffusion conditions are poor, the concentration of various pollutants increases, so there are correlations between the concentration of each pollutant.Before the regression model was established, we studied the correlation of various indicators.
Figure 3a,b shows the correlation between the 6 and 11 air pollution indicators, respectively; we found that these indicators are highly correlated.Therefore, when one or more indicators is missing, it is feasible to use the remaining indicators to predict missing indicators.Taking the AQI prediction and NO X concentration prediction as examples, we constructed a regression model to verify the feasibility of our proposed method.
Appl.Sci.2019, 9, x FOR PEER REVIEW 5 of 9 We know that the AQI and the concentration of air pollutants (CAP) are affected by many factors.In particular, when gas diffusion conditions are poor, the concentration of various pollutants increases, so there are correlations between the concentration of each pollutant.Before the regression model was established, we studied the correlation of various indicators.

Air Quality Index Prediction of Beijing
In this experiment, we took the AQI of Beijing as the regression target.The data from the first four years of the dataset were used to train the model, and the data from the last year were used for the model testing.
For the SVR-based model training, radial basis function (RBF) was chosen as the kernel function.The kernel parameter gamma (γ) and the penalty parameter (C) were selected by a grid search method.The experimental results showed that the best combination for the SVR model was C = 10 and γ = 0.1.Table 1 summarizes the statistical parameters of the SVR regression model for AQI prediction.The statistical parameters for the testing set were as follows: r = 0.9887, R 2 = 0.9776, and RMSE = 7.666.These results demonstrate that the SVR-based method was a good substitute for analyzing the regression on the AQI.
In this experiment, for the RF-based model, 100 regression trees were used to build the regression model; NF was defined using the empirical formula ( = F N M ) mentioned earlier.The statistical parameters (r = 0.9823, R 2 = 0.9633, and RMSE = 9.602) of the RFR-based model are also shown in Table 1.
As shown in Figure 4, the AQI of the testing samples was estimated using SVR and RFR, with the x-axis denoting the sequence number of the testing samples and the y-axis (target value) representing the AQI of Beijing.Subplots 4a and 4b, the line chart of the actual values, and the values predicted using different regression models, all reflect the regression of the AQI very well.In particular, the SVR-based model shows better performance.

Air Quality Index Prediction of Beijing
In this experiment, we took the AQI of Beijing as the regression target.The data from the first four years of the dataset were used to train the model, and the data from the last year were used for the model testing.
For the SVR-based model training, radial basis function (RBF) was chosen as the kernel function.The kernel parameter gamma (γ) and the penalty parameter (C) were selected by a grid search method.The experimental results showed that the best combination for the SVR model was C = 10 and γ = 0.1.Table 1 summarizes the statistical parameters of the SVR regression model for AQI prediction.The statistical parameters for the testing set were as follows: r = 0.9887, R 2 = 0.9776, and RMSE = 7.666.These results demonstrate that the SVR-based method was a good substitute for analyzing the regression on the AQI.In this experiment, for the RF-based model, 100 regression trees were used to build the regression model; N F was defined using the empirical formula (N F = √ M) mentioned earlier.The statistical parameters (r = 0.9823, R 2 = 0.9633, and RMSE = 9.602) of the RFR-based model are also shown in Table 1.
As shown in Figure 4, the AQI of the testing samples was estimated using SVR and RFR, with the x-axis denoting the sequence number of the testing samples and the y-axis (target value) representing the AQI of Beijing.Subplots 4a and 4b, the line chart of the actual values, and the values predicted using different regression models, all reflect the regression of the AQI very well.In particular, the SVR-based model shows better performance.statistical parameters (r = 0.9823, R = 0.9633, and RMSE = 9.602) of the RFR-based model are also shown in Table 1.
As shown in Figure 4, the AQI of the testing samples was estimated using SVR and RFR, with the x-axis denoting the sequence number of the testing samples and the y-axis (target value) representing the AQI of Beijing.Subplots 4a and 4b, the line chart of the actual values, and the values predicted using different regression models, all reflect the regression of the AQI very well.In particular, the SVR-based model shows better performance.

NO X Prediction in an Italian City
In this experiment, we took the NO X concentration of an Italian city as the regression target.Data from the last three months were used to test the model, and the remaining data were used for the model training.
For the SVR-based model training, RBF was also chosen as the kernel function.The kernel parameter gamma (γ) and the penalty parameter (C) were also selected by a grid search method.The experimental results showed that the best combination for the SVR model was C = 20 and γ = 0.15.Table 1 summarizes the statistical parameters of the SVR regression model for the AQI prediction.In the testing set, the statistical parameters were as follows: r = 0.8923, R 2 = 0.7960, and RMSE = 94.4918.
For the RF-based model in our experiment, we also used 100 regression trees to build the regression model.The model achieved the criteria for good performance, with r = 0.9180, R 2 = 0.8401, and RMSE = 83.6716,which is also shown in Table 1.The results showed that the RFR-based method was a better substitute for analyzing the regression on the NO X .
As shown in Figure 5, the NOx of the testing samples was estimated using SVR and RFR, with the x-axis denoting the sequence number of the testing samples and the y-axis (target value) representing the NO X of the Italian city.In Figure 5, the line chart of the actual values and the values predicted using different regression models demonstrates that the regression model based on RFR is a better predictor than the model based on SVR.

NOX Prediction in an Italian City
In this experiment, we took the NOX concentration of an Italian city as the regression target.Data from the last three months were used to test the model, and the remaining data were used for the model training.
For the SVR-based model training, RBF was also chosen as the kernel function.The kernel parameter gamma (γ) and the penalty parameter (C) were also selected by a grid search method.The experimental results showed that the best combination for the SVR model was C = 20 and γ = 0.15.Table 1 summarizes the statistical parameters of the SVR regression model for the AQI prediction.In the testing set, the statistical parameters were as follows: r = 0.8923, R 2 = 0.7960, and RMSE = 94.4918.
For the RF-based model in our experiment, we also used 100 regression trees to build the regression model.The model achieved the criteria for good performance, with r = 0.9180, R 2 = 0.8401, and RMSE = 83.6716,which is also shown in Table 1.The results showed that the RFR-based method was a better substitute for analyzing the regression on the NOX.
As shown in Figure 5, the NOx of the testing samples was estimated using SVR and RFR, with the x-axis denoting the sequence number of the testing samples and the y-axis (target value) representing the NOX of the Italian city.In Figure 5, the line chart of the actual values and the values predicted using different regression models demonstrates that the regression model based on RFR is a better predictor than the model based on SVR.

Conclusions
Accurate air quality forecasting has important theoretical and practical value for the public; without it, neither the government nor the public can effectively avoid the health damage caused by air pollution or improve the emergency response capability of heavy pollution days.In this study, we built regression models to predict air indicators based on machine learning algorithms, taking the AQI prediction of Beijing and the pollutant concentration prediction of an Italian city as examples.

Figure 3 .
Figure 3.The matrix of correlation coefficients for air pollution indicators of (a) Beijing and (b) an Italy city.

Figure 4 .
Figure 4. Scatterplots of the actual values and the predicted values: (a) SVR model, (b) RFR model.

Figure 4 .
Figure 4. Scatterplots of the actual values and the predicted values: (a) SVR model, (b) RFR model.

Table 1 .
The statistical parameters of the experiments for the Air Quality Index (AQI) and total nitrogen oxide (NO X ) prediction.

Table 1 .
The statistical parameters of the experiments for the Air Quality Index (AQI) and total nitrogen oxide (NOX) prediction.