An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm

Ma, Jianhong; Ma, Xiaoyan; Yang, Cong; Xie, Lipeng; Zhang, Weixing; Li, Xuexiang

doi:10.3390/electronics12061463

Open AccessArticle

An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm

by

Jianhong Ma

,

Xiaoyan Ma

,

Cong Yang

,

Lipeng Xie

,

Weixing Zhang

^*

and

Xuexiang Li

^*

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450002, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(6), 1463; https://doi.org/10.3390/electronics12061463

Submission received: 6 February 2023 / Revised: 13 March 2023 / Accepted: 15 March 2023 / Published: 20 March 2023

(This article belongs to the Special Issue Advanced Techniques in Computing and Security)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, air pollutants have become an important issue in meteorological research and an indispensable part of air quality forecasting. To improve the accuracy of the Chinese Unified Atmospheric Chemistry Environment (CUACE) model’s air pollutant forecasts, this paper proposes a solution based on ensemble learning. Firstly, the forecast results of the CUACE model and the corresponding monitoring data are extracted. Then, using feature analysis, we screen the correction factors that affect air quality. The random forest algorithm, XGBoost algorithm, and GBDT algorithm are employed to correct the prediction results of PM_2.5, PM₁₀, and O₃. To further optimize the model, we introduce the grid search method. Finally, we compare and analyze the correction effect and determine the best correction model for the three air pollutants. This approach enhances the precision of the CUACE model’s forecast and improves our understanding of the factors that affect air quality. The experimental results show that the model has a better prediction error correction effect than the traditional machine learning statistical model. After the algorithm correction, the prediction accuracy of PM_2.5 and PM₁₀ is increased by 60%, and the prediction accuracy of O₃ is increased by 70%.

Keywords:

air pollutants; error correction; CUACE model; ensemble learning

1. Introduction

In recent years, the aggravation of atmospheric environmental pollution has damaged the public’s health [1], and there have been higher and higher requirements for the accuracy and refinement of air quality forecasting. With the rapid development of Internet technology, the accuracy of numerical weather prediction is also increasing [2], rendering it a key method in current weather forecasting. Among them, the air quality forecast Chinese Unified Atmospheric Chemistry Environment (CUACE) model developed by the China Academy of Meteorological Sciences provides a common computing platform for air quality forecasting and atmospheric composition changes [3], with high prediction accuracy. Despite continuous improvements in numerical prediction techniques, differences in air quality models arise from the large uncertainty associated with emission sources, initial conditions, and physical parameters. These discrepancies can impact the credibility of prediction results, as different air quality models produce varying forecasts [4]. The prediction results of correction and ensemble prediction methods can effectively make up for the shortcomings of single air quality model prediction to a certain extent [5,6,7,8].

Zhengzhou is in the North China Plain. As the core city of the Central Plains urban agglomeration, its emissions of PM_2.5, PM₁₀, NO_x, O_3, and other pollutants are large, and its air pollution is serious. Moreover, the industrial structure of Henan province in China is heavy, the energy structure is partial to coal, the freight structure is partial to the highway, and the pollution situation in the extensive development mode is complex. In the past, the use of the CUACE model for predicting the atmospheric environment in Henan Province has received limited attention. The correction method for the CUACE model has primarily focused on correcting model system deviation, and there has been little research on the combination of fusion meteorological elements and machine learning algorithms for correction [9].

The use of an ensemble learning method in developing an air pollutant forecast correction model has been shown to enhance the accuracy of the automatic decision module, according to previous research [10]. The integrated learning algorithm provides accurate models for various machine learning-related problems by combining multiple base classifiers. The model can solve many problems that cannot be solved by a single model, and it has good applicability to many mature machine learning frameworks [11]. By using the ensemble learning correction model, the generalization error can be significantly reduced, and a combination module with a stronger prediction correction effect can be selected to improve the robustness and accuracy of the entire model. The ensemble learning algorithm is based on abstract interpretation and refinement techniques for the verification of gradient boosted DTs and Random Forests [12].

The prediction results of the CUACE model are evaluated. The results show that the CUACE model has different prediction effects on air quality in different regions. The ensemble learning correction model has a better correction effect and accuracy than traditional statistical correction methods such as linear regression equation correction and bias elimination correction [13]. Therefore, given the deviation between the predicted values of PM_2.5, PM₁₀, and O₃ in the CUACE model and the observed values, this paper applies the integrated learning algorithm to the forecast of air pollutants and constructs the correction model of air pollutant forecast by screening the factors affecting the forecast results.

The main contributions of our work are summarized as follows:

(1): The random forest [14], the gradient boosting tree, and the XGBoost algorithm were used to construct the ensemble learning correction model [15,16]. The training model was built to correct predictions, which further improved air quality prediction accuracy.
(2): The grid search data method [17] was applied to the model and automatically matched the relationship coefficients of each air pollutant prediction result’s influencing factors [18]. This resulted in a correction model based on ensemble learning that achieves better correction results.

The remainder of this article is organized as follows. Section 2 describes the related work of other scholars on prediction error correction and explains their contributions. In Section 3, the main research methods including bagging and boosting are briefly introduced. Section 4 describes the composition of the ensemble learning correction model. The results and evaluation are provided in Section 5. Finally, Section 6 presents the conclusion of this work.

2. Related Work

To further improve forecast results, some scholars have used a variety of bias correction methods to improve the model forecast results and have achieved good results. Cheng et al. [19] used the adaptive partial least squares regression method to correct the PM_2.5 concentration prediction results under the CUACE model prediction. The study found that the correction model based on ensemble learning significantly improved the accuracy of PM_2.5 concentration prediction, with a significantly increased correlation coefficient between the corrected and measured values and a significant reduction in the correction error. The accuracy rate of the model was found to be between 65% and 70%. In a separate study, Chen et al. [20] used the descending averaging method and the rolling bias correction method to correct the AQI forecast value under the CUACE model in each district and county of Ningbo. After correction, the accuracy rate of each district and county reached more than 60%, and the difference between the two methods was not significant. The application of both methods resulted in a significant reduction in the three error indexes, namely standard error, normalized deviation, and average deviation. Furthermore, the forecast results after correction by both methods were found to be more closely aligned with the actual situation. He et al. [21] used the error rolling linear regression correction method to test and correct the prediction results of six pollutants in Lanzhou predicted by the CUACE model. The accuracy of pollutant level prediction can be improved by 8.7–75% after correction by this method.

Some scholars have used machine learning and deep learning techniques to revise numerical weather forecast results. Zhang et al. [22] used the ensemble deep learning method to use four popular machine learning statistical models—the random forest model, generalized linear model, gradient boosting model, and deep neural network model—to correct the error of PM_2.5 original forecast results from the CMAQ model. After correction, the coefficient of determination (R²) increased by 60–160% compared with the original forecast, and the root means square error (RMSE) decreased by about 40%. Sun et al. [23] used three machine learning methods, LASSO regression, random forest, and deep learning, to correct the near-surface 10 m wind speed in North China predicted by the numerical weather prediction model ECMWF. The prediction results, which were corrected by the three machine learning methods, were compared with the results predicted by the ECMWF model. The results show that the correction effects of the three machine learning methods are better than those predicted by the ECMWF model. This experiment reflects the great advantages of machine learning methods in numerical weather prediction correction.

Summarizing the above related work, it can be found that the machine learning method has greatly improved the correction results of the CUACE model compared with the traditional statistical algorithm. Table 1 is representative of this.

3. Methodology

Given the existing error correction model being revised more roughly, there is a certain lack of accuracy. To enhance the accuracy of air pollutant prediction and generate more accurate air quality forecasts, this paper employs an ensemble learning algorithm to improve the correspondence between the prediction results and actual data based on the CUACE model prediction.

3.1. Bagging Ensemble Method

The random forest (RF) is a classifier that combines multiple decision trees to train and predict samples. It uses the resampling method to extract multiple samples from the original sample, models each sample with a decision tree, and then combines the predictions of multiple decision trees to obtain the final prediction result by voting. Specifically, the traditional decision tree selects an optimal attribute in the attribute set of the current node when selecting the partition attribute (assuming that there are d attributes). On the RF, for each node of the base decision tree, the k attributes are randomly selected from the attribute set of the node to form the attribute set. Then, the optimal partition attribute is selected from the attribute set. In general,

k = {l o g}_{2} d

(

d

features) is recommended.

The random forest algorithm can process very high-dimensional data without feature selection. After the training is completed, key feature attributes can be selected. The training results can be visualized to facilitate the analysis of experimental results.

3.2. Boosting Ensemble Method

The Gradient Boosting Decision Tree (GBDT) is a gradient boosting algorithm based on multiple decision trees; it is an iterative tree rather than a classification tree. Every new training is conducted to improve on the previous result. Each tree learns the residual of the sum of all previous tree conclusions, which is the sum of the predicted and actual values.

The gradient lifting tree is an optimization algorithm to solve the lifting tree when the loss function is general. It is called the gradient lifting algorithm. It uses the approximation method of the steepest descent method. The key is to use the negative gradient of the loss function in the current model, which is defined as:

- [\frac{\partial L (y, f (x_{i}))}{\partial f (x_{i})}] f (x) = f_{m - 1} (x)

(1)

As an approximation of the residual in the regression problem boosting tree algorithm, a regression tree is fitted. For different Loss functions, the gradient has different expressions, as shown in Table 2.

XGBoost (XGB) is an open-source machine learning project developed by Chen Tianqi et al. [24]. It is an Extreme version of the GBDT. In contrast to the GBDT, second-order Taylor expansion is carried out, a penalty term is added, automatic processing of missing value feature strategy is added, and parallel processing is supported.

XGB constitutes a great improvement on the gradient-boosting decision tree algorithm. The XGBoost algorithm adds a penalty term to the objective function to balance the complexity of the model. The objective function expression can be denoted by Equation (2):

O b j (θ) = \sum_{i = 1}^{m} L (y_{i}, \hat{y_{i}}) + \sum_{k = 1}^{K} Ω (f)

(2)

where Obj(θ) is the objective function, I is the sample, a total of m samples,

y_{i}

is the sample value,

\hat{y_{i}}

is the predicted sample value, L(θ) is the loss function, k is the decision tree, a total of K decision trees, and

Ω (f)

is the regularization term; the regularization term expression is shown in Equation (3):

Ω (f) = γ T + 1 / 2 λ {| | ω | |}^{2}

(3)

where T represents the number of leaf nodes in each decision tree model, γ and λ represent coefficients, and ω represents the set of scores of leaf nodes in each decision tree model.

When training the model, we use an addition method. In round t, we add

f_{t}

to the model to minimize the following objective function, as shown in Equation (4):

X^{(t)} = \sum_{i = 1}^{n} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i})) + Ω (f_{t})

(4)

where

X^{(t)}

is the objective function and

(y_{i}, {\hat{y}}_{i}^{(t - 1)} + f_{t} (x_{i}))

is the prediction error of the first t − 1 trees, which is a constant.

f_{t} (x_{i})

is the sample value

x_{i}

of the t-th decision tree.

Ω (f_{t})

is expressed as the regularization term of the t-th round.

Specifically, compared with the traditional GBDT algorithm, the XGBoost algorithm is more powerful and has more parameters. A booster is used to specify the type of weak learner. This paper directly uses its default value. The learning_rate is used to set the learning rate of the model. This paper uses the default value of 0.01. Similar to the GBDT algorithm, for the number of decision trees and the maximum depth of each decision tree, this paper selects 100 and 3 as the default values of the number of decision trees and the maximum depth of the decision tree, respectively.

4. Ensemble Learning Correction Model

The model in this paper combines base classifier, correction factor screening, grid search data, analysis of the generalization error of the prediction model, and model tuning to form a learning correction model. The specific model is shown in Figure 1.

4.1. Base Classifier

Ensemble learning the construction of a set of base classifiers based on training data and its classification by aggregating the output of each base classifier. The ensemble learning correction model uses three base classifiers, namely the Bayes algorithm classifier, the C4.5 decision tree algorithm classifier, and MLR multiple linear regression classifier.

It consists of three weak classifiers. Considering the prediction correction opinions of multiple base classifiers, after evaluating the opinions of three base classifiers, classifier integration is carried out. The integrated learning correction model classifier integration is shown in Figure 2.

4.2. Correction Factor Screening

The ensemble learning correction model screens the factors that have the greatest impact on the forecast to construct the forecast correction model. Feature factor importance measurement method: random forest calculates out-of-bag data error.

The order of importance of factors is as follows: 1. the factors with the biggest impact on PM_2.5 and PM₁₀ forecast are visibility and relative humidity; 2. the factors with the biggest impact on O₃ forecast are temperature and mixed layer height; 3. in terms of O₃ forecast, forecast values, starting times, and live values have little effect on the model. The order of correction factor filtering for the ensemble learning correction model is shown in Table A1. The order of factor importance is arranged by using random forest to calculate the error of out-of-bag data.

Based on the feature analysis, the factors with importance less than 0.02 are eliminated; considering that the prediction results of some factors in the actual business are not easy to obtain, the factors shown in Figure 3, Figure 4 and Figure 5 are finally determined by screening.

For different atmospheric pollutants, the importance of their characteristic factors may be different, and the results of selecting factors are also different.

4.3. Grid Search Data

The ensemble learning correction model uses the grid search data to exhaustively search the specified parameter values of its function; in other words, the number of decision trees and different learning rates in the ensemble learning algorithm are arranged and combined, and all possible combination results are listed to generate a ‘grid’; then, each combination is added to the ensemble learning correction model and assigned to different algorithms for training, and its performance is evaluated by cross-validation. After the fitting function tries all the parameter combinations, a suitable classifier is returned and automatically adjusted to the best parameter combination.

The parameter values of the ensemble learning correction model are searched through a dedicated grid search library function, which can search the parameters of the classifier specified in the model in detail. The default parameter of the cross-validation generator of the ensemble learning correction model is five crosses.

4.4. Optimizing Model

Ensemble learning correction model tuning is mainly based on the two parameters of the number of decision trees and the maximum depth of the tree. Because there are different parameters in different algorithm functions, they are tested separately. The optimal parameters of the corresponding algorithm functions are given below.

In the random forest algorithm, the value range of the decision tree is [100, 1100], the step length is 200, the value range of the maximum tree depth is [10, 40], and the step length is 5. The three elements of PM_2.5, PM₁₀, and O₃ are trained and modeled separately. Each element has 60 models, and a total of 180 models are generated. The optimal matching parameters are shown in Table 3.

For the gradient boosting tree (GBDT) algorithm, the integrated learning correction model sets the learning rate range as [0.1, 0.3], the step size as 0.1, the value range of the decision tree as [100, 900], the step size as 100, the maximum tree depth as [5, 8], and the step size as 1. The three elements of PM_2.5, PM₁₀, and O₃ were trained and modeled, respectively, with 540 models for each element, resulting in a total of 1620 models. The learning rate of the gradient boosting tree in the model is set to 0.1, and the optimal matching parameters are shown in Table 4.

The XGB algorithm optimizes the three parameters of the number of decision trees, random sampling ratio, and learning rate random sampling column ratio. The optimal matching parameters are shown in Table 5.

5. Results and Evaluation

5.1. Experimental Environment

The experimental environment and its configuration are shown in Table 6.

5.2. Dataset

The data set used in the experiment is the forecast result of Zhengzhou City based on the CUACE model. The Zhengzhou air quality data set has a total of 7249 sample data, and the meaning of each column element includes:

(1): The actual elements of the reporting time: PM_2.5 concentration, PM₁₀ concentration, O₃ concentration, air pressure, temperature, relative humidity, 10-m wind direction, 10-m wind speed, visibility, precipitation, 3-h pressure change, 24-h pressure change, 24-h temperature change, inversion, mixed layer height, 850 hPa temperature, 925 hPa wind speed.
(2): The actual elements of the revised time: PM_2.5 concentration, PM₁₀ concentration, O₃ concentration.
(3): The forecast elements of the correction time: PM_2.5 concentration, PM₁₀ concentration, O₃ concentration, air pressure, temperature, relative humidity, 10 m wind direction, 10 m wind speed, visibility, precipitation, 3 h variable pressure, 24 h variable pressure, 24 h variable temperature, temperature inversion, mixed layer height, 850 hPa temperature, 925 hPa wind speed.

The experiment randomly divided the data set according to the ratio of 8:2, with 80% as the training set and 20% as the test set.

5.3. Data Pre-Processing

Based on the analysis of the influencing factors of the ensemble learning correction model based on the attribute values of each meteorological observation value, the concentration, air pressure, temperature, relative humidity, wind direction, customs, visibility, precipitation, variable pressure, and variable temperature are used as the input of the model.

(1): The ensemble learning correction model for missing observation data in the data set uses mean filling, taking the upper and lower observations for mean filling, such as the data unit with a value of 999,999 in the dataset.
(2): Aiming at the data unit with the wind direction observation value of 999,017 in the dataset, its true meaning is the static wind state, and the ensemble learning correction model deletes its data.

5.4. Evaluating Indicator

The performance indicators of each algorithm evaluated in this paper include root mean square error (RMSE), mean deviation (

M D

), and the coefficient of determination (

R^{2}

). RMSE measures the overall error between the predicted and observed values.

M D

is mainly used to measure the magnitude and direction of the overall average deviation between the predicted value and the actual value. The closer the root mean square error and the average deviation are to 0, the better the forecast effect is. The coefficient of determination

R^{2}

is mainly used to judge the fitting of the machine learning algorithm model to the test data set. The larger the coefficient of determination is, the closer the algorithm is to reality.

The specific calculation formulas are as follows Equations (5)–(7).

R M S E = [\frac{1}{N} \sum_{i = 1}^{N} {{(P_{i} - O_{i})}^{2}]}^{\frac{1}{2}}

(5)

M D = \frac{1}{N} \sum_{i = 1}^{N} (P_{i} - O_{i})

(6)

R^{2} = 1 - \sum_{k = 1}^{N} \frac{{(O_{i} - P_{i})}^{2}}{{(O_{i} - \bar{O_{i}})}^{2}}

(7)

where

P_{i}

represents the predicted value,

O_{i}

represents the measured value,

\bar{O_{i}}

represents the average of all sample points, and

N

represents the total number of samples.

5.5. Comparative Experiment Settings

Different ensemble learning algorithms have different correction effects on different atmospheric pollutant forecasts. The Bagging algorithm and Boosting algorithm have different processing methods for classifiers. In the Bagging algorithm, there is no strong dependence between individual learners and a parallel method that can be generated simultaneously. In the Boosting algorithm, there is a strong dependence between individual learners and a serially generated serialization method. Therefore, to verify the accuracy and stability of the correction effect of the ensemble learning correction model, this paper selects the ET-BPNN algorithm correction model for experimental comparison. The reason for this comparison is that the ET-BPNN algorithm correction model is the same as the air pollutant type and evaluation index corrected by the model in this paper, but the model tuning method is different.

The forecast and live data of Zhengzhou City are used as the input of the integrated learning correction model and the ET-BPNN algorithm correction model. Then, the three main air pollutants, PM_2.5, PM₁₀, and O₃, were each revised. By comparing the correlation coefficient, root means square error, and average relative error of the two models in the iterative process, the smaller the root mean square error, the higher the prediction correction effect of the model.

5.6. Experimental Result Analysis

The ensemble learning correction model and the ET-BPNN algorithm correction model [25] are each trained on the training data set, and the obtained correlation coefficient, root mean square error, and average relative error are compared, as shown in Table 7.

It can be seen from Table 7 that the ensemble learning correction model has a more obvious correction effect than the ET-BPNN algorithm model, and the correction effect on the prediction of particulate matter is significantly improved. The integrated learning correction model selects the best algorithm for different pollutants to construct the correction model, which reduces the standard deviation of the prediction results of PM_2.5, PM₁₀, and O₃ by 33.4%, 2.7%, and 51.2%, respectively.

The three algorithms of random forest, GBDT, and XGB in the ensemble learning correction model are trained on the training data set. In this paper, the prediction results of the CUACE model are compared with the observation data, and then the results of the CUACE model are compared with the observation data after the correction of the CUACE model by the ensemble learning correction model. Finally, the comparison results are analyzed, and the results are as follows (the solid line in the figure represents the observed value):

The RF algorithm for different air pollution correction effects is shown in Figure 6.

The GBDT algorithm for different air pollution correction effects is shown in Figure 7.

The XGB algorithm for different air pollution correction effects is shown in Figure 8.

Through the analysis of the comparison results, from the perspective of the predicted three pollutants, the three correction algorithms have a better prediction effect on the concentration of O₃, and the average deviation and root mean square error of the XGBoost algorithm for the prediction of particulate matter (PM_2.5 and PM₁₀) are relatively low. Overall, compared with the other two algorithm models, the accuracy of the XGBoost algorithm is relatively high.

The average deviation, root mean square error, and determination coefficient of the three correction algorithms are each calculated. The average deviation and root mean square error are calculated before and after the correction. After the correction of the three machine learning algorithms, the standard errors of the three pollutant concentrations predicted by the CUACE model are generally reduced by 40–80%, and the average deviation is generally reduced by 60–70%. The statistical results are shown in Table 8, Table 9 and Table 10.

From the coefficient of determination (R²), the random forest algorithm, the GBDT algorithm, and the XGBoost algorithm have better correction effects on O₃ prediction. The coefficients of determination of the three algorithm models are 0.987, 0.991, and 0.979, respectively. Compared with the other two algorithms, the XGBoost algorithm has a better correction effect on the prediction of PM_2.5 and PM₁₀ pollutant concentrations.

6. Conclusions

In this paper, combined with the actual needs of atmospheric pollutant forecasting, to improve accuracy, the main pollutants PM_2.5, PM₁₀, and O₃ are used as the correction objects, and an ensemble learning correction model is constructed by using the ensemble learning algorithm for the three main pollutants. The experimental results show that the ensemble learning correction model has a good correction effect on the CUACE model forecast outcomes. The operational correction effect of the RF algorithm on particulate matter (PM_2.5 and PM₁₀) is better than that of GBDT and XGB, and the correction effect is closer to the actual situation, which can reduce the forecast deviation rate by 10–60%. The XGB model has the best correction effect on ozone (O₃) forecast, improving the forecast effect by 5–70%.

In terms of the importance of factors, visibility and relative humidity have the greatest impact on PM_2.5 and PM₁₀ forecasts, respectively, and temperature and mixed layer height have the greatest impact on O₃ forecast. O₃ forecast is affected by the forecast value; the actual value of the initial time has little effect on the model.

The ensemble learning correction model has obvious overestimation when the actual concentration of pollutants is low, and the accuracy of air pollutant forecast in different seasons also has different deviations. However, the ensemble learning correction model in this paper does not consider the seasonal forecast deviation. Future ideas to build a BP neural network model and other methods to study the seasonal forecast bias will further improve the air pollutant forecast correction method as the direction and content of further research.

Author Contributions

Conceptualization, J.M. and X.M.; methodology, J.M.; software, X.M.; validation, J.M., X.M. and C.Y.; formal analysis, L.X.; investigation, W.Z.; resources, J.M.; data curation, C.Y.; writing—original draft preparation, X.M.; writing—review and editing, J.M.; visualization, X.M.; supervision, X.L.; project administration, W.Z.; funding acquisition, J.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the key sub-project of the national key research and development plan 2020YFB1712401-1 and Zhengzhou collaborative innovation major project 20XTZX06013.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Correction factor filtering sort.

Sort	PM_2.5		PM₁₀		O₃
Sort	Factor	Ratio	Factor	Ratio	Factor	Ratio
1	VIS_fc	0.6321	VIS_fc	0.3792	TEM_fc	0.6078
2	RHU_fc	0.1235	RHU_fc	0.2551	MixHeight	0.1935
3	T850_fc	0.0818	TEM_24h	0.0640	RHU_fc	0.0581
4	T850	0.0552	MixHeight	0.0441	WIN_S_fc	0.0351
5	T850_fc	0.0407	bt_month	0.0416	bt_month	0.0263
6	TEM	0.0153	T850	0.0407	VIS_fc	0.0194
7	WIN_S_fc	0.0103	WIN_S_fc	0.0362	O₃_fc	0.0160
8	bt_month	0.0099	PRE_fc	0.0303	T850_fc	0.0123
9	TEM_fc	0.0043	T850_fc	0.0302	PRS_fc	0.0076
10	PRE_fc	0.0039	PM₁₀_fc	0.0273	WS925_fc	0.0062

Explanation: VIS_fc Actual value of visibility, TEM_fc Actual value of temperature; RHU_fc Actual value of relative humidity, MixHeight Mixing height value; T850 850 m altitude temperature value, WIN_S_fc Actual value of wind speed; PRS_fc Actual value of air pressure, PRE_fc Actual value of precipitation.

References

Jiang, Y.; Wu, X.-J.; Guan, Y.-J. Effect of ambient air pollutants and meteorological variables on COVID-19 incidence. Infect. Control. Hosp. Epidemiol. 2020, 41, 1011–1015. [Google Scholar] [CrossRef] [PubMed]
Wardah, T.; Kamil, A.; Hamid, A.S.; Maisarah, W. Statistical verification of numerical weather prediction models for quantitative precipitation forecast. In Proceedings of the 2011 IEEE Colloquium on Humanities, Science and Engineering (CHUSER), Penang, Malaysia, 5–6 December 2011; pp. 88–92. [Google Scholar]
Wang, Y.F.; Ma, Y.J.; Quan, W.J.; Li, R.P. Research progress on application of air quality numerical forecast model in Northeast China. J. Meteorol. Environ. 2020, 36, 130–136. [Google Scholar]
Gong, S.L.; Zhang, X.Y. CUACE/Dust-an integrated system of observation and modeling systems for operational dust forecasting in Asia. Atmos. Chem. Phys. 2007, 7, 1061–1067. [Google Scholar] [CrossRef] [Green Version]
Hólm, E.V.; Lang, S.T.; Fisher, M.; Kral, T.; Bonavita, M. Distributed Observations in Meteorological Ensemble Data Assimilation and Forecasting. In Proceedings of the 2018 21st International Conference on Information Fusion (FUSION), Cambridge, UK, 10–13 July 2018; pp. 92–99. [Google Scholar]
Yang, G.Y.; Shi, C.E.; Deng, X.L.; Zhai, J.; Huo, Y.F.; Yu, C.X.; Zhao, Q. Application of multi-model ensemble method in PM_2.5 forecasting in Anhui Province. J. Environ. Sci. 2021, 41, 1–11. [Google Scholar]
Du, J. Status and Prospects of Ensemble Forecast. Appl. Meteorol. 2002, 13, 16–28. [Google Scholar]
Xu, J.W.; Yang, Y. Integrated Learning Methods: A Review. J. Yunnan Univ. (Nat. Sci. Ed.) 2018, 40, 1082–1092. [Google Scholar]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2012, 12, 2825–2830. [Google Scholar]
Huang, F.L.; Xie, G.Q.; Xiao, R.L. Research on Ensemble Learning. In Proceedings of the International Conference on Artificial Intelligence and Computational Intelligence, Shanghai, China, 7–8 November 2009; pp. 249–252. [Google Scholar]
Yang, Y.L.; Wang, J.Y.; Zhao, W. The effect of CUACE model on heavy pollution weather forecast in Yinchuan. J. Ningxia Univ. (Nat. Sci. Ed.) 2022, 43, 215–219+224. [Google Scholar]
Krichen, M.; Mihoub, A.; Alzahrani, M.Y.; Adoni, W.Y.H.; Nahhal, T. Are Formal Methods Applicable to Machine Learning And Artificial Intelligence? In Proceedings of the 2022 2nd International Conference of Smart Systems and Emerging Technologies (SMARTTECH), Riyadh, Saudi Arabia, 9–11 May 2022; pp. 48–53. [Google Scholar]
Perales-Gonzalez, C.; Fernandez-Navarro, F.; Carbonero-Ruz, M.; Perez-Rodriguez, J. Global Negative Correlation Learning: A Unified Framework for Global Optimization of Ensemble Models. In IEEE Transactions on Neural Networks and Learning Systems; IEEE: Piscatvie, NJ, USA, 2022; p. 99. [Google Scholar]
Fan, X.; Feng, Z.; Yang, X.; Xu, T.; Tian, J.; Lv, N. Haze weather recognition based on multiple features and Random Forest. In Proceedings of the International Conference on Security, Pattern Analysis, and Cybernetics (SPAC), Jinan, China, 14–17 December 2018; pp. 485–488. [Google Scholar]
Liu, S.; Cui, Y.; Ma, Y.; Liu, P. Short-term Load Forecasting Based on GBDT Combinatorial Optimization. In Proceedings of the 2018 2nd IEEE Conference on Energy Internet and Energy System Integration (EI2), Beijing, China, 20–22 October 2018; pp. 1–5. [Google Scholar]
Li, S.T.; Wang, X.R. Application of XGBoost model in prediction of novel coronavirus. J. Chin. Mini-Micro Comput. Syst. 2021, 42, 2465–2472. [Google Scholar]
Cao, Y.; Wang, B.; Zhao, W.; Zhang, X.; Wang, H. Research on Searching Algorithms for Unstructured Grid Remapping Based on KD Tree. In Proceedings of the 2020 IEEE 3rd International Conference on Computer and Communication Engineering Technology (CCET), Beijing, China, 14–16 August 2020; pp. 29–33. [Google Scholar]
He, S.; Wu, P.; Yun, P.; Li, X.; Li, J. An EM algorithm for target tracking with an unknow correlation coefficient of measurement noise. Meas. Sci. Technol. 2022, 33, 045110. [Google Scholar] [CrossRef]
Lv, M.Y.; Cheng, X.H.; Zhang, H.D.; Diao, Z.G.; Xie, C.; Liu, C.; Jiang, Q. Research on the improved method of pollutant forecast bias correction of CUACE model based on adaptive partial least squares regression. J. Environ. Sci. 2018, 38, 2735–2745. [Google Scholar]
Chen, L.; Yu, K.A.; Qin, B.B.; Li, Y.Y.; Fan, K.F.; Li, Q.B. Evaluation and correction analysis of air quality forecast in Ningbo based on CUACE model. Technol. Bull. 2022, 38, 26–31. [Google Scholar]
He, J.M.; Liu, K.; Wang, Y.H.; Zhang, P.Y. Verification and correction of CUACE model in urban air quality forecast in Lanzhou. Drought Meteorol. 2017, 35, 495–501. [Google Scholar]
Zhang, B.; Lv, B.L.; Wang, X.L.; Zhang, W.X.; Hu, Y.T. Using Ensemble Deep Learning to Correct Numerical Prediction Results of Air Quality—A Case Study of WuChangshi Urban Agglomeration in Xinjiang. J. Peking Univ. (Nat. Sci. Ed.) 2020, 56, 931–938. [Google Scholar]
Sun, Q.D.; Jiao, R.L.; Xia, J.J.; Yan, Z.W.; Li, H.C.; Sun, J.H.; Wang, L.Z.; Liang, Z.M. Research on wind speed correction of numerical weather prediction based on machine learning. Meteorology 2019, 45, 426–436. [Google Scholar]
Chen, T.Q.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences (CoRR). arXiv 2016, arXiv:1603.02754. [Google Scholar]
Xiao, Y. Research and application of air quality numerical forecast correction method based on multi-machine learning algorithm coupling. Environ. Sci. Res. 2022, 35, 2693–2701. [Google Scholar]

Figure 1. Ensemble learning correction model, including classifier integration, correction factor screening, grid search and model evaluation.

Figure 2. Base Classifier Integration.

Figure 3. PM_2.5 Characteristic factor.

Figure 4. PM₁₀ Characteristic factor.

Figure 5. O₃ Characteristic factor.

Figure 6. RF algorithm correction results.

Figure 7. GBDT algorithm correction results.

Figure 8. XGB algorithm correction results.

Table 1. Summary of related work.

Ref. No	Author	Method/Framework	Discussion
1	Cheng et al. (2018) [19]	The adapting partial least square regression technique (APLSR)	The real-time dynamic correction of PM_2.5 concentration predicted by GRAPES-GUACE model is carried out by introducing the measured concentration and meteorological conditions of PM_2.5 and considering the influence of meteorological conditions in different seasons and regions.
2	Chen et al. (2022) [20]	The decaying averaging method and The rolling bias correction method.	After correcting the daily AQI forecast value of the CUACE model by using the decreasing average method and the rolling deviation correction method, the standard error, average deviation, and normalized deviation are all significantly reduced, and the correction effect is remarkable.
3	He et al. (2017) [21]	The error rolling linear regression correction method.	After analyzing the absolute errors of the corrected PM₁₀, PM_2.5 and NO₂ pollutant concentrations, it was found that the concentration values were all significantly smaller, and the PM₁₀ concentration was the most obvious. Therefore, after adding a certain value to the corrected results, it was found that the correct rate can be increased to more than 66%.
4	Zhang et al. (2020) [22]	An ensemble deep learning model	This paper proposes a post-correction method based on ensemble deep learning to correct the PM_2.5 concentration forecast results of the original CMAQ model and improve the spatial resolution of the CMAQ forecast.
5	Sun et al. (2019) [23]	Based on LASSO regression, random forest and deep learning model	The correction effects of the three machine learning algorithms are better than those of the MOS method, which shows the potential of machine learning methods in improving local accurate weather forecasts.

Table 2. GBDT Loss function expression.

Setting	Loss Function	$\partial L (y_{i}, f (x_{i})) ∕ \partial f (x_{i})$
Regression	${\frac{1}{2} [y_{i} - f (x_{i})]}^{2}$	$y_{i} - f (x_{i})$
	$\| y_{i} - f (x_{i}) \|$	$s i g n [y_{i} - f (x_{i})]$
	Huber	$y_{i} - f (x_{i}) f o r y_{i} - f (x_{i}) \| \leq δ_{m} δ_{m} s i g n [y_{i} - f (x_{i})] f o r \|y_{i} - f (x_{i})\| \leq δ_{m} w h e r e δ_{m} = α t h - q u a n t i l e {\| y_{i} - f (x_{i}) \|}$
Classification	Deviance	$k t h c o m p o n e n t : {I (y}_{i} = G_{k}) - p_{k} (x_{i})$

Table 3. RF Optimal parameters.

Element	Max_Depth	N_Estimators	Best_Score
PM_2.5	40	700	0.8
PM₁₀	25	900	0.81
O₃	35	1100	0.97

Table 4. GBDT Optimal parameters.

Element	Max_Depth	N_Estimators	Best_Score
PM_2.5	5	900	0.88
PM₁₀	5	900	0.86
O₃	8	800	0.98

Table 5. XGB Optimal parameters.

Element	N_Estimators	Best_Score
PM_2.5	300	0.9
PM₁₀	500	0.88
O₃	500	0.98

Table 6. Experimental environment configuration.

Experimental Environment	Configuration
Operating System	Win10
CPU	Intel(R) Core (TM)i7-11800H
GPU	NVIDIA RTX3060 8 GB
Memory	16.0 GB
Programming language	Python3.6

Table 7. Comparisons of Ensemble learning correction model and ET-BPNN model.

Air Pollutants	Model	Evaluating Indicator
Air Pollutants	Model	$R^{2}$	RMSE	$M D$
PM_2.5	ET-BPNN model	0.73	18.23	13.31
PM_2.5	Ensemble learning correction model	0.80	17.43	9.09
PM₁₀	ET-BPNN model	0.63	26.34	18.82
PM₁₀	Ensemble learning correction model	0.63	30.51	17.71
O₃	ET-BPNN model	0.81	29.16	3.52
O₃	Ensemble learning correction model	0.99	5.58	1.59

Table 8. MD score.

Types of Pollutants	CUACE	RF	GBDT	XGB
PM_2.5	35.46	12.13	9.95	9.09
PM₁₀	51.09	20.33	19.7	17.71
0₃	41	17.71	1.59	2.19

Table 9. RMSE score.

Types of Pollutants	CUACE	RF	GBDT	XGB
PM_2.5	50.73	20.79	24.76	17.43
PM₁₀	75.54	31.85	45.93	30.51
0₃	57.55	5.58	7.49	7.49

Table 10. R² score.

Types of Pollutants	CUACE	RF	GBDT	XGB
PM_2.5	0.11	0.80	0.80	0.89
PM₁₀	−0.81	0.63	0.56	0.74
0₃	−0.04	0.99	0.98	0.99

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, J.; Ma, X.; Yang, C.; Xie, L.; Zhang, W.; Li, X. An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm. Electronics 2023, 12, 1463. https://doi.org/10.3390/electronics12061463

AMA Style

Ma J, Ma X, Yang C, Xie L, Zhang W, Li X. An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm. Electronics. 2023; 12(6):1463. https://doi.org/10.3390/electronics12061463

Chicago/Turabian Style

Ma, Jianhong, Xiaoyan Ma, Cong Yang, Lipeng Xie, Weixing Zhang, and Xuexiang Li. 2023. "An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm" Electronics 12, no. 6: 1463. https://doi.org/10.3390/electronics12061463

APA Style

Ma, J., Ma, X., Yang, C., Xie, L., Zhang, W., & Li, X. (2023). An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm. Electronics, 12(6), 1463. https://doi.org/10.3390/electronics12061463

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Air Pollutant Forecast Correction Model Based on Ensemble Learning Algorithm

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Bagging Ensemble Method

3.2. Boosting Ensemble Method

4. Ensemble Learning Correction Model

4.1. Base Classifier

4.2. Correction Factor Screening

4.3. Grid Search Data

4.4. Optimizing Model

5. Results and Evaluation

5.1. Experimental Environment

5.2. Dataset

5.3. Data Pre-Processing

5.4. Evaluating Indicator

5.5. Comparative Experiment Settings

5.6. Experimental Result Analysis

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI