Classification Prediction of PM10 Concentration Using a Tree-Based Machine Learning Approach

Shaziayani, Wan Nur; Ul-Saufie, Ahmad Zia; Mutalib, Sofianita; Mohamad Noor, Norazian; Zainordin, Nazatul Syadia

doi:10.3390/atmos13040538

Open AccessArticle

Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach

by

Wan Nur Shaziayani

¹,

Ahmad Zia Ul-Saufie

^1,*

,

Sofianita Mutalib

¹

,

Norazian Mohamad Noor

²

and

Nazatul Syadia Zainordin

³

¹

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam 40450, Selangor, Malaysia

²

Faculty of Civil Engineering Technology, Universiti Malaysia Perlis, Kompleks Pengajian Jejawi 3, Arau 02600, Perlis, Malaysia

³

Department of Environment, Faculty of Forestry and Environment, Universiti Putra Malaysia, Seri Kembangan 43400, Selangor, Malaysia

^*

Author to whom correspondence should be addressed.

Atmosphere 2022, 13(4), 538; https://doi.org/10.3390/atmos13040538

Submission received: 16 February 2022 / Revised: 17 March 2022 / Accepted: 24 March 2022 / Published: 29 March 2022

(This article belongs to the Section Atmospheric Techniques, Instruments, and Modeling)

Download

Browse Figures

Versions Notes

Abstract

:

The PM₁₀ prediction has received considerable attention due to its harmful effects on human health. Machine learning approaches have the potential to predict and classify future PM₁₀ concentrations accurately. Therefore, in this study, three machine learning algorithms—namely, decision tree (DT), boosted regression tree (BRT), and random forest (RF)—were applied for the prediction of PM₁₀ in Kota Bharu, Kelantan. The results from these three methods were compared to find the best method to predict PM₁₀ concentration for the next day by using the maximum daily data from January 2002 to December 2017. To this end, 80% of the data were used for training and 20% for validation of the models. The performance measure of the PM₁₀ concentration was based on accuracy, sensitivity, specificity, and precision for RF, BRT, and DT, respectively, which indicates that these three models were developed effectively, and they are applicable in the prediction of other atmospheric environmental data. The best model to use in predicting the next day’s PM₁₀ concentration classification was the random forest classifier, with an accuracy of 98.37, sensitivity of 97.19, specificity of 99.55, and precision of 99.54, but the result of the boosted regression tree was substantially different from the RF model, with an accuracy of 98.12, sensitivity of 97.51, specificity of 98.72, and precision of 98.71. The best model can assist local governments in providing early warnings to people who are at risk of acute and chronic health consequences from air pollution.

Keywords:

PM₁₀; prediction; decision tree; boosted regression tree; random forest

1. Introduction

Particulate matter is a well-known air contaminant that has been linked to a variety of negative health effects. When compared with the other criterion contaminations, particulate matter is the most prominent pollutant in Peninsular Malaysia, with the highest air pollutant index (API) value. According to one study [1], until 2017, PM₁₀ contributed the most to Malaysia’s API, whereas PM_2.5 had a substantial impact on APIs starting in the middle of 2017.

Particulate matter consists of microscopic liquid droplets or solids that can travel deep into the lungs of humans and cause serious health problems. Studies [2,3,4] have stated that particulate matter can cause lung or heart disease, asthma, irregular heartbeat, heart attacks, decreased lung function, and increased respiratory symptoms such as coughing, airway irritation, and breathing difficulties.

In other studies [5,6], it was stated that vehicle traffic is one of the main causes of PM₁₀ pollution in Malaysia due to primary and secondary emissions from exhausts, as well as suspended dust from the streets caused by circulation. In addition, power plants, open burning and wildfires, industrial facilities, and other sources also contribute to this pollution.

Several recent studies have developed models to predict PM₁₀ concentrations such as WRF-CMAQ and WRF-Chem. Weather research and forecasting (WRF) models combined with community multiscale air quality (CMAQ) models (WRF-CMAQ) were developed to predict the mesoscale meteorology condition and to simulate the regional air quality, respectively [7,8,9,10]. The WRF-CMAQ model system provides advances in terms of the capability to simulate complex atmospheric processes that transport and transform pollutants in a dynamic environment [11].

However, the interactions between air pollutants and meteorological conditions could decrease the accuracy of the WRF-CMAQ models, especially in heavily polluted episodes. Ref. [12] reported that the performance of the WRF model inaccurately reproduced low surface temperature and overestimated the wind speed as the pollution loading increased, thus affecting the CMAQ model. There was also another prediction model developed associated with WRF, which is the weather research and forecasting coupled with the chemistry module (WRF-Chem) system. WRF-Chem incorporated complex gas-phase chemistry, aerosol treatments, and photolysis scheme to investigate the influence of different chemical mechanisms on aerosol concentrations [13,14,15]. Chemical transport and transformations are incorporated into WRF so that the interactions between meteorology and chemistry can be investigated [16]. Different from the WRF-CMAQ model, the WRF-Chem system showed the worst performance, specifically during episodes dominated by coarse particles [17]. WRF-Chem model was found to underestimate aerosol optical depth (AOD) because of the misinterpretation of the coarse particles.

Advanced statistical models based on machine learning (ML) techniques have been widely applied in the field of air quality modelling [18,19] because they can perform large and complex data analysis to predict a potential outcome efficiently [20]. Artificial intelligence is used in these techniques, which learn the patterns in the dataset and then construct and train a predictive model [21].

The decision tree (DT) analysis is a data mining and machine learning analytical method that generates a tree-based classification model that classifies cases or predicts values of a dependent variable based on the values of independent variables. Ref. [22] introduced the term ‘classification and regression trees’ (CART) to refer to DT algorithms that can be used to solve classification or regression predictive modelling challenges. This algorithm is known as ‘decision trees’ in the traditional way, but in some platforms, such as R, it is referred to as ‘CART’. The DT technique was originally applied to statistics by [22]. Ref. [23] explained how the DT technique in machine learning was established, while the authors of [24] described decision trees from a statistical approach.

Two well-known ensemble approaches are bagging [25] and boosting [26,27]. They were developed to improve CART’s stability by generating multiple tree models and integrating their outputs to obtain a final prediction. According to [25], the random forest (RF) algorithm applies the bagging approach to an ensemble of DT that trains numerous trees in parallel and uses the majority decision of the trees as the RF model’s final decision. The boosted regression tree (BRT) algorithm, stated by [28], uses boosting to randomly resample training datasets (without replacement) and build a sequence of trees, with each new tree focusing on poorly fitted cases. As a rule of thumb, the authors of [28] suggest using 1000 trees. However, this is based on a detailed analysis of predictive stability for the specific dataset used in the paper.

Furthermore, there are various methods used by previous researchers to predict air pollution concentrations using classification-based predictions. For instance, a study by [29] was performed to determine the major pollutants present in India. The results showed that the DT algorithm provided the highest accuracy. A study conducted by [30] suggested that the multilabel classification be used to predict multiple pollutants since it computes more accurate posterior probabilities, which better supports the decision maker. Then, the authors of [31] stated that classification-based predictions can be measured by accuracy, and the results of this study showed that the multilayer perceptron gave the best results in predicting the levels of PM₁₀ concentration.

The aim of this study is to develop a new approach to predict and classify PM₁₀ concentration levels in Malaysia using machine learning tree-based techniques, i.e., decision tree (DT), boosted regression tree (BRT), and random forest (RF). This study also presents the results of the most relevant features in predicting PM₁₀ concentration levels in Kota Bharu, Kelantan.

2. Materials and Methods

2.1. Study Area

Kota Bharu is a town located in Kelantan, and the station is located at Sekolah Menengah Kebangsaan Tanjong Chat, Kelantan. It is the royal seat and the state capital of Kelantan. The latitude and longitude for this station are 6°6′28.42″ N and 102°15′5.01″ E. This location is classified as an urban area by the Department of Environment (DoE), Malaysia. Its location on the Peninsular’s east coast, facing the South China Sea, in the path of the cold surge, results in a harsh climate during the months of November, December, and January due to the annual monsoon season. Since Kota Bharu is Kelantan’s capital, its population is higher than the population of other parts of the state. The availability of jobs in this main city has resulted in rising population density and economic development, which might result in increased pollution.

2.2. Monitoring Records

In order to acquire a better grasp of PM₁₀ variability, this study used eight maximum daily parameters across a sixteen-year period (2002–2017). Gaseous parameters such as carbon monoxide (CO, ppb), nitrogen dioxide (NO₂, ppb), sulphur dioxide (SO₂, ppb), particulate matter with an aerodynamic diameter less than 10 µm (PM₁₀, µgm⁻³), and ozone concentration (O₃, ppb), as well as meteorological parameters such as wind speed (WS, Km/h), relative humidity (RH, %) and temperature (T, °C), were used as predictors for the next day’s PM₁₀ concentration. WS, RH, T, and previous PM₁₀ concentrations were selected as independent variables since, according to [32,33,34], they had the most significant effect on future PM₁₀ concentrations. On the other hand, Ref. [35] discovered that additional air pollutants, such as SO₂, NO₂, and CO, can similarly enhance PM₁₀ concentration predictions. Furthermore, Ref. [36] found no significant correlation between PM₁₀ and wind direction. Therefore, wind direction does not affect PM₁₀ concentration prediction.

This study used linear interpolation for missing data imputation. According to [37], this linear interpolation method estimates the missing data better for the air pollution data.

The monitoring records were obtained from Malaysia’s Ministry of Environment and Water’s Department of Environment (DoE). Then, 80% of the monitoring data were used for model training, while 20% was used for model validation. In this study, value 1 was set as indicative of a low level of air quality, if the concentration level did not exceed 100, which is the index value in the current 24 h PM₁₀ standard, while value 2 represented a high level of air quality if it exceeded 100; Table 1 shows the labelling for the next day’s PM₁₀ concentration. Thus, in this study, the threshold value was considered based on the Malaysia Ambient Air Quality Guideline 2019 [38].

To avoid imbalanced classification in developing predictive models on these classification datasets, this study used the synthetic minority oversampling technique (SMOTE), as proposed by [39]. In the first step, the dataset was filtered to only consider the minority class, which was the high-level air quality, with 95 data points only. Following that, a search of the k-nearest neighbours was performed (k = 5). The algorithm then selected a random nearest neighbour for these data. A new data point was created, which was on the line between the two data points. The results before and after using the SMOTE technique are shown in Figure 1.

2.3. Tree-Based Machine Learning Approaches

Three models were used for the prediction of PM₁₀ concentration for the next day by using the maximum daily data in Kota Bharu, Kelantan. The models are decision trees (DTs), random forest (RF), and boosted regression trees (BRTs). In addition, in the process of model development, model prediction, and model evaluation, RapidMiner Studio was used to predict the air pollution concentration. Moreover, 80% of the data from January 2002 to December 2017 were divided into two parts, which means 80% of the data were used for training and 20% for validation of the models. Then, the results of each model were compared to find the best-proposed model for predicting PM₁₀ concentration. Table 2 shows the general model of DT, RF, and BRT in predicting next-day PM₁₀ concentration (PM_10,D+1). The dependent variable was next-day PM₁₀ concentration represented by PM_10,D+1, while the independent variables were daily maximum CO_, PM₁₀, NO₂, SO₂, O₃, T, RH, and WS represented by CO _(D), PM_{10 (D)}, NO_{2 (D)}, SO_{2 (D)}, O_{3 (D)}, T _(D), RH _(D), and WS_(D).

2.3.1. Decision Tree

The decision tree (DT) model is a nonparametric method that can be used to predict a variety of quantitative and qualitative variables. In the form of a tree structure and with a reciprocal classification of data, a DT model illustrates direct and indirect correlations of numerous independent variables with a target variable (dependent) [40]. Variables on the upper branches of the tree structure have a greater impact on the prediction of the related class. For the DT model, four split measures were tested using gain ratio, information gain, Gini index, and accuracy.

2.3.2. Random Forests

Random forests (RFs) are a collection of methods for assembling a collection (or forest) of decision trees [41]. According to [25], the random forest (RF) approach applies the bagging technique to an ensemble of DTs, through which it trains many trees concurrently and uses the majority judgment developed in DTs as the RF model’s final decision. After that, the outputs of each tree are combined to create an ensemble prediction of the target variable. The model also estimates the most relevant features by determining how much the prediction error increases when data for that variable are permuted, while the rest of the data are not [42]. For RF models, four split measures were tested using gain ratio, information gain, Gini index, and accuracy.

2.3.3. Boosted Regression Tree

Boosted regression tree (BRT) models are developed by integrating two algorithms: Decision trees [11] are used to fit a series of single models, and then boosting [26] is used to aggregate their outputs to obtain the total prediction. A thorough description of the approaches can be found in [6,43]. The learning rate (lr), which is the shrinkage parameter used in each iteration to reduce the contribution of the tree, the complexity of the tree (tc), which is the maximum tree depth of variable interactions, as well as the number of trees (nt), are all tuning parameters that must be controlled in the BRT model. In this study, default settings for lr (0.01), tc (5), and nt (1000) were used to fit BRT models in RapidMiner. To optimise the number of trees in BRT, the optimise parameters (grid) operator was used, which is a nested operator. The loss functions or distributions that were used for this study were multinomial and Bernoulli distributions.

The DT, BRT, and RF models were run using RapidMiner software with the abovementioned independent variables (CO_(D), PM_10(D), NO_2(D), SO_2(D), O_{3 (D)}, T_(D), RH_(D)_, WS_(D)) and dependent variable (PM_10,D+1), with two class labels.

2.4. Performance Measures

Several performance measures—namely, accuracy, sensitivity, specificity, and precision values—were used to evaluate each classification model in this study. The formula for the performance measures used in this study is shown in Equations (1)–(4).

Accuracy = \frac{TP + TN}{(TP + FP + TN + FN)}

(1)

Sensitivity = \frac{TP}{(TN + FP)}

(2)

Specificity = \frac{TN}{(TN + FP)}

(3)

Precision = \frac{TP}{(TP + FP)}

(4)

where TP is the true-positive, TN is the true-negative, FP is the false-positive, and FN is the false-negative value based on a confusion matrix. To support the accuracy values, we used sensitivity, specificity, and precision values as suggested by [44]. The overview of the experiments performed for this study is shown in Figure 2.

3. Results and Discussion

3.1. Statistical Characteristics of PM₁₀

The descriptive statistics of all independent variables are shown in Table 3. The mean of the PM₁₀ concentration was 48.73 µg/m³ below the specified MAAQG for the yearly average of 50 µg/m³ [45]. The distribution was highly skewed for PM₁₀ (1.72), CO (16.44), NO₂ (1.22), SO₂ (19.54), RH (−5.92), and WS (31.08) since the values were less than −1 or more than +1 [6]. During the 2002–2017 period, the data revealed the presence of an extreme level of concentration in Kota Bharu.

The box plot results in Figure 3 also showed that PM₁₀ concentrations had extreme values since they consisted of many outliers. The highest extreme value indicates the maximum daily reading in 16 years of PM₁₀ concentration. The maximum PM₁₀ concentration was reported at 198 g/m³.

3.2. Decision Tree (DT)

The performance values for each of the DT splitting criteria are shown in Table 4. Overall, the results indicate that all of the splitting criteria can be used to classify the air pollution dataset. On the other hand, the information gain produced the highest accuracy value in terms of total performance and recorded an impressive accuracy value of 96.52%, compared with the other splitting criteria, which was supported by sensitivity (95.78%), specificity (97.25%), and precision (97.21%).

Figure 4 shows the decision tree classifier with eight features included in the model. The most influential variable regarding PM₁₀ concentration levels was SO₂ for gain ratio (0.33) and accuracy (0.21), while the second-most influential parameter would be WS for gain ratio (0.31) and accuracy (0.20). For information gain and Gini indices, the most influential variable was WS.

3.3. Boosted Regression Tree (BRT)

The performance values for each of the BRT distributions are shown in Table 5. Overall, the results indicate that both distributions can be used to classify air pollution data sets. However, the best distribution was the multinomial distribution, which had the same accuracy (98.12%) as the Bernoulli distribution but was supported by sensitivity (97.51%), specificity (98.72%), and precision (98.71%).

The boosted regression tree classifier, which has eight features, is shown in Figure 5. The most influential (PM₁₀) and second-most (WS) influential variables were the same for both distributions. The parameter with the least significant influence on PM₁₀ was ozone concentration, with only 811.43 (multinomial) and 958.16 (Bernoulli). After optimising the number of trees using the parameters (grid) operator, the best number of trees to be used in this study for multinomial was 500, and for Bernoulli, it was 450.

3.4. Random Forest (RF)

Table 6 shows the performance numbers for each of the RF splitting criteria. Overall, the results show that the air pollution dataset can be classified using all of the splitting criteria. In comparison to the other splitting criteria, the information gain produced the highest accuracy value in terms of total performance, with an exceptional accuracy of 98.37%, which was confirmed by sensitivity (97.19%), specificity (99.55%), and precision (99.54%).

The random forest classifier, which has eight features, is shown in Figure 6. The most influential variable on PM₁₀ concentration was WS for all methods of splitting criteria in RF, with a gain ratio of 0.22, information gain of 0.22, Gini index of 0.24, and accuracy of 0.19. The temperature was the second-most influential factor.

3.5. Performance Comparison

According to the results of these performance error and accuracy measurements, the RF model outperformed BRT and DT models. As a result, in Kota Bharu, RF predicted PM₁₀ concentrations better than BRT and DT. The values of the performance measures based on accuracy, sensitivity, specificity, and precision are shown in Table 7. Overall, the best model to use in predicting the next day’s PM₁₀ concentration classification was the random forest classifier, with the five best parameters of WS, T, RH, NO₂, and PM₁₀. Table 8 summarises the comparison results with other researchers. The data indicate that the results in this study are quite similar to those of other researchers. In addition, the values of the accuracy are larger for the RF model, compared with those found by other models. This shows that the RF model can be used to predict PM₁₀ concentrations since it improved the performance of the model.

4. Conclusions

Based on the results of this study, the random forest classifier performed the best in predicting for the next day’s PM₁₀ concentration classification, with an accuracy of 98.37, sensitivity of 97.19, specificity of 99.55, and precision of 99.54, but the results showed that the BRT performance was not substantially different from that of the RF model, with an accuracy of 98.12, sensitivity of 97.51, specificity of 98.72, and precision of 98.71.

Next, the wind speed was the most relevant feature to classifying the next day’s PM₁₀ concentration as indicative of a low or high level of air quality in RF and DT techniques, but for BRT, PM₁₀ was the most relevant feature.

Overall, this study’s findings show that machine learning algorithms can help classify the next day’s PM₁₀ concentration as indicative of a low or high level of air quality. The three classifiers applied in this research also covered the most relevant features in PM₁₀ concentration prediction. The best model (RF) is suitable to predict PM₁₀ concentration in Kota Bharu, Kelantan, for early warning systems and for local authorities to develop strategies to improve air quality.

The model can only be used when the sources and conditions of PM₁₀ concentration remain constant, which is a limitation of this study. As a result, it may not be appropriate for the other areas. Furthermore, if there is a sudden forest fire or storm in a specific area, the PM₁₀ concentration will be affected.

Author Contributions

A.Z.U.-S., W.N.S. and S.M. designed the study concept and secured funding; A.Z.U.-S. is the project administrator; W.N.S. and A.Z.U.-S. performed the data analysis; W.N.S. and S.M. wrote the manuscript; A.Z.U.-S., N.M.N. and N.S.Z. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The research was funded by Malaysia’s Ministry of Education through the Fundamental Research Grant Scheme (FRGS/1/2019/WAB05/UITM/03/2).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data for this project are confidential but may be obtained with Data Use Agreements with the Department of Environment (DOE), Ministry of Environment and Water of Malaysia.

Acknowledgments

The authors thank Universiti Teknologi MARA for their support and also the Department of Environment Malaysia for providing air quality monitoring data.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; in the decision to publish the results.

References

Department of Environment, Malaysia. Malaysia Environmental Quality Report 2016. Available online: https://www.doe.gov.my/wp-content/uploads/2021/08/EQR-2016-AIR-TANAH.pdf (accessed on 1 January 2022).
US EPA. Health and Environmental Effects of Particulate Matter (PM) 2015. Available online: https://www.epa.gov/pm-pollution/health-and-environmental-effects-particulate-matter-pm (accessed on 4 January 2022).
Hassan, N.A.; Hashim, Z.; Hashim, J.H. Impact of climate change on air quality and public health in urban areas. Asia Pac. J. Public Health 2016, 28, 385–485. [Google Scholar] [CrossRef]
Vinceti, M.; Malagoli, C.; Malavolti, M.; Cherubini, A.; Maffeis, G.; Rodolfi, R.; Heck, J.E.; Astolfi, G.; Calzolari, E.; Nicolini, F. Does maternal exposure to benzene and PM₁₀ during pregnancy increase the risk of congenital anomalies? A population-based case-control study. Sci. Total Environ. 2016, 541, 444–450. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Azmi, S.Z.; Latif, M.T.; Ismail, A.S.; Juneng, L.; Jemain, A.A. Trend and status of air quality at three different monitoring stations in the Klang Valley, Malaysia. Air Qual. Atmos. Health 2010, 3, 53–64. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shaziayani, W.N.; Ul-Saufie, A.Z.; Ahmat, H.; Al-Jumeily, D. Coupling of Quantile Regression into Boosted Regression Trees (BRT) Technique in Forecasting Emission Model of PM₁₀ Concentration. Air Qual. Atmos. Health 2021, 14, 1647–1663. [Google Scholar] [CrossRef]
Byun, D.; Schere, K.L. Review of the governing equations, computational algorithms, and other components of the Models-3 Community Multiscale Air Quality (CMAQ) modeling system. Appl. Mech. Rev. 2006, 59, 51–77. [Google Scholar] [CrossRef]
Im, U.; Markakis, K.; Unal, A.; Kindap, T.; Poupkou, A.; Incecik, S.; Yenigun, O.; Melas, D.; Theodosi, C.; Mihalopoulos, N. Study of a winter PM episode in Istanbul using the high resolution WRF/CMAQ modeling system. Atmos. Environ. 2010, 44, 3085–3094. [Google Scholar] [CrossRef]
Hu, J.; Li, X.; Huang, L.; Ying, Q.; Zhang, Q.; Zhao, B.; Wang, S.; Zhang, H. Ensemble prediction of air quality using the WRF/CMAQ model system for health effect studies in China. Atmos. Chem. Phys. 2017, 17, 13103–13118. [Google Scholar] [CrossRef] [Green Version]
Vongruang, P.; Wongwises, P.; Pimonsree, S. Assessment of fire emission inventories for simulating particulate matter in Upper Southeast Asia using WRF-CMAQ. Atmos. Pollut. Res. 2017, 8, 921–929. [Google Scholar] [CrossRef]
Tan, J.; Zhang, Y.; Ma, W.; Yu, Q.; Wang, Q.; Fu, Q.; Zhou, B.; Chen, J.; Chen, L. Evaluation and potential improvements of WRF/CMAQ in simulating multi-levels air pollution in megacity Shanghai, China. Stoch. Environ. Res. Risk Assess. 2017, 31, 2513–2526. [Google Scholar] [CrossRef]
Zhang, H.; DeNero, S.P.; Joe, D.K.; Lee, H.H.; Chen, S.H.; Michalakes, J.; Kleeman, M.J. Development of a source oriented version of the WRF/Chem model and its application to the California regional PM 10/PM 2.5 air quality study. Atmos. Chem. Phys. 2014, 14, 485–503. [Google Scholar] [CrossRef] [Green Version]
Kumar, A.; Jime, R.; Belalca, L.C. Application of WRF-Chem model to simulate PM₁₀ concentration over Bogota. Aerosol Air Qual. Res. 2016, 16, 1206–1221. [Google Scholar] [CrossRef] [Green Version]
Jenkins, G.S.; Gueye, M. Annual and early summer variability in WRF-CHEM simulated West African PM₁₀ during 1960–2016. Atmos. Environ. 2022, 273, 118957. [Google Scholar] [CrossRef]
Casallas, A.; Celis, N.; Ferro, C.; López Barrera, E.; Peña, C.; Corredor, J.; Ballen Segura, M. Validation of PM₁₀ and PM_2.5 early alert in Bogotá, Colombia, through the modeling software WRF-CHEM. Environ. Sci. Pollut. Res. 2020, 27, 35930–35940. [Google Scholar] [CrossRef] [PubMed]
Grell, G.A.; Peckham, S.E.; Schmitz, R.; McKeen, S.A.; Frost, G.; Skamarock, W.C.; Eder, B. Fully coupled “online” chemistry within the WRF model. Atmos. Environ. 2005, 39, 6957–6975. [Google Scholar] [CrossRef]
Balzarini, A.; Pirovano, G.; Honzak, L.; Žabkar, R.; Curci, G.; Forkel, R.; Hirtl, M.; San Jose, R.; Tuccella, P.; Grell, G.A. WRF-Chem model sensitivity to chemical mechanisms choice in reconstructing aerosol optical properties. Atmos. Environ. 2015, 115, 604–619. [Google Scholar] [CrossRef]
Gagliardi, R.V.; Andenna, C. A Machine Learning Approach to Investigate the Surface Ozone Behavior. Atmosphere 2020, 11, 1173. [Google Scholar] [CrossRef]
Rybarczyk, Y.; Zalakeviciute, R. Machine Learning Approaches for Outdoor Air Quality Modelling: A Systematic Review. Appl. Sci. 2018, 8, 2570. [Google Scholar] [CrossRef] [Green Version]
Myers, K.D.; Knowles, J.W.; Staszak, D.; Shapiro, M.D.; Howard, W.; Yadava, M.; Zuzick, D.; Williamson, L.; Shah, N.H.; Banda, J.M.; et al. Precision screening for familial hypercholesterolaemia: A machine learning study applied to electronic health encounter data. Lancet Digit. Heal. 2019, 1, 393–402. [Google Scholar] [CrossRef] [Green Version]
Rosli, M.M.; Edward, J.; Onn, M.; Chua, Y.A.; Kasim, N.A.M.; Nawawi, H. Classifying Familial Hypercholesterolaemia: A Tree-based Machine Learning Approach. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 66–73. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.; Stone, C.J. Classification and Regression Trees; Wadsworth: Belmont, CA, USA, 1984. [Google Scholar]
Quinlan, R. C4.5: Programs for Machine Learning; Morgan Kaufmann: San Mateo, CA, USA, 1993. [Google Scholar]
Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2009. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal. 2002, 38, 367–378. [Google Scholar] [CrossRef]
Elith, J.; Leathwick, J.R.; Hastie, T. A Working Guide to Boosted Regression Trees. J. Anim. Ecol. 2008, 77, 802–813. [Google Scholar] [CrossRef] [PubMed]
Akiladevi, R.; Nandhini, D.B.; Nivesh, K.V.; Nivetha, P. Prediction and Analysis of Pollutant using Supervised Machine Learning. Int. J. Recent Technol. Eng. 2020, 9, 50–54. [Google Scholar]
Giorgio, C.; Mauro, S. Air pollution prediction via multi-label classification. Environ. Model. Softw. 2016, 80, 259–264. [Google Scholar]
Akhtar, A.; Masood, S.; Gupta, C.; Masood, A. Prediction and analysis of pollution levels in delhi using multilayer perceptron. Adv. Intell. Syst. Comput. 2018, 542, 563–572. [Google Scholar]
Grivas, G.; Chaloulakou, A. Artificial neural network models for prediction of PM₁₀ hourly concentrations, in the Greater Area of Athens, Greece. Atmos. Environ. 2006, 40, 1216–1229. [Google Scholar] [CrossRef]
Elis, S.Z.N.; Ul-Saufie, A.Z.; Shaziayani, W.N.; Noor, N.M.; Zubir, N.A. Assessment of Ambient Air Pollution in Langkawi Island, Malaysia. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Kazimierz Dolny, Poland, 21–23 November 2019; Volume 551, p. 012123. [Google Scholar]
Mohamad, N.S.; Deni, S.M.; Ul-Saufie, A.Z. Application of the First Order of Markov Chain Model in Describing the PM₁₀ Occurrences in Shah Alam and Jerantut, Malaysia. Pertanika J. Sci. Technol. 2018, 26, 367–378. [Google Scholar]
Paschalidou, A.K.; Karakitsios, S.; Kleanthous, S.; Kassomenos, P.A. Hourly PM₁₀ Concentration in Cyprus through Artificial Neural Networks and Multiple Regression Models: Implications to Local Environmental Management. Environ. Sci. Pollut. Res. 2011, 18, 316–327. [Google Scholar] [CrossRef] [PubMed]
Papanastasiou, D.K.; Kioutsoukis, M.D. Development And Assessment Of Neural Network And Multiple Regression Models In Order To Predict PM₁₀ Levels In A Medium-Sized Mediterranean City. Water Air Soil Pollut. 2007, 182, 325–334. [Google Scholar] [CrossRef]
Libasin, Z.; Suhailah, W.; Fauzi, W.M.; Ul-Saufie, A.Z.; Idris, N.A.; Mazeni, N.A. Evaluation of Single Missing Value Imputation Techniques for Incomplete Air Particulates Matter (PM₁₀) Data in Malaysia. Pertanika J. Sci. Technol. 2021, 29, 3099–3112. [Google Scholar] [CrossRef]
Department of Environment, Malaysia. Malaysia Environmental Quality Report 2019. Available online: https://www.doe.gov.my/portalv1/en/ (accessed on 10 January 2022).
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Esfandiarpour-Boroujeni, I.; Shahini, S.M.; Shirani, H.; Mosleh, Z.; Bagheri, B.M.; Salehi, M.H. Comparison of error and uncertainty of decision tree and learning vector quantization models for predicting soil classes in areas with low altitude variations. CATENA 2020, 191, 104581. [Google Scholar] [CrossRef]
Stafoggia, M.; Bellander, T.; Bucci, S.; Davoli, M.; Hoogh, K.D.; Donato, F.D.; Gariazzo, C.; Lyapustin, A.; Michelozzi, P.; Renzi, M.; et al. Estimation of daily PM₁₀ and PM_2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model. Environ. Int. 2019, 124, 170–179. [Google Scholar] [CrossRef] [PubMed]
Liaw, A.; Wiener, M. Classification and regression by random forest. R News 2002, 2, 18–22. [Google Scholar]
Shaziayani, W.N.; Ul-Saufie, A.Z.; Yusoff, S.A.M.; Ahmat, H.; Libasin, Z. Evaluation of boosted regression tree for the prediction of the maximum 24-hour concentration of particulate matter. Int. J. Environ. Sci. Dev. 2021, 12, 126–130. [Google Scholar] [CrossRef]
Rosli, M.M.; Edward, J.; Onn, M. Precision screening for familial hypercholesterolaemia: A machine learning study applied to electronic health encounter data. Int. J. Adv. Comput. Sci. Appl. 2021, 9, 66–73. [Google Scholar]
Department of Environment, Malaysia. Malaysia Environmental Quality Report 2018; Ministry of Energy, Science, Technology, Environment and Climate Change, Malaysia: Kuala Lumpur, Malaysia, 2018.

Figure 1. Original dataset vs. synthetic minority oversampling technique (SMOTE) dataset.

Figure 2. Flowchart of steps in this study.

Figure 3. Box plot for PM₁₀ concentration.

Figure 4. Feature importance for eight parameters (decision tree (DT) model).

Figure 5. Feature importance for eight parameters (boosted regression tree (BRT) model).

Figure 6. Feature importance for eight parameters (random forest (RF) model).

Table 1. Class labels for the next day’s PM₁₀ concentration.

PM₁₀ Concentration Value	Class Label
Low-level air quality with, PM₁₀ < 100	1
High-level air quality with, PM₁₀ ≥ 100	2

Table 2. General model for the next day’s prediction.

Prediction	Models
PM_10,D+1	PM_10,D+1 ~ DT (CO _(D), PM_{10 (D)}, NO_{2 (D)}, SO_{2 (D)}, O_{3 (D)}, T _(D), RH _(D), WS_(D),)
	PM_10,D+1 ~ BRT (CO _(D), PM_{10 (D)}, NO_{2 (D)}, SO_{2 (D)}, O_{3 (D)}, T _(D), RH _(D), WS_(D),) PM_10,D+1 ~ RF (CO _(D), PM_{10 (D)}, NO_{2 (D)}, SO_{2 (D)}, O_{3 (D)}, T _(D), RH _(D), WS_(D),)

Table 3. The descriptive statistics of independent variables (IVs).

IV	Minimum	Maximum	Mean	Standard Deviation	Skewness	Kurtosis
WS	0.01	360	9.50	8.04	31.08	1346.58
T	22.89	37.5	31.36	2.33	−0.63	0.21
RH	3.22	100.2	91.86	6.80	−5.92	59.73
SO₂	0	71.4	0.90	1.54	19.54	836.04
NO₂	0	63	15.15	6.23	1.22	4.16
CO	180	21712	926.37	475.30	16.44	689.80
O₃	0.9	69	29.21	10.99	0.23	−0.05
PM₁₀	12	198	48.73	18.11	1.72	6.22

Table 4. The best splitting criteria for the decision tree.

Splitting Criteria	Accuracy	Sensitivity	Specificity	Precision
Gain ratio	95.81	92.59	99.04	98.98
Information gain	96.52	95.78	97.25	97.21
Gini index	96.52	95.02	98.02	97.69
Accuracy	93.39	95.53	91.25	91.61

Table 5. The best distribution for the boosted regression tree.

Distribution	Accuracy	Sensitivity	Specificity	Precision
Multinomial	98.12	97.51	98.72	98.71
Bernouli	98.12	97.75	98.66	98.64

Table 6. The best splitting criteria for the random forest.

Splitting Criteria	Accuracy	Sensitivity	Specificity	Precision
Gain ratio	97.03	94.44	99.62	99.6
Information gain	98.37	97.19	99.55	99.54
Gini index	98.27	97	99.55	99.54
Accuracy	94.5	96.68	92.33	92.65

Table 7. Performance comparison for decision tree (DT), boosted regression tree (BRT), and random forest (RF).

Models	Accuracy	Sensitivity	Specificity	Precision
DT	96.52	95.78	97.25	97.21
RF	98.37	97.19	99.55	99.54
BRT	98.12	97.51	98.72	98.71

Table 8. Performance comparison and the results gained from other researchers.

Authors	Method	Accuracy	Sensitivity	Specificity	Precision
[31]	Multilayer Perceptron (MLP)	98.1	-	-	98
	Support vector machines (SVM)	92.5	-	-	92
	Naïve Bayes	91.25	-	-	90
[29]	Logistic Regression (LR)	98	97	98	97
	Naïve Bayes	97	98	97	93
	K-nearest Neighbour (KNN)	97	97	98	97

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shaziayani, W.N.; Ul-Saufie, A.Z.; Mutalib, S.; Mohamad Noor, N.; Zainordin, N.S. Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach. Atmosphere 2022, 13, 538. https://doi.org/10.3390/atmos13040538

AMA Style

Shaziayani WN, Ul-Saufie AZ, Mutalib S, Mohamad Noor N, Zainordin NS. Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach. Atmosphere. 2022; 13(4):538. https://doi.org/10.3390/atmos13040538

Chicago/Turabian Style

Shaziayani, Wan Nur, Ahmad Zia Ul-Saufie, Sofianita Mutalib, Norazian Mohamad Noor, and Nazatul Syadia Zainordin. 2022. "Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach" Atmosphere 13, no. 4: 538. https://doi.org/10.3390/atmos13040538

APA Style

Shaziayani, W. N., Ul-Saufie, A. Z., Mutalib, S., Mohamad Noor, N., & Zainordin, N. S. (2022). Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach. Atmosphere, 13(4), 538. https://doi.org/10.3390/atmos13040538

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Classification Prediction of PM₁₀ Concentration Using a Tree-Based Machine Learning Approach

Abstract

1. Introduction