The Role of Machine Learning in Enhancing Particulate Matter Estimation: A Systematic Literature Review
Abstract
:1. Introduction
- RQ1: What are the benefits of using ML-based models to estimate PM concentrations?
- RQ2: What are the current solutions that employ ML-based models for estimating PM concentrations?
- RQ3: What are the research gaps and future directions for estimating PM concentrations using ML-based models?
- We propose a Systematic Literature Review (SLR) of the recent advancements in applying ML models to enhance the accuracy of estimating and concentrations. This review covers studies published from 2018 to 2024, ranging from those focused on individual ML models to those exploring ensemble learning models.
- We explore the primary challenges of using a specific type of training dataset in ML-based PM estimation models.
- We provide a comprehensive assessment of the state-of-the-art in leveraging ML for improved air quality monitoring and estimation of and , utilizing key metrics such as feature importance analysis, residual analysis, temporal and spatial consistency, and cross-validation.
- We outline future directions that could enhance the accuracy of and estimation.
2. Background
2.1. Air Pollution Modeling
2.1.1. Traditional Statistical Models
- Linear Mixed-Effect (LME) models: The ability of LME models to handle complex hierarchical data structures and account for both fixed and random effects makes them an effective tool for estimating PM concentrations. LME models incorporate fixed effects, which represent systematic influences of predictors such as geographical features or meteorological variables, along with random effects that capture variability at different levels of the data hierarchy, such as temporal or spatial variations. This approach enhances estimation accuracy by capturing the inherent variability and correlation structures within the data [23,37,38].
- Generalized Additive Models (GAMs): GAMs are a semi-parametric extension of generalized linear models (GLMs). They are particularly effective at capturing intricate, nonlinear, and non-monotonic relationships among variables. Specifically, GAMs are highly useful for estimating concentrations, modeling spatial patterns, and identifying key drivers of air pollution. Their ability to accommodate nonlinear dynamics helps address the intricacies of pollutant dispersion and chemical interactions that influence air quality. Typically, GAMs use an identity link function with a Gaussian error distribution, offering greater flexibility in modeling the relationships between predictors and levels, thereby improving interpretability [39,40]. Additionally, GAMs apply to both cross-sectional and longitudinal data, providing a comprehensive understanding of spatial and temporal variations. For example, cross-sectional data capture concentrations across various locations at a single point in time, while longitudinal data track levels at the same location over an extended period, allowing for the analysis of temporal trends and long-term changes.
- Spatio-Temporal Mixed Effect Model (STMEM): STMEMs are designed for data that vary across both space and time, making them highly effective for estimating PM concentrations. These models incorporate spatial and temporal correlations to account for geographic variability and time-based changes, such as seasonal patterns and pollution events. By using random effects to capture these dynamics, STMEMs offer a robust framework for analyzing complex environmental patterns and improving the accuracy of estimates, which supports public health and air quality management. However, their complexity can make implementation and interpretation challenging [41].
2.1.2. Machine Learning (ML)-Based Models
- Traditional Machine Learning (ML)-based Models:Different models offer an effective basis for air quality estimation. The effectiveness of these models depends on the quality of the input features. Some commonly used models include the following:
- –
- Support Vector Machines (SVMs) model: SVMs effectively estimates PM levels by leveraging their ability to find optimal hyperplanes in high-dimensional spaces. The process begins with collecting relevant input features that influence PM concentrations, such as meteorological data (temperature, humidity, wind speed), geographical information, and PM measurements. These features then transform into a higher-dimensional space using kernel functions, such as linear, polynomial, or radial basis function (RBF) kernels, which capture the underlying patterns in the data. The SVM algorithm minimizes estimation error while maximizing the margin between estimated values and actual PM concentrations by finding a hyperplane that fits the training data. The final output from an SVM model is a continuous numerical value that represents the estimated concentration of PM in the air [10,27,45] (See Figure 1).Figure 1. General architecture of a SVM model [46].
- –
- Decision Tree (DT) model: The DT algorithm aims to model PM concentrations using various independent variables (e.g., meteorological data and satellite observations). It follows a recursive partitioning process where the tree is made up of decision nodes and terminal leaves. For PM estimation, the algorithm uses standard deviation reduction to determine optimal splits, starting at the root node, based on the most significant variable affecting PM levels. Each split minimizes the sum of squared errors (SSE) to reduce estimation errors. This splitting continues until a termination criterion is met. The final nodes, known as leaf nodes, provide the estimated values for PM concentrations, allowing for effective air quality assessments [9,26,47,48,49]. Figure 2 illustrates the general structure of a standard DT.Figure 2. General architecture of DT model [48].
- –
- K-Nearest Neighbor (KNN) model: The KNN model is widely used for estimating PM concentrations. It works by measuring the distance between data points using metrics such as Euclidean or Mahalanobis distance to identify the closest neighbors in the dataset. The choice of k, representing the number of nearest neighbors to consider, is crucial. Selecting the optimal k value helps mitigate overfitting while improving the model’s generalization capabilities. Common methods to determine the optimal value of k include cross-validation, grid search, and using the square root of N (where N is the total number of samples) [50]. When a new data point is introduced, KNN calculates its distance to all training data points to find the k nearest neighbors. The estimated PM concentration for the new point is then determined by averaging the concentrations of these neighbors. This approach effectively captures patterns in environmental data, enabling reliable PM level estimations based on historical observations and spatial relationships among data points [26,32,51,52,53]. Figure 3 illustrates the general structure of a KNN model.Figure 3. General architecture of KNN model [53].
- –
- Artificial Neural Networks (ANNs) model: ANNs provide a robust framework for estimating PM concentrations, effectively capturing the complex, non-linear relationships inherent in air quality data. ANNs consist of an input layer that collects data from various sources, including meteorological variables and pollutant concentrations. These data pass through one or more hidden layers, where the model learns complex relationships between the inputs and the target output. The output layer generates a single value that indicates the estimated PM concentration for a given time and location (See Figure 4).Figure 4. Basic structure of an ANN [54].
- DL-based Models:DL algorithms are well-suited for capturing complex, non-linear relationships. They are particularly effective for developing estimation models for PM concentrations. They can effectively analyze and interpret the relationships between meteorological data and PM levels as follows:
- –
- Multi-layer Perceptron (MLP) Neural Network model: The MLP model is effective in estimating PM concentrations. It has a layered structure, consisting of multiple interconnected layers of neurons (See Figure 5). Each node processes input data via weighted connections and uses activation functions to introduce non-linearity. The structure contains an input layer that receives temporal variables (date and time) and meteorological variables (temperature, humidity, and wind speed) that act as explanatory variables. The hidden layers enable the model to learn complex patterns and relationships within the data, and the output layer produces the estimated PM concentration [49,55].
- –
- Convolutional Neural Network (CNN) model: Convolutional neural networks have been widely used in image data processing [56,57]. This model enhances the estimation accuracy of PM concentrations in different cities such as the United States [28] and Kaohsiung [9]. They use a structured approach, alternating between convolutional and pooling layers (See Figure 6). The convolutional layers extract spatial features from input data, including air quality measurements and meteorological variables. These layers apply filters or kernels to perform convolution operations, which produce feature maps that highlight important patterns associated with PM levels. The pooling layers reduce the size of the convolved features, decreasing the computational resources needed to process the data. This integration of convolutional and pooling layers enables CNNs to effectively learn and estimate PM concentrations from intricate environmental datasets [9,28,58].
- –
- Deep belief-Back Propagation Network model: The prediction model, leveraging a deep belief neural network integrated with a Back Propagation (BP) neural network, represents a sophisticated hybrid approach that combines the strengths of multiple unsupervised Restricted Boltzmann Machines (RBMs) and supervised BP networks to effectively predict pollutant concentrations, specifically and . As illustrated in Figure 7, this architecture comprises an input layer with 29 nodes dedicated to capturing relevant features of the PM, while the output layer consists of a single node that predicts concentration values. The total number of layers in the network is variable, denoted as n, allowing for flexibility in model complexity; each layer is formed by stacking RBMs followed by BP networks, which enhances the model’s ability to learn intricate patterns in the data [59].
- Ensemble Learning-based ModelsEnsemble learning is an ML approach that combines multiple models to improve accuracy and reduces overfitting in estimating PM concentrations [42]. It includes techniques like bagging, boosting, and stacking. Bagging trains several models on different subsets of data and averages their estimations [60]. Boosting trains multiple models sequentially, with each new model correcting the errors of its predecessor [61]. Stacking uses different models and combines their outputs through a meta-learner for final estimations [62]. Examples of these models include the following:
- –
- Random Forest (RF) model: RF is an effective ensemble learning model for estimating PM concentrations. This model generates multiple decision trees to improve the estimation accuracy. The input variables are usually meteorological and environmental parameters like temperature, humidity, wind speed, atmospheric pressure, and PM levels. Each decision tree is built using a bootstrap sample from the training dataset. This allows each tree to be trained on a unique subset of data. The remaining data are then used to estimate the error for that tree. At each node of the decision trees, it selects a random subset of independent variables to determine the best split, promoting tree diversity and reducing overfitting. The final PM concentration is estimated by averaging the outputs of all trees, providing a robust estimate that captures complex environmental interactions [23,49,53,63,64]. Figure 8 shows the general structure of a random forest regressor.
- –
- Extreme Gradient Boosting (XGBoost) model: The XGBoost model is highly effective for estimating PM concentrations. The process begins by training an initial decision tree on a randomly chosen subset of data to estimate PM levels. The model then calculates the residuals, which represent the differences between the estimated and actual PM concentrations. These residuals are used to train the subsequent trees, with each new tree aiming to correct the errors of the previous ones. This iterative approach continues by updating the model parameters to enhance the objective function. The objective function is divided into two parts: the loss function (L), which measures estimation error, and a regularization term that penalizes complexity to prevent overfitting. By incorporating various input features, such as atmospheric data (temperature, humidity, and wind speed) and aerosol optical depth (AOD), XGBoost effectively captures the complex relationships and interactions influencing PM concentrations. The final PM estimation in XGBoost is calculated by summing the estimations from all individual trees in the ensemble [42,45,65,66,67]. This results in enhanced accuracy of estimations across different spatial and temporal contexts (See Figure 9).
- –
- Light Gradient Boosting Machine (LightGBM) model: This model performs exceptionally well at modeling complex, non-linear relationships between PM concentrations and various environmental variables. The algorithm constructs a decision tree using input features such as traffic patterns, meteorological data, and PM measurements. It uses a gradient boosting approach, where each subsequent tree corrects the errors of the previous ones. LightGBM speeds up training by applying a histogram-based method that bins continuous features into discrete intervals to efficiently calculate potential split points. The splitting in LightGBM follows a leaf-wise approach, selecting the leaf node with the maximum gain to grow and prioritizing the most informative splits to reduce estimation error. This process continues until a stopping criterion, such as a set number of trees or achieving a sufficient level of accuracy, is reached. The final estimation for PM is calculated as the sum of the estimations from all the individual trees in the model [67,68,69]. Figure 10 shows the structure of the LightGBM model.
2.2. Model Evaluation Metrics
3. Methodology
3.1. Planning Phase
- Identify the need for the review
- Specifying the review objectives
3.2. Conducting Phase
- Step 1: Study selection:In the initial step, a search strategy was implemented to identify all relevant studies aligned with our research objectives. Specifically, a two-step procedure outlining the methodology for sourcing relevant literature using search terms was executed.
- Initially, three keyword groups were identified by taking into account alternative spellings of the terms using the following approach:
- –
- Defining the keywords relevant to the expansive scope of the research, such as air pollutants, particulate matter estimation, , and .
- –
- Specifying the keywords about enhancing the accuracy of and estimation using alternative technologies: artificial intelligence and machine learning.
- –
- Narrowing down the research scope by selecting terms associated with the proposed solution type, such as Random Forest (RF), XGBoost, Convolutional Neural Networks (CNNs), Deep Learning (DL), Artificial Neural Networks (ANNs), and Support Vector Machines (SVsM).
- Second, ten digital libraries were chosen: Springer, MDPI, Elsevier, Aerosol and Air Quality Research, IOP Science, ACS Publication, Nature Environment and Pollution Technology (NEPT), Europe PMC, and Earth System Science Data (ESSD). Subsequently, the Boolean operators OR and AND were utilized to apply the keywords to these libraries. OR was employed between terms within each group, while AND connected keywords across different groups.
- Step 2: Filter the search results:During this step, the papers were refined from the search results to pinpoint thematically relevant studies essential for addressing the research questions of this SLR. Inclusion and exclusion criteria were established (see Table 3). The steps taken in the selection and filtration of this SLR are as follows:
- –
- Implementing our inclusion and exclusion criteria.
- –
- Eliminating any duplicate articles that have been found across multiple libraries.
- –
- Looking up more similar articles by searching the article’s references.
- Step 3: Data extraction:
Study No. | Ref. | Year | Study Period | Study Location | Measured Parameter |
---|---|---|---|---|---|
S1 | [35] | 2018 | 2005–2015 | United States | |
S2 | [71] | 2018 | 2000–2015 | Butler, Hamilton, Warren, Clermont, Campbell, Kenton, Boone | |
S3 | [42] | 2018 | 2008–2017 | China | |
S4 | [72] | 2018 | 2014–2016 | China | |
S5 | [22] | 2019 | 2000–2015 | United States | |
S6 | [73] | 2019 | 2013–2015 | Italy | , |
S7 | [23] | 2020 | 1 July–30 June 2018 | Indo-Gangetic Plain | |
S8 | [28] | 2020 | 2011 | conterminous United States | |
S9 | [26] | 2020 | 2015–2017 | Beijing–Tianjin–Hebei (BTH) region | |
S10 | [74] | 2020 | 2005–2016 | Sweden | , , PM2.5–10 |
S11 | [75] | 2021 | 2008–2016 | Coastal site in the Eastern Mediterranean | |
S12 | [27] | 2021 | 2018–2019 | Malaysia | |
S13 | [76] | 2021 | 2016–2020 | Beijing | |
S14 | [1] | 2021 | 2018 | Thailand | |
S15 | [68] | 2021 | 2018 | China | |
S16 | [63] | 2021 | 2014–2018 | Texas | |
S17 | [43] | 2021 | 24 March–31 May 2020 | Kolkata metropolitan city | |
S18 | [77] | 2021 | 2014–2018 | Malaysia | , |
S19 | [10] | 2021 | February–May 2019 | Algiers | , , , and |
S20 | [78] | 2022 | 2018–2020 | Guanzhong Urban Agglomeration, China | |
S21 | [32] | 2022 | 2018 | Continental United States | |
S22 | [44] | 2022 | 2018–2019 | China | |
S23 | [79] | 2022 | 2018–2019 | China | |
S24 | [9] | 2023 | 2021 | Taiwan | |
S25 | [66] | 2023 | 2011–2020 | Thailand | |
S26 | [18] | 2023 | 2019 | India | |
S27 | [51] | 2024 | 2019–2021 | Tuzla Canton, Bosnia and Herzegovina (BiH) | |
S28 | [60] | 2024 | 2020 | Mexico City | |
S29 | [47] | 2024 | 2000–2019 | South Coast Air Basin of California | |
S30 | [80] | 2024 | 2014–2021 | China | |
S31 | [81] | 2024 | 2013–2021 | China | |
S32 | [82] | 2024 | 2020 | China |
Study No. | Methods | Evaluation Metrics | Estimation Target | Data |
---|---|---|---|---|
S1 | RF | OOB | Ground measurements of constituents, GEOS-Chem simulated constituents, meteorological data, land use and population data, spatial and temporal indicators. | |
S2 | RF | CV , RMSE, MAE | measurements, aerosol optical depth data, meteorological data, land use data, spatiotemporal features. | |
S3 | RF, generalized additive model and extreme gradient boosting, generalized additive ensemble model | CV , RMSE, MAE | measurements, MODIS AOD, meteorological data, land use data, Modern Era-Retrospective Analysis for Research and Analysis version 2 (MERRA-2) reanalysis data, visibility data. | |
S4 | RF | adjusted , RMSE, regression slope, coefficients | In situ measurements of , satellite-retrieved AOD data, meteorological data, land cover data, MODIS active fire data, high-resolution elevation data. | |
S5 | Ensemble learning model | 10-fold CV , RMSE, bias, slope | monitoring data, AOD measurements and related satellite data, meteorological conditions, land use variables, chemical transport model predictions. | |
S6 | RF | 10-fold CV , Root Mean Squared Percentage Error (RMSPE), intercepts, slope | , | PM monitored data, AOD data, meteorological parameters |
S7 | LME model, RF model | , RMSE, Relative Prediction Error (RPE), Mean Prediction Error (MPE), slope (b), and intercept (a) | Ground-based Measurements, MODIS MAIAC products, auxiliary data, meteorological data. | |
S8 | CNN | , RMSPE, MPE, slope | Ground-truth measurement data, MODIS AOD and GEOS-Chem AOD. | |
S9 | Decision tree, RF, bagging, GBRT, KNN, and Support Vector Regression (SVR) | Correlation coefficient (R), RMSE | Ground-level concentration, Himawari-8 AOD, AERONET AOD, GEOS-Chem AOD. | |
S10 | RF | CV | Satellite data, atmospheric composition variables, land use terms, meteorological parameters, population density | |
S11 | Pattern recognition neural network (PRNN) model | R, RMSE, relative mean bias (RMB), expected error (EE) envelope, mean square error (MSE), mean absolute percentage error (MAPE), mean absolute error (MAE) | Different gap-filled AOD datasets, observations, auxiliary data. | |
S12 | RF, SVR | , RMSE, MBE, Nash–Sutcliffe Efficiency (NSE) | Ground measured air pollutants, satellite AOD observations, meteorological parameters | |
S13 | Multilayer perceptron (MLP) neural network analysis | , | gaseous air pollutants, meteorological parameters, daily ambient data | |
S14 | Machine learning algorithm (MLA) | R, slope, intercept, bias, RMSE | MERRA-2 Reanalysis data, Surface data, meteorological parameters. | |
S15 | LightGBM model | , RMSE, MAE | Ground monitoring data, meteorological data | |
S16 | RF algorithm, multiple linear regression (MLR), mixed effects model (MEM) | CV R, mean absolute bias (MAB), mean bias (MB) | EPA surface data, Satellite AOD, meteorological data, MERRA-2 reanalysis data, elevation data, normalized difference vegetation Index (NDVI) 16-day data, land use variables | |
S17 | MLR, artificial neural network (ANN) models | , RMSE, MAE | concentration of data, daily meteorological data | |
S18 | Multiple linear regression (MLR), random forest regression (RFR), extra tree regression (ETR), decision tree regression with AdaBoost (BTR) | , RMSE | , | concentration, concentration, meteorological data. |
S19 | Hybrid dragonfly-SVM algorithm | , RMSE, MAE, MSE, NRMSE, MAPE % | , , , | The hourly data of conventional fractions (, , , and ), weather factors (temperature, pressure, and relative humidity) |
S20 | RF-XGBoost | , RMSE, MAE | Ground measurements, MODIS AOD, auxiliary data, meteorological conditions. | |
S21 | Regression models, stochastic gradient descent, k-nearest neighbor (KNN), adaptive boosting (AdaBoost), Gradient Boost (GB), Extreme Gradient Boost (XGB), SVM, RF) | SR, RMSE, MB | MERRA-2 data, ground station data, meteorological and aerosol parameters | |
S22 | Deep Forest (DF) | CV , RMSE, MAE, EE | FY-4A TOAR data, hourly atmospheric observation data, meteorological parameters, geographic information, time variables. | |
S23 | DF | CV , RMSE, MAE, bias | and AOD data, auxiliary data | |
S24 | CNN–RF | , RMSE, MAE, MSE, are error (MSE) | Five meteorological parameters, four spatiotemporal elements, eight air pollution factors (CO, , NO, , , , ) | |
S25 | Multiple Linear Regression (MLR), RF, XGBoost, SVM | , RMSE | data, satellite data | |
S26 | Individual and stacking models (XGB, RF, LGBM, ridge, lasso) | , RMSE, MB, MAE | Ground data, MERRA-2 reanalysis data | |
S27 | XGBoost, KNN, and Naive Bayes (NB) | Accuracy, precision, and Area Under the ROC Curve (AUC) | concentration data, remote sensing data (USGS landsat 8 collection 2 tier 1 and real-time data raw scenes). | |
S28 | RF | Air pollutant concentration data, meteorological data | ||
S29 | Decision tree, RF, SVM, SVR, k-nearest neighbor, neural network, Gaussian process regression | , RMSE, cross-validation | meteorological factors, estimated emissions, large-scale climate indices | |
S30 | LSTM neural networks, RF regression models | , RMSE, MAE | data, MODIS AOD product, auxiliary data, meteorological variables, land use-related variables | |
S31 | ResNet model | , RMSE | Testbed dataset | |
S32 | Categorical Boosting (CatBoost) model | , RMSE | Geographical data, nighttime light data, meteorological data, aerosol optical depth products, ground-based measurements. |
- Step 4: Quality assessment:To evaluate the chosen studies based on the research questions, we selected a set of quality assessment metrics for the ML-based models to estimate and concentrations. These metrics ensure the models are accurate, reliable, and suitable for practical applications, such as air quality monitoring and environmental management. Four quality assessment metrics are chosen as follows:
- –
- Feature importance analysis: determine which features—such as meteorological data, geographic information, and temporal factors—contribute most significantly to estimating and levels. Accuracy and model refinement can be achieved by understanding of the features’ importance.
- –
- Residual analysis: examine the residuals (differences between estimated and actual values) to determine patterns or biases in the estimations made by the model. This analysis can help identify areas where the model may be underperforming or where improvements can be made.
- –
- Temporal and spatial consistency: verify that the model estimations match the data that have been observed in terms of both timing and space.
- –
- Cross-validation: offers insights into the model’s resilience and aids in assessing its performance across several dataset subsets.
3.3. Reporting Phase
- Dissemination strategy identification: determining the most suitable approach to share the outcomes of our review with the relevant audience. This involves strategizing the best methods and channels for effectively disseminating the review findings.
- Report formatting: focusing on formatting the report to present our review findings in a clear, concise, and organized manner. This ensures the information is conveyed in a reader-friendly and easily understandable format.
- Report evaluation: to ensure the quality and effectiveness of the report, the evaluation process is conducted. This involves critically reviewing the content, coherence, and adherence to the objectives of the review. The goal is to validate the integrity and impact of the reported findings.
4. Analysis and Discussion
4.1. RQ1: What Are the Benefits of Using ML-Based Models to Estimate PM Concentrations?
4.2. RQ2: What Are the Existing Solutions to Estimate PM Concentrations Using ML-Based Models?
4.2.1. Traditional Machine Learning (ML)-Based Models
- Decision Tree (DT) model: the decision tree (DT) models offer several benefits for estimating PM concentrations, including:
- Support Vector Machine (SVM) model: The algorithms of SVMs have also been utilized to build models for estimating regions with high levels of PM concentration. SVMs are a useful technique for classification, pattern recognition, and functional regression problems [95]. SVMs are an excellent choice for modeling the complexities involved in PM concentrations because they can effectively handle variables with nonlinear relationships, such as geographic features, emissions, and weather conditions [47].
- Artificial Neural Networks (ANNs) model: ANNs are non-linear computational algorithms that simulate the natural neural network of the human nervous system to make decisions and arrive at conclusions [96]. Researchers have leveraged ANN algorithms as cost-effective methods in constructing models for estimating levels, striving to calculate concentrations based on easily sensed data [97,98].
4.2.2. Deep Learning (DL)-Based Models
- Convolutional Neural Networks (CNNs) model: CNNs are designed to process grid-like data patterns. They excel in tasks like image classification and segmentation and can also handle time-series data, such as air quality measurements. Therefore, CNN algorithms are ideal for constructing estimation models [9,99].
- Pattern Recognition Neural Network (PRNN) model: The PRNN algorithm is a type of neural network that learns to find patterns in data and link those patterns to particular outcomes. When a PRNN is used to build a PM estimation model, it can identify patterns that correlate with PM levels by analyzing input data such as environmental and meteorological parameters. With the help of fresh data inputs, the network can estimate PM concentrations after learning these patterns during training, which makes it valuable for monitoring air quality [75].
- Residual Neural Network (ResNet) model: The ResNet model was capable of handling the inherent nonlinearity in atmospheric processes and demonstrated strong capabilities in estimating concentrations [100]. Its architecture, which utilized residual connections, allowed for improved feature extraction and adaptability to complex atmospheric data.
4.2.3. Ensemble Learning (EL)-Based Models
- Deep Forest (DF) model: DF models use decision trees to make independent estimations, which are then aggregated. These models also can identify the most influential features, aiding in understanding data relationships and improving the overall estimation of the model [101].
- Random Forest (RF) model: RF is an ensemble learning algorithm that builds multiple decision trees. This algorithm is used to build the estimation model. It enhances performance by introducing feature randomness and aggregating the outputs from each tree, leveraging their strengths while minimizing their shortcomings [66].
- Extreme Gradient Boosting (XGBoost) model: XGBoost algorithms are known for their superior data mining capabilities and high performance. Due to these strengths, they have been increasingly used to construct PM concentration estimation models. This has led to enhanced accuracy and reliability in these estimates [18].
- Light Gradient Boosting Machine (LightGBM) model: The LightGBM model employed a leaf-by-leaf growth method with deep constraints. It accelerated training by using a histogram-based algorithm, which reduced both training time and memory consumption. As a result, the researchers used the LightGBM algorithm to develop the estimation model [69].
- Categorical Boosting (CatBoost) model: The CatBoost algorithm gained popularity in environmental research for building PM estimation models. Its strength lied in handling regression problems with complex, periodic, non-stationary, and non-linear characteristics. These models also took into account numerous features and noisy data, which helped achieve high accuracy [82,102].
- Noise, missing values, or inaccurate data could negatively impact model performance. Therefore, the model must be trained on high-quality, well-labeled, balanced data to generate accurate estimates.
- Important meteorological parameters were neglected when training EL-based models for precise PM concentration estimations, such as temperature, humidity, wind speed, and air pressure.
- Using latitude and longitude as input features in models for estimating PM concentrations produced spatial discontinuity.
4.2.4. Comparison of ML-Based Solutions for PM Estimation
4.3. RQ3: What Are the Research Gaps and Future Directions for Estimating PM2.5 and PM10 Concentrations Using ML-Based Models?
4.3.1. The Research Gap
4.3.2. Future Directions
- Balanced long-term historical PM dataset:The short-term datasets spanning minutes, hours, or days often caused overfitting, which decreased model accuracy. Additionally, they minimized the consistency of PM concentration estimates across periods and environmental conditions. Long-term trends were not captured, making it difficult to determine whether the state of the air was improving or worsening. These short datasets were also unsuitable for evaluating chronic PM exposure, which could have led to an underestimation of the long-term health risks associated with air pollution. Therefore, the development of an extensive dataset containing historical information on PM concentrations over several years or even decades is essential, since it records long-term changes in PM levels as well as trends and seasonal variations. Balanced long-term databases are necessary for epidemiological studies to evaluate population exposure to PM over an extended period. This is because imbalanced samples cause ML-based models to fail to provide accurate estimations of PM across the entire spatial domain. Furthermore, the inclusion of a comprehensive set of meteorological parameters, along with land use and land cover variables, is crucial for understanding PM concentration dynamics and enhancing the performance and applicability of the ML-based models. Meteorological factors such as boundary layer height can significantly influence PM concentrations by trapping pollutants near the surface during low-ventilation and inversion conditions. Similarly, changes in land use, such as urbanization or drought in arid and semi-arid regions, can enhance PM emissions. Addressing the limitations of existing datasets and developing comprehensive, balanced, long-term databases should be a high priority for the research community.
- Spatiotemporal modeling:Spatiotemporal modeling techniques will offer valuable insights into the patterns and trends of and concentrations. These techniques will reveal how PM levels vary across different locations and times. By incorporating spatial elements, the model will better account for regional differences in PM concentrations, such as higher levels near highways or industrial zones and lower levels in rural or forested areas. Additionally, including temporal features will enable the model to capture how PM concentrations change over time. This will account for factors such as daily cycles, seasonal variations, and long-term trends. Furthermore, it is important to use climate similarity to address issues of spatial discontinuity when using latitude and longitude as input features when training the model for enhancing and describing the spatial proximity of samples.
- Hybrid ML-based model:In heavily polluted metropolitan regions, PM concentration estimation solutions that use hybrid ML models are becoming more crucial. Hybrid models allow for an improved comprehension of difficulties in metropolitan contexts, where pollution can fluctuate dramatically due to traffic congestion, industrial activity, and shifting weather patterns. These models provide a strong instrument for monitoring air quality, supporting efficient pollution control plans, and safeguarding public health.For instance, the Hybrid dragonfly–SVM–RF model may potentially revolutionize air quality monitoring and estimate PM concentration. The dragonfly algorithm will be used for optimization tasks such as feature selection or parameter tuning. By combining it with the estimation power of SVM and the ensemble capabilities of RF, the hybrid dragonfly–SVM–RF model may achieve superior accuracy in estimating PM concentrations compared to individual models. This combination may allows the model to capture non-linear relationships within the data, providing a more comprehensive analysis.
- Site-based Ttme-based cross validations:In site-based cross validation, a model will be trained using data from several monitoring stations and tested using data from additional, unseen stations. This technique will promote the assessment of the model’s generalizability to new contexts. In time-based cross validation, we will train the model on data from specific years and test it on data from different years. This will ensure that the model estimates remain consistent over time and can accurately estimate future trends based on historical data.Based upon the directions, future research can further enhance the understanding and estimation of air pollution, ultimately supporting more effective air quality management.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Gupta, P.; Zhan, S.; Mishra, V.; Aekakkararungroj, A.; Markert, A.; Paibong, S.; Chishtie, F. Machine learning algorithm for estimating surface PM2.5 in Thailand. Aerosol Air Qual. Res. 2021, 21, 210105. [Google Scholar] [CrossRef]
- Air Pulltion. 2024. Available online: https://www.who.int/health-topics/air-pollution#tab=tab_1 (accessed on 13 July 2024).
- Alamoudi, M.; Taylan, O.; Keshtegar, B.; Abusurrah, M.; Balubaid, M. Modeling sulphur dioxide (SO2) quality levels of Jeddah City using machine learning approaches with meteorological and chemical factors. Sustainability 2022, 14, 16291. [Google Scholar] [CrossRef]
- Kampa, M.; Castanas, E. Human health effects of air pollution. Environ. Pollut. 2008, 151, 362–367. [Google Scholar] [CrossRef] [PubMed]
- Kim, D.; Chen, Z.; Zhou, L.F.; Huang, S.X. Air pollutants and early origins of respiratory diseases. Chronic Dis. Transl. Med. 2018, 4, 75–94. [Google Scholar] [CrossRef]
- Brunekreef, B.; Holgate, S.T. Air pollution and health. Lancet 2002, 360, 1233–1242. [Google Scholar] [CrossRef]
- Cohen, A.J.; Ross Anderson, H.; Ostro, B.; Pandey, K.D.; Krzyzanowski, M.; Künzli, N.; Gutschmidt, K.; Pope, A.; Romieu, I.; Samet, J.M.; et al. The global burden of disease due to outdoor air pollution. J. Toxicol. Environ. Health Part A 2005, 68, 1301–1307. [Google Scholar] [CrossRef]
- Künzli, N.; Tager, I.B. Air pollution: From lung to heart. Swiss Med. Wkly. 2005, 135, 697–702. [Google Scholar]
- Chen, M.H.; Chen, Y.C.; Chou, T.Y.; Ning, F.S. PM2.5 Concentration Prediction Model: A CNN–RF Ensemble Framework. Int. J. Environ. Res. Public Health 2023, 20, 4077. [Google Scholar] [CrossRef]
- Ibrir, A.; Kerchich, Y.; Hadidi, N.; Merabet, H.; Hentabli, M. Prediction of the concentrations of PM1, PM2.5, PM4, and PM10 by using the hybrid dragonfly-SVM algorithm. Air Qual. Atmos. Health 2021, 14, 313–323. [Google Scholar] [CrossRef]
- Valavanidis, A.; Fiotakis, K.; Vlachogianni, T. Airborne particulate matter and human health: Toxicological assessment and importance of size and composition of particles for oxidative damage and carcinogenic mechanisms. J. Environ. Sci. Health Part C 2008, 26, 339–362. [Google Scholar] [CrossRef]
- Shaltout, A.A.; Boman, J.; Shehadeh, Z.F.; Dhaif-Allah, R.; Hemeda, O.; Morsy, M.M. Spectroscopic investigation of PM2.5 collected at industrial, residential and traffic sites in Taif, Saudi Arabia. J. Aerosol Sci. 2015, 79, 97–108. [Google Scholar] [CrossRef]
- Aina, Y.A.; Van der Merwe, J.H.; Alshuwaikhat, H.M. Spatial and temporal variations of satellite-derived multi-year particulate data of Saudi Arabia: An exploratory analysis. Int. J. Environ. Res. Public Health 2014, 11, 11152–11166. [Google Scholar] [CrossRef] [PubMed]
- Heisler, S.L.; Friedlander, S. Gas-to-particle conversion in photochemical smog: Aerosol growth laws and mechanisms for organics. Atmos. Environ. 1977, 11, 157–168. [Google Scholar] [CrossRef]
- Carvalho, H. New WHO global air quality guidelines: More pressure on nations to reduce air pollution levels. Lancet Planet. Health 2021, 5, e760–e761. [Google Scholar] [CrossRef]
- Sprigg, W.; Nickovic, S.; Galgiani, J.; Pejanovic, G.; Petkovic, S.; Vujadinovic, M.; Vukovic, A.; Dacic, M.; DiBiase, S.; Prasad, A.; et al. Regional dust storm modeling for health services: The case of valley fever. Aeolian Res. 2014, 14, 53–73. [Google Scholar] [CrossRef]
- Haq, M.A. SMOTEDNN: A novel model for air pollution forecasting and AQI classification. Comput. Mater. Contin. 2022, 71, 1403–1425. [Google Scholar]
- Dhandapani, A.; Iqbal, J.; Kumar, R.N. Application of machine learning (individual vs stacking) models on MERRA-2 data to predict surface PM2.5 concentrations over India. Chemosphere 2023, 340, 139966. [Google Scholar] [CrossRef]
- Mircea, M.; Calori, G.; Pirovano, G.; Belis, C. European Guide on Air Pollution Source Apportionment for Particulate Matter with Source Oriented Models and Their Combined Use with Receptor Models; Publications Office of the European Union: Luxembourg, 2020. [Google Scholar]
- Johnson, T.M.; Guttikunda, S.; Wells, G.J.; Artaxo, P.; Bond, T.C.; Russell, A.G.; Watson, J.G.; West, J. Tools for Improving Air Quality Management: A Review of Top-Down Source Apportionment Techniques and Their Application in Developing Countries; World Bank: Washington, DC, USA, 2011. [Google Scholar]
- Li, Y.; Yuan, S.; Fan, S.; Song, Y.; Wang, Z.; Yu, Z.; Yu, Q.; Liu, Y. Satellite remote sensing for estimating PM2.5 and its components. Curr. Pollut. Rep. 2021, 7, 72–87. [Google Scholar] [CrossRef]
- Di, Q.; Amini, H.; Shi, L.; Kloog, I.; Silvern, R.; Kelly, J.; Sabath, M.B.; Choirat, C.; Koutrakis, P.; Lyapustin, A.; et al. An ensemble-based model of PM2.5 concentration across the contiguous United States with high spatiotemporal resolution. Environ. Int. 2019, 130, 104909. [Google Scholar] [CrossRef]
- Mhawish, A.; Banerjee, T.; Sorek-Hamer, M.; Bilal, M.; Lyapustin, A.I.; Chatfield, R.; Broday, D.M. Estimation of high-resolution PM2.5 over the Indo-Gangetic Plain by fusion of satellite data, meteorology, and land use variables. Environ. Sci. Technol. 2020, 54, 7891–7900. [Google Scholar] [CrossRef]
- Kaginalkar, A.; Kumar, S.; Gargava, P.; Niyogi, D. Review of urban computing in air quality management as smart city service: An integrated IoT, AI, and cloud technology perspective. Urban Clim. 2021, 39, 100972. [Google Scholar] [CrossRef]
- Essamlali, I.; Nhaila, H.; El Khaili, M. Supervised Machine Learning Approaches for Predicting Key Pollutants and for the Sustainable Enhancement of Urban Air Quality: A Systematic Review. Sustainability 2024, 16, 976. [Google Scholar] [CrossRef]
- Zuo, X.; Guo, H.; Shi, S.; Zhang, X. Comparison of six machine learning methods for estimating PM2.5 concentration using the Himawari-8 aerosol optical depth. J. Indian Soc. Remote Sens. 2020, 48, 1277–1287. [Google Scholar] [CrossRef]
- Zaman, N.A.F.K.; Kanniah, K.D.; Kaskaoutis, D.G.; Latif, M.T. Evaluation of machine learning models for estimating PM2.5 concentrations across malaysia. Appl. Sci. 2021, 11, 7326. [Google Scholar] [CrossRef]
- Park, Y.; Kwon, B.; Heo, J.; Hu, X.; Liu, Y.; Moon, T. Estimating PM2.5 concentration of the conterminous United States via interpretable convolutional neural networks. Environ. Pollut. 2020, 256, 113395. [Google Scholar] [CrossRef]
- Chakma, A.; Vizena, B.; Cao, T.; Lin, J.; Zhang, J. Image-based air quality analysis using deep convolutional neural network. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3949–3952. [Google Scholar]
- Li, J.; Jin, M.; Li, H. Exploring spatial influence of remotely sensed PM2.5 concentration using a developed deep convolutional neural network model. Int. J. Environ. Res. Public Health 2019, 16, 454. [Google Scholar] [CrossRef]
- Qadeer, K.; Rehman, W.U.; Sheri, A.M.; Park, I.; Kim, H.K.; Jeon, M. A long short-term memory (LSTM) network for hourly estimation of PM2.5 concentration in two cities of South Korea. Appl. Sci. 2020, 10, 3984. [Google Scholar] [CrossRef]
- Sayeed, A.; Lin, P.; Gupta, P.; Tran, N.N.M.; Buchard, V.; Christopher, S. Hourly and Daily PM2.5 Estimations Using MERRA-2: A Machine Learning Approach. Earth Space Sci. 2022, 9, e2022EA002375. [Google Scholar] [CrossRef]
- Shtein, A.; Kloog, I.; Schwartz, J.; Silibello, C.; Michelozzi, P.; Gariazzo, C.; Viegi, G.; Forastiere, F.; Karnieli, A.; Just, A.C.; et al. Estimating daily PM2.5 and PM10 over Italy using an ensemble model. Environ. Sci. Technol. 2019, 54, 120–128. [Google Scholar] [CrossRef]
- Gu, Y. Estimating PM2.5 Concentrations Using 3 km MODIS AOD Products: A Case Study in British Columbia, Canada. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2019. [Google Scholar]
- Meng, X.; Hand, J.L.; Schichtel, B.A.; Liu, Y. Space-time trends of PM2.5 constituents in the conterminous United States estimated by a machine learning approach, 2005–2015. Environ. Int. 2018, 121, 1137–1147. [Google Scholar] [CrossRef]
- Yu, W.; Li, S.; Ye, T.; Xu, R.; Song, J.; Guo, Y. Deep ensemble machine learning framework for the estimation of PM2.5 concentrations. Environ. Health Perspect. 2022, 130, 037004. [Google Scholar] [CrossRef] [PubMed]
- LME. 2024. Available online: https://www.geeksforgeeks.org/linear-mixed-effects-models-lme-in-r/ (accessed on 13 July 2024).
- Lee, H.; Liu, Y.; Coull, B.; Schwartz, J.; Koutrakis, P. A novel calibration approach of MODIS AOD data to predict PM2.5 concentrations. Atmos. Chem. Phys. 2011, 11, 7991–8002. [Google Scholar] [CrossRef]
- Yu, H.; Fotheringham, A.S.; Li, Z.; Oshan, T.; Kang, W.; Wolf, L.J. Inference in multiscale geographically weighted regression. Geogr. Anal. 2020, 52, 87–106. [Google Scholar] [CrossRef]
- Zou, B.; Chen, J.; Zhai, L.; Fang, X.; Zheng, Z. Satellite based mapping of ground PM2.5 concentration using generalized additive modeling. Remote Sens. 2016, 9, 1. [Google Scholar] [CrossRef]
- Unnithan, S.K.; Gnanappazham, L. Spatiotemporal mixed effects modeling for the estimation of PM2.5 from MODIS AOD over the Indian subcontinent. GISci. Remote Sens. 2020, 57, 159–173. [Google Scholar] [CrossRef]
- Xiao, Q.; Chang, H.H.; Geng, G.; Liu, Y. An ensemble machine-learning model to predict historical PM2.5 concentrations in China from satellite data. Environ. Sci. Technol. 2018, 52, 13260–13269. [Google Scholar] [CrossRef]
- Bera, B.; Bhattacharjee, S.; Sengupta, N.; Saha, S. PM2.5 concentration prediction during COVID-19 lockdown over Kolkata metropolitan city, India using MLR and ANN models. Environ. Chall. 2021, 4, 100155. [Google Scholar] [CrossRef]
- Chen, B.; Song, Z.; Huang, J.; Zhang, P.; Hu, X.; Zhang, X.; Guan, X.; Ge, J.; Zhou, X. Estimation of atmospheric PM10 concentration in China using an interpretable deep learning model and top-of-the-atmosphere reflectance data from China’s new generation geostationary meteorological satellite, FY-4A. J. Geophys. Res. Atmos. 2022, 127, e2021JD036393. [Google Scholar] [CrossRef]
- Maltare, N.N.; Vahora, S. Air Quality Index prediction using machine learning for Ahmedabad city. Digit. Chem. Eng. 2023, 7, 100093. [Google Scholar] [CrossRef]
- Deo, R.C.; Wen, X.; Qi, F. A wavelet-coupled support vector machine model for forecasting global incident solar radiation using limited meteorological dataset. Appl. Energy 2016, 168, 568–593. [Google Scholar] [CrossRef]
- Gao, Z.; Do, K.; Li, Z.; Jiang, X.; Maji, K.J.; Ivey, C.E.; Russell, A.G. Predicting PM2.5 levels and exceedance days using machine learning methods. Atmos. Environ. 2024, 323, 120396. [Google Scholar] [CrossRef]
- Balogun, A.L.; Tella, A. Modelling and investigating the impacts of climatic variables on ozone concentration in Malaysia using correlation analysis with random forest, decision tree regression, linear regression, and support vector regression. Chemosphere 2022, 299, 134250. [Google Scholar] [CrossRef] [PubMed]
- Méndez, M.; Merayo, M.G.; Núñez, M. Machine learning algorithms to forecast air quality: A survey. Artif. Intell. Rev. 2023, 56, 10031–10066. [Google Scholar] [CrossRef] [PubMed]
- The Optimal Value of K in KNN. 2024. Available online: https://www.geeksforgeeks.org/how-to-find-the-optimal-value-of-k-in-knn/ (accessed on 4 October 2024).
- Ayinde, B.O.; Musa, M.R.; Ayinde, A.A.O. Application of machine learning models and landsat 8 data for estimating seasonal PM2.5 concentrations. Environ. Anal. Health Toxicol. 2024, 39, e2024011. [Google Scholar] [CrossRef]
- Xiong, L.; Yao, Y. Study on an adaptive thermal comfort model with K-nearest-neighbors (KNN) algorithm. Build. Environ. 2021, 202, 108026. [Google Scholar] [CrossRef]
- Balogun, A.L.; Tella, A.; Baloo, L.; Adebisi, N. A review of the inter-correlation of climate change, air pollution and urban sustainability using novel machine learning algorithms and spatial information science. Urban Clim. 2021, 40, 100989. [Google Scholar] [CrossRef]
- Sánchez-Ruiz, F.J.; Hernandez, E.A.; Terrones-Salgado, J.; Quiroz, L.J.F. Evolutionary artificial neural network for temperature control in a batch polymerization reactor. Ingenius 2023, 79–89. [Google Scholar] [CrossRef]
- Afan, H.A.; Ibrahem Ahmed Osman, A.; Essam, Y.; Ahmed, A.N.; Huang, Y.F.; Kisi, O.; Sherif, M.; Sefelnasr, A.; Chau, K.w.; El-Shafie, A. Modeling the fluctuations of groundwater level by employing ensemble deep learning techniques. Eng. Appl. Comput. Fluid Mech. 2021, 15, 1420–1439. [Google Scholar] [CrossRef]
- LeCun, Y.; Bengio, Y. Convolutional networks for images, speech, and time series. Handb. Brain Theory Neural Netw. 1995, 3361, 1995. [Google Scholar]
- Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
- Ayturan, Y.A.; Ayturan, Z.C.; Altun, H.O. Air pollution modelling with deep learning: A review. Int. J. Environ. Pollut. Environ. Model. 2018, 1, 58–62. [Google Scholar]
- Tian, J.; Liu, Y.; Zheng, W.; Yin, L. Smog prediction based on the deep belief-BP neural network model (DBN-BP). Urban Clim. 2022, 41, 101078. [Google Scholar] [CrossRef]
- Valencia, A.R.Z.; Rosales, A.A.R. Application of Random Forest in a Predictive Model of PM10 Particles in Mexico City. Nat. Environ. Pollut. Technol. 2024, 23, 711–724. [Google Scholar] [CrossRef]
- Gui, K.; Che, H.; Zeng, Z.; Wang, Y.; Zhai, S.; Wang, Z.; Luo, M.; Zhang, L.; Liao, T.; Zhao, H.; et al. Construction of a virtual PM2.5 observation network in China based on high-density surface meteorological observations using the Extreme Gradient Boosting model. Environ. Int. 2020, 141, 105801. [Google Scholar] [CrossRef] [PubMed]
- Chen, C.J.; Hua, Y.J.; Lin, Z.; Zhang, T.; Di, Z.M. Stacking machine learning model for estimating hourly PM2.5 in China based on Himawari 8 aerosol optical depth data. Sci. Total Environ. 2019, 697, 134021. [Google Scholar] [CrossRef] [PubMed]
- Ghahremanloo, M.; Choi, Y.; Sayeed, A.; Salman, A.K.; Pan, S.; Amani, M. Estimating daily high-resolution PM2.5 concentrations over Texas: Machine Learning approach. Atmos. Environ. 2021, 247, 118209. [Google Scholar] [CrossRef]
- Chen, Z.Y.; Zhang, T.H.; Zhang, R.; Zhu, Z.M.; Yang, J.; Chen, P.Y.; Ou, C.Q.; Guo, Y. Extreme gradient boosting model to estimate PM2.5 concentrations with missing-filled satellite data in China. Atmos. Environ. 2019, 202, 180–189. [Google Scholar] [CrossRef]
- Mohammadi, A.; Karimzadeh, S.; Banimahd, S.A.; Ozsarac, V.; Lourenço, P.B. The potential of region-specific machine-learning-based ground motion models: Application to Turkey. Soil Dyn. Earthq. Eng. 2023, 172, 108008. [Google Scholar] [CrossRef]
- Buya, S.; Usanavasin, S.; Gokon, H.; Karnjana, J. An Estimation of Daily PM2.5 Concentration in Thailand Using Satellite Data at 1-Kilometer Resolution. Sustainability 2023, 15, 10024. [Google Scholar] [CrossRef]
- Ferreira, F.P.V.; Jeong, S.H.; Mansouri, E.; Shamass, R.; Tsavdaridis, K.; Martins, C.H.; De Nardin, S. Five Machine Learning Models Predicting the Global Shear Capacity of Composite Cellular Beams with Hollow-Core Units. Buildings 2024, 14, 2256. [Google Scholar] [CrossRef]
- Zeng, Z.; Gui, K.; Wang, Z.; Luo, M.; Geng, H.; Ge, E.; An, J.; Song, X.; Ning, G.; Zhai, S.; et al. Estimating hourly surface PM2.5 concentrations across China from high-density meteorological observations by machine learning. Atmos. Res. 2021, 254, 105516. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 52. [Google Scholar]
- Keele, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; Keele University: Keele, UK, 2007. [Google Scholar]
- Brokamp, C.; Jandarov, R.; Hossain, M.; Ryan, P. Predicting daily urban fine particulate matter concentrations using a random forest model. Environ. Sci. Technol. 2018, 52, 4173–4179. [Google Scholar] [CrossRef] [PubMed]
- Chen, G.; Wang, Y.; Li, S.; Cao, W.; Ren, H.; Knibbs, L.D.; Abramson, M.J.; Guo, Y. Spatiotemporal patterns of PM10 concentrations over China during 2005–2016: A satellite-based estimation using the random forests approach. Environ. Pollut. 2018, 242, 605–613. [Google Scholar] [CrossRef]
- Stafoggia, M.; Bellander, T.; Bucci, S.; Davoli, M.; De Hoogh, K.; De’Donato, F.; Gariazzo, C.; Lyapustin, A.; Michelozzi, P.; Renzi, M.; et al. Estimation of daily PM10 and PM2.5 concentrations in Italy, 2013–2015, using a spatiotemporal land-use random-forest model. Environ. Int. 2019, 124, 170–179. [Google Scholar] [CrossRef]
- Stafoggia, M.; Johansson, C.; Glantz, P.; Renzi, M.; Shtein, A.; de Hoogh, K.; Kloog, I.; Davoli, M.; Michelozzi, P.; Bellander, T. A random forest approach to estimate daily particulate matter, nitrogen dioxide, and ozone at fine spatial resolution in Sweden. Atmosphere 2020, 11, 239. [Google Scholar] [CrossRef]
- Tuygun, G.T.; Gündoğdu, S.; Elbir, T. Estimation of ground-level particulate matter concentrations based on synergistic use of MODIS, MERRA-2 and AERONET AODs over a coastal site in the Eastern Mediterranean. Atmos. Environ. 2021, 261, 118562. [Google Scholar] [CrossRef]
- Liu, W.; Yang, Z.; Liu, Q. Estimations of ambient fine particle and ozone level at a suburban site of Beijing in winter. Environ. Res. Commun. 2021, 3, 081008. [Google Scholar] [CrossRef]
- Djarum, D.H.; Ahmad, Z.; Zhang, J. Comparing Different Pre-processing Techniques and Machine Learning Models to Predict PM10 and PM2.5 Concentration in Malaysia. In Proceedings of the 3rd International Conference on Separation Technology: Sustainable Design in Construction, Materials and Processes, Johor, Malaysia, 15–16 August 2020; Springer: Singapore, 2021; pp. 353–374. [Google Scholar]
- Lin, L.; Liang, Y.; Liu, L.; Zhang, Y.; Xie, D.; Yin, F.; Ashraf, T. Estimating PM2.5 concentrations using the machine learning RF-XGBoost model in guanzhong urban agglomeration, China. Remote Sens. 2022, 14, 5239. [Google Scholar] [CrossRef]
- Chen, B.; Song, Z.; Shi, B.; Li, M. An interpretable deep forest model for estimating hourly PM10 concentration in China using Himawari-8 data. Atmos. Environ. 2022, 268, 118827. [Google Scholar] [CrossRef]
- Yang, Y.; Wang, Z.; Cao, C.; Xu, M.; Yang, X.; Wang, K.; Guo, H.; Gao, X.; Li, J.; Shi, Z. Estimation of PM2.5 concentration across china based on multi-source remote sensing data and machine learning methods. Remote Sens. 2024, 16, 467. [Google Scholar] [CrossRef]
- Li, S.; Ding, Y.; Xing, J.; Fu, J.S. Retrieving Ground-Level PM2.5 Concentrations in China (2013–2021) with a Numerical Model-Informed Testbed to Mitigate Sample Imbalance-Induced Biases. Earth Syst. Sci. Data Discuss. 2024, 16, 3781–3793. [Google Scholar] [CrossRef]
- Ding, Y.; Li, S.; Xing, J.; Li, X.; Ma, X.; Song, G.; Teng, M.; Yang, J.; Dong, J.; Meng, S. Retrieving hourly seamless PM2.5 concentration across China with physically informed spatiotemporal connection. Remote Sens. Environ. 2024, 301, 113901. [Google Scholar] [CrossRef]
- Gupta, P.; Christopher, S.A. Particulate matter air quality assessment using integrated surface, satellite, and meteorological products: Multiple regression approach. J. Geophys. Res. Atmos. 2009, 114, 1–13. [Google Scholar] [CrossRef]
- Zhang, T.; Liu, G.; Zhu, Z.; Gong, W.; Ji, Y.; Huang, Y. Real-time estimation of satellite-derived PM2.5 based on a semi-physical geographically weighted regression model. Int. J. Environ. Res. Public Health 2016, 13, 974. [Google Scholar] [CrossRef]
- Liu, Y.; Paciorek, C.J.; Koutrakis, P. Estimating regional spatial and temporal variability of PM2.5 concentrations using satellite data, meteorology, and land use information. Environ. Health Perspect. 2009, 117, 886–892. [Google Scholar] [CrossRef]
- Rao, P.; Niharika, V. A survey on air quality forecasting techniques. Int. J. Comput. Sci. Inf. Technol. 2014, 5, 812–816. [Google Scholar]
- Bilal, M.; Nichol, J.E.; Spak, S.N. A new approach for estimation of fine particulate concentrations using satellite aerosol optical depth and binning of meteorological variables. Aerosol Air Qual. Res. 2017, 17, 356–367. [Google Scholar] [CrossRef]
- Chen, M.J.; Yang, P.H.; Hsieh, M.T.; Yeh, C.H.; Huang, C.H.; Yang, C.M.; Lin, G.M. Machine learning to relate PM2.5 and PM10 concentrations to outpatient visits for upper respiratory tract infections in Taiwan: A nationwide analysis. World J. Clin. Cases 2018, 6, 200. [Google Scholar] [CrossRef]
- Azid, A.; Juahir, H.; Toriman, M.E.; Kamarudin, M.K.A.; Saudi, A.S.M.; Hasnam, C.N.C.; Aziz, N.A.A.; Azaman, F.; Latif, M.T.; Zainuddin, S.F.M.; et al. Prediction of the level of air pollution using principal component analysis and artificial neural network techniques: A case study in Malaysia. Water Air Soil Pollut. 2014, 225, 1–14. [Google Scholar] [CrossRef]
- Zang, L.; Mao, F.; Guo, J.; Wang, W.; Pan, Z.; Shen, H.; Zhu, B.; Wang, Z. Estimation of spatiotemporal PM1.0 distributions in China by combining PM2.5 observations with satellite aerosol optical depth. Sci. Total Environ. 2019, 658, 1256–1264. [Google Scholar] [CrossRef]
- Kujawska, J.; Kulisz, M.; Oleszczuk, P.; Cel, W. Machine learning methods to forecast the concentration of PM10 in Lublin, Poland. Energies 2022, 15, 6428. [Google Scholar] [CrossRef]
- Kumar, S.; Mishra, S.; Singh, S.K. A machine learning-based model to estimate PM2.5 concentration levels in Delhi’s atmosphere. Heliyon 2020, 6, e05618. [Google Scholar] [CrossRef] [PubMed]
- Liao, K.; Huang, X.; Dang, H.; Ren, Y.; Zuo, S.; Duan, C. Statistical approaches for forecasting primary air pollutants: A review. Atmosphere 2021, 12, 686. [Google Scholar] [CrossRef]
- Hu, X.; Belle, J.H.; Meng, X.; Wildani, A.; Waller, L.A.; Strickland, M.J.; Liu, Y. Estimating PM2.5 concentrations in the conterminous United States using the random forest approach. Environ. Sci. Technol. 2017, 51, 6936–6944. [Google Scholar] [CrossRef] [PubMed]
- Unik, M.; Sitanggang, I.S.; Syaufina, L.; Jaya, I.N.S. PM2.5 estimation using machine learning models and satellite data: A literature review. Int. J. Adv. Comput. Sci. Appl. 2023, 14, 359–370. [Google Scholar] [CrossRef]
- Gao, S.; Zhao, H.; Bai, Z.; Han, B.; Xu, J.; Zhao, R.; Zhang, N.; Chen, L.; Lei, X.; Shi, W.; et al. Combined use of principal component analysis and artificial neural network approach to improve estimates of PM2.5 personal exposure: A case study on older adults. Sci. Total Environ. 2020, 726, 138533. [Google Scholar] [CrossRef]
- Haiming, Z.; Xiaoxiao, S. Study on prediction of atmospheric PM2.5 based on RBF neural network. In Proceedings of the 2013 Fourth International Conference on Digital Manufacturing & Automation, Qingdao, China, 29–30 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1287–1289. [Google Scholar]
- Zheng, Y.; Liu, F.; Hsieh, H.P. U-air: When urban air quality inference meets big data. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago IL, USA, 11–14 August 2013; pp. 1436–1444. [Google Scholar]
- Kaushik, R.; Kumar, S.; Pooling, M. Image segmentation using convolutional neural network. Int. J. Sci. Technol. Res 2019, 8, 667–675. [Google Scholar]
- Tao, H.; Xing, J.; Zhou, H.; Pleim, J.; Ran, L.; Chang, X.; Wang, S.; Chen, F.; Zheng, H.; Li, J. Impacts of improved modeling resolution on the simulation of meteorology, air quality, and human exposure to PM2.5, O3 in Beijing, China. J. Clean. Prod. 2020, 243, 118574. [Google Scholar] [CrossRef]
- Yan, X.; Zang, Z.; Jiang, Y.; Shi, W.; Guo, Y.; Li, D.; Zhao, C.; Husi, L. A Spatial-Temporal Interpretable Deep Learning Model for improving interpretability and predictive accuracy of satellite-based PM2.5. Environ. Pollut. 2021, 273, 116459. [Google Scholar] [CrossRef]
- Hancock, J.T.; Khoshgoftaar, T.M. CatBoost for big data: An interdisciplinary review. J. Big Data 2020, 7, 94. [Google Scholar] [CrossRef]
- Levy, R.C. The dark-land MODIS collection 5 aerosol retrieval: Algorithm development and product evaluation. In Satellite Aerosol Remote Sensing over Land; Springer: Berlin/Heidelberg, Germany, 2009; pp. 19–68. [Google Scholar]
Statistical Metrics | Abbreviation | Definition |
---|---|---|
Cross-validation | CV | Evaluate the ML model’s performance on unseen data. The available data are divide into multiple folds; one of these folds serves as a validation set, while the other folds serve for training the model [32,63]. |
Determination coefficient | Shows how much of the variance in the dependent variable can be estimated based on the independent variables. The range of values is from 0 to 1, and higher values indicate a better fit [63]. | |
Correlation coefficient | R | Clarifies the association between two variables [63]. |
Out-of-Bag | OOB | Refers to the portion of the original dataset not included in the bootstrap sample during the training of each model in an ensemble, which is then used to assess the model’s performance [35]. |
Spearman rank correlation coefficient | SR | Measures the strength and direction of the association between the observed values and the estimated values [32]. |
Research Questions | Objectives |
---|---|
What are the benefits of using ML to estimate PM concentrations? | Identify and synthesize the key benefits of using ML techniques for estimating PM concentrations compared to traditional statistical methods. |
What are the current solutions that employ ML models for estimating the concentrations of PM? | Systematically review the current solutions and ML-based models that have been employed for estimating the concentrations of and in ambient air. |
What are the research gaps and future directions for estimating PM concentrations based on a machine learning model? | Analyze the research gaps critically and identify future directions for advancing the application of ML techniques to improve the estimation and monitoring of and levels. |
Inclusion Criteria | Exclusion Criteria |
---|---|
Include machine learning-based solutions to estimate particulate matter. | Remove any studies that were published over six years ago. |
Included articles must primarily address the estimation of or concentrations. | Exclude studies that forecast particulate matter using ML-based models. |
Include ISI articles or scopes articles | Do not include books or theses |
Library | Elsevier | MDPI | Springer | AAQR | IOP Science | ACS Publication | Wiley | NEPT | Europe PMC | ESSD | Total |
---|---|---|---|---|---|---|---|---|---|---|---|
Excluded studies | 784 | 79 | 180 | 46 | 30 | 303 | 227 | 78 | 2057 | 0 | 3754 |
Included studies | 15 | 5 | 3 | 1 | 1 | 3 | 2 | 1 | 1 | 1 | 32 |
Study No. | Feature Importance Analysis | Residual Analysis | Temporal and Spatial Consistency | Cross-Validation | Total |
---|---|---|---|---|---|
S1 | 1 | 1 | 1 | 1 | 4 |
S2 | 1 | 1 | 1 | 1 | 4 |
S3 | 1 | 1 | 1 | 1 | 4 |
S4 | 1 | 1 | 1 | 1 | 4 |
S5 | 1 | 1 | 1 | 1 | 4 |
S6 | 1 | 1 | 1 | 1 | 4 |
S7 | 1 | 1 | 1 | 1 | 4 |
S8 | 1 | 1 | 1 | 1 | 4 |
S9 | 1 | 1 | 1 | 1 | 4 |
S10 | 1 | 1 | 1 | 1 | 4 |
S11 | 1 | 1 | 1 | 0 | 3 |
S12 | 1 | 1 | 1 | 1 | 4 |
S13 | 1 | 1 | 1 | 0 | 3 |
S14 | 1 | 1 | 1 | 1 | 4 |
S15 | 1 | 1 | 1 | 1 | 4 |
S16 | 1 | 1 | 1 | 1 | 4 |
S17 | 1 | 1 | 1 | 0 | 3 |
S18 | 1 | 1 | 1 | 0 | 3 |
S19 | 1 | 1 | 1 | 0 | 3 |
S20 | 1 | 1 | 1 | 1 | 4 |
S21 | 1 | 1 | 1 | 1 | 4 |
S22 | 1 | 1 | 1 | 1 | 4 |
S23 | 1 | 1 | 1 | 1 | 4 |
S24 | 1 | 1 | 1 | 0 | 3 |
S25 | 1 | 1 | 1 | 0 | 3 |
S26 | 1 | 1 | 1 | 1 | 4 |
S27 | 1 | 1 | 1 | 0 | 3 |
S28 | 1 | 1 | 1 | 1 | 4 |
S29 | 1 | 1 | 1 | 1 | 4 |
S30 | 1 | 1 | 0 | 1 | 3 |
S31 | 1 | 1 | 1 | 0 | 3 |
S32 | 1 | 1 | 1 | 1 | 4 |
Total | 32 | 31 | 30 | 23 | Avg = 3.68 |
Ref. | PM Type | Location | Model | Accuracy | Strengths | Limitations |
---|---|---|---|---|---|---|
[26] | BTH | DT Model | R = 0.854 | - Captured the complex relationships between AOD and . | - Was not generalizable to other locations. - Reliance on AOD data limited the accuracy, especially in heavily polluted areas. | |
[10] | Algiers | h-Hybrid dragonfly–SVM model | = 0.98 | - Was a useful tool to help authorities anticipate critical air quality episodes in the absence of continuous monitoring. | - Lacked consideration for land use and seasonal effects. | |
[26] | BTH | SVM Model | R = 0.32. | - | - The lack of a uniform training dataset reduced the accuracy. | |
[27] | Malaysia | SVR Model | = 0.69 | - Overfitting was minimized by relying on the kernel function. | - Some biases and underestimations of peak values were present. - It was not generalizable to other locations. | |
[47] | South Coast Air Basin of California | SVR model | of 0.94 | - Had high accuracy with low computational requirements. | - Did not accurately predict the extreme values. | |
[43] | Kolkata | ANN Model | = 0.69 | - A rational model for estimating spatiotemporal concentrations was developed. | - Lacked comprehensive spatial and temporal data coverage. |
Ref. | PM Type | Location | Model | Accuracy | Strengths | Limitations |
---|---|---|---|---|---|---|
[28] | Conterminous United States | CNN model | = 0.84 | - CNN generated a smooth annual prediction map. | - Limited temporal scope. -The model was trained for one year and might not have reflected the most recent changes in concentrations. | |
[75] | Turkey | PRNN model | R = 0.74. | - It was capable of handling random variations. | - It was not generalizable to other locations. | |
[81] | China | ResNet Model | R = 0.61 | - Enhanced the estimation accuracy. - Mitigated biases induced by sample imbalance. | - The numerical model might had uncertainties, which caused discrepancies with real observations. |
Ref. | PM Type | Location | Model | Accuracy | Strengths | Limitations |
---|---|---|---|---|---|---|
[44] | China | DF | = 0.99 (annual averages) | - Achieved optimal hourly, daily, monthly, and annual averages | - Potential biases. - Lower performance during summer and autumn. - The model performed poorly in areas with high surface pressure contributions. | |
[79] | China | DF | = 0.82–0.88 | - The model achieved consistent results with the measured by the ground station. | - Accuracy affected by high surface pressure. | |
[35] | constituents | United States | RF | = 0.71–0.86 | - Captured long-term trends and spatial patterns at national and local scales. | - The estimation map had a 0.250 × 0.31250 spatial resolution and did not adequately capture local variations. |
[32] | USA | RF | = 0.65 | - The RF model effectively estimated when compared with surface measurements | - The model had limitations due to uncertain MERRA-2 emissions and insufficient satellite data. | |
[71] | Seven-county urban area | RF | = 0.91 | - The spatiotemporal RF model showed high accuracy and was useful for assessing exposure. | - RF was not generalizable to other locations. | |
[74] | , , PM2.5–10 | Sweden | RF | = 0.64–0.77 | - The RF model demonstrated better performance in large cities. | - The spatial resolution of cloud cover data affected the model’s accuracy. |
[72] | China | RF | 0.78 | - RF showed high predictive ability and low bias. | - Missing AOD values affected the estimation accuracy. - The trained model lacked ground monitoring data to validate estimates. | |
[80] | China | RF | = 0.93 | - RF achieved higher accuracy and outperformed several regression models. | - Low data temporal resolution affected model accuracy. | |
[42] | China | Ensemble ML model | = 0.79 | - Accurate estimations were achieved at daily and monthly levels. The model provided unbiased historical estimates. | - Incomplete satellite data coverage may have affected estimate accuracy. | |
[73] | , | Italy | Five-stage RF | = 0.75–0.86 | - Captured most of the PM variability. | - Biases were observed in model estimations during summer and in southern Italy. - It was not generalizable to other cities. |
[27] | Malaysia | RF | = 0.46–0.76 | - RF had an effective representation of values and temporal changes. | - It was not generalizable to other cities - There were limitations in spatial coverage. | |
[1] | Thailand | RF | R = 0.95 | - Estimated with nearly zero mean bias. | - Did not explore the model’s capacity for long-term trends. It was not generalizable to other cities | |
[66] | Thailand | RF | = 0.71 | - data from the RF model can be used to analyze short- and long-term effects on population health. | - Cloud cover, complex surfaces, and missing values impacted model accuracy. | |
[23] | IGP region | RF | = 0.87 | - Outperformed LME model across various timescales | - Lack of historical data affected assessment of year-to-year variability. | |
[63] | Texas | RF | R = 0.83–0.90 | - High estimation accuracy with low MAB. | - not generalizable to other locations. | |
[60] | Mexico | RF | 0.804 | - the model outputs were very close to the real observed data. | - The accuracy of the model was influenced by the quality of the data used. - Not generalizable to other locations. | |
[9] | Kaohsiung | CNN-RF | = 0.93 | - CNN-RF model outperformed the single CNN and RF models. | - Limited geographical coverage and short-term trend analysis. | |
[22] | United States | Ensemble learning model | = 0.86 | - Provided a solid foundation for modeling | - Used a 1 km × 1 km resolution, which may be inadequate for epidemiological applications. | |
[51] | Tuzla Canton, Bosnia, and Herzegovina | XGBoost | R = 0.98 (Winter) | - Demonstrated the highest overall accuracy across all seasons. | - Potential bias. - Did not consider important predictors. - Not generalizable to other locations | |
[78] | China | RF-XGBoost | = 0.93 | - Improved the estimation of ground-level concentrations. | - It tended to underestimate on high-pollution days and overestimate it on low-pollution days. | |
[18] | India | Stacking model | = 0.80 (hourly) | - The stacking model was applied regionally. | - Analysis was limited to a single year. | |
[68] | China | LightGBM | = 0.86 | - It achieved better hourly estimation results. | - Temporal limitations in assessing concentrations. | |
[82] | China | Wavelet-CatBoost | = 0.92 | - Achieved high estimation accuracy with low error- Enhanced spatiotemporal connectivity. | - |
Strengths | Limitations |
---|---|
ML models can process extensive datasets and detect significant patterns from diverse variables, including meteorological parameters and ground observations. | ML models require extensive preprocessing and cleaning of raw data, which can be difficult and time-consuming. |
ML models showed increased accuracy in estimating PM concentrations compared with traditional statistical models | ML models may overestimate or underestimate PM concentrations in some locations, especially in isolated or heavily polluted places where data are limited. |
ML models can aid in understanding the connections between various predictor variables and PM concentrations, thereby revealing the underlying mechanisms of air pollution. | ML models often require substantial computational resources and expertise for their execution and optimization. |
ML models require meticulous tuning and validation to prevent over-fitting or under-fitting, which can impact their generalizability and reliability. |
Strengths | Limitations |
---|---|
DL models are useful for processing huge, complicated datasets and providing precise estimates of and concentrations. | The hyperparameters and architecture optimization in DL models can be a time-consuming process. |
DL models have been superior to ML models (like SVM) at identifying complex patterns and generating accurate estimates of and concentrations. | Deep learning models often overfit, especially when dealing with noisy or sparse data, leading to poorer performance on new data and decreased model reliability. |
DL models can learn complex features and hierarchies from raw data, eliminating the need for manual feature engineering. | DL-based models can lead to substantial estimation biases if the training data are not balanced across space or time. |
Strengths | Limitations |
---|---|
Improved accuracy and robustness of the model by utilizing the strengths of multiple trees or models | Higher computational complexity and greater resource demands |
Improved generalization performance and decreased the risk of overfitting | Potential for greater model complexity and decreased interpretability |
Potentially more effective than individual models in some situations. | Risk of relying excessively on ensemble models and ignoring the advantages and disadvantages of individual models |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alkhodaidi, A.; Attiah, A.; Mhawish, A.; Hakeem, A. The Role of Machine Learning in Enhancing Particulate Matter Estimation: A Systematic Literature Review. Technologies 2024, 12, 198. https://doi.org/10.3390/technologies12100198
Alkhodaidi A, Attiah A, Mhawish A, Hakeem A. The Role of Machine Learning in Enhancing Particulate Matter Estimation: A Systematic Literature Review. Technologies. 2024; 12(10):198. https://doi.org/10.3390/technologies12100198
Chicago/Turabian StyleAlkhodaidi, Amjad, Afraa Attiah, Alaa Mhawish, and Abeer Hakeem. 2024. "The Role of Machine Learning in Enhancing Particulate Matter Estimation: A Systematic Literature Review" Technologies 12, no. 10: 198. https://doi.org/10.3390/technologies12100198
APA StyleAlkhodaidi, A., Attiah, A., Mhawish, A., & Hakeem, A. (2024). The Role of Machine Learning in Enhancing Particulate Matter Estimation: A Systematic Literature Review. Technologies, 12(10), 198. https://doi.org/10.3390/technologies12100198