Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia

Mohammad Sham, Noraishah; Ismain, Siti Hazimah Ayu; Sazali, Siti Syakirin

doi:10.3390/atmos17050501

Open AccessArticle

Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia

by

Noraishah Mohammad Sham

^1,*

,

Siti Hazimah Ayu Ismain

² and

Siti Syakirin Sazali

¹

Environmental Health Research Centre, Institute for Medical Research, National Institutes of Health, Ministry of Health Malaysia, Shah Alam 40170, Selangor, Malaysia

²

Centre of Studies for Surveying Science and Geomatics, Faculty of Architecture, Planning and Surveying, Universiti Teknologi MARA, Shah Alam 40450, Selangor, Malaysia

^*

Author to whom correspondence should be addressed.

Atmosphere 2026, 17(5), 501; https://doi.org/10.3390/atmos17050501

Submission received: 3 March 2026 / Revised: 6 April 2026 / Accepted: 9 April 2026 / Published: 14 May 2026

(This article belongs to the Special Issue Advances in Air Quality Monitoring and Source Apportionment)

Download

Browse Figures

Versions Notes

Abstract

This study applies and evaluates established machine learning (ML) models for predicting monthly PM2.5 concentrations across the Greater Klang Valley (GKV), Malaysia using one year of data collected from 36 mobile monitoring stations between July 2022 and June 2023. Daily PM2.5 temperature (T), relative humidity (RH), and station location (L) were aggregated to form monthly datasets. Exploratory analysis showed substantial temporal variability, with elevated PM2.5 levels during the southwest monsoon and reduced concentrations during the northeast monsoon due to enhanced rainfall washout. Tree-based ML algorithms: decision tree (DT), random forest (RF), and Extreme Gradient Boosting (XGBoost) were developed following data cleaning, transformation, partitioning, and hyperparameter optimization via grid search. Model performance was evaluated using R², RMSE, MAE and NAE. Across all months, XGBoost consistently outperformed DT and RF, achieving the highest R² values (0.214–0.559) and generally lower error metrics. Model performance varied seasonally, with the highest accuracy observed in March 2023 (R² = 0.559) and February 2023 (R² = 0.552), whereas November 2022 showed the weakest predictive capability. Feature-importance analysis revealed that temperature exerted the strongest influence during the southwest monsoon, while station location dominated predictions in several months, reflecting spatial heterogeneity likely associated with land-use and emission patterns. RH was most influential in September 2022, when low humidity coincided with higher PM2.5 levels. Comparison of predicted and observed values showed strong alignment except during extreme pollution events, where the model tended to underperform. Overall, the findings demonstrate that XGBoost provides a robust modeling framework for monthly PM2.5 prediction in the GKV and highlights the importance of incorporating meteorological and spatial drivers to improve localized air-quality assessments.

Keywords:

PM2.5; climatic; machine learning; prediction

1. Introduction

Fine particulate matter (PM2.5) has become a persistent environmental and public health concern in Malaysia due to its strong association with respiratory and cardiovascular diseases and its ability to penetrate deep into the lungs. Several studies have reported that PM2.5 concentrations in urban regions frequently remain elevated despite changes in human activity levels, reflecting the influence of persistent emission sources and complex atmospheric processes [1]. Long-term assessments further show substantial spatial and seasonal variability across the country, with recurring PM2.5 hotspots developing in densely populated metropolitan areas such as the Greater Klang Valley (GKV) [2,3].

Predicting PM2.5 accurately remains a challenge because pollutant levels are strongly modulated by meteorological factors such as temperature, relative humidity, wind speed, and rainfall. Prior research in Malaysia has shown that these meteorological drivers exert significant control on PM2.5 behavior at daily and seasonal scales, particularly across monsoonal transitions [4,5]. However, traditional statistical modeling approaches often struggle to represent the nonlinear interactions between pollutant emissions, atmospheric chemistry, and meteorological variability.

Machine learning (ML) methods have emerged as powerful alternatives for air-quality prediction because of their capability to capture complex nonlinear patterns. A recent systematic review highlights the rapid growth and variety of ML and deep-learning techniques applied to spatiotemporal air quality forecasting globally, while noting challenges in achieving consistent predictive accuracy across contexts [6]. Studies from Southeast Asia, such as PM2.5 prediction in Ho Chi Minh City using both ML and deep learning models, illustrate the potential of AI-based approaches in urban pollution forecasting [7]. Similarly, deep learning and statistical models have shown strong forecasting performance in Thai cities, underscoring the value of advanced algorithms in regional air quality applications [8]. Comparative evaluations in large metropolitan settings also demonstrate that gradient-boosted models such as XGBoost can outperform traditional ML alternatives for PM2.5 concentration prediction [8]. Recent Malaysian studies have demonstrated im-proved performance using ML models such as random forest and hybrid learning frameworks when predicting PM2.5 based on meteorological and pollutant inputs [9,10]. Regional evidence from neighboring Southeast Asian cities similarly highlights the effectiveness of ML and deep-learning techniques for modeling spatiotemporal pollution dynamics [11,12].

Despite these advancements, several gaps remain. Most existing studies rely on fixed ground-based monitoring stations and focus on short-term prediction (hourly or daily), with limited emphasis on monthly-scale forecasting that is useful for exposure assessment, policy planning, and seasonal management strategies. Furthermore, the potential of dense mobile air-monitoring networks which offer broader spatial coverage than conventional monitoring stations remains underexplored in the Malaysian context. To address these gaps, this study aims to develop and evaluate three tree-based ML models decision tree (DT), random forest (RF), and Extreme Gradient Boosting (XGBoost) for predicting monthly PM2.5 concentrations across the GKV using one year of high-resolution data collected from 36 mobile monitoring stations. By integrating meteorological variables with spatially diverse observations, this study captures monsoonal influences, spatial heterogeneity, and the nonlinear drivers of PM2.5 variability. The findings are expected to enhance localized PM2.5 prediction capability in Malaysia and support data-driven air-quality management and public-health protection.

Beyond air quality prediction, machine learning techniques have also been extensively reviewed and applied in other atmospheric and meteorological contexts. For example, several systematic reviews have examined the use of machine learning, deep learning, and ensemble approaches for wind flow and wind power forecasting, highlighting their ability to capture complex nonlinear dynamics in wind patterns and outperform traditional statistical and physical models in many cases [13]. Ensemble learning methods, neural networks, and hybrid ML architectures have been shown to improve wind speed and power forecasts by effectively handling nonlinear relationships and temporal dependencies in large datasets [14].

In addition, neural network-based approaches have been developed to estimate atmospheric boundary layer wind profiles using observational data such as LiDAR measurements combined with physical model outputs, demonstrating the potential of ML to enhance boundary layer characterization and atmospheric structure estimation [15]. These developments reflect a broader trend toward integrating data-driven and hybrid modeling frameworks in atmospheric science, illustrating the growing relevance of ML approaches for a wide range of environmental and meteorological prediction tasks.

2. Materials and Methods

2.1. Research Workflow

The data preparation focused on acquisition, exploring, transforming and partitioning the data were shown in Figure 1. The significant variables identified by each method were used to develop predictive models using tree-based models, namely decision tree (DT), random forest (RF) and Extreme Gradient Boosting (XGBoost), which evaluates using performance indicators. The performance of these models was compared to determine the best model.

All data processing, analysis, and machine learning model developments were carried out using Python (version 3.13). Key libraries included NumPy and pandas for data manipulation and preprocessing, scikit-learn for implementing and evaluating machine learning algorithms, and Matplotlib and Seaborn for data visualization. The workflow involved data cleaning, feature preparation, model training, and performance evaluation using standard metrics. The use of Python ensured a reproducible and efficient analytical pipeline for handling the dataset and conducting the study.

The study utilized a network of nine fixed air quality monitoring stations (AQMS) strategically located in residential, traffic, and industrial areas, complemented by 27 mobile stations to achieve comprehensive coverage of populated areas. The locations of the sensors were selected based on several criteria: ensuring broad area coverage, accessibility to the general public densely populated areas and even spatial distribution to avoid clustering. Additionally, the network was designed to support subsequent data analytics and air quality predictions, with several stations placed within healthcare facilities in GKV to facilitate integration with public health applications. This configuration provides a robust and representative dataset for the study while maintaining practical deployment considerations.

2.2. Data Preparation

This section focuses on data acquisition, data exploration, data transformation and data partitioning. The data acquisition describes the data and parameters used in this study. Next, data exploration presents the descriptive analysis, and data transformation will explain the transformation of the data before analyzed. Finally, dataset partitioning outlines how the data was divided.

As for the data acquisition, field observations were conducted using stationary observation posts rather than route-based surveys. Specifically, observers were positioned at predefined fixed locations of the monitoring station (L), PM2.5 and meteorological data consisting of temperature (T), and relative humidity (RH). PM2.5 served as the target variable, while T, RH and L served as input variables. Data were collected from July 2022 to June 2023 from the Department of Environment (DOE) and Malaysia Meteorological Department (MetMalaysia) at each location, ensuring consistency in observation duration and conditions.

This stationary-post approach enabled controlled and repeatable data acquisition across locations, providing a reliable dataset for subsequent machine learning (ML) analysis. An exploratory data analysis (EDA) was performed to investigate the properties of the data or variables utilized as input for the model. This analysis was assessed using the Spearman correlation heatmap. The heatmap provides a visual summary of the strength and direction of linear associations between variables while descriptive statistics evaluate the mean, median and standard deviation to determine the data’s central tendency, as well as skewness to assess whether variables follow a normal distribution.

Based on the data retrieved, missing value detection was conducted and showed that the dataset has good completeness values in all four variables of the 13,141 samples. However, the dataset contained outliers. These outliers were retained because they represent genuine environmental phenomena or rare events that ensure the models capture the full variability and potential extremes of environmental systems. The readings of each parameter were originally recorded on an hourly basis. To enable prediction of daily PM2.5 concentrations, the data was transformed from hourly to a daily format, with daily values obtained by averaging the hourly measurements. This transformation ensures consistency between the input data and the modeling objective while reducing high-frequency variability and noise associated with short-term fluctuations (e.g., traffic or meteorological changes). The use of daily averages provides a more stable and representative measure of ambient air quality, facilitating the identification of underlying temporal patterns relevant for prediction. Furthermore, this approach is consistent with standard air quality assessment practices, where regulatory guidelines and health-based thresholds are typically defined based on 24 h mean PM2.5 concentrations, thereby enhancing the interpretability and comparability of the results. Next, an ordinal encoding technique was applied to transform the L variable into structured numerical format that is suitable for analysis.

For this study, random selection for 80% of training and 20% of testing were chosen. This approach, supported by [16] in the context PM2.5 concentration prediction, has proven effective in providing sufficient data for model training while ensuring reliable evaluation. This approach helps to verify that the model generalizes well to unseen data and evaluates model performance.

2.3. Model Development

In the model development, tree-based models were employed for the dataset. Developing tree-based models such as DT, RF, and XGBoost were selected for their capability in handling nonlinear relationships in environmental data. Additionally, DT offers a straightforward and interpretable structure, RF demonstrates strong robustness in handling noisy data, and XGBoost is known for its ability to capture complex interactions.

DT is an algorithmically simple but powerful algorithm that partitions data at each node into subsets based on feature importance [17]. DT also represents all possible pathways in a decision-making process, where each internal node denotes a decision, each branch reflects the corresponding outcome, and each terminal leaf represents a final prediction. The tree construction begins with root node and proceeds through splitting and generating additional nodes until the stopping criteria are met. RF is a widely embraced ensemble learning technique, effective for both regression and classification tasks [18,19]. It is an advanced variant of bagging, where a large collection of decision trees is trained on different subsets of the data to collectively determine the final prediction. In this framework, sampling of data points is referred to as row sampling, while sampling of features is called column sampling, with each tree being built from a unique combination of both [20]. Specifically, in the RF model, every tree is grown using a random subset of the training data along with a random subset of the input features. The final model output was obtained by taking the average of predictions from all individual trees [21]. This strategy enhances the model’s robustness, providing reliable estimates even for previously unseen data. The RF can be measured based on Equation (1) [20].

R F = \frac{1}{k} \sum_{i = 1}^{k} h_{k} (x)

(1)

where k is the number of trees in the ensemble and h_k (x) is the prediction from the k-th decision tree.

XGBoost is a highly powerful ML algorithm that builds upon decision trees as its fundamental predictive units. This technique constructs an ensemble of trees in a sequential manner, where each new tree was designed to correct the prediction errors of the previous ones. Compared to other ML models, XGBoost offers greater flexibility, owing to its extensive set of tunable hyperparameters. One of its key strengths is its ability to control overfitting through regularization, which not only improves model generalization but also speeds up the training process [15,22]. XGBoost can be computed using Equation (2) [20].

{\hat{Y}}_{i}^{t} = \sum_{k = 1}^{t} f_{k} (x_{i}) = {\hat{Y}}_{i}^{(t - 1)} + f_{t} (x_{i})

(2)

where

{\hat{Y}}_{i}^{t}

is the prediction at interaction t,

{\hat{Y}}_{i}^{(t - 1)}

is the prediction from the previous iteration,

f_{t} (x_{i})

is the new weak learner added at step t and x_i is the input variable.

Hyperparameter tuning was a process to identify the most suitable values for the model parameters that enhance model performance. In this study, hyperparameter optimization was carried out using a grid search, a systematic approach that exhaustively explores all possible parameter combinations within a predefined search space. This method was applied to the tree-based models, namely DT, RF and XGBoost. The optimal set of hyperparameters was determined by selecting the combination that minimized the validation error. The final hyperparameter settings for tree-based models were presented in Table 1.

2.4. Model Evaluation

The evaluation phase was essential for assessing the performance of the tree-based models to ensure the desired reliability was required. The tree-based models for this prediction focus on four statistical indicators, including the coefficient of determination (R²), root mean squared error (RMSE), mean absolute error (MAE), and normalized absolute error (NAE) were used to evaluate the correlation of estimated and observed PM2.5 concentrations as per Equations (3)–(6).

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(3)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - \hat{y_{i}})}^{2}}

(4)

M A E = \frac{1}{n} \sum_{i = 1}^{m} |(y_{i} - \hat{y_{i}})|

(5)

N A E = \frac{\sum_{i = 1}^{n} |(y_{i} - \hat{y_{i}})|}{\sum_{i = 1}^{n} \hat{y_{i}}}

(6)

where

y_{i}

represents the predicted value of PM2.5 concentrations,

\hat{y_{i}}

represents the actual value of PM2.5 concentrations, and n represents the number of observations. R² value closer to one indicates the better model, while RMSE, MAE and NAE value closer to zero indicates the better model.

3. Results

Table 2 presents the measures of central tendency and dispersion of PM2.5, temperature (T), and relative humidity (RH) during the study period. According to the Malaysian Ambient Air Quality Standard (MAAQS-2020), the recommended limit for the one-month average of PM2.5 is 35 μg/m³ [23]. The mean monthly PM2.5 values at all stations were below this threshold. However, the mean PM2.5 concentrations exceeded the median values, indicating skewed distributions and the presence of extreme concentration events in certain months.

PM2.5 concentrations were highest in July 2022 during the southwest monsoon, reaching a maximum mean of 23.939 μg/m³ (SD = 9.946), and lowest in October 2022 during the northeast monsoon, with an average of 11.21 μg/m³ (SD = 0.804). Positive skewness values further indicate the occurrence of extreme events.

Temperature reached its highest level in May 2023, with a mean of 27.658 °C (SD = 0.180), while the lowest mean temperature was observed in December 2022 at 25.350 °C (SD = 0.041). Relative humidity was highest in November 2022, averaging 76.536% (SD = 5.293), and lowest in February 2022 at 72.248% (SD = 6.772). The non-normal distributions observed across variables support the use of tree-based models, which do not require normality assumptions.

Monthly variations in PM2.5 concentrations across 36 stations in the GKV are illustrated using a heatmap (Figure 2). Elevated PM2.5 levels were observed from July to September 2022 and May to June 2023. A noticeable reduction occurred in October 2022, followed by generally lower concentrations from November 2022 to March 2023. Overall, the lowest PM2.5 concentrations were recorded in October 2022, whereas higher values were observed around April 2023 during the inter-monsoon period.

Figure 3 and Figure 4 present the spearman correlation matrix of PM2.5, temperature (T), and relative humidity (RH). Figure 3 shows monthly correlation that highlights temporal variations, while Figure 4 presents the general relationship between variables. Both figures show a correlation coefficient between PM2.5 and the meteorological variables ranged from very weak to moderate.

Figure 3 illustrates that weak to moderate positive correlations between PM2.5 and temperature in several months, including July 2022 (r = 0.25), August 2022 (r = 0.44), October 2022 (r = 0.23), January 2023 (r = 0.25), and April 2023 (r = 0.32). In contrast, a negligible negative correlation was observed in May 2023 (r = −0.05). Regarding relative humidity (RH), the correlation coefficients with PM2.5 were generally negligible to weak. Weak positive correlations were observed in November 2022 (r = 0.07), January 2023 (r = 0.03), February 2023 (r = 0.13), and May 2023 (r = 0.21). In contrast, weak negative correlations were found in August 2022 (r = −0.23), December 2022 (r = −0.14), and March 2023 (r = −0.12).

Figure 4 shows a moderate positive correlation between PM2.5 and temperature (r = 0.24), while a weak negative correlation was observed with relative humidity (r = −0.05). It indicates that an increase in temperature (T) was slightly associated with PM2.5, whereas higher relative humidity (RH) was associated with lower PM2.5. Table 3 summarizes the performance of the decision tree (DT), random forest (RF), and XGBoost models using R², RMSE, MAE, and NAE. Across the evaluated months, the XGBoost model generally achieved better predictive performance than DT and RF.

Using XGBoost, the highest R² was observed in March 2023 (0.559), followed by February 2023 (0.552) and June 2023 (0.508) suggesting that while the model captures a substantial portion of the variability in daily PM2.5 concentrations, a considerable fraction remains unexplained. This level of performance is not unexpected in air quality modeling, where PM2.5 dynamics are influenced by complex and partially unobserved factors such as localized emission sources, atmospheric chemistry, and meteorological variability. Given these inherent uncertainties and the use of aggregated daily data, the observed R² is considered acceptable for capturing general temporal trends rather than precise point predictions. Therefore, the model should be interpreted as a tool for identifying broad patterns in PM2.5 variability rather than providing highly accurate forecasts. The lowest performance occurred in November 2022 (R² = 0.214). Error metrics showed that October 2022 achieved the lowest RMSE (4.692), MAE (3.745), and NAE (0.364). The highest errors were observed in April 2023, with RMSE = 8.313, MAE = 6.703, and NAE = 0.284.

Figure 5 presents the relative importance of predictors in the XGBoost model. Temperature was the dominant variable in July 2022 (0.42), August 2022 (0.50), October 2022 (0.39), and January 2023 (0.40). Location showed the highest influence in November 2022 (0.47), February 2023 (0.61), and March 2023 (0.63). Relative humidity was most influential in September 2022 (0.41).

Figure 6 illustrates the comparison between observed and predicted PM2.5 concentrations. The predicted values generally followed the observed trends closely. Deviations were mainly observed during extreme PM2.5 event.

4. Discussion

The descriptive analysis revealed that PM2.5 concentrations in the GKV remained below the MAAQS guideline but exhibited skewed distributions, indicating episodic pollution events. Higher PM2.5 levels during the southwest monsoon (June to September 2023) in Figure 5 suggest the influence of regional transport and local anthropogenic activities such as biomass and open burning, consistent with findings by [24]. In contrast, reduced concentrations during the northeast monsoon can be attributed to enhanced rainfall and atmospheric washout processes [25].

The heatmap analysis further confirms a clear seasonal signal, with elevated PM2.5 during dry periods and lower values during wetter months. The inter-monsoon transition periods showed intermediate concentrations, reflecting mixed meteorological conditions that influence pollutant accumulation and dispersion.

Correlation analysis indicates that temperature and relative humidity have weak to moderate relationships with PM2.5. Positive associations between T and PM2.5 in most months imply that warmer conditions promote pollutant accumulation, likely due to reduced dispersion and enhanced secondary aerosol formation. The mixed behavior of RH suggests that humidity may both facilitate particle growth and enhance wet deposition depending on prevailing atmospheric conditions [26].

Model comparison demonstrates that XGBoost performs better than DT and RF for PM2.5 prediction in the GKV, supporting previous work by [27]. The lower predictive performance during the northeast monsoon, particularly in November 2022, may be related to rainfall variability and rapidly changing atmospheric conditions that introduce noise into the prediction process.

The variable importance results highlight the seasonal role of meteorology and spatial factors. Temperature dominated during the southwest monsoon, reinforcing its influence on PM2.5 formation and accumulation. Location emerged as a key predictor in several months, indicating the importance of spatial heterogeneity, possibly linked to industrial distribution, urban structure, and land-use patterns across the GKV. RH was influential in September 2022, suggesting that dry conditions during this period contributed to higher PM2.5 concentrations.

A limitation of this study is the exclusion of key meteorological variables such as precipitation, wind speed, and atmospheric boundary layer height, which are known to influence pollutant dispersion, removal, and accumulation. Due to the lack of consistent, high-resolution meteorological data aligned with the sampling locations and periods, these factors were not incorporated into the analysis. Nevertheless, the model relies on locally measured variables that reflect site-specific conditions, allowing it to capture relevant patterns in PM2.5 variability within the study context. While the absence of meteorological parameters may limit the generalizability of the findings, the model remains effective for the available dataset. Future studies should integrate such parameters to enhance model robustness and provide a more comprehensive understanding of air quality dynamics.

Finally, although XGBoost effectively captured overall PM2.5 trends, its limitations in predicting extreme pollution events were evident. This behavior agrees with [20] who reported reduced model reliability under high PM2.5 concentrations. Future work may integrate additional predictors such as wind speed, boundary layer height, and emission inventories to improve performance under extreme conditions.

5. Conclusions

This study investigated the spatial–temporal variability of PM2.5 concentrations in the GKV using machine learning models integrated with meteorological and spatial predictors. The descriptive analysis showed that although monthly PM2.5 levels remained below the Malaysian Ambient Air Quality Standard, the distributions were skewed, indicating the occurrence of episodic pollution events. Seasonal patterns were evident, with higher PM2.5 concentrations during the southwest monsoon and lower levels during the northeast monsoon, highlighting the strong influence of regional climate conditions on air quality.

Correlation analysis revealed weak to moderate relationships between PM2.5 and meteorological factors, suggesting that temperature and relative humidity contribute to PM2.5 variability in a complex and season-dependent manner. Among the tested models, XGBoost consistently outperformed decision tree and random forest, demonstrating its robustness for PM2.5 prediction in the GKV. Variable importance analysis further indicated that temperature dominates during the southwest monsoon, while spatial factors, represented by longitude, play a major role in explaining PM2.5 variability across several months.

Despite the strong overall performance, the model showed limitations in predicting extreme PM2.5 events, particularly during inter-monsoon periods. This highlights the need for incorporating additional atmospheric and emission-related predictors, such as wind speed, boundary layer height, and source inventories, in future work. Overall, the proposed framework provides valuable insights for air quality monitoring and supports evidence-based environmental and public health management in the GKV region.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, N.M.S., S.H.A.I. and S.S.S.; project administration, funding acquisition, N.M.S. All authors have read and agreed to the published version of the manuscript.

Funding

The research herein was funded by the Ministry of Health Malaysia, grant number NMRR-19-3747-52136.

Institutional Review Board Statement

The study was conducted in accordance with the Declara-tion of Helsinki, and approved by the Medical Research & Ethics Committee of Ministry of Health Malaysia (protocol code: NMRR-19-3747-52136 and dated 31 December 2020).

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used during the current study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors sincerely thank the Director-General of Health Malaysia for granting permission to submit this manuscript for publication and for supporting the dissemi-nation of this research.

Conflicts of Interest

The authors declare no conflicts of interest to report regarding this study conducted.

Abbreviations

The following abbreviations are used in this manuscript:

ML	Machine Learning
GKV	Greater Klang Valley
PM2.5	particles less than 2.5 micrometers in diameter
T	Temperature
RH	Relative Humidity
L	station Location
DT	Decision Tree
RF	Random Forest
XGBoost	Extreme Gradient Boosting
DOE	Department of Environment
MetMalaysia	Malaysia Meteorological Department
MAAQS-2020	Malaysian Ambient Air Quality Standard
R²	Coefficient of Determination
RMSE	Root Mean Squared Error
MAE	Mean Absolute Error
NAE	Normalized Absolute Error

References

Abdullah, S.; Ismail, M.; Zaki, S.A. Air quality assessment during COVID-19 lockdown in Malaysia: Variation of PM2.5 and meteorological factors. Aerosol Air Qual. Res. 2020, 20, 2407–2418. [Google Scholar]
Mahmud, M. Long-term spatiotemporal analysis of PM2.5 in Malaysia using integrated monitoring and satellite data. Atmos. Environ. 2019, 214, 116852. [Google Scholar]
Dominick, D.; Juahir, H.; Latif, M.T.; Aris, A.Z.; Zain, S.M. Spatial assessment of air quality patterns in Malaysia using multivariate analysis. Atmos. Environ. 2012, 60, 172–181. [Google Scholar] [CrossRef]
Wong, Y.J.; Latif, M.T.; Saw, K.C. Seasonal behaviour of PM2.5 in Peninsular Malaysia and its meteorological drivers. Atmos. Res. 2021, 250, 105369. [Google Scholar]
Mohamed, M.N.; Latif, M.T.; Othman, M. Urban air pollution patterns and PM2.5 determinants in the Klang Valley. Environ. Sci. Pollut. Res. 2020, 27, 25662–25676. [Google Scholar]
Agbehadji, I.E.; Obagbuwa, I.C. Systematic Review of Machine Learning and Deep Learning Techniques for Spatiotemporal Air Quality Prediction. Atmosphere 2024, 15, 1352. [Google Scholar] [CrossRef]
Nguyen, P.H.; Dao, N.K.; Nguyen, L.S.P. Development of Machine Learning and Deep Learning Prediction Models for PM2.5 in Ho Chi Minh City, Vietnam. Atmosphere 2024, 15, 1163. [Google Scholar] [CrossRef]
Damkliang, K.; Chumnaul, J. Deep learning and statistical approaches for area-based PM 2.5 forecasting in Hat Yai, Thailand. J. Big Data 2025, 12, 36. [Google Scholar] [CrossRef]
Lim, J.S.; Wong, H.M.; Azid, A. Predicting PM2.5 in the Klang Valley using Random Forest and meteorological inputs. Sustain. Cities Soc. 2022, 85, 104065. [Google Scholar]
Ahmed, F.; Abdul Rahman, N. Machine learning prediction of PM2.5 using meteorological variables in urban Malaysia. Atmos. Pollut. Res. 2022, 13, 101456. [Google Scholar]
Hien, P.D.; Hang, N.T.; Chinh, N.T. Short-term PM_2.5 forecasting by deep learning in Hanoi, Vietnam. Atmos. Pollut. Res. 2019, 10, 134–141. [Google Scholar]
Hsu, Y.C.; Cheng, Y.H.; Lai, L.W.; Chen, Y.C. Improving PM2.5 prediction in Southeast Asia using hybrid machine learning models. Environ. Model. Softw. 2020, 130, 104738. [Google Scholar]
Haq, I.U.; Kumar, A.; Rathore, P.S. Machine learning approaches for wind power forecasting: A comprehensive review. Discov. Appl. Sci. 2025, 7, 1139. [Google Scholar] [CrossRef]
Sri Preethaa, K.R.; Muthuramalingam, A.; Natarajan, Y.; Wadhwa, G.; Ali, A.A.Y. A Comprehensive Review on Machine Learning Techniques for Forecasting Wind Flow Pattern. Sustainability 2023, 15, 12914. [Google Scholar] [CrossRef]
García-Gutiérrez, A.; López, D.; Domínguez, D.; Gonzalo, J. Atmospheric Boundary Layer Wind Profile Estimation Using Neural Networks, Mesoscale Models, and LiDAR Measurements. Sensors 2023, 23, 3715. [Google Scholar] [CrossRef] [PubMed] [PubMed Central]
Lin, L.; Liang, Y.; Liu, L.; Zhang, Y.; Xie, D.; Yin, F.; Ashraf, T. Estimating PM2.5 concentrations using the machine learning RF-XGBoost model in guanzhong urban agglomeration, China. Remote Sens. 2022, 14, 5239. [Google Scholar] [CrossRef]
Abuouelezz, W.; Ali, N.; Aung, Z.; Altunaiji, A.; Shah, S.B.; Gliddon, D. Exploring PM2.5 and PM10 ML forecasting models: A comparative study in the UAE. Sci. Rep. 2025, 15, 9797. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Masood, A.; Hameed, M.M.; Srivastava, A.; Pham, Q.B.; Ahmad, K.; Razali, S.F.M.; Baowidan, S.A. Improving PM2.5 prediction in New Delhi using a hybrid extreme learning machine coupled with snake optimization algorithm. Sci. Rep. 2023, 13, 21057. [Google Scholar] [CrossRef]
Makhdoomi, A.; Sarkhosh, M.; Ziaei, S. PM2.5 concentration prediction using machine learning algorithms: An approach to virtual monitoring stations. Sci. Rep. 2025, 15, 8076. [Google Scholar] [CrossRef]
Hameed, M.M.; AlOmar, M.K.; Khaleel, F.; Al-Ansari, N. An extra tree regression model for discharge coefficient prediction: Novel, practical applications in the hydraulic sector and future research directions. Math. Probl. Eng. 2021, 1, 7001710. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. arXiv 2016, arXiv:1603.02754. [Google Scholar] [CrossRef]
Ramli, N.A.; Shith, S.; Yusof, N.M.; Zarkasi, K.Z.; Suroto, A. Physicochemical characteristics of PM2.5 particles during high particulate event (HPE) in school area. IOP Conf. Ser. Earth Environ. Sci. 2020, 498, 012067. [Google Scholar] [CrossRef]
Suroto, A.; Shith, S.; Yusof, N.M.; Ramli, N.A. Impact of high particulate event on the indoor and outdoor fine particulate matter concentrations during the Southwest monsoon season. IOP Conf. Ser. Mater. Sci. Eng. 2020, 920, 012007. [Google Scholar] [CrossRef]
Rusmili, S.H.A.; Mohamad Hamzah, F.; Choy, L.K.; Azizah, R.; Sulistyorini, L.; Yudhastuti, R.; Latif, M.T. Ground-level particulate matter (PM2.5) concentration mapping in the Central and South Zones of Peninsular Malaysia using a geostatistical approach. Sustainability 2023, 15, 16169. [Google Scholar] [CrossRef]
Ma’amor, A.; Noor, N.M.; Jafri, I.A.M.; Addiena, N.A.; Ul, A.Z.; Saufie, N.A.A.; Deak, G. Spatial and temporal variation of particulate matter (PM10 and PM2.5) and its health effects during the haze event in Malaysia. J. Atmos. Sci. Res. 2023, 6, 26–47. [Google Scholar] [CrossRef]
Kang, J.; Zou, X.; Tan, J.; Li, J.; Karimian, H. Short-Term PM2.5 concentration changes prediction: A comparison of meteorological and historical data. Sustainability 2023, 15, 11408. [Google Scholar] [CrossRef]

Figure 1. Research workflow used in the current approach for PM2.5 predictions.

Figure 2. Heatmap of monthly PM2.5 in the Greater Klang Valley.

Figure 3. Monthly spearman correlation matrix between parameters.

Figure 4. Overall spearman correlation matrix between parameters.

Figure 5. Feature importance of XGBoost.

Figure 6. Observed vs. predicted XGBoost model.

Table 1. Hyperparameter optimization using a grid search method.

Model	Hyper- Parameters	2022						2023
Model	Hyper- Parameters	Jul	Aug	Sept	Oct	Nov	Dec	Jan	Feb	Mar	Apr	May	Jun
DT	Max Depth	5	5	5	20	10	20	15	30	20	10	10	5
	Min Samples Split	12	2	2	10	2	5	5	2	2	15	15	12
	Min Samples leaf	5	10	10	1	10	2	1	1	1	5	5	5
RF	N Estimator	100	200	100	500	100	500	400	400	200	300	200	100
	Max Depth	5	10	15	30	30	15	30	10	20	10	10	30
	Min Samples Split	2	10	2	10	10	5	5	2	2	10	2	10
XGBoost	N Estimator	100	100	100	500	400	100	100	100	500	100	500	100
	Max Depth	5	5	5	5	5	5	5	5	5	5	5	5
	Learning Rate	0.1	0.1	0.2	0.01	0.01	0.1	0.1	0.1	0.01	0.1	0.01	0.2

Table 2. Descriptive statistics of parameter used in the study according to month.

Month		PM2.5 (μg/m³)				T (°C)				RH (%)
Month		Mean	Median	Standard Deviation	Skewness	Mean	Median	Standard Deviation	Skewness	Mean	Median	Standard Deviation	Skewness
2022	Jul	23.9	23	9.9	0.235	26.9	26.9	1.3	0.114	71.5	71.5	6.9	−0.041
	Aug	19.9	19	9.9	0.303	26.1	26.1	1.2	0.067	73.8	74.0	6.1	−0.114
	Sept	16.9	15	8.8	0.531	26.1	26.1	1.1	−0.058	72.4	72.4	5.7	0.068
	Oct	11.2	10	6.6	0.804	25.8	25.7	1.1	0.267	74.6	74.7	5.4	−0.175
	Nov	13.4	12	7.7	0.693	25.6	25.6	1.0	0.246	76.6	76.7	5.3	−0.144
	Dec	13.3	11	8.8	0.803	25.4	25.3	1.1	0.041	74.8	75.2	6.6	−0.319
2023	Jan	15.2	13	9.7	0.692	25.4	25.5	1.0	−0.204	73.5	73.5	6.3	0.019
	Feb	13.7	12	8.5	0.664	25.7	25.7	1.0	−0.202	72.2	72.7	6.8	−0.246
	Mar	16.8	15	9.4	0.468	26.4	26.5	1.3	−0.498	70.7	70.3	7.9	0.160
	Apr	23.8	23	11.8	0.159	26.7	26.6	1.1	0.204	73.8	74.2	5.0	−0.105
	May	14.5	13	8.9	0.632	27.7	27.6	1.1	0.180	72.6	72.4	7.0	0.377
	Jun	18.9	18	10.3	0.442	27.3	27.3	1.2	0.593	72.3	72.2	5.6	−0.187

Table 3. Model performance for tree-based models.

	Month	Model	R²	RMSE	MAE	NAE
		DT	0.169	8.313	6.646	0.288
2022	July	RF	0.237	7.965	6.382	0.276
		XGBoost	0.322	7.511	5.933	0.257
		DT	0.197	8.258	6.849	0.345
	August	RF	0.312	7.644	6.264	0.316
		XGBoost	0.354	7.407	6.087	0.307
		DT	<0.1	8.645	7.250	0.435
	September	RF	0.228	7.884	6.585	0.395
		XGBoost	0.229	7.881	6.574	0.395
		DT	<0.1	5.858	4.351	0.423
	October	RF	0.360	4.699	3.761	0.366
		XGBoost	0.362	4.692	3.745	0.364
		DT	0.146	7.421	5.992	0.446
	November	RF	0.202	7.177	5.882	0.437
		XGBoost	0.214	7.122	5.821	0.433
		DT	<0.1	8.827	6.539	0.484
	December	RF	0.320	7.285	5.648	0.418
		XGBoost	0.319	7.285	5.443	0.403
		DT	<0.1	10.839	7.906	0.526
2023	January	RF	0.313	7.760	6.063	0.403
		XGBoost	0.367	7.450	5.849	0.389
		DT	0.234	7.269	5.243	0.396
	February	RF	0.547	5.588	4.239	0.320
		XGBoost	0.552	5.557	4.078	0.308
		DT	0.323	7.699	6.058	0.345
	March	RF	0.531	6.413	5.191	0.296
		XGBoost	0.559	6.219	5.041	0.287
		DT	0.322	9.583	7.608	0.323
	April	RF	0.418	8.876	7.115	0.302
		XGBoost	0.490	8.313	6.703	0.284
		DT	0.145	8.446	6.601	0.457
	May	RF	0.340	7.419	5.858	0.405
		XGBoost	0.351	7.359	5.831	0.403
		DT	0.311	8.808	7.0	0.361
	June	RF	0.495	7.543	6.001	0.310
		XGBoost	0.508	7.439	5.850	0.302

Note: Bold values show the highest value for each statistical indicator.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mohammad Sham, N.; Ismain, S.H.A.; Sazali, S.S. Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia. Atmosphere 2026, 17, 501. https://doi.org/10.3390/atmos17050501

AMA Style

Mohammad Sham N, Ismain SHA, Sazali SS. Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia. Atmosphere. 2026; 17(5):501. https://doi.org/10.3390/atmos17050501

Chicago/Turabian Style

Mohammad Sham, Noraishah, Siti Hazimah Ayu Ismain, and Siti Syakirin Sazali. 2026. "Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia" Atmosphere 17, no. 5: 501. https://doi.org/10.3390/atmos17050501

APA Style

Mohammad Sham, N., Ismain, S. H. A., & Sazali, S. S. (2026). Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia. Atmosphere, 17(5), 501. https://doi.org/10.3390/atmos17050501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Urban PM2.5 Dynamics with XGBoost: Insights from a Dense Mobile Monitoring Network in Malaysia

Abstract

1. Introduction

2. Materials and Methods

2.1. Research Workflow

2.2. Data Preparation

2.3. Model Development

2.4. Model Evaluation

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI