Fine particulate matter (PM2.5) is a critical environmental and health concern in northern Thailand, where haze episodes are strongly influenced by biomass burning, meteorological variability, and complex topography. This study aims to (1) analyze and select input variables for PM2.5 prediction by integrating
[...] Read more.
Fine particulate matter (PM2.5) is a critical environmental and health concern in northern Thailand, where haze episodes are strongly influenced by biomass burning, meteorological variability, and complex topography. This study aims to (1) analyze and select input variables for PM2.5 prediction by integrating WRF-Chem outputs, satellite data, and ground observations, and (2) evaluate the predictive performance of four machine learning (ML) algorithms—Random Forest (RF), XGBoost, CNN3D, and ConvLSTM—during the 2024 haze season (January–May). The dataset included hourly PM2.5 observations from 54 stations, the WRF-Chem-simulated PM2.5 and meteorological variables, satellite-based fire data, and geographical data. To improve consistency with ground-based data, WRF-Chem PM2.5 values were bias-corrected for the training and validation phases prior to ML learning. Among Linear Regression, RF, XGBoost, Artificial Neural Network (ANN), and Convolutional Neural Network (CNN) tested for bias correction, RF achieved the best performance (R = 0.78, RMSE = 29.28 µg/m
3); the RF-corrected WRF-Chem PM2.5 was then used as an input to the forecasting stage. Variable selection was supported by correlation, VIF, feature importance, and SHAP analyses. The results indicate that RF provided the most reliable predictions, achieving a correlation of R = 0.867 and the lowest RMSE of 27.6 µg/m
3 when using the SHAP+VIF-selected input set (seven variables: PM2.5_lag1, PM2.5_lag24, T2, RH2, Precip, Burned Area, NDVI). Notably, RF remained the top performer, predicting PM2.5 more accurately than the other algorithms during high-pollution conditions, specifically Air Quality Index (AQI) “Unhealthy for Sensitive Groups” (high) and “Unhealthy” (very high). Taken together, RF set the performance bar across both stages, with XGBoost ranked second, whereas CNN3D and ConvLSTM performed considerably worse. These findings emphasize the effectiveness of ensemble tree-based algorithms combined with bias-corrected WRF-Chem outputs and strategic variable selection in supporting accurate hourly PM2.5 predictions for air quality management in biomass burning regions.
Full article