Air pollution caused by fine particulate matter (PM
2.5) poses a serious public health threat in many South Asian megacities where monitoring networks remain limited. Lahore, Pakistan—frequently ranked among the world’s most polluted cities—still lacks reliable short-term PM
2.5 forecasting systems. This study develops a performance-weighted ensemble machine learning framework that integrates satellite observations, meteorological reanalysis data, and ground monitoring measurements to improve daily PM
2.5 prediction. Eleven predictor variables were processed using a unified Google Earth Engine pipeline, including MODIS aerosol optical depth, Sentinel-5P trace gases (CO, NO
2, SO
2), and ERA5 meteorological parameters. Four tree-based machine learning algorithms—Random Forest, XGBoost, LightGBM, and CatBoost—were trained using daily observations from 2019 to 2023. Model evaluation using an independent 2024 dataset showed strong predictive capability, with Random Forest achieving R
2 = 0.77 (RMSE = 24.75 µg m
−3), XGBoost R
2 = 0.76 (RMSE = 26.32 µg m
−3), CatBoost R
2 = 0.73 (RMSE = 30.39 µg m
−3), and LightGBM R
2 = 0.70 (RMSE = 32.75 µg m
−3). To further enhance performance, the best models were combined into a weighted ensemble (RF 0.5, XGBoost 0.3, and CatBoost 0.2), which produced the highest validation accuracy (R
2 = 0.77; RMSE = 23.37 µg m
−3). Statistical testing using paired
t-tests and Diebold–Mariano tests confirmed that the ensemble significantly reduced forecast errors compared with individual models. Feature importance analysis revealed that surface pressure, temperature, CO, and NO
2 were the most influential predictors of PM
2.5 variability. The proposed framework demonstrates that combining satellite data, reanalysis meteorology, and ground observations through ensemble learning can provide accurate and scalable air quality forecasting for data-limited urban environments.
Full article