Explainable Machine Learning Reveals Distinct Air Pollution Profiles in Two Geographically Adjacent Cities
Abstract
1. Introduction
- In addition to traditional correlation analysis, a Dynamic Time Warping (DTW) distance matrix will be calculated to measure the rhythmic and shape similarity of pollutant profiles over time and visualized with a hierarchical clustering (dendrogram). This will objectively prove that the pollution regimes of the two cities are distinct in terms of non-linear temporal variation.
- Regional classification using pollutant data using Random Forest (RF), XGBoost, and Support Vector Machine (SVM) models will be used to assess the usability of air quality data for spatial identification. This will demonstrate whether the air pollution profiles of the two cities are statistically independent and predictably different.
- SHAP (SHapley Additive Explanations) analysis is utilized to provide model-based interpretability by quantifying the contribution of individual pollutants to classification outcomes. The analysis enables the identification of relative feature importance and captures interaction effects within the model, thereby enhancing transparency without implying direct physical or mechanistic relationships. This contributes to the development of interpretable data-driven frameworks for air pollution analysis.
Related Works
2. Materials and Methods
2.1. Dataset and Data Preprocessing
2.2. Dynamic Time Warping (DTW) Analysis
2.3. Random Forest (RF)
2.4. XGBoost
2.5. Support Vector Machine (SVM)
2.6. SHAP (SHapley Additive Explanations)
2.7. Hyperparameter Optimization
2.8. Evaluation Metrics
- TP (True Positive): Examples that the model predicted as positive but are actually positive.
- FP (False Positive): Examples that the model predicted as positive but are actually negative.
- TN (True Negative): Examples that the model predicted as negative but are actually negative.
- FN (False Negative): Examples that the model predicted as negative but are actually positive.
- Precision: It expresses the ratio of true positive predictions to total positive predictions and shows how accurately the model predicted positive results.
- Recall: It shows how many of the true positive examples were predicted correctly and how well the model captured all positive examples.
- F-Measure: It is the harmonic mean that measures the balance between precision and Recall, evaluates the overall performance of the model and is calculated as 2 × (Precision × Recall)/(Precision + Recall).
- ROC Area: It represents the area under the ROC curve and measures the classification performance of the model, especially the relationship between the true positive rate and the false positive rate. AUC ≥ 0.9 is considered high performance.
- Accuracy: It shows the ratio of correct predictions to total predictions and evaluates the overall success of the model.
- Confusion Matrix: It shows the model’s correct and incorrect classifications for each class in detail. This allows for detailed analysis of classification performance by class, including the number of correct and incorrect classifications.
3. Results
3.1. Density and Correlation Analysis
3.2. Classification Results
3.3. DTW Distance Analysis
3.4. Explainability Analysis
4. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Landrigan, P.J.; Fuller, R.; Acosta, N.J.R.; Adeyi, O.; Arnold, R.; Basu, N.; Baldé, A.B.; Bertollini, R.; Bose-O’Reilly, S.; Boufford, J.I.; et al. The Lancet Commission on pollution and health. Lancet 2018, 391, 462–512. [Google Scholar] [CrossRef]
- Carvalho, H. New WHO global air quality guidelines: More pressure on nations to reduce air pollution levels. Lancet Planet. Health 2021, 5, e760–e761. [Google Scholar] [CrossRef]
- Cohen, A.J.; Brauer, M.; Burnett, R.; Anderson, H.R.; Frostad, J.; Estep, K.; Balakrishnan, K.; Brunekreef, B.; Dandona, L.; Dandona, R.; et al. Estimates and 25-year trends of the global burden of disease attributable to ambient air pollution. Lancet 2017, 389, 1907–1918. [Google Scholar] [CrossRef]
- Burnett, R.; Chen, H.; Szyszkowicz, M.; Fann, N.; Hubbell, B.; Pope, C.A., 3rd; Apte, J.S.; Brauer, M.; Cohen, A.; Weichenthal, S.; et al. Global estimates of mortality associated with long-term exposure to outdoor fine particulate matter. Proc. Natl. Acad. Sci. USA 2018, 115, 9592–9597. [Google Scholar] [CrossRef]
- Forbes Türkiye. Türkiye’de Hava Kirliliğinin Ekonomiye Maliyeti Yılda 138 Milyar Dolar. Available online: https://www.forbes.com.tr/surdurulebilirlik/turkiye-de-hava-kirliliginin-ekonomiye-maliyeti-yilda-138-milyar-dolar (accessed on 12 December 2025).
- Ayus, I.; Natarajan, N.; Gupta, D. Comparison of machine learning and deep learning techniques for the prediction of air pollution: A case study from China. Asian J. Atmos. Environ. 2023, 17, 4. [Google Scholar] [CrossRef]
- Bozdağ, A.; Dokuz, Y.; Gökçek, Ö.B. Spatial prediction of PM10 concentration using machine learning algorithms in Ankara, Turkey. Environ. Pollut. 2020, 263, 114599. [Google Scholar] [CrossRef] [PubMed]
- Cerezuela-Escudero, E.; Montes-Sanchez, J.M.; Dominguez-Morales, J.P.; Duran-Lopez, L.; Jimenez-Moreno, G. A systematic comparison of different machine learning models for the spatial estimation of air pollution. Appl. Intell. 2023, 53, 29604–29619. [Google Scholar] [CrossRef]
- Kujawska, J.; Kulisz, M.; Oleszczuk, P.; Cel, W. Machine Learning Methods to Forecast the Concentration of PM10 in Lublin, Poland. Energies 2022, 15, 6428. [Google Scholar] [CrossRef]
- Gokul, P.; Mathew, A.; Bhosale, A.; Nair, A.T. Spatio-temporal air quality analysis and PM2.5 prediction over Hyderabad City, India using artificial intelligence techniques. Ecol. Inform. 2023, 76, 102045. [Google Scholar] [CrossRef]
- Bekkar, A.; Hssina, B.; Douzi, S.; Douzi, K. Air-pollution prediction in smart city: Deep learning approach. J. Big Data 2021, 8, 35. [Google Scholar] [CrossRef]
- Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air quality prediction using CNN+LSTM-based hybrid deep learning architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef] [PubMed]
- Mampitiya, L.; Rathnayake, N.; Hoshino, Y.; Rathnayake, U. Forecasting PM10 levels in Sri Lanka: A comparative analysis of machine learning models. J. Hazard. Mater. Adv. 2024, 13, 100597. [Google Scholar] [CrossRef]
- Morapedi, T.D.; Obagbuwa, I.C. Air pollution particulate matter (PM2.5) prediction in South African cities using machine learning techniques. Front. Artif. Intell. 2023, 6, 1197004. [Google Scholar] [CrossRef]
- Pande, C.B.; Radhadevi, L.; Satyanarayana, M.B. Evaluation of machine learning and deep learning models for daily air quality index prediction in Delhi city, India. Environ. Monit. Assess. 2024, 196, 847. [Google Scholar] [CrossRef] [PubMed]
- Kalantari, E.; Gholami, H.; Malakooti, H.; Nafarzadegan, A.R.; Moosavi, V. Machine learning for air quality index (AQI) forecasting: Shallow learning or deep learning? Environ. Sci. Pollut. Res. 2024, 31, 62962–62982. [Google Scholar] [CrossRef]
- Patel, P.; Patel, S.; Shah, K.; Desai, K.; Patel, S.; Shah, M.; Patel, S. A systematic study on PM2.5 and PM10 concentration prediction using machine learning and deep learning model. Environ. Chem. Ecotoxicol. 2025, 7, 1401–1415. [Google Scholar] [CrossRef]
- Zaini, N.; Ean, L.W.; Ahmed, A.N.; Malek, M.A. A systematic literature review of deep learning neural network for time series air quality forecasting. Environ. Sci. Pollut. Res. 2022, 29, 4958–4990. [Google Scholar] [CrossRef]
- Iqbal, A.; Mukherjee, N. A Systematic Review and Comparative Study of Machine Learning Techniques for Air Quality Prediction. Water Air Soil Pollut. 2025, 236, 119. [Google Scholar] [CrossRef]
- Gupta, N.S.; Mohta, Y.; Heda, K.; Armaan, R.; Valarmathi, B.; Arulkumaran, G. Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis. J. Environ. Public Health 2023, 2023, 6691301. [Google Scholar] [CrossRef]
- Kurnaz, G.; Demir, A.S. Prediction of SO2 and PM10 air pollutants using a deep learning-based recurrent neural network: Case of industrial city Sakarya. Urban Clim. 2022, 41, 101036. [Google Scholar] [CrossRef]
- Liang, Y.C.; Maimury, Y.; Chen, A.H.L.; Juarez, J.R.C. Machine learning-based prediction of air quality. Appl. Sci. 2020, 10, 9151. [Google Scholar] [CrossRef]
- Mampitiya, L.; Rathnayake, N.; Hoshino, Y.; Rathnayake, U. Performance of machine learning models to forecast PM10 levels. MethodsX 2024, 12, 102345. [Google Scholar] [CrossRef] [PubMed]
- Mishra, A.; Gupta, Y. Comparative analysis of Air Quality Index prediction using deep learning algorithms. Spat. Inf. Res. 2024, 32, 63–72. [Google Scholar] [CrossRef]
- Sakoe, H.; Chiba, S. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 1978, 26, 43–49. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
- Dhaliwal, S.S.; Nahid, A.A.; Abbas, R. Effective intrusion detection system using XGBoost. Information 2018, 9, 149. [Google Scholar] [CrossRef]
- Cortes, C.; Vapnik, V. Support–vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
- Lundberg, S.; Lee, S.I. A unified approach to interpreting model predictions. Adv. Neural Inf. Process. Syst. (NeurIPS) 2017, 30, 4765–4774. [Google Scholar]










| Attribute | Unit | Range | Mean | Description |
|---|---|---|---|---|
| Date | — | 24 April 2021–9 August 2025 | — | It represents the daily dates of the measurements. Data were recorded simultaneously for both cities. |
| PM10 | µg/m3 | 2.53–658.6 | 56.11 | It represents particulate matter smaller than 10 micrometers. |
| SO2 | µg/m3 | 0.79–96.32 | 7.21 | It is sulfur dioxide, which is produced by fossil fuel burning, industrial emissions, and heating activities. |
| CO | µg/m3 | 18.08–12,034.59 | 560.9 | Carbon monoxide is a colorless and odorless gas resulting from incomplete combustion. |
| O3 | µg/m3 | 6.23–167.93 | 38.13 | It is tropospheric ozone, formed at ground level. It is formed by the photochemical reactions of nitrogen oxides and volatile organic compounds. It varies depending on sunlight. |
| City | — | Gaziantep: 1569 record Kilis: 1569 record | — | It is a categorical (class) variable indicating the city where the observations are from. The study is based on a comparison of the air quality profiles of these two neighboring provinces. |
| Model | Class | Precision | Recall | F1-Score | Accuracy | F1-Score (Weighted) |
|---|---|---|---|---|---|---|
| RF | Gaziantep | 0.93 | 0.93 | 0.93 | 0.9299 | 0.9299 |
| Kilis | 0.93 | 0.93 | 0.93 | |||
| XGBoost | Gaziantep | 0.93 | 0.92 | 0.93 | 0.9257 | 0.9257 |
| Kilis | 0.92 | 0.93 | 0.93 | |||
| SVM | Gaziantep | 0.88 | 0.85 | 0.87 | 0.8684 | 0.8683 |
| Kilis | 0.85 | 0.89 | 0.87 |
| G_PM10 | G_SO2 | G_CO | G_O3 | K_PM10 | K_SO2 | K_CO | K_O3 | |
| G_PM10 | 0.0 | 706,201.0 | 745,879.0 | 828,986.0 | 822,127.0 | 854,691.0 | 800,104.0 | 785,270.0 |
| G_SO2 | 706,201.0 | 0.0 | 573,040.0 | 821,306.0 | 912,979.0 | 530,334.0 | 629,503.0 | 507,623.0 |
| G_CO | 745,879.0 | 573,040.0 | 0.0 | 849,330.0 | 821,695.0 | 948,980.0 | 1,088,776.0 | 755,274.0 |
| G_O3 | 828,986.0 | 821,306.0 | 849,330.0 | 0.0 | 660,658.0 | 670,698.0 | 671,914.0 | 554,497.0 |
| K_PM10 | 822,127.0 | 912,979.0 | 821,695.0 | 660,658.0 | 0.0 | 994,119.0 | 616,372.0 | 580,580.0 |
| K_SO2 | 854,691.0 | 530,334.0 | 948,980.0 | 670,698.0 | 994,119.0 | 0.0 | 717,057.0 | 621,264.0 |
| K_CO | 800,104.0 | 629,503.0 | 1,088,776.0 | 671,914.0 | 616,372.0 | 717,057.0 | 0.0 | 514,100.0 |
| K_O3 | 785,270.0 | 507,623.0 | 755,274.0 | 554,497.0 | 580,580.0 | 621,264.0 | 514,100.0 | 0.0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Aktürk, C. Explainable Machine Learning Reveals Distinct Air Pollution Profiles in Two Geographically Adjacent Cities. Appl. Sci. 2026, 16, 3784. https://doi.org/10.3390/app16083784
Aktürk C. Explainable Machine Learning Reveals Distinct Air Pollution Profiles in Two Geographically Adjacent Cities. Applied Sciences. 2026; 16(8):3784. https://doi.org/10.3390/app16083784
Chicago/Turabian StyleAktürk, Cemal. 2026. "Explainable Machine Learning Reveals Distinct Air Pollution Profiles in Two Geographically Adjacent Cities" Applied Sciences 16, no. 8: 3784. https://doi.org/10.3390/app16083784
APA StyleAktürk, C. (2026). Explainable Machine Learning Reveals Distinct Air Pollution Profiles in Two Geographically Adjacent Cities. Applied Sciences, 16(8), 3784. https://doi.org/10.3390/app16083784

