Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management
Abstract
1. Introduction
2. Materials and Methods
2.1. Methodological Framework: KDD-Based Data Science Engineering
2.2. Research Type, Approach and Design (KDD: Problem Understanding and Analytical Design)
2.3. Study Area and Data (KDD: Data Understanding and Selection)
2.4. Predictor and Target Variables (KDD: Feature Selection and Representation)
2.5. Data Collection Technique and Preprocessing (KDD: Data Cleaning and Transformation)
- -
- Validity checks: records with physically implausible values (e.g., negative pollutant concentrations) were removed.
- -
- Missing data handling: observations with missing target labels were discarded; records with missing predictor values were removed when missingness exceeded predefined thresholds.
- -
- Outlier treatment: extreme values were identified using an interquartile range (IQR) criterion and excluded from the analysis.
- -
- Feature encoding: temporal categorical variables were encoded numerically for model compatibility.
- -
- Air Quality Index calculation: the AQI was not directly available in the raw dataset and was therefore computed from pollutant concentrations using the standard piecewise linear interpolation approach:
- -
- Data leakage control: all preprocessing steps were fitted exclusively on the training subset and subsequently applied to the subset.
2.6. Analytical Modeling: Gaussian Naïve Bayes (KDD: Data Mining and Modeling)
2.7. Evaluation Metrics and Performance Analysis (KDD: Evaluation and Interpretation)
Decision Criteria and Hypothesis Testing
2.8. KDD Presentation Phase
2.9. Ethical Considerations
2.10. Use of Generative Artificial Intelligence (GenAI)
3. Results
3.1. Overall Evaluation Framework
3.2. PM10 Concentration
3.3. PM2.5 Concentration
3.4. NO2 Concentration
3.5. Air Quality Index (AQI)
3.6. Global Model Performance and General Hypothesis
4. Discussion
5. Conclusions
- This study demonstrates that the Gaussian Naïve Bayes (GNB) model provides accurate, stable, and well-calibrated predictions of air pollution levels in Metropolitan Lima for the 2020–2025 period. The global performance metrics (accuracy, precision, recall, and F1-score ≈ 0.925), together with a low average Brier Score (≈0.023), confirm both strong classification capability and reliable probabilistic estimation.
- The model showed robust performance in predicting particulate matter concentrations. For PM10, classification accuracy exceeded 93%, with misclassifications largely restricted to adjacent concentration levels, reflecting effective discrimination under relatively stable urban pollution conditions. For PM2.5, performance remained consistently high (≈0.918), despite the greater variability and localized emission sources associated with fine particulate matter.
- For NO2, the GNB model achieved reliable predictive performance (metrics ≈ 0.913), capturing rapid concentration transitions driven by urban traffic dynamics. Misclassifications were primarily confined to neighboring categories, indicating appropriate generalization across gaseous pollution levels.
- The strongest results were obtained for the Air Quality Index (AQI), where the model achieved accuracy and F1-score values above 0.93 and an exceptionally low Brier Score (≈0.010). This highlights the effectiveness of GNB in handling multiclass, ordered air-quality categories and producing well-calibrated probabilistic outputs suitable for risk communication.
- Overall, the findings confirm that Gaussian Naïve Bayes represents a computationally efficient and interpretable modeling approach that balances simplicity with high predictive performance. Its low computational cost and stable calibration make it particularly suitable for operational air-quality monitoring, early warning systems, and evidence-based decision-making in resource-constrained urban environments.
- This study contributes a structured and reproducible engineering-oriented framework for air-quality prediction, in which KDD serves as the backbone for predictive knowledge generation. By framing the model as an explicit knowledge artifact rather than a black-box predictor, the proposed approach facilitates transparency, scalability, and transferability to other urban contexts. This perspective supports the development of low-latency, interpretable decision-support tools aligned with sustainable urban environmental management.
- By aligning air pollution prediction with Sustainable Development Goal 13 (Climate Action), this study contributes a practical and scalable methodological framework that supports urban environmental management and strengthens resilience strategies in densely populated metropolitan areas.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AQI | Air Quality Index |
| GNB | Gaussian Naïve Bayes |
| PM10 | Particulate Matter ≤ 10 µm |
| PM2.5 | Particulate Matter ≤ 2.5 µm |
| NO2 | Nitrogen Dioxide |
| WHO | World Health Organization |
| SDG | Sustainable Development Goal |
| ML | Machine Learning |
Appendix A
Data Mining Engineering (KDD)

Appendix B
Appendix B.1. Low-Fidelity Prototype

Appendix B.2. Dashboards in Looker Studio

Appendix B.3. Predictive System


References
- Henninger, E.; Smith, E.K. Beyond the haze: Decomposing the effect of economic inequality on global air quality from 2000 to 2020. Ecol. Econ. 2024, 222, 108210. [Google Scholar] [CrossRef]
- Chaurasiya, M.; Kumar, S.; Bhatt, K.; Sharma, S. The interplay of SDGs and climate action: A quantitative analysis of regional income influences on SDG 13 progress. Phys. Chem. Earth Parts A/B/C 2025, 139, 103939. [Google Scholar] [CrossRef]
- García-García, J.A.; Reding-Bernal, A.; López-Alvarenga, J.C. Cálculo del tamaño de la muestra en investigación en educación médica. Investig. Educ. Méd. 2013, 2, 217–224. [Google Scholar] [CrossRef]
- Vu, B.N.; Tapia, V.; Ebelt, S.; Gonzales, G.F.; Liu, Y.; Steenland, K. The association between asthma emergency department visits and satellite-derived PM2.5 in Lima, Peru. Environ. Res. 2021, 199, 111226. [Google Scholar] [CrossRef]
- Cummings, L.E.; Stewart, J.D.; Kremer, P.; Shakya, K.M. Predicting citywide distribution of air pollution using mobile monitoring and three-dimensional urban structure. Sustain. Cities Soc. 2022, 76, 103510. [Google Scholar] [CrossRef]
- Mondal, C.; Uddin, M.J. Classification of short-term flood events using stochastic variable selection and Gaussian Naïve Bayes classifier: A case study of Sirajganj district, Bangladesh. Heliyon 2025, 11, e41941. [Google Scholar] [CrossRef]
- Yang, Z.; Lau, Y.; Kanrak, M. Pollution prevention of vessels in the greater bay area: A practical contribution of port state control inspection system towards carbon neutralisation using a tree augmented naive bayes approach. J. Clean. Prod. 2023, 423, 138651. [Google Scholar] [CrossRef]
- Venkata, P.; Pandya, V. Data mining model and Gaussian Naive Bayes based fault diagnostic analysis of modern power system networks. Mater. Today Proc. 2022, 62, 7156–7161. [Google Scholar] [CrossRef]
- Manish Lad, A.; Mani Bharathi, K.; Akash Saravanan, B.; Karthik, R. Factors affecting agriculture and estimation of crop yield using supervised learning algorithms. Mater. Today Proc. 2022, 62, 4629–4634. [Google Scholar] [CrossRef]
- Gnecco, V.M.; Kousis, I.; Pigliautile, I.; Pisello, A.L. Decoding Living Lab sensing system through Bayesian networks: The preferable working space targeting comfort and productivity. J. Build. Eng. 2025, 101, 111913. [Google Scholar] [CrossRef]
- Shang, Y. Prevention and detection of DDOS attack in virtual cloud computing environment using Naive Bayes algorithm of machine learning. Meas. Sens. 2024, 31, 100991. [Google Scholar] [CrossRef]
- Phruksahiran, N. Improvement of air quality index prediction using geographically weighted predictor methodology. Urban Clim. 2021, 38, 100890. [Google Scholar] [CrossRef]
- Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M.; Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M. Contaminación del aire en la ciudad de Lima, Perú. Rev. Arbitr. Interdiscip. Koin. 2023, 8, 822–830. [Google Scholar] [CrossRef]
- Gómez Peláez, L.M.; Santos, J.M.; de Almeida Albuquerque, T.T.; Reis, N.C.; Andreão, W.L.; de Fátima Andrade, M. Air quality status and trends over large cities in South America. Environ. Sci. Policy 2020, 114, 422–435. [Google Scholar] [CrossRef]
- Ndiaye, A.; Shen, Y.; Kyriakou, K.; Karssenberg, D.; Schmitz, O.; Flückiger, B.; de Hoogh, K.; Hoek, G. Hourly land-use regression modeling for NO2 and PM2.5 in the Netherlands. Environ. Res. 2024, 256, 119233. [Google Scholar] [CrossRef]
- Mangones, S.C.; Cuéllar-Álvarez, Y.; Rojas-Roa, N.Y.; Osses, M. Addressing urban transport-related air pollution in Latin America: Insights and policy directions. Lat. Am. Transp. Stud. 2025, 3, 100033. [Google Scholar] [CrossRef]
- Shetty, S.; Hamer, P.D.; Stebel, K.; Kylling, A.; Hassani, A.; Berntsen, T.K.; Schneider, P. Daily high-resolution surface PM2.5 estimation over Europe by ML-based downscaling of the CAMS regional forecast. Environ. Res. 2025, 264, 120363. [Google Scholar] [CrossRef] [PubMed]
- Alnowaiser, K.; Alarfaj, A.A.; Alabdulqader, E.A.; Umer, M.; Cascone, L.; Alankar, B. IoT based smart framework to predict air quality in congested traffic areas using SV-CNN ensemble and KNN imputation model. Comput. Electr. Eng. 2024, 118, 109311. [Google Scholar] [CrossRef]
- Llatas, C.; Soust-Verdaguer, B.; Torres, L.C.; Cagigas, D. Application of Knowledge Discovery in Databases (KDD) to environmental, economic, and social indicators used in BIM workflow to support sustainable design. J. Build. Eng. 2024, 91, 109546. [Google Scholar] [CrossRef]
- Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R.; Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R. Framework for Structuring Big Data Projects. Electronics 2022, 11, 3540. [Google Scholar] [CrossRef]
- La Organización Creadora de Conocimiento: Cómo las Compañías Japonesas Crean la Dinámica de la Innovación—Universidad Granada. Available online: https://granatensis.ugr.es/discovery/fulldisplay/alma991003128989704990/34CBUA_UGR:VU1 (accessed on 9 January 2026).
- Higashide, N.; Zhang, Y.; Asatani, K.; Miura, T.; Sakata, I. Quantifying advances from basic research to applied research in material science. Technovation 2024, 135, 103050. [Google Scholar] [CrossRef]
- Su, X.; Shang, S.; Xu, Z.; Qian, H.; Pan, X. Assessment of Dependent Performance Shaping Factors in SPAR-H Based on Pearson Correlation Coefficient. Comput. Model. Eng. Sci. 2023, 138, 1813–1826. [Google Scholar] [CrossRef]
- Tieppo, E.; Nievola, J.C.; Barddal, J.P. Adaptive learning on hierarchical data streams using window-weighted Gaussian probabilities. Appl. Soft Comput. 2024, 152, 111271. [Google Scholar] [CrossRef]
- Moreno, R.; Nery, A.; Zamora, R.; Lora, Á.; Galán, C. Contribution of urban trees to carbon sequestration and reduction of air pollutants in Lima, Peru. Ecosyst. Serv. 2024, 67, 101618. [Google Scholar] [CrossRef]
- Romero, Y.; Diaz, C.; Meldrum, I.; Arias Velasquez, R.; Noel, J. Temporal and spatial analysis of traffic—Related pollutant under the influence of the seasonality and meteorological variables over an urban city in Peru. Heliyon 2020, 6, e04029. [Google Scholar] [CrossRef]
- Gond, A.K.; Jamal, A.; Verma, T. Developing a machine learning model using satellite data to predict the Air Quality Index (AQI) over Korba Coalfield, Chhattisgarh (India). Atmos. Pollut. Res. 2025, 16, 102398. [Google Scholar] [CrossRef]
- Berrar, D. Bayes’ Theorem and Naive Bayes Classifier. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 403–412. ISBN 978-0-12-811432-2. [Google Scholar]
- Islam, R.; Devnath, M.K.; Samad, M.D.; Jaffrey Al Kadry, S.M. GGNB: Graph-based Gaussian naive Bayes intrusion detection system for CAN bus. Veh. Commun. 2022, 33, 100442. [Google Scholar] [CrossRef]
- Arshad, A.; Jabeen, M.; Ubaid, S.; Raza, A.; Abualigah, L.; Aldiabat, K.; Jia, H. A novel ensemble method for enhancing Internet of Things device security against botnet attacks. Decis. Anal. J. 2023, 8, 100307. [Google Scholar] [CrossRef]
- Ji, W.; Wang, C.; Chen, H.; Liang, Y.; Wang, S. Predicting post-stroke cognitive impairment using machine learning: A prospective cohort study. J. Stroke Cerebrovasc. Dis. 2023, 32, 107354. [Google Scholar] [CrossRef]
- Otsu, T.; Taniguchi, G. Kolmogorov–Smirnov type test for generated variables. Econ. Lett. 2020, 195, 109401. [Google Scholar] [CrossRef]
- Just, M.; Schubert, P.; Blatt, J.; Delfmann, P. Data Preprocessing for Cross-System Analysis: The DaProXSA Approach. Procedia Comput. Sci. 2024, 239, 1635–1644. [Google Scholar] [CrossRef]
- Lydersen, S. Statistical review: Frequently given comments updated. Ann. Rheum. Dis. 2025, 84, 660–663. [Google Scholar] [CrossRef] [PubMed]
- Cabot, J.H.; Ross, E.G. Evaluating prediction model performance. Surgery 2023, 174, 723–726. [Google Scholar] [CrossRef] [PubMed]
- Dimitriadis, T.; Gneiting, T.; Jordan, A.I.; Vogel, P. Evaluating probabilistic classifiers: The triptych. Int. J. Forecast. 2024, 40, 1101–1122. [Google Scholar] [CrossRef]
- Conciatori, M.; Valletta, A.; Segalini, A. Improving the quality evaluation process of machine learning algorithms applied to landslide time series analysis. Comput. Geosci. 2024, 184, 105531. [Google Scholar] [CrossRef]
- Gehringer, C.K.; Martin, G.P.; Van Calster, B.; Hyrich, K.L.; Verstappen, S.M.M.; Sergeant, J.C. How to develop, validate, and update clinical prediction models using multinomial logistic regression. J. Clin. Epidemiol. 2024, 174, 111481. [Google Scholar] [CrossRef]
- Decreto Supremo N.° 029-2021-PCM. Available online: https://www.gob.pe/es/institucion/pcm/normas-legales/1705101-029-2021-pcm (accessed on 3 June 2025).
- Resolución de Secretaría General N.° 000039-2024-AGN/SG. Available online: https://www.gob.pe/institucion/agn/normas-legales/5371925-000039-2024-agn-sg (accessed on 3 June 2025).
- Onah, J.O.; Abdulhamid, S.M.; Abdullahi, M.; Hassan, I.H.; Al-Ghusham, A. Genetic Algorithm based feature selection and Naïve Bayes for anomaly detection in fog computing environment. Mach. Learn. Appl. 2021, 6, 100156. [Google Scholar] [CrossRef]
- Paneru, S.; Xu, X.; Wang, J.; Chi, G.; Hu, Y. Assessing building thermal resilience in response to heatwaves through integrating a social vulnerability lens. J. Build. Eng. 2024, 98, 111219. [Google Scholar] [CrossRef]










| Metrics | Obtained Value | Reference Scale | Evaluation |
|---|---|---|---|
| Accuracy | 0.931 | Excellent ≥ 0.80 | Excellent |
| Precision | 0.933 | Excellent ≥ 0.80 | Excellent |
| Recall | 0.931 | Excellent ≥ 0.80 | Excellent |
| F1-Score | 0.934 | Excellent ≥ 0.80 | Excellent |
| Metrics | Obtained Value | Reference Scale | Evaluation |
|---|---|---|---|
| Accuracy | 0.918 | Excellent ≥ 0.80 | Excellent |
| Precision | 0.918 | Excellent ≥ 0.80 | Excellent |
| Recall | 0.918 | Excellent ≥ 0.80 | Excellent |
| F1-Score | 0.918 | Excellent ≥ 0.80 | Excellent |
| Metrics | Obtained Value | Reference Scale | Evaluation |
|---|---|---|---|
| Accuracy | 0.913 | Excellent ≥ 0.80 | Excellent |
| Precision | 0.913 | Excellent ≥ 0.80 | Excellent |
| Recall | 0.913 | Excellent ≥ 0.80 | Excellent |
| F1-Score | 0.913 | Excellent ≥ 0.80 | Excellent |
| Metrics | Obtained Value | Reference Scale | Evaluation |
|---|---|---|---|
| Accuracy | 0.931 | Excellent ≥ 0.80 | Excellent |
| Precision | 0.932 | Excellent ≥ 0.80 | Excellent |
| Recall | 0.932 | Excellent ≥ 0.80 | Excellent |
| F1-Score | 0.932 | Excellent ≥ 0.80 | Excellent |
| Metrics | Obtained Value | Reference Scale | Evaluation |
|---|---|---|---|
| Accuracy | 0.925 | Excellent ≥ 0.80 | Excellent |
| Precision | 0.925 | Excellent ≥ 0.80 | Excellent |
| Recall | 0.925 | Excellent ≥ 0.80 | Excellent |
| F1-Score | 0.925 | Excellent ≥ 0.80 | Excellent |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gavidia, A.; Dominguez, A.; Flores-Chacón, E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability 2026, 18, 5748. https://doi.org/10.3390/su18115748
Gavidia A, Dominguez A, Flores-Chacón E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability. 2026; 18(11):5748. https://doi.org/10.3390/su18115748
Chicago/Turabian StyleGavidia, Aimee, Aldair Dominguez, and Erick Flores-Chacón. 2026. "Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management" Sustainability 18, no. 11: 5748. https://doi.org/10.3390/su18115748
APA StyleGavidia, A., Dominguez, A., & Flores-Chacón, E. (2026). Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability, 18(11), 5748. https://doi.org/10.3390/su18115748

