Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management

Gavidia, Aimee; Dominguez, Aldair; Flores-Chacón, Erick

doi:10.3390/su18115748

Open AccessArticle

Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management

by

Aimee Gavidia

^*

,

Aldair Dominguez

^*

and

Erick Flores-Chacón

^*

Faculty of Engineering and Architecture, Cesar Vallejo University, Lima 15434, Peru

^*

Authors to whom correspondence should be addressed.

Sustainability 2026, 18(11), 5748; https://doi.org/10.3390/su18115748 (registering DOI)

Submission received: 10 December 2025 / Revised: 22 January 2026 / Accepted: 10 February 2026 / Published: 5 June 2026

(This article belongs to the Section Air, Climate Change and Sustainability)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Air pollution episodes in Metropolitan Lima pose persistent challenges for urban health protection and timely environmental decision-making. However, many machine learning approaches for air-quality prediction remain difficult to operationalize due to high latency, extensive hyperparameter tuning, and limited interpretability. This study addresses this gap by adopting an engineering-driven predictive knowledge modeling approach grounded in the Knowledge Discovery in Databases (KDD) framework to evaluate an efficient probabilistic classifier—Gaussian Naïve Bayes (GNB)—for predicting regulatory air-quality categories in Metropolitan Lima. A total of 768,185 hourly observations from SENAMHI monitoring stations covering the 2020–2025 period were analyzed, considering PM₁₀, PM_2.5, NO₂ concentrations, and the Air Quality Index (AQI). Data were preprocessed through validity checks, explicit outlier handling, and categorical encoding based on regulatory thresholds, while a time-based train–test split preserved temporal structure and prevented data leakage. The proposed model achieved strong predictive performance (global accuracy ≥ 0.925) and excellent probabilistic calibration (overall Brier Score ≈ 0.023; AQI Brier Score ≈ 0.010). These results demonstrate that GNB provides a robust, interpretable, and computationally efficient solution for operational air-quality management and early warning support, contributing to evidence-based urban environmental decision-making aligned with Sustainable Development Goal 13 (Climate Action).

Keywords:

Gaussian Naïve Bayes; air pollution; PM₁₀; PM_2.5; NO₂; Air Quality Index (AQI); environmental prediction; data mining; urban sustainability; climate action

1. Introduction

Air pollution is one of the most critical environmental and public health challenges worldwide, contributing to millions of premature deaths, increased respiratory and cardiovascular diseases, and severe economic losses [1,2]. International organizations such as the World Health Organization (WHO) report that over 90% of the global urban population is exposed to particulate matter levels that exceed recommended standards, posing significant risks to human health and compromising progress toward the Sustainable Development Goals (SDGs), particularly SDG 13 (Climate Action) [3,4].

In Latin America, several major cities frequently exceed permissible levels of PM₁₀, PM_2.5 and NO₂ due to rapid urban expansion, increased vehicular traffic, industrial activity and limited environmental regulation [5]. Within this regional context, Metropolitan Lima represents one of the most critical cases, exhibiting high vehicular emissions, complex meteorological and topographic conditions, and persistent patterns of particulate matter concentrations above national and international limits [6].

Official monitoring reports indicate that annual mean concentrations of PM_2.5 in Lima frequently exceed both national air-quality standards and the WHO guideline of 15 µg/m³, with wintertime peaks substantially above this threshold, particularly between June and September [6,7]. Similarly, PM₁₀ concentrations regularly surpass the 24 h regulatory limit during periods of low wind speed and atmospheric stability, while spatial analyses reveal marked intra-urban variability, with several districts experiencing recurrent high-pollution episodes driven primarily by vehicular emissions and resuspended road dust [7,8]. These quantitative patterns confirm that air pollution in Lima is not episodic but structural and seasonal, underscoring the urgent need for effective predictive tools to support environmental management and protect public health.

Advances in data science and artificial intelligence have enabled the use of machine learning models for air pollution prediction, trend analysis, and early warning system development [9]. Commonly applied techniques include neural networks, support vector machines, decision trees, Random Forest, and ensemble models, which have shown promising results in several regions [10,11,12,13,14,15]. However, despite their predictive accuracy, these models often face challenges related to interpretability, hyperparameter tuning complexity, and computational cost, limiting their operational adoption in public institutions that require fast, reliable, and low-cost predictive solutions [16,17].

In contrast, low-complexity probabilistic models such as Naïve Bayes and its Gaussian variant have shown promise in environmental classification tasks due to their fast training, interpretability, and stability, even when handling noisy or moderately correlated data [18]. The Gaussian Naïve Bayes (GNB) model assumes conditional normality of continuous predictors and conditional independence among variables given the class label. Although these assumptions are often moderately violated in atmospheric datasets, previous studies have demonstrated that GNB remains robust under such conditions, achieving competitive performance with substantially lower computational requirements compared to more complex approaches.

A further methodological distinction in air-quality modeling concerns the difference between the estimation of continuous pollutant concentrations and the probabilistic prediction of regulatory air-quality categories, such as those defined by the Air Quality Index (AQI). While continuous predictions are valuable for scientific analysis, categorical predictions aligned with regulatory thresholds are more directly actionable for public health alerts, early warning systems, and operational environmental management. Nevertheless, relatively few studies in Latin American urban contexts, and none to date in Metropolitan Lima, have focused on probabilistically calibrated, multi-pollutant classification models evaluated using metrics such as Brier Score, reliability curves, and extended confusion matrices.

In this context, the present study evaluates the performance of the Gaussian Naïve Bayes model for predicting air-quality categories associated with PM₁₀, PM_2.5, NO₂, and AQI in Metropolitan Lima, using a large-scale historical dataset covering the 2020–2025 period. By emphasizing probabilistic calibration, computational efficiency, and operational feasibility, this research contributes evidence-based insights that support urban environmental management and decision-making aligned with SDG 13 (Climate Action).

2. Materials and Methods

2.1. Methodological Framework: KDD-Based Data Science Engineering

In this study, the scientific method, within the solution variable, is operationalized through the Knowledge Discovery in Databases (KDD) methodology, conceptualized as an engineering-driven predictive knowledge modeling (EDPKM) approach. Under this perspective, KDD is employed as a structured engineering process that integrates data analytics, data preprocessing, and predictive modeling to systematically transform environmental data into explicit and operational knowledge artifacts. The development of the predictive model is therefore framed as a controlled knowledge-generation activity, aligned with the theory of explicit knowledge creation proposed by Nonaka and Takeuchi. As a result, air-quality prediction is treated as an engineered transformation of information into actionable predictive knowledge, ensuring methodological transparency, reproducibility, and scalability of the proposed approach [19,20,21]. A detailed schematic representation of the adopted KDD engineering workflow is provided in Appendix A.

2.2. Research Type, Approach and Design (KDD: Problem Understanding and Analytical Design)

This study is conducted in nature, as it addresses a practical problem related to forecasting air-quality conditions in Metropolitan Lima [22]. A quantitative approach was adopted, based on the analysis of numerical environmental data to identify temporal patterns and evaluate predictive performance. The research followed a non-experimental, longitudinal predictive design, suitable for modeling the temporal evolution of air pollution using historical air-quality records [23]. From a methodological and engineering perspective, the study was conducted at a predictive level, employing statistical and machine learning techniques within a structured Knowledge Discovery in Databases (KDD) process. This process-oriented approach supports the systematic development, validation, and evaluation of predictive models aimed at operational air-quality assessment and early warning systems [24].

2.3. Study Area and Data (KDD: Data Understanding and Selection)

The study area comprises Metropolitan Lima, characterized by high population density, heavy vehicular traffic, and meteorological conditions that frequently hinder pollutant dispersion. A total of 768,185 hourly records from meteorological and air-quality monitoring stations operated by the National Service of Meteorology and Hydrology of Peru (SENAMHI) were analyzed, covering the 2020–2025 period.

In accordance with the data understanding and selection stages of the KDD methodology, the entire dataset was retained to preserve temporal representativeness, capture seasonal and diurnal variability, and minimize selection bias. Although a reference sample size was initially estimated using the finite population formula (n = 36,686), the predictive modeling phase leveraged the full dataset to exploit the informational value inherent in large-volume, high-frequency observations. From a Big Data analytics perspective, this decision aligns with volume-oriented analytical strategies that prioritize comprehensive data utilization over subsampling when computationally feasible [20].

Given the low computational complexity and scalability of the Gaussian Naïve Bayes classifier, processing the complete dataset was technically viable and enabled more stable, robust, and generalizable parameter estimation across spatial and temporal contexts [3]. All computations were performed on a workstation equipped with an Intel® Core™ i7 processor (Intel Corporation, Santa Clara, CA, USA), 16 GB RAM, running Windows 11 Pro (Microsoft Corporation, Redmond, WA, USA).

All environmental measurements were obtained from official open-access repositories provided through the Peruvian National Open Data Platform, which ensures transparency, traceability, and reproducibility of the data used in this study (https://www.datosabiertos.gob.pe/ (accessed on 15 January 2026))

2.4. Predictor and Target Variables (KDD: Feature Selection and Representation)

The modeling framework focused on the probabilistic classification of air-quality categories. Target variables included regulatory categories associated with PM₁₀, PM_2.5, and NO₂, as well as the Air Quality Index (AQI), defined according to national and international air-quality standards [25,26,27].

Predictor variables were selected following the feature selection and representation stages of the KDD process and were derived from spatiotemporal and contextual attributes available in the monitoring dataset. These included the hour of measurement (HORA), geographic coordinates (LONGITUD and LATITUD), station altitude (ALTITUD), and administrative location descriptors such as department, province, district, and UBIGEO code. Temporal context was incorporated through the reference timestamp (FECHA_CORTE), enabling the representation of diurnal and calendar-related patterns in air-quality conditions [28,29,30].

Lagged predictors were not incorporated, as the objective of the study was to perform same-hour probabilistic classification aligned with operational air-quality monitoring and real-time early warning requirements, thereby avoiding dependence on historical pollutant values not available in real-time operational settings.

2.5. Data Collection Technique and Preprocessing (KDD: Data Cleaning and Transformation)

A documentary analysis technique was employed, using registry records from official monitoring stations as the primary data source [31]. Data preprocessing followed a reproducible and engineering-oriented pipeline aligned with the data cleaning and transformation stages of the KDD methodology, ensuring data quality prior to modeling [32,33,34]. The operational implementation of this cleaning and transformation module is illustrated in Appendix B.3. The preprocessing steps included:

-: Validity checks: records with physically implausible values (e.g., negative pollutant concentrations) were removed.
-: Missing data handling: observations with missing target labels were discarded; records with missing predictor values were removed when missingness exceeded predefined thresholds.
-: Outlier treatment: extreme values were identified using an interquartile range (IQR) criterion and excluded from the analysis.
-: Feature encoding: temporal categorical variables were encoded numerically for model compatibility.
-: Air Quality Index calculation: the AQI was not directly available in the raw dataset and was therefore computed from pollutant concentrations using the standard piecewise linear interpolation approach:

I C A = \frac{I_{h i g h} - I_{l o w}}{C_{h i g h} - C_{l o w}} (C - C_{l o w}) + I_{l o w}

where

C

denotes the observed pollutant concentration,

C_{l o w}

and

C_{h i g h}

correspond to the concentration breakpoints surrounding

C

, and

I_{l o w} a n d I_{h i g h}

represent the associated AQI values.

-: Data leakage control: all preprocessing steps were fitted exclusively on the training subset and subsequently applied to the subset.

2.6. Analytical Modeling: Gaussian Naïve Bayes (KDD: Data Mining and Modeling)

The Gaussian Naïve Bayes (GNB) classifier estimates posterior class probabilities using Bayes’ theorem:

P (y = k | x) \propto P (y = k) j = \prod_{j = 1}^{p} P (x_{j} | y = k)

where

x

represents the vector of the predictor variables and

y

denotes the air-quality class [29]. For continuous predictors, class-conditional likelihoods are modeled as Gaussian distributions:

P (x_{j} | y = k) \frac{1}{\sqrt{2 π σ_{k j}^{2}}} \exp (- \frac{(x_{j} - μ_{k j})^{2}}{2 σ_{k j}^{2}})

Parameters

μ_{k j}

and

σ_{k j}^{2}

are estimated from the training data using maximum likelihood [8].

The model was implemented in Python (Python Software Foundation, Wilmington, DE, USA), version 3.11, using the GaussianNB classifier from the scikit-learn library (version 1.4.2; scikit-learn Developers, Paris, France). Data manipulation was performed using pandas (version 2.2.1; pandas Development Team, USA) and numerical operations were conducted using NumPy (version 1.26.4; NumPy Developers, USA). All experiments were executed within a Jupyter Notebook environment (version 6.5.4, Project Jupyter, USA). Training was performed using the standard fit (x_train and y_train) procedure on the training subset. Default prior handling was used (priors = none), allowing class prior probabilities to be estimated from the data. Numerical stability was ensured through the default variance smoothing parameter (var_smoothing = 1 ×

10^{- 9}

), which adds a small constant to feature variances to avoid numerical instability [6].

2.7. Evaluation Metrics and Performance Analysis (KDD: Evaluation and Interpretation)

Model performance was evaluated using accuracy, precision, recall and F1-score, as recommended by prior environmental prediction studies [35]. Probabilistic calibration was assessed using the multiclass Brier Score, computed as the mean squared difference between predicted class probabilities and observed class indicators [36,37]. Reliability curves were used to analyze calibration behavior across probability bins, while confusion matrices were employed to examine class-wise classification performance and misclassification patterns [38].

Consistent with the interpretation stage of the KDD framework, each observation was assigned to the air-quality category associated with the highest posterior probability.

Decision Criteria and Hypothesis Testing

The Gaussian Naïve Bayes (GNB) model was evaluated using standard classification metrics, including accuracy (A), precision (P), recall (R), and F1-score (F), together with the percentage of correctly classified cases derived from the confusion matrix (C), the multiclass Brier Score (B), and the number of reliability points located near the diagonal of the reliability curve (L).

In accordance with the decision rule defined in this study, the alternative hypothesis (H₁) was accepted—and the null hypothesis (H₀) was rejected—when all predefined performance conditions were simultaneously satisfied: A ≥ 0.80, P ≥ 0.80, R ≥ 0.80, F ≥ 0.80, C ≥ 80%, B ≤ 0.10, and L ≥ 3. This conjunctive decision criterion ensures that model acceptance reflects not only classification accuracy but also probabilistic reliability and calibration quality.

This rule was applied consistently to each specific predictive objective (PM₁₀, PM_2.5, NO₂, and AQI), as well as to the global model performance assessment.

2.8. KDD Presentation Phase

As part of the knowledge presentation and deployment stage of the KDD process, all analytical results, confusion matrices, and reliability curves were generated using a dedicated data analysis and visualization interface developed specifically for this study. The software integrates model execution, probabilistic evaluation, and iconographic visualization to support interpretability, transparency, and reproducibility of the predictive knowledge generated. The initial structural workflow and user interaction design were conceptualized through a low-fidelity prototype (Appendix B.1). The complete source code and interactive outputs are publicly available at https://gnb.aldairdominguez.me/ (accessed 25 January 2026), enabling independent verification and reuse of the proposed engineering-driven predictive framework.

Exploratory and monitoring dashboards were developed using Looker Studio to support interactive environmental data analysis prior to modeling (Appendix B.2).

2.9. Ethical Considerations

This study relied exclusively on open-access environmental data provided by SENAMHI and did not involve personal information or human subjects; therefore, informed consent was not required. Data usage complied with Peru’s digital governance framework established by Supreme Decree No. 029-2021-PCM and adhered to the principles of transparency, integrity, and responsible reuse of public information outlined in Directive No. 003-2024-AGN [39,40]. All datasets were used solely for academic purposes in accordance with national and international open-science standards.

2.10. Use of Generative Artificial Intelligence (GenAI)

The authors declare that generative artificial intelligence tools were used exclusively for language editing assistance, writing refinement, content organization, and coherence checking. No GenAI tools were used to generate, modify, or synthesize data, figures, results, statistical analyses, or scientific interpretations.

3. Results

3.1. Overall Evaluation Framework

The performance of the Gaussian Naïve Bayes (GNB) model was evaluated using standard classification metrics, including accuracy, precision, recall, and F1-score. In addition, probabilistic performance was assessed using confusion matrices, Brier Score, and reliability curves to evaluate calibration quality and predictive uncertainty [37].

Class support was explicitly considered in the evaluation to account for potential imbalance across air-quality categories. For all pollutants and AQIs, the number of samples per class was reported in the corresponding confusion matrices, ensuring that performance metrics were interpreted in the context of their empirical frequency [27].

To assess model stability, performance consistency was verified across all pollutants and classes through the joint analysis of global metrics, class-wise confusion matrices, and probabilistic calibration indicators. The narrow dispersion of class-wise Brier Scores indicates low predictive variance and stable probabilistic behavior [38].

3.2. PM₁₀ Concentration

The GNB model demonstrated strong predictive performance for PM₁₀ concentration classification. An overall accuracy of 0.931 was achieved, with precision, recall, and F1-score all exceeding 0.93 (Table 1).

The confusion matrix (Figure 1) exhibits a dominant diagonal structure, with correct classification rates ranging from 92% to 94% across the Low-, Medium-, and High-PM₁₀ categories. Misclassifications were limited and primarily occurred between adjacent classes, indicating adequate discrimination among pollution levels.

The average Brier Score for PM₁₀ was 0.023, with class-wise values ranging from 0.022 to 0.024, reflecting high-quality probabilistic calibration. The reliability curve (Figure 2) shows close alignment with the diagonal, confirming consistency between predicted probabilities and observed frequencies.

3.3. PM_2.5 Concentration

For PM_2.5 concentration, the GNB model achieved an accuracy, precision, recall, and F1-score of 0.918 (Table 2), indicating stable and consistent classification performance.

As shown in the confusion matrix (Figure 3), correct classification rates exceeded 91% across all PM_2.5 classes, with misclassification rates below 5% and restricted to neighboring categories. This pattern highlights the robustness of the model when handling fine particulate matter concentrations.

The average Brier Score for PM_2.5 was 0.029, with class-wise values between 0.027 and 0.030, remaining well within acceptable calibration thresholds. The corresponding reliability curve (Figure 4) confirms appropriate probabilistic calibration.

3.4. NO₂ Concentration

The prediction of NO₂ concentration also yielded favorable results. The GNB model achieved accuracy, precision, recall, and F1-score values of 0.913 (Table 3).

The confusion matrix (Figure 5) shows classification accuracies slightly above 91% for the Low-, Medium-, and High-NO₂ categories, with errors limited to adjacent classes. These results indicate consistent generalization across nitrogen dioxide concentration levels.

The average Brier Score was 0.031, with minimal variation among classes, further supporting reliable probability estimates. The reliability curve (Figure 6) confirms good agreement between predicted probabilities and observed outcomes.

3.5. Air Quality Index (AQI)

For the Air Quality Index (AQI), the GNB model was evaluated across six ordered categories: Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy, Very Unhealthy, and Hazardous.

The model achieved an accuracy of 0.931, with precision, recall, and F1-score values above 0.93 (Table 4). The confusion matrix (Figure 7) displays strong diagonal dominance, with correct classification rates between 92% and 94.5% across all AQI categories. Misclassifications were infrequent and mainly occurred between adjacent AQI levels.

The average Brier Score for AQI was 0.010, indicating excellent probabilistic calibration. The reliability curve (Figure 8) further confirms that predicted probabilities closely match empirical frequencies.

3.6. Global Model Performance and General Hypothesis

Finally, the global evaluation of the GNB model, considering all pollutants jointly, yielded accuracy, precision, recall, and F1-score values of 0.925 (Table 5). The global confusion matrix (Figure 9) shows correct classification rates above 92% for Low, Medium, and High pollution levels, with errors confined to neighboring categories.

The overall average Brier Score was 0.023, with consistent class-wise values, indicating stable and well-calibrated probabilistic predictions. The global reliability curve (Figure 10) confirms strong agreement between predicted probabilities and observed frequencies.

From an engineering perspective, the observed performance stability across pollutant categories and the strong probabilistic calibration indicate that the predictive model behaves consistently as a reproducible knowledge artifact generated through a structured KDD-based workflow. The concentration of misclassifications between adjacent air-quality categories further reflects the model’s sensitivity to regulatory threshold boundaries rather than random predictive instability.

4. Discussion

The results obtained for the 2020–2025 period demonstrate that the Gaussian Naïve Bayes (GNB) model provides consistent, stable, and well-calibrated predictions across all evaluated pollutants, supporting its suitability as an operational air-quality prediction tool for Metropolitan Lima. The combination of strong classification metrics and low Brier Scores not only indicates high predictive accuracy but also reliable probabilistic estimation, which is essential for environmental decision-making and early warning systems.

For PM₁₀, the model achieved accuracy, precision, recall, and F1-score values above 0.93, accompanied by a low average Brier Score (≈0.023). The diagonal-dominant confusion matrix and the concentration of misclassifications between adjacent categories suggest that GNB effectively discriminates between PM₁₀ concentration levels. This behavior is consistent with previous studies reporting relatively stable PM₁₀ dynamics driven by vehicular traffic, resuspension processes, and persistent meteorological conditions in large Latin American metropolitan areas [13,14,16]. The comparatively smooth temporal behavior of PM₁₀ favors the conditional independence and Gaussian distribution assumptions of the GNB model, contributing to its strong performance.

In the case of PM_2.5, the model achieved slightly lower—but still robust—performance (metrics ≈ 0.918) and an average Brier Score of approximately 0.029. This result aligns with the existing literature indicating that PM_2.5 concentrations are more difficult to predict due to their sensitivity to combustion-related sources, micro-scale traffic conditions, and localized atmospheric processes [15,17]. Minor deviations observed in the reliability curve can be attributed to short-term emission peaks and complex street-level dynamics. Nevertheless, the observed calibration and accuracy levels remain sufficient for practical applications, and performance could be further enhanced by incorporating higher-resolution traffic flow data or meteorological predictors.

For NO₂, the GNB model obtained accuracy and F1-score values around 0.913, with an average Brier Score close to 0.031. Misclassifications were primarily concentrated between the Medium and High categories, reflecting the rapid temporal variability of NO₂ concentrations driven by fluctuations in vehicular activity and urban mobility patterns. These findings are consistent with studies reporting reliable NO₂ prediction performance in dense urban environments, where short-term changes in traffic intensity play a dominant role [14,15]. Despite these dynamics, GNB maintained stable calibration, demonstrating robustness in handling rapidly changing gaseous pollutants.

The strongest performance was observed for the Air Quality Index (AQI), with accuracy and F1-score values exceeding 0.93 and a notably low average Brier Score (≈0.010). The near-perfect alignment of the reliability curve with the diagonal indicates excellent probabilistic calibration, even when classifying six ordered AQI categories. This outcome can be explained by the integrative nature of AQI, which aggregates multiple pollutants into standardized thresholds, reducing noise and facilitating classification. Similar findings have been reported in studies that emphasize the effectiveness of composite air-quality indices for risk communication and public health decision support [13,17,18].

From a global perspective, the integrated model achieved consistent performance (metrics ≈ 0.925; Brier Score ≈ 0.023), confirming that GNB effectively combines multi-pollutant information and generalizes well to unseen data. Compared with more complex approaches such as deep neural networks, ensemble models, or hybrid architectures, GNB offers a favorable balance between predictive accuracy, interpretability, and computational efficiency [41,42]. This balance is particularly relevant in urban contexts such as Metropolitan Lima, where environmental monitoring agencies may face constraints in technical infrastructure, data availability, and operational resources.

Despite its strengths, this study has certain limitations. The conditional independence assumption inherent to GNB may not fully capture interactions among pollutants and meteorological variables, particularly under extreme pollution episodes. Additionally, the use of fixed AQI thresholds may introduce discretization effects that smooth abrupt concentration changes. Future research could address these limitations by integrating hybrid or hierarchical models, incorporating spatial dependencies, or evaluating real-time forecasting scenarios.

Overall, the findings confirm that Gaussian Naïve Bayes constitutes a robust, low-complexity, and well-calibrated approach for air pollution prediction, offering a practical solution for early warning systems, environmental management, and evidence-based policymaking in densely populated urban environments aligned with Sustainable Development Goal 13 (Climate Action).

The results support the validity of adopting an engineering-driven predictive knowledge modeling approach, in which the Knowledge Discovery in Databases (KDD) framework structures the transformation of air-quality data into explicit and operational predictive knowledge. Rather than optimizing model complexity, the proposed approach prioritizes reproducibility, interpretability, and computational efficiency—key requirements for real-time air-quality monitoring and decision-support systems. In this context, the Gaussian Naïve Bayes classifier demonstrates that probabilistic models with transparent assumptions can achieve competitive performance while remaining suitable for operational deployment.

5. Conclusions

This study demonstrates that the Gaussian Naïve Bayes (GNB) model provides accurate, stable, and well-calibrated predictions of air pollution levels in Metropolitan Lima for the 2020–2025 period. The global performance metrics (accuracy, precision, recall, and F1-score ≈ 0.925), together with a low average Brier Score (≈0.023), confirm both strong classification capability and reliable probabilistic estimation.
The model showed robust performance in predicting particulate matter concentrations. For PM₁₀, classification accuracy exceeded 93%, with misclassifications largely restricted to adjacent concentration levels, reflecting effective discrimination under relatively stable urban pollution conditions. For PM_2.5, performance remained consistently high (≈0.918), despite the greater variability and localized emission sources associated with fine particulate matter.
For NO₂, the GNB model achieved reliable predictive performance (metrics ≈ 0.913), capturing rapid concentration transitions driven by urban traffic dynamics. Misclassifications were primarily confined to neighboring categories, indicating appropriate generalization across gaseous pollution levels.
The strongest results were obtained for the Air Quality Index (AQI), where the model achieved accuracy and F1-score values above 0.93 and an exceptionally low Brier Score (≈0.010). This highlights the effectiveness of GNB in handling multiclass, ordered air-quality categories and producing well-calibrated probabilistic outputs suitable for risk communication.
Overall, the findings confirm that Gaussian Naïve Bayes represents a computationally efficient and interpretable modeling approach that balances simplicity with high predictive performance. Its low computational cost and stable calibration make it particularly suitable for operational air-quality monitoring, early warning systems, and evidence-based decision-making in resource-constrained urban environments.
This study contributes a structured and reproducible engineering-oriented framework for air-quality prediction, in which KDD serves as the backbone for predictive knowledge generation. By framing the model as an explicit knowledge artifact rather than a black-box predictor, the proposed approach facilitates transparency, scalability, and transferability to other urban contexts. This perspective supports the development of low-latency, interpretable decision-support tools aligned with sustainable urban environmental management.
By aligning air pollution prediction with Sustainable Development Goal 13 (Climate Action), this study contributes a practical and scalable methodological framework that supports urban environmental management and strengthens resilience strategies in densely populated metropolitan areas.

Author Contributions

Conceptualization, A.G., A.D. and E.F.-C.; methodology, A.D. and E.F.-C.; software, A.D.; validation A.D. and A.G.; formal analysis, A.D.; investigation, A.D. and A.G.; resources, A.G.; data curation, A.D.; writing—original draft preparation, A.D.; writing—review and editing, A.G. and E.F.-C.; visualization, A.D.; supervision, E.F.-C.; project administration, A.G. All authors have read and agreed to the published version of the manuscript.

Funding

Institutional funding from Universidad César Vallejo is currently under administrative review.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The raw environmental and meteorological data used for model training and evaluation were obtained from the open-data platform of the National Meteorology and Hydrology Service of Peru (SENAMHI) and are accessible through the Peruvian open data portal at https://www.datosabiertos.gob.pe/dataset/monitoreo-de-los-contaminantes-del-aire-en-lima-metropolitana-servicio-nacional-de (accessed on 15 January 2026). Processed datasets and analysis scripts generated in the current study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors express their gratitude to the National Meteorology and Hydrology Service of Peru (SENAMHI) for providing open-access environmental datasets that made this research possible. The authors also acknowledge the administrative and academic support from the Cesar Vallejo University during the development of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AQI	Air Quality Index
GNB	Gaussian Naïve Bayes
PM₁₀	Particulate Matter ≤ 10 µm
PM_2.5	Particulate Matter ≤ 2.5 µm
NO₂	Nitrogen Dioxide
WHO	World Health Organization
SDG	Sustainable Development Goal
ML	Machine Learning

Appendix A

Data Mining Engineering (KDD)

The Knowledge Discovery in Databases (KDD) process is a systematic framework for extracting valid, novel, and useful patterns from large datasets [19]. In this study, the KDD process was adopted as an engineering guideline to structure the data pipeline that supports the Gaussian Naïve Bayes (GNB) model for air pollution prediction in Metropolitan Lima.

Appendix B

Appendix B.1. Low-Fidelity Prototype

The figure illustrates the structural workflow and user interaction architecture of the data analysis and predictive modeling interface developed for this study. The system is organized into sequential modules that guide the user from authentication to probabilistic model evaluation and visualization.

Appendix B.2. Dashboards in Looker Studio

The figure presents the Environmental Analysis Dashboard designed to monitor and explore air-quality conditions in Metropolitan Lima. This interface functions as an interactive exploratory data analysis module within the KDD framework, supporting data understanding and interpretation prior to predictive modeling.

Appendix B.3. Predictive System

The figure illustrates the Climate Data Management module of the Climate Analytics platform, corresponding to the Data Cleaning and Transformation stages of the Knowledge Discovery in Databases (KDD) process. This interface supports the structured preparation of environmental datasets prior to predictive modeling.

References

Henninger, E.; Smith, E.K. Beyond the haze: Decomposing the effect of economic inequality on global air quality from 2000 to 2020. Ecol. Econ. 2024, 222, 108210. [Google Scholar] [CrossRef]
Chaurasiya, M.; Kumar, S.; Bhatt, K.; Sharma, S. The interplay of SDGs and climate action: A quantitative analysis of regional income influences on SDG 13 progress. Phys. Chem. Earth Parts A/B/C 2025, 139, 103939. [Google Scholar] [CrossRef]
García-García, J.A.; Reding-Bernal, A.; López-Alvarenga, J.C. Cálculo del tamaño de la muestra en investigación en educación médica. Investig. Educ. Méd. 2013, 2, 217–224. [Google Scholar] [CrossRef]
Vu, B.N.; Tapia, V.; Ebelt, S.; Gonzales, G.F.; Liu, Y.; Steenland, K. The association between asthma emergency department visits and satellite-derived PM2.5 in Lima, Peru. Environ. Res. 2021, 199, 111226. [Google Scholar] [CrossRef]
Cummings, L.E.; Stewart, J.D.; Kremer, P.; Shakya, K.M. Predicting citywide distribution of air pollution using mobile monitoring and three-dimensional urban structure. Sustain. Cities Soc. 2022, 76, 103510. [Google Scholar] [CrossRef]
Mondal, C.; Uddin, M.J. Classification of short-term flood events using stochastic variable selection and Gaussian Naïve Bayes classifier: A case study of Sirajganj district, Bangladesh. Heliyon 2025, 11, e41941. [Google Scholar] [CrossRef]
Yang, Z.; Lau, Y.; Kanrak, M. Pollution prevention of vessels in the greater bay area: A practical contribution of port state control inspection system towards carbon neutralisation using a tree augmented naive bayes approach. J. Clean. Prod. 2023, 423, 138651. [Google Scholar] [CrossRef]
Venkata, P.; Pandya, V. Data mining model and Gaussian Naive Bayes based fault diagnostic analysis of modern power system networks. Mater. Today Proc. 2022, 62, 7156–7161. [Google Scholar] [CrossRef]
Manish Lad, A.; Mani Bharathi, K.; Akash Saravanan, B.; Karthik, R. Factors affecting agriculture and estimation of crop yield using supervised learning algorithms. Mater. Today Proc. 2022, 62, 4629–4634. [Google Scholar] [CrossRef]
Gnecco, V.M.; Kousis, I.; Pigliautile, I.; Pisello, A.L. Decoding Living Lab sensing system through Bayesian networks: The preferable working space targeting comfort and productivity. J. Build. Eng. 2025, 101, 111913. [Google Scholar] [CrossRef]
Shang, Y. Prevention and detection of DDOS attack in virtual cloud computing environment using Naive Bayes algorithm of machine learning. Meas. Sens. 2024, 31, 100991. [Google Scholar] [CrossRef]
Phruksahiran, N. Improvement of air quality index prediction using geographically weighted predictor methodology. Urban Clim. 2021, 38, 100890. [Google Scholar] [CrossRef]
Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M.; Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M. Contaminación del aire en la ciudad de Lima, Perú. Rev. Arbitr. Interdiscip. Koin. 2023, 8, 822–830. [Google Scholar] [CrossRef]
Gómez Peláez, L.M.; Santos, J.M.; de Almeida Albuquerque, T.T.; Reis, N.C.; Andreão, W.L.; de Fátima Andrade, M. Air quality status and trends over large cities in South America. Environ. Sci. Policy 2020, 114, 422–435. [Google Scholar] [CrossRef]
Ndiaye, A.; Shen, Y.; Kyriakou, K.; Karssenberg, D.; Schmitz, O.; Flückiger, B.; de Hoogh, K.; Hoek, G. Hourly land-use regression modeling for NO₂ and PM_2.5 in the Netherlands. Environ. Res. 2024, 256, 119233. [Google Scholar] [CrossRef]
Mangones, S.C.; Cuéllar-Álvarez, Y.; Rojas-Roa, N.Y.; Osses, M. Addressing urban transport-related air pollution in Latin America: Insights and policy directions. Lat. Am. Transp. Stud. 2025, 3, 100033. [Google Scholar] [CrossRef]
Shetty, S.; Hamer, P.D.; Stebel, K.; Kylling, A.; Hassani, A.; Berntsen, T.K.; Schneider, P. Daily high-resolution surface PM2.5 estimation over Europe by ML-based downscaling of the CAMS regional forecast. Environ. Res. 2025, 264, 120363. [Google Scholar] [CrossRef] [PubMed]
Alnowaiser, K.; Alarfaj, A.A.; Alabdulqader, E.A.; Umer, M.; Cascone, L.; Alankar, B. IoT based smart framework to predict air quality in congested traffic areas using SV-CNN ensemble and KNN imputation model. Comput. Electr. Eng. 2024, 118, 109311. [Google Scholar] [CrossRef]
Llatas, C.; Soust-Verdaguer, B.; Torres, L.C.; Cagigas, D. Application of Knowledge Discovery in Databases (KDD) to environmental, economic, and social indicators used in BIM workflow to support sustainable design. J. Build. Eng. 2024, 91, 109546. [Google Scholar] [CrossRef]
Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R.; Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R. Framework for Structuring Big Data Projects. Electronics 2022, 11, 3540. [Google Scholar] [CrossRef]
La Organización Creadora de Conocimiento: Cómo las Compañías Japonesas Crean la Dinámica de la Innovación—Universidad Granada. Available online: https://granatensis.ugr.es/discovery/fulldisplay/alma991003128989704990/34CBUA_UGR:VU1 (accessed on 9 January 2026).
Higashide, N.; Zhang, Y.; Asatani, K.; Miura, T.; Sakata, I. Quantifying advances from basic research to applied research in material science. Technovation 2024, 135, 103050. [Google Scholar] [CrossRef]
Su, X.; Shang, S.; Xu, Z.; Qian, H.; Pan, X. Assessment of Dependent Performance Shaping Factors in SPAR-H Based on Pearson Correlation Coefficient. Comput. Model. Eng. Sci. 2023, 138, 1813–1826. [Google Scholar] [CrossRef]
Tieppo, E.; Nievola, J.C.; Barddal, J.P. Adaptive learning on hierarchical data streams using window-weighted Gaussian probabilities. Appl. Soft Comput. 2024, 152, 111271. [Google Scholar] [CrossRef]
Moreno, R.; Nery, A.; Zamora, R.; Lora, Á.; Galán, C. Contribution of urban trees to carbon sequestration and reduction of air pollutants in Lima, Peru. Ecosyst. Serv. 2024, 67, 101618. [Google Scholar] [CrossRef]
Romero, Y.; Diaz, C.; Meldrum, I.; Arias Velasquez, R.; Noel, J. Temporal and spatial analysis of traffic—Related pollutant under the influence of the seasonality and meteorological variables over an urban city in Peru. Heliyon 2020, 6, e04029. [Google Scholar] [CrossRef]
Gond, A.K.; Jamal, A.; Verma, T. Developing a machine learning model using satellite data to predict the Air Quality Index (AQI) over Korba Coalfield, Chhattisgarh (India). Atmos. Pollut. Res. 2025, 16, 102398. [Google Scholar] [CrossRef]
Berrar, D. Bayes’ Theorem and Naive Bayes Classifier. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 403–412. ISBN 978-0-12-811432-2. [Google Scholar]
Islam, R.; Devnath, M.K.; Samad, M.D.; Jaffrey Al Kadry, S.M. GGNB: Graph-based Gaussian naive Bayes intrusion detection system for CAN bus. Veh. Commun. 2022, 33, 100442. [Google Scholar] [CrossRef]
Arshad, A.; Jabeen, M.; Ubaid, S.; Raza, A.; Abualigah, L.; Aldiabat, K.; Jia, H. A novel ensemble method for enhancing Internet of Things device security against botnet attacks. Decis. Anal. J. 2023, 8, 100307. [Google Scholar] [CrossRef]
Ji, W.; Wang, C.; Chen, H.; Liang, Y.; Wang, S. Predicting post-stroke cognitive impairment using machine learning: A prospective cohort study. J. Stroke Cerebrovasc. Dis. 2023, 32, 107354. [Google Scholar] [CrossRef]
Otsu, T.; Taniguchi, G. Kolmogorov–Smirnov type test for generated variables. Econ. Lett. 2020, 195, 109401. [Google Scholar] [CrossRef]
Just, M.; Schubert, P.; Blatt, J.; Delfmann, P. Data Preprocessing for Cross-System Analysis: The DaProXSA Approach. Procedia Comput. Sci. 2024, 239, 1635–1644. [Google Scholar] [CrossRef]
Lydersen, S. Statistical review: Frequently given comments updated. Ann. Rheum. Dis. 2025, 84, 660–663. [Google Scholar] [CrossRef] [PubMed]
Cabot, J.H.; Ross, E.G. Evaluating prediction model performance. Surgery 2023, 174, 723–726. [Google Scholar] [CrossRef] [PubMed]
Dimitriadis, T.; Gneiting, T.; Jordan, A.I.; Vogel, P. Evaluating probabilistic classifiers: The triptych. Int. J. Forecast. 2024, 40, 1101–1122. [Google Scholar] [CrossRef]
Conciatori, M.; Valletta, A.; Segalini, A. Improving the quality evaluation process of machine learning algorithms applied to landslide time series analysis. Comput. Geosci. 2024, 184, 105531. [Google Scholar] [CrossRef]
Gehringer, C.K.; Martin, G.P.; Van Calster, B.; Hyrich, K.L.; Verstappen, S.M.M.; Sergeant, J.C. How to develop, validate, and update clinical prediction models using multinomial logistic regression. J. Clin. Epidemiol. 2024, 174, 111481. [Google Scholar] [CrossRef]
Decreto Supremo N.° 029-2021-PCM. Available online: https://www.gob.pe/es/institucion/pcm/normas-legales/1705101-029-2021-pcm (accessed on 3 June 2025).
Resolución de Secretaría General N.° 000039-2024-AGN/SG. Available online: https://www.gob.pe/institucion/agn/normas-legales/5371925-000039-2024-agn-sg (accessed on 3 June 2025).
Onah, J.O.; Abdulhamid, S.M.; Abdullahi, M.; Hassan, I.H.; Al-Ghusham, A. Genetic Algorithm based feature selection and Naïve Bayes for anomaly detection in fog computing environment. Mach. Learn. Appl. 2021, 6, 100156. [Google Scholar] [CrossRef]
Paneru, S.; Xu, X.; Wang, J.; Chi, G.; Hu, Y. Assessing building thermal resilience in response to heatwaves through integrating a social vulnerability lens. J. Build. Eng. 2024, 98, 111219. [Google Scholar] [CrossRef]

Figure 1. Confusion matrix by class for PM10.

Figure 2. Reliability curve for PM₁₀.

Figure 3. Confusion matrix by class for PM_2.5.

Figure 4. Reliability curve for PM_2.5.

Figure 5. Confusion matrix by class for NO₂.

Figure 6. Reliability curve for NO₂.

Figure 7. Confusion matrix by class for AQI.

Figure 8. Reliability curve for AQI.

Figure 9. Confusion matrix by global model class.

Figure 10. Global GNB model reliability curve.

Table 1. Performance metrics for PM₁₀.

Metrics	Obtained Value	Reference Scale	Evaluation
Accuracy	0.931	Excellent ≥ 0.80	Excellent
Precision	0.933	Excellent ≥ 0.80	Excellent
Recall	0.931	Excellent ≥ 0.80	Excellent
F1-Score	0.934	Excellent ≥ 0.80	Excellent

Table 2. Performance metrics for PM_2.5.

Metrics	Obtained Value	Reference Scale	Evaluation
Accuracy	0.918	Excellent ≥ 0.80	Excellent
Precision	0.918	Excellent ≥ 0.80	Excellent
Recall	0.918	Excellent ≥ 0.80	Excellent
F1-Score	0.918	Excellent ≥ 0.80	Excellent

Table 3. Performance metrics for NO₂.

Metrics	Obtained Value	Reference Scale	Evaluation
Accuracy	0.913	Excellent ≥ 0.80	Excellent
Precision	0.913	Excellent ≥ 0.80	Excellent
Recall	0.913	Excellent ≥ 0.80	Excellent
F1-Score	0.913	Excellent ≥ 0.80	Excellent

Table 4. Performance metrics for AQI.

Metrics	Obtained Value	Reference Scale	Evaluation
Accuracy	0.931	Excellent ≥ 0.80	Excellent
Precision	0.932	Excellent ≥ 0.80	Excellent
Recall	0.932	Excellent ≥ 0.80	Excellent
F1-Score	0.932	Excellent ≥ 0.80	Excellent

Table 5. General performance metrics.

Metrics	Obtained Value	Reference Scale	Evaluation
Accuracy	0.925	Excellent ≥ 0.80	Excellent
Precision	0.925	Excellent ≥ 0.80	Excellent
Recall	0.925	Excellent ≥ 0.80	Excellent
F1-Score	0.925	Excellent ≥ 0.80	Excellent

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gavidia, A.; Dominguez, A.; Flores-Chacón, E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability 2026, 18, 5748. https://doi.org/10.3390/su18115748

AMA Style

Gavidia A, Dominguez A, Flores-Chacón E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability. 2026; 18(11):5748. https://doi.org/10.3390/su18115748

Chicago/Turabian Style

Gavidia, Aimee, Aldair Dominguez, and Erick Flores-Chacón. 2026. "Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management" Sustainability 18, no. 11: 5748. https://doi.org/10.3390/su18115748

APA Style

Gavidia, A., Dominguez, A., & Flores-Chacón, E. (2026). Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability, 18(11), 5748. https://doi.org/10.3390/su18115748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management

Abstract

1. Introduction

2. Materials and Methods

2.1. Methodological Framework: KDD-Based Data Science Engineering

2.2. Research Type, Approach and Design (KDD: Problem Understanding and Analytical Design)

2.3. Study Area and Data (KDD: Data Understanding and Selection)

2.4. Predictor and Target Variables (KDD: Feature Selection and Representation)

2.5. Data Collection Technique and Preprocessing (KDD: Data Cleaning and Transformation)

2.6. Analytical Modeling: Gaussian Naïve Bayes (KDD: Data Mining and Modeling)

2.7. Evaluation Metrics and Performance Analysis (KDD: Evaluation and Interpretation)

Decision Criteria and Hypothesis Testing

2.8. KDD Presentation Phase

2.9. Ethical Considerations

2.10. Use of Generative Artificial Intelligence (GenAI)

3. Results

3.1. Overall Evaluation Framework

3.2. PM10 Concentration

3.3. PM2.5 Concentration

3.4. NO2 Concentration

3.5. Air Quality Index (AQI)

3.6. Global Model Performance and General Hypothesis

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Data Mining Engineering (KDD)

Appendix B

Appendix B.1. Low-Fidelity Prototype

Appendix B.2. Dashboards in Looker Studio

Appendix B.3. Predictive System

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.2. PM₁₀ Concentration

3.3. PM_2.5 Concentration

3.4. NO₂ Concentration