Next Article in Journal
Expert Consensus on Modular Design and Product Configuration for Mass Customization in Regulated Manufacturing SMEs
Previous Article in Journal
Green Technology Innovation and Low-Carbon Transition: Mediating Pathways of Energy Consumption and Industrial Structure
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management

by
Aimee Gavidia
*,
Aldair Dominguez
* and
Erick Flores-Chacón
*
Faculty of Engineering and Architecture, Cesar Vallejo University, Lima 15434, Peru
*
Authors to whom correspondence should be addressed.
Sustainability 2026, 18(11), 5748; https://doi.org/10.3390/su18115748 (registering DOI)
Submission received: 10 December 2025 / Revised: 22 January 2026 / Accepted: 10 February 2026 / Published: 5 June 2026
(This article belongs to the Section Air, Climate Change and Sustainability)

Abstract

Air pollution episodes in Metropolitan Lima pose persistent challenges for urban health protection and timely environmental decision-making. However, many machine learning approaches for air-quality prediction remain difficult to operationalize due to high latency, extensive hyperparameter tuning, and limited interpretability. This study addresses this gap by adopting an engineering-driven predictive knowledge modeling approach grounded in the Knowledge Discovery in Databases (KDD) framework to evaluate an efficient probabilistic classifier—Gaussian Naïve Bayes (GNB)—for predicting regulatory air-quality categories in Metropolitan Lima. A total of 768,185 hourly observations from SENAMHI monitoring stations covering the 2020–2025 period were analyzed, considering PM10, PM2.5, NO2 concentrations, and the Air Quality Index (AQI). Data were preprocessed through validity checks, explicit outlier handling, and categorical encoding based on regulatory thresholds, while a time-based train–test split preserved temporal structure and prevented data leakage. The proposed model achieved strong predictive performance (global accuracy ≥ 0.925) and excellent probabilistic calibration (overall Brier Score ≈ 0.023; AQI Brier Score ≈ 0.010). These results demonstrate that GNB provides a robust, interpretable, and computationally efficient solution for operational air-quality management and early warning support, contributing to evidence-based urban environmental decision-making aligned with Sustainable Development Goal 13 (Climate Action).

1. Introduction

Air pollution is one of the most critical environmental and public health challenges worldwide, contributing to millions of premature deaths, increased respiratory and cardiovascular diseases, and severe economic losses [1,2]. International organizations such as the World Health Organization (WHO) report that over 90% of the global urban population is exposed to particulate matter levels that exceed recommended standards, posing significant risks to human health and compromising progress toward the Sustainable Development Goals (SDGs), particularly SDG 13 (Climate Action) [3,4].
In Latin America, several major cities frequently exceed permissible levels of PM10, PM2.5 and NO2 due to rapid urban expansion, increased vehicular traffic, industrial activity and limited environmental regulation [5]. Within this regional context, Metropolitan Lima represents one of the most critical cases, exhibiting high vehicular emissions, complex meteorological and topographic conditions, and persistent patterns of particulate matter concentrations above national and international limits [6].
Official monitoring reports indicate that annual mean concentrations of PM2.5 in Lima frequently exceed both national air-quality standards and the WHO guideline of 15 µg/m3, with wintertime peaks substantially above this threshold, particularly between June and September [6,7]. Similarly, PM10 concentrations regularly surpass the 24 h regulatory limit during periods of low wind speed and atmospheric stability, while spatial analyses reveal marked intra-urban variability, with several districts experiencing recurrent high-pollution episodes driven primarily by vehicular emissions and resuspended road dust [7,8]. These quantitative patterns confirm that air pollution in Lima is not episodic but structural and seasonal, underscoring the urgent need for effective predictive tools to support environmental management and protect public health.
Advances in data science and artificial intelligence have enabled the use of machine learning models for air pollution prediction, trend analysis, and early warning system development [9]. Commonly applied techniques include neural networks, support vector machines, decision trees, Random Forest, and ensemble models, which have shown promising results in several regions [10,11,12,13,14,15]. However, despite their predictive accuracy, these models often face challenges related to interpretability, hyperparameter tuning complexity, and computational cost, limiting their operational adoption in public institutions that require fast, reliable, and low-cost predictive solutions [16,17].
In contrast, low-complexity probabilistic models such as Naïve Bayes and its Gaussian variant have shown promise in environmental classification tasks due to their fast training, interpretability, and stability, even when handling noisy or moderately correlated data [18]. The Gaussian Naïve Bayes (GNB) model assumes conditional normality of continuous predictors and conditional independence among variables given the class label. Although these assumptions are often moderately violated in atmospheric datasets, previous studies have demonstrated that GNB remains robust under such conditions, achieving competitive performance with substantially lower computational requirements compared to more complex approaches.
A further methodological distinction in air-quality modeling concerns the difference between the estimation of continuous pollutant concentrations and the probabilistic prediction of regulatory air-quality categories, such as those defined by the Air Quality Index (AQI). While continuous predictions are valuable for scientific analysis, categorical predictions aligned with regulatory thresholds are more directly actionable for public health alerts, early warning systems, and operational environmental management. Nevertheless, relatively few studies in Latin American urban contexts, and none to date in Metropolitan Lima, have focused on probabilistically calibrated, multi-pollutant classification models evaluated using metrics such as Brier Score, reliability curves, and extended confusion matrices.
In this context, the present study evaluates the performance of the Gaussian Naïve Bayes model for predicting air-quality categories associated with PM10, PM2.5, NO2, and AQI in Metropolitan Lima, using a large-scale historical dataset covering the 2020–2025 period. By emphasizing probabilistic calibration, computational efficiency, and operational feasibility, this research contributes evidence-based insights that support urban environmental management and decision-making aligned with SDG 13 (Climate Action).

2. Materials and Methods

2.1. Methodological Framework: KDD-Based Data Science Engineering

In this study, the scientific method, within the solution variable, is operationalized through the Knowledge Discovery in Databases (KDD) methodology, conceptualized as an engineering-driven predictive knowledge modeling (EDPKM) approach. Under this perspective, KDD is employed as a structured engineering process that integrates data analytics, data preprocessing, and predictive modeling to systematically transform environmental data into explicit and operational knowledge artifacts. The development of the predictive model is therefore framed as a controlled knowledge-generation activity, aligned with the theory of explicit knowledge creation proposed by Nonaka and Takeuchi. As a result, air-quality prediction is treated as an engineered transformation of information into actionable predictive knowledge, ensuring methodological transparency, reproducibility, and scalability of the proposed approach [19,20,21]. A detailed schematic representation of the adopted KDD engineering workflow is provided in Appendix A.

2.2. Research Type, Approach and Design (KDD: Problem Understanding and Analytical Design)

This study is conducted in nature, as it addresses a practical problem related to forecasting air-quality conditions in Metropolitan Lima [22]. A quantitative approach was adopted, based on the analysis of numerical environmental data to identify temporal patterns and evaluate predictive performance. The research followed a non-experimental, longitudinal predictive design, suitable for modeling the temporal evolution of air pollution using historical air-quality records [23]. From a methodological and engineering perspective, the study was conducted at a predictive level, employing statistical and machine learning techniques within a structured Knowledge Discovery in Databases (KDD) process. This process-oriented approach supports the systematic development, validation, and evaluation of predictive models aimed at operational air-quality assessment and early warning systems [24].

2.3. Study Area and Data (KDD: Data Understanding and Selection)

The study area comprises Metropolitan Lima, characterized by high population density, heavy vehicular traffic, and meteorological conditions that frequently hinder pollutant dispersion. A total of 768,185 hourly records from meteorological and air-quality monitoring stations operated by the National Service of Meteorology and Hydrology of Peru (SENAMHI) were analyzed, covering the 2020–2025 period.
In accordance with the data understanding and selection stages of the KDD methodology, the entire dataset was retained to preserve temporal representativeness, capture seasonal and diurnal variability, and minimize selection bias. Although a reference sample size was initially estimated using the finite population formula (n = 36,686), the predictive modeling phase leveraged the full dataset to exploit the informational value inherent in large-volume, high-frequency observations. From a Big Data analytics perspective, this decision aligns with volume-oriented analytical strategies that prioritize comprehensive data utilization over subsampling when computationally feasible [20].
Given the low computational complexity and scalability of the Gaussian Naïve Bayes classifier, processing the complete dataset was technically viable and enabled more stable, robust, and generalizable parameter estimation across spatial and temporal contexts [3]. All computations were performed on a workstation equipped with an Intel® Core™ i7 processor (Intel Corporation, Santa Clara, CA, USA), 16 GB RAM, running Windows 11 Pro (Microsoft Corporation, Redmond, WA, USA).
All environmental measurements were obtained from official open-access repositories provided through the Peruvian National Open Data Platform, which ensures transparency, traceability, and reproducibility of the data used in this study (https://www.datosabiertos.gob.pe/ (accessed on 15 January 2026))

2.4. Predictor and Target Variables (KDD: Feature Selection and Representation)

The modeling framework focused on the probabilistic classification of air-quality categories. Target variables included regulatory categories associated with PM10, PM2.5, and NO2, as well as the Air Quality Index (AQI), defined according to national and international air-quality standards [25,26,27].
Predictor variables were selected following the feature selection and representation stages of the KDD process and were derived from spatiotemporal and contextual attributes available in the monitoring dataset. These included the hour of measurement (HORA), geographic coordinates (LONGITUD and LATITUD), station altitude (ALTITUD), and administrative location descriptors such as department, province, district, and UBIGEO code. Temporal context was incorporated through the reference timestamp (FECHA_CORTE), enabling the representation of diurnal and calendar-related patterns in air-quality conditions [28,29,30].
Lagged predictors were not incorporated, as the objective of the study was to perform same-hour probabilistic classification aligned with operational air-quality monitoring and real-time early warning requirements, thereby avoiding dependence on historical pollutant values not available in real-time operational settings.

2.5. Data Collection Technique and Preprocessing (KDD: Data Cleaning and Transformation)

A documentary analysis technique was employed, using registry records from official monitoring stations as the primary data source [31]. Data preprocessing followed a reproducible and engineering-oriented pipeline aligned with the data cleaning and transformation stages of the KDD methodology, ensuring data quality prior to modeling [32,33,34]. The operational implementation of this cleaning and transformation module is illustrated in Appendix B.3. The preprocessing steps included:
-
Validity checks: records with physically implausible values (e.g., negative pollutant concentrations) were removed.
-
Missing data handling: observations with missing target labels were discarded; records with missing predictor values were removed when missingness exceeded predefined thresholds.
-
Outlier treatment: extreme values were identified using an interquartile range (IQR) criterion and excluded from the analysis.
-
Feature encoding: temporal categorical variables were encoded numerically for model compatibility.
-
Air Quality Index calculation: the AQI was not directly available in the raw dataset and was therefore computed from pollutant concentrations using the standard piecewise linear interpolation approach:
I C A = I h i g h I l o w C h i g h C l o w C C l o w + I l o w
where C denotes the observed pollutant concentration, C l o w and C h i g h correspond to the concentration breakpoints surrounding C , and I l o w   a n d   I h i g h represent the associated AQI values.
-
Data leakage control: all preprocessing steps were fitted exclusively on the training subset and subsequently applied to the subset.

2.6. Analytical Modeling: Gaussian Naïve Bayes (KDD: Data Mining and Modeling)

The Gaussian Naïve Bayes (GNB) classifier estimates posterior class probabilities using Bayes’ theorem:
P y = k | x P y = k j = j = 1 p P x j | y = k
where x represents the vector of the predictor variables and y denotes the air-quality class [29]. For continuous predictors, class-conditional likelihoods are modeled as Gaussian distributions:
P x j | y = k 1 2 π σ k j 2 exp ( x j μ k j ) 2 2 σ k j 2
Parameters μ k j and σ k j 2 are estimated from the training data using maximum likelihood [8].
The model was implemented in Python (Python Software Foundation, Wilmington, DE, USA), version 3.11, using the GaussianNB classifier from the scikit-learn library (version 1.4.2; scikit-learn Developers, Paris, France). Data manipulation was performed using pandas (version 2.2.1; pandas Development Team, USA) and numerical operations were conducted using NumPy (version 1.26.4; NumPy Developers, USA). All experiments were executed within a Jupyter Notebook environment (version 6.5.4, Project Jupyter, USA). Training was performed using the standard fit (x_train and y_train) procedure on the training subset. Default prior handling was used (priors = none), allowing class prior probabilities to be estimated from the data. Numerical stability was ensured through the default variance smoothing parameter (var_smoothing = 1 × 10 9 ), which adds a small constant to feature variances to avoid numerical instability [6].

2.7. Evaluation Metrics and Performance Analysis (KDD: Evaluation and Interpretation)

Model performance was evaluated using accuracy, precision, recall and F1-score, as recommended by prior environmental prediction studies [35]. Probabilistic calibration was assessed using the multiclass Brier Score, computed as the mean squared difference between predicted class probabilities and observed class indicators [36,37]. Reliability curves were used to analyze calibration behavior across probability bins, while confusion matrices were employed to examine class-wise classification performance and misclassification patterns [38].
Consistent with the interpretation stage of the KDD framework, each observation was assigned to the air-quality category associated with the highest posterior probability.

Decision Criteria and Hypothesis Testing

The Gaussian Naïve Bayes (GNB) model was evaluated using standard classification metrics, including accuracy (A), precision (P), recall (R), and F1-score (F), together with the percentage of correctly classified cases derived from the confusion matrix (C), the multiclass Brier Score (B), and the number of reliability points located near the diagonal of the reliability curve (L).
In accordance with the decision rule defined in this study, the alternative hypothesis (H1) was accepted—and the null hypothesis (H0) was rejected—when all predefined performance conditions were simultaneously satisfied: A ≥ 0.80, P ≥ 0.80, R ≥ 0.80, F ≥ 0.80, C ≥ 80%, B ≤ 0.10, and L ≥ 3. This conjunctive decision criterion ensures that model acceptance reflects not only classification accuracy but also probabilistic reliability and calibration quality.
This rule was applied consistently to each specific predictive objective (PM10, PM2.5, NO2, and AQI), as well as to the global model performance assessment.

2.8. KDD Presentation Phase

As part of the knowledge presentation and deployment stage of the KDD process, all analytical results, confusion matrices, and reliability curves were generated using a dedicated data analysis and visualization interface developed specifically for this study. The software integrates model execution, probabilistic evaluation, and iconographic visualization to support interpretability, transparency, and reproducibility of the predictive knowledge generated. The initial structural workflow and user interaction design were conceptualized through a low-fidelity prototype (Appendix B.1). The complete source code and interactive outputs are publicly available at https://gnb.aldairdominguez.me/ (accessed 25 January 2026), enabling independent verification and reuse of the proposed engineering-driven predictive framework.
Exploratory and monitoring dashboards were developed using Looker Studio to support interactive environmental data analysis prior to modeling (Appendix B.2).

2.9. Ethical Considerations

This study relied exclusively on open-access environmental data provided by SENAMHI and did not involve personal information or human subjects; therefore, informed consent was not required. Data usage complied with Peru’s digital governance framework established by Supreme Decree No. 029-2021-PCM and adhered to the principles of transparency, integrity, and responsible reuse of public information outlined in Directive No. 003-2024-AGN [39,40]. All datasets were used solely for academic purposes in accordance with national and international open-science standards.

2.10. Use of Generative Artificial Intelligence (GenAI)

The authors declare that generative artificial intelligence tools were used exclusively for language editing assistance, writing refinement, content organization, and coherence checking. No GenAI tools were used to generate, modify, or synthesize data, figures, results, statistical analyses, or scientific interpretations.

3. Results

3.1. Overall Evaluation Framework

The performance of the Gaussian Naïve Bayes (GNB) model was evaluated using standard classification metrics, including accuracy, precision, recall, and F1-score. In addition, probabilistic performance was assessed using confusion matrices, Brier Score, and reliability curves to evaluate calibration quality and predictive uncertainty [37].
Class support was explicitly considered in the evaluation to account for potential imbalance across air-quality categories. For all pollutants and AQIs, the number of samples per class was reported in the corresponding confusion matrices, ensuring that performance metrics were interpreted in the context of their empirical frequency [27].
To assess model stability, performance consistency was verified across all pollutants and classes through the joint analysis of global metrics, class-wise confusion matrices, and probabilistic calibration indicators. The narrow dispersion of class-wise Brier Scores indicates low predictive variance and stable probabilistic behavior [38].

3.2. PM10 Concentration

The GNB model demonstrated strong predictive performance for PM10 concentration classification. An overall accuracy of 0.931 was achieved, with precision, recall, and F1-score all exceeding 0.93 (Table 1).
The confusion matrix (Figure 1) exhibits a dominant diagonal structure, with correct classification rates ranging from 92% to 94% across the Low-, Medium-, and High-PM10 categories. Misclassifications were limited and primarily occurred between adjacent classes, indicating adequate discrimination among pollution levels.
The average Brier Score for PM10 was 0.023, with class-wise values ranging from 0.022 to 0.024, reflecting high-quality probabilistic calibration. The reliability curve (Figure 2) shows close alignment with the diagonal, confirming consistency between predicted probabilities and observed frequencies.

3.3. PM2.5 Concentration

For PM2.5 concentration, the GNB model achieved an accuracy, precision, recall, and F1-score of 0.918 (Table 2), indicating stable and consistent classification performance.
As shown in the confusion matrix (Figure 3), correct classification rates exceeded 91% across all PM2.5 classes, with misclassification rates below 5% and restricted to neighboring categories. This pattern highlights the robustness of the model when handling fine particulate matter concentrations.
The average Brier Score for PM2.5 was 0.029, with class-wise values between 0.027 and 0.030, remaining well within acceptable calibration thresholds. The corresponding reliability curve (Figure 4) confirms appropriate probabilistic calibration.

3.4. NO2 Concentration

The prediction of NO2 concentration also yielded favorable results. The GNB model achieved accuracy, precision, recall, and F1-score values of 0.913 (Table 3).
The confusion matrix (Figure 5) shows classification accuracies slightly above 91% for the Low-, Medium-, and High-NO2 categories, with errors limited to adjacent classes. These results indicate consistent generalization across nitrogen dioxide concentration levels.
The average Brier Score was 0.031, with minimal variation among classes, further supporting reliable probability estimates. The reliability curve (Figure 6) confirms good agreement between predicted probabilities and observed outcomes.

3.5. Air Quality Index (AQI)

For the Air Quality Index (AQI), the GNB model was evaluated across six ordered categories: Good, Moderate, Unhealthy for Sensitive Groups, Unhealthy, Very Unhealthy, and Hazardous.
The model achieved an accuracy of 0.931, with precision, recall, and F1-score values above 0.93 (Table 4). The confusion matrix (Figure 7) displays strong diagonal dominance, with correct classification rates between 92% and 94.5% across all AQI categories. Misclassifications were infrequent and mainly occurred between adjacent AQI levels.
The average Brier Score for AQI was 0.010, indicating excellent probabilistic calibration. The reliability curve (Figure 8) further confirms that predicted probabilities closely match empirical frequencies.

3.6. Global Model Performance and General Hypothesis

Finally, the global evaluation of the GNB model, considering all pollutants jointly, yielded accuracy, precision, recall, and F1-score values of 0.925 (Table 5). The global confusion matrix (Figure 9) shows correct classification rates above 92% for Low, Medium, and High pollution levels, with errors confined to neighboring categories.
The overall average Brier Score was 0.023, with consistent class-wise values, indicating stable and well-calibrated probabilistic predictions. The global reliability curve (Figure 10) confirms strong agreement between predicted probabilities and observed frequencies.
From an engineering perspective, the observed performance stability across pollutant categories and the strong probabilistic calibration indicate that the predictive model behaves consistently as a reproducible knowledge artifact generated through a structured KDD-based workflow. The concentration of misclassifications between adjacent air-quality categories further reflects the model’s sensitivity to regulatory threshold boundaries rather than random predictive instability.

4. Discussion

The results obtained for the 2020–2025 period demonstrate that the Gaussian Naïve Bayes (GNB) model provides consistent, stable, and well-calibrated predictions across all evaluated pollutants, supporting its suitability as an operational air-quality prediction tool for Metropolitan Lima. The combination of strong classification metrics and low Brier Scores not only indicates high predictive accuracy but also reliable probabilistic estimation, which is essential for environmental decision-making and early warning systems.
For PM10, the model achieved accuracy, precision, recall, and F1-score values above 0.93, accompanied by a low average Brier Score (≈0.023). The diagonal-dominant confusion matrix and the concentration of misclassifications between adjacent categories suggest that GNB effectively discriminates between PM10 concentration levels. This behavior is consistent with previous studies reporting relatively stable PM10 dynamics driven by vehicular traffic, resuspension processes, and persistent meteorological conditions in large Latin American metropolitan areas [13,14,16]. The comparatively smooth temporal behavior of PM10 favors the conditional independence and Gaussian distribution assumptions of the GNB model, contributing to its strong performance.
In the case of PM2.5, the model achieved slightly lower—but still robust—performance (metrics ≈ 0.918) and an average Brier Score of approximately 0.029. This result aligns with the existing literature indicating that PM2.5 concentrations are more difficult to predict due to their sensitivity to combustion-related sources, micro-scale traffic conditions, and localized atmospheric processes [15,17]. Minor deviations observed in the reliability curve can be attributed to short-term emission peaks and complex street-level dynamics. Nevertheless, the observed calibration and accuracy levels remain sufficient for practical applications, and performance could be further enhanced by incorporating higher-resolution traffic flow data or meteorological predictors.
For NO2, the GNB model obtained accuracy and F1-score values around 0.913, with an average Brier Score close to 0.031. Misclassifications were primarily concentrated between the Medium and High categories, reflecting the rapid temporal variability of NO2 concentrations driven by fluctuations in vehicular activity and urban mobility patterns. These findings are consistent with studies reporting reliable NO2 prediction performance in dense urban environments, where short-term changes in traffic intensity play a dominant role [14,15]. Despite these dynamics, GNB maintained stable calibration, demonstrating robustness in handling rapidly changing gaseous pollutants.
The strongest performance was observed for the Air Quality Index (AQI), with accuracy and F1-score values exceeding 0.93 and a notably low average Brier Score (≈0.010). The near-perfect alignment of the reliability curve with the diagonal indicates excellent probabilistic calibration, even when classifying six ordered AQI categories. This outcome can be explained by the integrative nature of AQI, which aggregates multiple pollutants into standardized thresholds, reducing noise and facilitating classification. Similar findings have been reported in studies that emphasize the effectiveness of composite air-quality indices for risk communication and public health decision support [13,17,18].
From a global perspective, the integrated model achieved consistent performance (metrics ≈ 0.925; Brier Score ≈ 0.023), confirming that GNB effectively combines multi-pollutant information and generalizes well to unseen data. Compared with more complex approaches such as deep neural networks, ensemble models, or hybrid architectures, GNB offers a favorable balance between predictive accuracy, interpretability, and computational efficiency [41,42]. This balance is particularly relevant in urban contexts such as Metropolitan Lima, where environmental monitoring agencies may face constraints in technical infrastructure, data availability, and operational resources.
Despite its strengths, this study has certain limitations. The conditional independence assumption inherent to GNB may not fully capture interactions among pollutants and meteorological variables, particularly under extreme pollution episodes. Additionally, the use of fixed AQI thresholds may introduce discretization effects that smooth abrupt concentration changes. Future research could address these limitations by integrating hybrid or hierarchical models, incorporating spatial dependencies, or evaluating real-time forecasting scenarios.
Overall, the findings confirm that Gaussian Naïve Bayes constitutes a robust, low-complexity, and well-calibrated approach for air pollution prediction, offering a practical solution for early warning systems, environmental management, and evidence-based policymaking in densely populated urban environments aligned with Sustainable Development Goal 13 (Climate Action).
The results support the validity of adopting an engineering-driven predictive knowledge modeling approach, in which the Knowledge Discovery in Databases (KDD) framework structures the transformation of air-quality data into explicit and operational predictive knowledge. Rather than optimizing model complexity, the proposed approach prioritizes reproducibility, interpretability, and computational efficiency—key requirements for real-time air-quality monitoring and decision-support systems. In this context, the Gaussian Naïve Bayes classifier demonstrates that probabilistic models with transparent assumptions can achieve competitive performance while remaining suitable for operational deployment.

5. Conclusions

  • This study demonstrates that the Gaussian Naïve Bayes (GNB) model provides accurate, stable, and well-calibrated predictions of air pollution levels in Metropolitan Lima for the 2020–2025 period. The global performance metrics (accuracy, precision, recall, and F1-score ≈ 0.925), together with a low average Brier Score (≈0.023), confirm both strong classification capability and reliable probabilistic estimation.
  • The model showed robust performance in predicting particulate matter concentrations. For PM10, classification accuracy exceeded 93%, with misclassifications largely restricted to adjacent concentration levels, reflecting effective discrimination under relatively stable urban pollution conditions. For PM2.5, performance remained consistently high (≈0.918), despite the greater variability and localized emission sources associated with fine particulate matter.
  • For NO2, the GNB model achieved reliable predictive performance (metrics ≈ 0.913), capturing rapid concentration transitions driven by urban traffic dynamics. Misclassifications were primarily confined to neighboring categories, indicating appropriate generalization across gaseous pollution levels.
  • The strongest results were obtained for the Air Quality Index (AQI), where the model achieved accuracy and F1-score values above 0.93 and an exceptionally low Brier Score (≈0.010). This highlights the effectiveness of GNB in handling multiclass, ordered air-quality categories and producing well-calibrated probabilistic outputs suitable for risk communication.
  • Overall, the findings confirm that Gaussian Naïve Bayes represents a computationally efficient and interpretable modeling approach that balances simplicity with high predictive performance. Its low computational cost and stable calibration make it particularly suitable for operational air-quality monitoring, early warning systems, and evidence-based decision-making in resource-constrained urban environments.
  • This study contributes a structured and reproducible engineering-oriented framework for air-quality prediction, in which KDD serves as the backbone for predictive knowledge generation. By framing the model as an explicit knowledge artifact rather than a black-box predictor, the proposed approach facilitates transparency, scalability, and transferability to other urban contexts. This perspective supports the development of low-latency, interpretable decision-support tools aligned with sustainable urban environmental management.
  • By aligning air pollution prediction with Sustainable Development Goal 13 (Climate Action), this study contributes a practical and scalable methodological framework that supports urban environmental management and strengthens resilience strategies in densely populated metropolitan areas.

Author Contributions

Conceptualization, A.G., A.D. and E.F.-C.; methodology, A.D. and E.F.-C.; software, A.D.; validation A.D. and A.G.; formal analysis, A.D.; investigation, A.D. and A.G.; resources, A.G.; data curation, A.D.; writing—original draft preparation, A.D.; writing—review and editing, A.G. and E.F.-C.; visualization, A.D.; supervision, E.F.-C.; project administration, A.G. All authors have read and agreed to the published version of the manuscript.

Funding

Institutional funding from Universidad César Vallejo is currently under administrative review.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. The raw environmental and meteorological data used for model training and evaluation were obtained from the open-data platform of the National Meteorology and Hydrology Service of Peru (SENAMHI) and are accessible through the Peruvian open data portal at https://www.datosabiertos.gob.pe/dataset/monitoreo-de-los-contaminantes-del-aire-en-lima-metropolitana-servicio-nacional-de (accessed on 15 January 2026). Processed datasets and analysis scripts generated in the current study are available from the corresponding author upon reasonable request.

Acknowledgments

The authors express their gratitude to the National Meteorology and Hydrology Service of Peru (SENAMHI) for providing open-access environmental datasets that made this research possible. The authors also acknowledge the administrative and academic support from the Cesar Vallejo University during the development of this study.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

AQIAir Quality Index
GNBGaussian Naïve Bayes
PM10Particulate Matter ≤ 10 µm
PM2.5Particulate Matter ≤ 2.5 µm
NO2Nitrogen Dioxide
WHOWorld Health Organization
SDGSustainable Development Goal
MLMachine Learning

Appendix A

Data Mining Engineering (KDD)

The Knowledge Discovery in Databases (KDD) process is a systematic framework for extracting valid, novel, and useful patterns from large datasets [19]. In this study, the KDD process was adopted as an engineering guideline to structure the data pipeline that supports the Gaussian Naïve Bayes (GNB) model for air pollution prediction in Metropolitan Lima.
Sustainability 18 05748 i001

Appendix B

Appendix B.1. Low-Fidelity Prototype

The figure illustrates the structural workflow and user interaction architecture of the data analysis and predictive modeling interface developed for this study. The system is organized into sequential modules that guide the user from authentication to probabilistic model evaluation and visualization.
Sustainability 18 05748 i002

Appendix B.2. Dashboards in Looker Studio

The figure presents the Environmental Analysis Dashboard designed to monitor and explore air-quality conditions in Metropolitan Lima. This interface functions as an interactive exploratory data analysis module within the KDD framework, supporting data understanding and interpretation prior to predictive modeling.
Sustainability 18 05748 i003

Appendix B.3. Predictive System

The figure illustrates the Climate Data Management module of the Climate Analytics platform, corresponding to the Data Cleaning and Transformation stages of the Knowledge Discovery in Databases (KDD) process. This interface supports the structured preparation of environmental datasets prior to predictive modeling.
Sustainability 18 05748 i004aSustainability 18 05748 i004b

References

  1. Henninger, E.; Smith, E.K. Beyond the haze: Decomposing the effect of economic inequality on global air quality from 2000 to 2020. Ecol. Econ. 2024, 222, 108210. [Google Scholar] [CrossRef]
  2. Chaurasiya, M.; Kumar, S.; Bhatt, K.; Sharma, S. The interplay of SDGs and climate action: A quantitative analysis of regional income influences on SDG 13 progress. Phys. Chem. Earth Parts A/B/C 2025, 139, 103939. [Google Scholar] [CrossRef]
  3. García-García, J.A.; Reding-Bernal, A.; López-Alvarenga, J.C. Cálculo del tamaño de la muestra en investigación en educación médica. Investig. Educ. Méd. 2013, 2, 217–224. [Google Scholar] [CrossRef]
  4. Vu, B.N.; Tapia, V.; Ebelt, S.; Gonzales, G.F.; Liu, Y.; Steenland, K. The association between asthma emergency department visits and satellite-derived PM2.5 in Lima, Peru. Environ. Res. 2021, 199, 111226. [Google Scholar] [CrossRef]
  5. Cummings, L.E.; Stewart, J.D.; Kremer, P.; Shakya, K.M. Predicting citywide distribution of air pollution using mobile monitoring and three-dimensional urban structure. Sustain. Cities Soc. 2022, 76, 103510. [Google Scholar] [CrossRef]
  6. Mondal, C.; Uddin, M.J. Classification of short-term flood events using stochastic variable selection and Gaussian Naïve Bayes classifier: A case study of Sirajganj district, Bangladesh. Heliyon 2025, 11, e41941. [Google Scholar] [CrossRef]
  7. Yang, Z.; Lau, Y.; Kanrak, M. Pollution prevention of vessels in the greater bay area: A practical contribution of port state control inspection system towards carbon neutralisation using a tree augmented naive bayes approach. J. Clean. Prod. 2023, 423, 138651. [Google Scholar] [CrossRef]
  8. Venkata, P.; Pandya, V. Data mining model and Gaussian Naive Bayes based fault diagnostic analysis of modern power system networks. Mater. Today Proc. 2022, 62, 7156–7161. [Google Scholar] [CrossRef]
  9. Manish Lad, A.; Mani Bharathi, K.; Akash Saravanan, B.; Karthik, R. Factors affecting agriculture and estimation of crop yield using supervised learning algorithms. Mater. Today Proc. 2022, 62, 4629–4634. [Google Scholar] [CrossRef]
  10. Gnecco, V.M.; Kousis, I.; Pigliautile, I.; Pisello, A.L. Decoding Living Lab sensing system through Bayesian networks: The preferable working space targeting comfort and productivity. J. Build. Eng. 2025, 101, 111913. [Google Scholar] [CrossRef]
  11. Shang, Y. Prevention and detection of DDOS attack in virtual cloud computing environment using Naive Bayes algorithm of machine learning. Meas. Sens. 2024, 31, 100991. [Google Scholar] [CrossRef]
  12. Phruksahiran, N. Improvement of air quality index prediction using geographically weighted predictor methodology. Urban Clim. 2021, 38, 100890. [Google Scholar] [CrossRef]
  13. Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M.; Moretti-Villegas, L.F.; Tafur-Anzualdo, V.I.; Valiente-Saldaña, Y.M. Contaminación del aire en la ciudad de Lima, Perú. Rev. Arbitr. Interdiscip. Koin. 2023, 8, 822–830. [Google Scholar] [CrossRef]
  14. Gómez Peláez, L.M.; Santos, J.M.; de Almeida Albuquerque, T.T.; Reis, N.C.; Andreão, W.L.; de Fátima Andrade, M. Air quality status and trends over large cities in South America. Environ. Sci. Policy 2020, 114, 422–435. [Google Scholar] [CrossRef]
  15. Ndiaye, A.; Shen, Y.; Kyriakou, K.; Karssenberg, D.; Schmitz, O.; Flückiger, B.; de Hoogh, K.; Hoek, G. Hourly land-use regression modeling for NO2 and PM2.5 in the Netherlands. Environ. Res. 2024, 256, 119233. [Google Scholar] [CrossRef]
  16. Mangones, S.C.; Cuéllar-Álvarez, Y.; Rojas-Roa, N.Y.; Osses, M. Addressing urban transport-related air pollution in Latin America: Insights and policy directions. Lat. Am. Transp. Stud. 2025, 3, 100033. [Google Scholar] [CrossRef]
  17. Shetty, S.; Hamer, P.D.; Stebel, K.; Kylling, A.; Hassani, A.; Berntsen, T.K.; Schneider, P. Daily high-resolution surface PM2.5 estimation over Europe by ML-based downscaling of the CAMS regional forecast. Environ. Res. 2025, 264, 120363. [Google Scholar] [CrossRef] [PubMed]
  18. Alnowaiser, K.; Alarfaj, A.A.; Alabdulqader, E.A.; Umer, M.; Cascone, L.; Alankar, B. IoT based smart framework to predict air quality in congested traffic areas using SV-CNN ensemble and KNN imputation model. Comput. Electr. Eng. 2024, 118, 109311. [Google Scholar] [CrossRef]
  19. Llatas, C.; Soust-Verdaguer, B.; Torres, L.C.; Cagigas, D. Application of Knowledge Discovery in Databases (KDD) to environmental, economic, and social indicators used in BIM workflow to support sustainable design. J. Build. Eng. 2024, 91, 109546. [Google Scholar] [CrossRef]
  20. Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R.; Grander, G.; Silva, L.F.D.; Gonzalez, E.D.R.S.; Penha, R. Framework for Structuring Big Data Projects. Electronics 2022, 11, 3540. [Google Scholar] [CrossRef]
  21. La Organización Creadora de Conocimiento: Cómo las Compañías Japonesas Crean la Dinámica de la Innovación—Universidad Granada. Available online: https://granatensis.ugr.es/discovery/fulldisplay/alma991003128989704990/34CBUA_UGR:VU1 (accessed on 9 January 2026).
  22. Higashide, N.; Zhang, Y.; Asatani, K.; Miura, T.; Sakata, I. Quantifying advances from basic research to applied research in material science. Technovation 2024, 135, 103050. [Google Scholar] [CrossRef]
  23. Su, X.; Shang, S.; Xu, Z.; Qian, H.; Pan, X. Assessment of Dependent Performance Shaping Factors in SPAR-H Based on Pearson Correlation Coefficient. Comput. Model. Eng. Sci. 2023, 138, 1813–1826. [Google Scholar] [CrossRef]
  24. Tieppo, E.; Nievola, J.C.; Barddal, J.P. Adaptive learning on hierarchical data streams using window-weighted Gaussian probabilities. Appl. Soft Comput. 2024, 152, 111271. [Google Scholar] [CrossRef]
  25. Moreno, R.; Nery, A.; Zamora, R.; Lora, Á.; Galán, C. Contribution of urban trees to carbon sequestration and reduction of air pollutants in Lima, Peru. Ecosyst. Serv. 2024, 67, 101618. [Google Scholar] [CrossRef]
  26. Romero, Y.; Diaz, C.; Meldrum, I.; Arias Velasquez, R.; Noel, J. Temporal and spatial analysis of traffic—Related pollutant under the influence of the seasonality and meteorological variables over an urban city in Peru. Heliyon 2020, 6, e04029. [Google Scholar] [CrossRef]
  27. Gond, A.K.; Jamal, A.; Verma, T. Developing a machine learning model using satellite data to predict the Air Quality Index (AQI) over Korba Coalfield, Chhattisgarh (India). Atmos. Pollut. Res. 2025, 16, 102398. [Google Scholar] [CrossRef]
  28. Berrar, D. Bayes’ Theorem and Naive Bayes Classifier. In Encyclopedia of Bioinformatics and Computational Biology; Ranganathan, S., Gribskov, M., Nakai, K., Schönbach, C., Eds.; Academic Press: Oxford, UK, 2019; pp. 403–412. ISBN 978-0-12-811432-2. [Google Scholar]
  29. Islam, R.; Devnath, M.K.; Samad, M.D.; Jaffrey Al Kadry, S.M. GGNB: Graph-based Gaussian naive Bayes intrusion detection system for CAN bus. Veh. Commun. 2022, 33, 100442. [Google Scholar] [CrossRef]
  30. Arshad, A.; Jabeen, M.; Ubaid, S.; Raza, A.; Abualigah, L.; Aldiabat, K.; Jia, H. A novel ensemble method for enhancing Internet of Things device security against botnet attacks. Decis. Anal. J. 2023, 8, 100307. [Google Scholar] [CrossRef]
  31. Ji, W.; Wang, C.; Chen, H.; Liang, Y.; Wang, S. Predicting post-stroke cognitive impairment using machine learning: A prospective cohort study. J. Stroke Cerebrovasc. Dis. 2023, 32, 107354. [Google Scholar] [CrossRef]
  32. Otsu, T.; Taniguchi, G. Kolmogorov–Smirnov type test for generated variables. Econ. Lett. 2020, 195, 109401. [Google Scholar] [CrossRef]
  33. Just, M.; Schubert, P.; Blatt, J.; Delfmann, P. Data Preprocessing for Cross-System Analysis: The DaProXSA Approach. Procedia Comput. Sci. 2024, 239, 1635–1644. [Google Scholar] [CrossRef]
  34. Lydersen, S. Statistical review: Frequently given comments updated. Ann. Rheum. Dis. 2025, 84, 660–663. [Google Scholar] [CrossRef] [PubMed]
  35. Cabot, J.H.; Ross, E.G. Evaluating prediction model performance. Surgery 2023, 174, 723–726. [Google Scholar] [CrossRef] [PubMed]
  36. Dimitriadis, T.; Gneiting, T.; Jordan, A.I.; Vogel, P. Evaluating probabilistic classifiers: The triptych. Int. J. Forecast. 2024, 40, 1101–1122. [Google Scholar] [CrossRef]
  37. Conciatori, M.; Valletta, A.; Segalini, A. Improving the quality evaluation process of machine learning algorithms applied to landslide time series analysis. Comput. Geosci. 2024, 184, 105531. [Google Scholar] [CrossRef]
  38. Gehringer, C.K.; Martin, G.P.; Van Calster, B.; Hyrich, K.L.; Verstappen, S.M.M.; Sergeant, J.C. How to develop, validate, and update clinical prediction models using multinomial logistic regression. J. Clin. Epidemiol. 2024, 174, 111481. [Google Scholar] [CrossRef]
  39. Decreto Supremo N.° 029-2021-PCM. Available online: https://www.gob.pe/es/institucion/pcm/normas-legales/1705101-029-2021-pcm (accessed on 3 June 2025).
  40. Resolución de Secretaría General N.° 000039-2024-AGN/SG. Available online: https://www.gob.pe/institucion/agn/normas-legales/5371925-000039-2024-agn-sg (accessed on 3 June 2025).
  41. Onah, J.O.; Abdulhamid, S.M.; Abdullahi, M.; Hassan, I.H.; Al-Ghusham, A. Genetic Algorithm based feature selection and Naïve Bayes for anomaly detection in fog computing environment. Mach. Learn. Appl. 2021, 6, 100156. [Google Scholar] [CrossRef]
  42. Paneru, S.; Xu, X.; Wang, J.; Chi, G.; Hu, Y. Assessing building thermal resilience in response to heatwaves through integrating a social vulnerability lens. J. Build. Eng. 2024, 98, 111219. [Google Scholar] [CrossRef]
Figure 1. Confusion matrix by class for PM10.
Figure 1. Confusion matrix by class for PM10.
Sustainability 18 05748 g001
Figure 2. Reliability curve for PM10.
Figure 2. Reliability curve for PM10.
Sustainability 18 05748 g002
Figure 3. Confusion matrix by class for PM2.5.
Figure 3. Confusion matrix by class for PM2.5.
Sustainability 18 05748 g003
Figure 4. Reliability curve for PM2.5.
Figure 4. Reliability curve for PM2.5.
Sustainability 18 05748 g004
Figure 5. Confusion matrix by class for NO2.
Figure 5. Confusion matrix by class for NO2.
Sustainability 18 05748 g005
Figure 6. Reliability curve for NO2.
Figure 6. Reliability curve for NO2.
Sustainability 18 05748 g006
Figure 7. Confusion matrix by class for AQI.
Figure 7. Confusion matrix by class for AQI.
Sustainability 18 05748 g007
Figure 8. Reliability curve for AQI.
Figure 8. Reliability curve for AQI.
Sustainability 18 05748 g008
Figure 9. Confusion matrix by global model class.
Figure 9. Confusion matrix by global model class.
Sustainability 18 05748 g009
Figure 10. Global GNB model reliability curve.
Figure 10. Global GNB model reliability curve.
Sustainability 18 05748 g010
Table 1. Performance metrics for PM10.
Table 1. Performance metrics for PM10.
MetricsObtained ValueReference ScaleEvaluation
Accuracy0.931Excellent ≥ 0.80Excellent
Precision0.933Excellent ≥ 0.80Excellent
Recall0.931Excellent ≥ 0.80Excellent
F1-Score0.934Excellent ≥ 0.80Excellent
Table 2. Performance metrics for PM2.5.
Table 2. Performance metrics for PM2.5.
MetricsObtained ValueReference ScaleEvaluation
Accuracy0.918Excellent ≥ 0.80Excellent
Precision0.918Excellent ≥ 0.80Excellent
Recall0.918Excellent ≥ 0.80Excellent
F1-Score0.918Excellent ≥ 0.80Excellent
Table 3. Performance metrics for NO2.
Table 3. Performance metrics for NO2.
MetricsObtained ValueReference ScaleEvaluation
Accuracy0.913Excellent ≥ 0.80Excellent
Precision0.913Excellent ≥ 0.80Excellent
Recall0.913Excellent ≥ 0.80Excellent
F1-Score0.913Excellent ≥ 0.80Excellent
Table 4. Performance metrics for AQI.
Table 4. Performance metrics for AQI.
MetricsObtained ValueReference ScaleEvaluation
Accuracy0.931Excellent ≥ 0.80Excellent
Precision0.932Excellent ≥ 0.80Excellent
Recall0.932Excellent ≥ 0.80Excellent
F1-Score0.932Excellent ≥ 0.80Excellent
Table 5. General performance metrics.
Table 5. General performance metrics.
MetricsObtained ValueReference ScaleEvaluation
Accuracy0.925Excellent ≥ 0.80Excellent
Precision0.925Excellent ≥ 0.80Excellent
Recall0.925Excellent ≥ 0.80Excellent
F1-Score0.925Excellent ≥ 0.80Excellent
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gavidia, A.; Dominguez, A.; Flores-Chacón, E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability 2026, 18, 5748. https://doi.org/10.3390/su18115748

AMA Style

Gavidia A, Dominguez A, Flores-Chacón E. Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability. 2026; 18(11):5748. https://doi.org/10.3390/su18115748

Chicago/Turabian Style

Gavidia, Aimee, Aldair Dominguez, and Erick Flores-Chacón. 2026. "Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management" Sustainability 18, no. 11: 5748. https://doi.org/10.3390/su18115748

APA Style

Gavidia, A., Dominguez, A., & Flores-Chacón, E. (2026). Predicting Air Pollution in Metropolitan Lima Using Gaussian Naïve Bayes (2025): An Efficient Model for Urban Environmental Management. Sustainability, 18(11), 5748. https://doi.org/10.3390/su18115748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop