A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments

Mustafa, Muzzamil; Akhtar, Maaz; Ahmad, Ashfaq; Javaid, Fahad; Haldar, Barun; Nisar, Badil

doi:10.3390/su18042148

Open AccessArticle

A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments

by

Muzzamil Mustafa

¹

,

Maaz Akhtar

^2,*

,

Ashfaq Ahmad

³

,

Fahad Javaid

⁴,

Barun Haldar

²

and

Badil Nisar

⁵

¹

Department of Information Engineering, Computer Science and Mathematics, Univerita Degli Studi Dell’Aquila, 67100 L’Aquila, Italy

²

Industrial Engineering Department, College of Engineering, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia

³

Department of Artificial Intelligence, University of Management and Technology, Lahore 54000, Pakistan

⁴

Department of Information and Works, Government College Women University Sialkot, Sialkot 51310, Pakistan

⁵

Saed Azka Limited Company, Al-Andalus District, Jeddah 23325, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Sustainability 2026, 18(4), 2148; https://doi.org/10.3390/su18042148

Submission received: 31 January 2026 / Revised: 16 February 2026 / Accepted: 19 February 2026 / Published: 22 February 2026

(This article belongs to the Special Issue Advances in Sustainable Climate Change Adaptation Research and Technology)

Download

Browse Figures

Versions Notes

Abstract

Urban air quality assessment is central to environmental sustainability and public health management. This study presents a structured comparative evaluation of Random Forest (RF), Support Vector Machine (SVM), LSTM, and Bi-LSTM models for pollutant-driven air quality classification under the Indian National Air Quality Index (NAQI) framework defined by CPCB guidelines. To provide a fair comparison, multi-pollutant data of Indian urban monitoring stations were preprocessed, and the class-balancing protocol and validation protocol were combined. RF had highest total accuracy (0.9971) in the held-out set, with Bi-LSTM (0.9615), LSTM (0.9495), and SVM (0.9442) coming next. Although ensemble methods proved to be very separable in line with the threshold-based NAQI structure, Bi-LSTM was more stable when it came to boundary-sensitive switches among the adjacent severity classes. Calibration analysis (multiclass Brier score: 0.08) showed consistent probabilistic behavior and interpretation, and using SHAP showed physically significant pollutant driving factors. The results explain the appropriateness of comparative models in organized AQI classification and present a reproducible assessment framework for the NAQI framework.

Keywords:

air quality index; green smart cities; multi-pollutant modeling; machine learning; time-aware validation; environmental sustainability

1. Introduction

The quality of air can be considered an important element of environmental sustainability and the key factor in the determination of human health outcomes on a global level. The pressing increases in industrialization, urbanization, and population have had some adverse effects on ambient air conditions in most places. Increased levels of PM and toxic gases are linked to respiratory illnesses, heart diseases, neurological effects, and permanent harm to organs [1]. The Air Quality Index (AQI) offers a unified system to communicate air quality and health hazards related to ambient air. Increased values of AQI indicate worsening pollution conditions and health susceptibility, especially for vulnerable groups [1]. Fine particulate matter (PM_2.5) is particularly alarming among key pollutants because it is formed by small particles and follows complicated formation mechanisms as well as nonlinear spatial and time dynamics, meaning that it is difficult to predict using conventional statistical models [2]. Conventional techniques of air quality monitoring, typically based on the collection of manual data and rudimentary statistical analysis, have limitations in timeliness, scalability, and prediction [3]. Secondary city pollutants and non-combustion sources also contribute to the worsening of air quality in high-density population areas, which, in turn, in combination with meteorological factors, is creating complexes of multiple pollutants [4]. To cope with these issues, machine learning (ML) as a data-driven method has been developed as an effective way to model nonlinear relationships among pollutants, meteorological factors, and sources of emissions. By combining the Internet of Things (IoT) sensing infrastructure with ML systems, near real-time environmental control becomes achievable and contributes to smart decision-making in the smart city format [4,5].

As much as these technological advances provide a fruitful avenue towards proactive management of the environment, it is important to always evaluate the predictive models in a uniform AQI classification system to ensure that they can be successfully deployed.

1.1. Environment Changes in Smart Cities

The use of communication and sensing technology has resulted in urban areas becoming smart cities. Interconnected infrastructures enable cities to observe real-time environmental and infrastructural conditions, which enables continuous data collection and for a large amount of data to be processed in order to make smart decisions. They are a networked system that increases the work within cities in various aspects, including transportation, energy, population and environmental safety.

New innovations have proven that intelligent systems can be effectively used to make energy consumption; infrastructure; and relationships with citizens, medical services and the city administration more efficient. Such systems are interconnected and they interact to create responsive and adaptive urban environments. It is also interesting to note that the application of such systems has expanded into the environmental sector, where sensorial platforms are being implemented to address the issue of pollution in a better way [6].

The urbanization of cities has been a major issue, particularly in urban areas that are fast developing. The greatest sources of high concentrations of particulate matter, ozone, and nitrogen dioxide are the growing industrialization and the emissions of vehicles. Health conditions including respiratory diseases, heart diseases and premature death have been linked to these pollutants. Especially vulnerable groups are the asthmatic and people suffering with other chronic diseases that are highly exposed to these gases when they come in contact with them [6]. To help overcome these challenges, intelligent environmental modeling procedures are being integrated into the planning and health policies of cities. The goals of accurate prediction, early warning, and actionable information on pollution reduction due to such data-intensive systems are in line with the objective of urban development in the case of sustainable growth [7].

1.2. Problem Definition

Rapid population increases and industrialization have exacerbated environmental quality challenges throughout the world. Pollution is largely caused by different sources, particularly air pollution and industrial emissions. These pollutants are highly harmful to human health and the sustainability of ecosystems, and environmental management remains a key concern in emerging smart and green city development. Recent research has largely centered on individual pollutant sources without sufficient examination of the joint impact of multi-source air pollutants. Such practice limits predictive and preventive measures. In this context, the need to identify a hybrid model that studies the concentrations of primary pollutants across multiple regions is emerging.

This study aims to develop a hybrid predictive framework using machine learning techniques to evaluate the collective effect of urban air pollutants on environmental quality. By modeling pollutant interactions and forecasting their combined impact, the proposed approach supports intelligent emission control and contributes to the realization of green smart cities.

1.3. Research Contributions

While numerous studies have applied machine learning techniques to AQI prediction, inconsistencies remain in evaluation design, metric reporting, and comparative interpretation across classical and deep learning approaches. In this context, the present study provides a structured comparative assessment of Random Forest, Support Vector Machine, LSTM, and Bi-LSTM models for pollutant-driven NAQI category classification under a unified preprocessing and validation framework. Rather than proposing a new algorithm, the contribution lies in clarifying model behavior across varying data complexity groups, reporting macro- and weighted-average metrics alongside overall accuracy, and analyzing boundary-sensitive misclassification patterns inherent to ordinal NAQI categories.

2. Literature Review

This section discusses related work in smart cities and green environments. The related work is presented in terms of the main technologies used for smart cities and green environments using machine learning, deep learning and other technologies. This section also includes limitations on related work. The criterion for selecting articles for the proposed study is articles from 2016 to 2023; articles before 2016 are not considered.

Air pollution is an environmental challenge that is critical and is aggravated by global industrialization, urbanization and vehicular networks. Long-term health risks as well as respiratory diseases are linked with exposure to minute particulate matter, especially PM_2.5, which has triggered an epidemic of predictive research. The use of machine learning models has become a powerful instrument to predict pollutant levels, which is much better compared to traditional approaches since the model predicts nonlinear and time-dependent behaviors.

Recent studies by [8] have shown that Long Short-Term Memory in conjunction with a Stacked Autoencoder (LSTM-SAE) had high accuracy, with a maximum of 91.22 percent, in smart urban system air quality prediction [8]. Maralinga suggested a general ML framework to predict the AQI in smart cities and explained how machine learning models are flexible to a variety of environmental data and urban settings [9]. Complementary research has highlighted the importance of classification algorithms in keeping track of the AQI trends and predicting the risk of harm to the environment, such as Gaussian Naive Bayes and ensemble approaches, which perform well with imbalanced datasets and dynamic air quality conditions [10].

In addition to the development of models, the structural features of urban settings have been studied by researchers. Compact cities are also touted to be sustainable compared to unplanned urban sprawl, but in many cases have very minimal green cover. Artmann et al. suggested a conceptual framework of an indicator-based approach that integrates the concepts of smart growth and green infrastructure to allow for more ecologically balanced urban planning [11]. The interplay of anthropogenic activity with air quality was also emphasized during the COVID-19 pandemic, during which significant decreases in the levels of pollution were witnessed due to the global mobility restrictions. Such findings support the need to combine behavioral, policy, and technological interventions of adaptive urban management [11]. At the regional scale, the case study of air pollution is a compounded problem in Malaysia. High health risks regarding hospitalization and death have been attributed to ozone exposure as a result of industrial emissions and traffic pollutants. Comparative evaluations between the policies on air quality in Malaysia and internationally have shown that there is a lack of research in localizing modeling, policy formulation, and forecasting environmental analytics [12].

PM_2.5 has become a leading pollutant because of its microscopic dimensions and the physiological consequences of micro particles. These small particles can circumvent the upper respiratory tract protection, enter the alveoli, and become a part of the blood, which may cause the development of systemic inflammation, worsen lung diseases, and lead to cardiovascular dysfunction. As a solution, machine learning models have been created to improve the predicative and spatio-temporal resolution in air quality assessment. K-Nearest Neighbors, Random Forest, Gradient Boosting, and AdaBoost are examples of algorithms with great potential to predict pollution concentrations in different environmental and meteorological conditions to provide data-driven basis for actions in policy and urban planning [13].

While PM_2.5 has received considerable attention due to its strong association with adverse health outcomes, standardized AQI frameworks such as NAQI incorporate multiple pollutants, including PM₁₀, NO₂, O₃, SO₂, and CO. These pollutants interact with meteorological conditions and emission sources in complex and nonlinear ways, influencing AQI categorization across severity levels. Recent research has increasingly emphasized integrated multi-pollutant modeling rather than reliance on a single dominant contaminant, particularly within structured AQI classification systems.

Applications of deep learning models in real life have also delivered promising results. In one of the studies conducted in Chennai, the predictors of LSTM and Support Vector Regression were applied, where meteorological variables and urban data were used to predict the AQI, and this has given superior results in the categorization of the AQI and better outcomes in urban planning [14]. Similarly, there is a study that employed a hybrid CNN-LSTM to forecast the spatial–temporal aerial pollutants with the aid of publicly accessible data in Barcelona, Istanbul, and Kocaeli. The model was highly realistic compared to the traditional ML processes, particularly in densely populated regions [15].

Supervised learning algorithms are still being used in several scenarios to predict the AQI and pollutant classification. Models that have been trained using measurements of NO₂, carbon monoxide and sulfur dioxide have shown good predictive results. The Root Mean Square error has often been taken as an evaluative factor of the models, with lower RMSE rates reflecting a stronger measure of the estimation of Air Quality Indices for various pollutant groups [16].

Smart cities focus on improving the quality of life, which means incorporating contemporary technologies to ensure environmental sustainability, energy savings, and smart city governance. The deployment of intelligent systems has brought with it new functions in real-time environmental surveillance, optimization of resources and adaptive formation of policies. The techniques of reinforcement learning and other machine learning methods are currently being used in various fields to facilitate the process of dynamic decision-making in industries like energy distribution, management of traffic flow, and mitigation of emission strategies [17].

The future of smart cities lies in how ICT systems are linked to the traditional infrastructure to create adaptive and inclusive governance models. Some of the major objectives that have been defined in research agendas include innovation promotion, urban equity, and citizen involvement in planning. These goals are underpinned by six central research missions that endeavor to transform the governance, infrastructural sustainability and citizen involvement in urban areas [18]. Along with the increase in smart devices, the amount of environmental data they produce increases. Due to massive datasets in high-volume settings, deep learning models are now commonly utilized to generate actionable information about these streams of data, namely traffic congestion. A recent framework provided a connection between the human traffic volume and the ambient air pollution concentration by applying seven regression models and comparing them with each other on the basis of predictive quality. These types of hybrid solutions provide effective instruments for environmental planning and reducing congestion [19].

Significance of Multi-Pollutant Consideration in Air Quality Prediction

The quality of air, in turn, is linked to the complex mixture of various pollutants, and these possess different chemical behaviors, sources of emission, and durations in the atmosphere. Single-pollutant models and the application of such models to complex cases ease the burden but can also make incorrect predictions and implement incorrect courses of action. Indicatively, PM_2.5 (fine particulate matter), NO₂, O₃, SO₂, CO, and volatile organic compounds (VOCs) often interact and merge in the air, thus leading to the formation of secondary pollutants such as smog and ground level ozone.

Consequently, the multi-pollutant modeling technique is necessary to forecast the quality of air. It enabled finding synergy and antagonistic interactions between pollutants, identifies the effect of compounds on the population and climate, and facilitates the creation of stronger policies to reduce pollution. The hybrid framework proposed in the current research considers various sources of pollution, especially air emissions, to enhance the accuracy and resolution of AQI predictions in smart cities. Figure 1 indicates that the AQI classification work in existence underreports temporal robustness, calibration and ordinal error structure.

3. Methodology

This section describes the proposed hybrid machine learning workflow for AQI category prediction under the Central Pollution Control Board (CPCB) National Air Quality Index (NAQI) guidelines. The overall experimental pipeline, including data preprocessing, validation design, model training, hyperparameter tuning, and final evaluation, is summarized in Figure 2.

In Figure 2, the decision node labeled “Meets Threshold?” refers to validation performance exceeding predefined criteria during hyperparameter tuning. Specifically, models were required to satisfy target macro-F1 performance and acceptable calibration behavior on the validation split. If the performance did not meet these criteria, hyperparameters were iteratively adjusted before proceeding to the final evaluation on the held-out test set. This threshold mechanism ensured consistent model selection under the unified validation framework. The methodology encompasses the following stages.

3.1. Data Preparation and Modeling

This paper employs an extensive dataset of both real-time and past measurements of air-based and urban pollution. The most important variables are PM_2.5, PM₁₀, NO₂, SO₂, CO concentrations. Meteorological characteristics (temperature, humidity, wind speed) and geospatial information are also added to achieve better predictability. Preprocessing involves the process of normalization, imputation of missing values and noise filtering. The classification of pollutant thresholds is built on the principles of the CPCB National Air Quality Index (NAQI), which defines the level of impact on pollutants into six categories, namely, the Good category, Satisfactory category, Moderate category, Poor category, Very Poor category, and Severe category.

The data is divided into training and testing in a 2:1 ratio. All of the models used in this paper are designed as supervised classification algorithms, which are intended to predict discrete AQI categories. There is no per-formed regression-based prediction of numerical values of AQI. In line with this, classification-based learning policies and measures of evaluation are embraced during the experimental analysis.

3.2. Dataset Description and Sample Data

In this research, a publicly available dataset on Kaggle is used, which is called Air Quality Data in India, and it can be accessed at: https://www.kaggle.com/rohanrao/air-quality-data-in-india Accessed on 30 September 2025. The dataset includes air quality data on 26 cities in India between 2015 and 2020, which covers a large scope of space and time. It has more than 29,500 samples and 16 characteristics, such as different concentrations of pollutants, meteorological variables and AQI values, that are to be utilized in supervised machine learning.

The variables in the dataset are main pollutants, like PM_2.5, PM₁₀, NO, NO₂, NO₃, SO₂, CO, O₃, and NH₃, as well as the volatile organic compounds of benzene, toluene, and xylene. On the right side of the records, they are marked with the value of the AQI and its classification according to AQI Bucket. The AQI_Bucket includes six classes that are categorized according to CPCB National Air Quality Index (NAQI) standards: Good, Satisfactory, Moderate, Poor, Very Poor, and Severe. Such labels are used as output classes in this work in order to classify some tasks. A lot of preprocessing was done so that the quality of data and its homogenous presentation could be achieved. Numeric values that were missing were filled with the mean substitution, and rows that contain null AQI values were eliminated. Label encoding was used to convert categorical variables, e.g., “City”, “AQI_Bucket”, etc., into numeric form, which was easier to train the model. The initial dataset was also disproportionate in terms of the distribution of classes; to deal with this, the SMOTE (Synthetic Minority Oversampling Technique) algorithm was used, leading to a balanced dataset that could be used in machine learning procedures. Following preprocessing, 24,850 clean and balanced records were kept to train and make evaluations.

Visual exploration of the dataset revealed that cities such as Ahmadabad consistently exhibited high pollutant concentrations, particularly PM_2.5, NO₂, and SO₂, often leading to AQI classifications in the Very Poor or Severe categories. These trends are shown in further part of the paper., which displays a sample portion of the dataset focusing on the pollutant levels and AQI categories observed in Ahmadabad.

The working dataset, which has been processed, consists of the records of various Indian cities’ (e.g., Ahmedabad, Visakhapatnam, etc.) size, which serves as a predictor for pollutants such as PM_2.5, PM₁₀, NO₂, NH₃, CO, SO₂, O₃, benzene, toluene, and xylene and for accessible meteorology (temperature, relative humidity, wind). The time-aligned variables were observed at the same time, gaps in the records were filled with time-sensitive interpolation (short gaps), model-based imputation was performed (trained on the training window only, long gaps), and the features were normalized with training-set statistics. To enhance resilience, we filtered against unit inconsistencies and sensor spikes and recorded extreme sensor values and filtered, minorized or masked the affected entries; duplicate and blatantly invalid rows were eliminated. In order to avoid label leakage, the numeric AQI was not included as a model input, and the target of the model was AQI Bucket (NAQI categories only).

Missing values in pollutant variables were limited to a small proportion of the dataset (less than 3%). Mean imputation was applied to maintain dataset continuity while preserving overall distributional characteristics. Since the proportion of missing data was minimal, more complex imputation strategies were not required, and distributional stability was verified through summary statistics before and after preprocessing.

3.3. Pairwise Relationship Between Pollutants

Figure 3a,b show seaborne pair plots which depict how the two key pollutants in the dataset are correlated with each other, with color coding describing the various AQI_Bucket categories. These plots provide a visual perception of the way pollutants re-relate to each other and how they influence the various degrees of air quality severity. The PM2.5 and PM10 show a strong positive correlation with each other; in particular, heavy clusters of a positive tendency tend to move up, which is reasonable due the fact that they are commonly produced by the same sources (vehicles and industries). Equally, the relationships between SO₂ and NO₂ and NO₃ and NO are all linear, which further confirms the interdependency of the two as nitrogen pollutants.

The scatter plots of CO, SO₂ and O₃ indicate that there are more spread relations, indicating that there is moderate correlation between them and PM. Also, benzene, toluene, and xylene, as VOCs, are less linearly associated with major pollutants but nevertheless exhibit a clustering effect in higher AQI Buckets (Very Poor and Severe), which denotes their aggregation effect in the urban air quality degradation. The color distribution of plots shows the contribution of a combination of pollutants to the severity of the AQIs. Severe or Very Poor data items tend to show up in the high ends of PM_2.5, PM₁₀ and NO, and these may indeed be the major cause of AQI scores taking on very high values. These observations substantiate the multi-pollutant modeling solution because air quality severity cannot be ascribed to one pollutant and instead the interaction of multiple key factors affects the severity of the air quality.

3.4. NAQI Air Quality Guidelines and Classification

This study adopts the Indian National Air Quality Index (NAQI) as defined by the Central Pollution Control Board (CPCB). The NAQI comprises six health-risk categories: Good (0–50), Satisfactory (51–100), Moderate (101–200), Poor (201–300), Very Poor (301–400), Severe (401–500), and pollutant-specific concentration breakpoints are used to compute sub-indices and the overall AQI category. All AQI_Bucket labels in this work are mapped strictly to the CPCB/NAQI categories and thresholds. WHO air-quality values are referenced only as concentration guidelines and are not used as an AQI scheme; the US EPA AQI is not used. The pollutant breakpoints employed for labeling are shown in Table 1, and the exact sources are available in appendices (See Appendix A for details on the source of these breakpoints).

All AQI figures in this study are presented with reference to the CPCB/NAQI category thresholds. In Figure 3, we overlay shaded horizontal bands corresponding to NAQI categories (Good: 0–50; Satisfactory: 51–100; Moderate: 101–200; Poor: 201–300; Very Poor: 301–400; Severe: 401–500) to visually anchor air-quality levels. These breakpoints match the CPCB/NAQI standard used for labeling and classification (Table 1). For city-level temporal trend analysis, monthly and yearly aggregated AQI values were examined to identify seasonal and long-term patterns across cities. The resulting visualizations and trend comparisons are presented in the Results section.

The AQI color-coded classification system, summarized in Table 2, not only communicates pollution levels but also serves as a public health advisory tool. Its intuitive scale, from Green (Good) to Maroon (Hazardous), enables quick understanding of potential health risks. This system is crucial for decision-making, especially for sensitive groups and urban populations, as it translates complex pollutant data into actionable information that guides personal behavior and regulatory responses.

3.5. Feature Set and Leakage Control

The original dataset (India, 2015–2020; Kaggle) contains raw pollutant concentrations (PM_2.5, PM₁₀, NO, NO₂, NOx, CO, SO₂, O₃, NH₃) and meteorological variables (temperature, relative humidity, wind speed, etc.), as well as pre-computed AQI and AQI_Bucket labels. Since AQI_Bucket is derived directly from AQI values using CPCB breakpoints, including AQI as a feature would constitute label leakage. Therefore, to prevent this, we excluded the AQI numeric field from all model inputs. Only raw pollutant concentrations and meteorological variables were used as predictors.

The classification target is the AQI_Bucket label (Good, Satisfactory, Moderate, Poor, Very Poor, and Severe). The final feature set includes:

Pollutants: PM_2.5, PM₁₀, NO, NO₂, NOx, CO, SO₂, O₃, NH₃.
Meteorological variables: temperature, relative humidity, wind speed (and others as available per station). No derived or label fields were included as inputs.

3.6. Pollutant Correlation Analysis

To examine inter-feature relationships prior to model training, a correlation analysis was conducted among the selected pollutants and the AQI/NAQI target variable. The resulting heatmap is shown in Figure 3. Figure 3a presents pairwise correlations among pollutant features, while Figure 3b illustrates correlations between key pollutants and the AQI/NAQI category. These correlations reflect statistical co-occurrence patterns and do not imply causal emission-sector attribution.

3.7. City-Level Pollutant Concentration and Feature Grouping

To examine pollutant behavior at the city level and support structured feature engineering, major pollutants were analyzed based on their concentration profiles and inter-correlations. Ten key pollutants were selected for modeling: PM_2.5, PM₁₀, NO, NO₂, NOx, CO, SO₂, O₃, NH₃, benzene, toluene, and xylene. These pollutants were chosen due to their established contribution to air quality degradation and their relevance within the NAQI framework.

For modeling purposes, the selected features were categorized into three groups based on emission source similarity and correlation structure. The first group (PM_2.5, PM₁₀, NO, NO_x) represents pollutants primarily associated with combustion and transport-related emissions. The second group (NH₃, CO, O₃) includes gases commonly linked to agricultural and residential activities. The third group (benzene, toluene, xylene) comprises volatile organic compounds typically emitted from industrial solvents and solid waste processes.

Each pollutant group was used as input for training and evaluating Random Forest, Support Vector Machine (SVM), LSTM, and Bi-LSTM models. This grouping strategy enabled assessment of model behavior under source-oriented feature subsets while maintaining consistency within the unified validation framework.

3.8. Metrological Features and Preprocessing

In addition to pollutant concentrations, the models use meteorological variables available in the dataset: temperature, relative humidity, and wind speed (and wind direction where present). All features were standardized (train statistics only) and aligned at the observation timestamp; the AQI numeric was excluded to prevent label leakage. To capture short-term dynamics, we included lagged meteorology (t − 1, t − 3 h or previous day for daily models) where available.

3.9. Time-Aware Train/Validation/Test Splitting

To avoid temporal leakage, we adopted a time-blocked partition by year: train = 2015–2018, validation = 2019, test = 2020. Models were fitted on the training block, tuned on the validation block, and all final numbers are reported on the held-out 2020 test block. For robustness, we additionally report per-city walk-forward validation (rolling-origin) and a random 5-fold CV baseline for comparability with prior work; however, temporal results are the primary benchmark for deployment. Evaluation metrics: Beyond overall accuracy, we report per-class precision/recall/F1 and support, macro-F1 and weighted-F1, balanced accuracy, Cohen’s κ, and Matthews’s correlation coefficient (MCC). For probabilistic outputs, we assess calibration (reliability curves; Brier score) and discrimination (one-vs.-rest AUROC and AUPRC). We also provide city- and season-conditioned performance and confusion matrices, highlighting misclassification between adjacent ordinal classes (Very Poor vs. Severe).

4. Results

4.1. Exploratory Data Analysis (EDA)

4.1.1. Yearly Variation in AQI

To investigate how air quality varied over time, annual AQI distributions (2015–2020) are summarized using boxplots using AQI values across the years 2015 to 2020. For this purpose, AQI values were randomly sampled from the dataset for each year, capturing a representative distribution of air quality levels. These values were compiled into a new DataFrame, and a boxplot was generated using the Seaborn visualization library 0.13.2 to illustrate the year-wise variation in AQI. As shown in Figure 4, the x-axis denotes the years, while the y-axis represents the simulated AQI values. The boxplots provide a statistical summary of the AQI distribution each year, with the central line indicating the median, the box covering the interquartile range (IQR), and the whiskers extending to reflect overall data spread. A comparison of these distributions suggests relatively stable AQI behavior across most years, with 2017 standing out due to a higher median and a wider IQR, indicating more frequent occurrences of poor air quality during that year.

This visualization approach is particularly useful for identifying temporal patterns, potential outlier years, and long-term trends in air quality. It supports a broader understanding of how pollution levels fluctuate annually and complements more detailed pollutant-specific or location-based analyses.

The highest mean AQI is observed in 2015, indicating relatively poor air quality conditions during that year. Although there is a sharp decline in 2016, the mean AQI rises again in 2017, suggesting a temporary regression in environmental control or seasonal effects. From 2017 onward, the mean AQI values exhibit a gradual downward trend, implying progressive improvement in air quality over the subsequent years, consistent with observed regulatory changes during that period. The spike in 2017 and consistently high values in 2015 stand out as critical points that may warrant further investigation into regional emission sources or climatic anomalies during those periods. Overall, this analysis provides useful temporal insight into urban air quality dynamics and reinforces the importance of year-wise environmental monitoring for policymaking and predictive modeling.

To understand long-term changes in air quality, the mean concentrations of major pollutants were aggregated annually from 2015 –2020 and visualized through a multi-line chart, as shown in Figure 4. Each line represents the average yearly concentration of a specific pollutant, enabling a comparative analysis of pollution trends over time. The chart reveals that PM₁₀ and PM_2.5 consistently recorded the highest concentration levels throughout the observed period, with PM₁₀ showing a significant decline from around 150 µg/m³ in 2015 to below 110 µg/m³ in 2020. A similar downward trend is observed for NO₂ and NO_x, reflecting potential improvements in vehicular emission controls or regulatory measures during this timeframe.

Interestingly, pollutants such as SO₂ and CO display relatively stable concentration levels, indicating minimal fluctuation across years. On the other hand, benzene, toluene, and xylene, commonly associated with industrial solvents and vehicular exhaust, exhibit marginal changes, suggesting persistent low-level emission sources. The year 2020, in particular, marks the lowest concentration levels for most pollutants, which may reflect the impact of mobility restrictions and reduced industrial activity during the COVID-19 pandemic. A decline in pollutant concentration levels was observed in 2020 relative to prior years.

Yearly behavior is summarized using actual aggregated statistics from the daily time series. For each calendar year (2015–2020), this research computes the median AQI and IQR, and 95% CI of the median via bootstrap. These statistics show elevated medians in 2015–2017, followed by improvements post-2017 and a further drop in 2020. The visualization (median with IQR band and CI whiskers) is provided in Figure 4a.

In the revised analysis, as shown in Figure 4 from a to d, yearly AQI behavior is summarized using actual aggregated statistics rather than simulated or randomly sampled values. For each year (2015–2020), this paper computed the median AQI, inter-quartile range (IQR), and 95% confidence intervals from the daily time series. These statistics provide a more reliable picture of temporal variability and long-term trends. Notably, median AQI values were highest during 2015–2017, reflecting multiple seasonal pollution peaks, and declined significantly after 2017, consistent with national emission-control efforts and policy interventions. The 2020 lockdown period shows a distinct drop in median and IQR across all cities.

4.1.2. Pollutant Correlation Matrix

The heatmap in Figure 4 represents the correlation matrix between major air pollutants, providing valuable insight into how pollutant concentrations vary in relation to one another. Correlation values range from −1 to 1, with darker red indicating stronger positive relationships and deep blue representing weak or no correlation.

The strongest correlations are observed between PM_2.5 and PM₁₀ (r = 0.68), and between NO, NO₂, and NO_x, particularly NO and NO_x (r = 0.81), which is expected due to their shared emission sources from vehicular and industrial combustion. Similarly, PM_2.5 exhibits moderate correlations with NO (0.57) and NO_x (0.53), indicating that particulate and gaseous pollutants often co-occur in polluted urban environments. NO₂ and SO₂ also show a moderate positive relationship (r = 0.48), suggesting simultaneous emission in certain industrial processes. In contrast, pollutants such as NH₃, O₃, and xylene display relatively weak correlations with most other variables, suggesting more independent behavior or source specificity. Notably, benzene and toluene exhibit a strong correlation (r = 0.72), reflecting their typical co-emission from vehicle exhaust and industrial solvents.

Contrary to expectations, O₃ shows only weak correlations with NO_x compounds. This is consistent with the complex, nonlinear formation of ozone through photochemical reactions, which depends on atmospheric conditions and precursor interactions.

4.2. Range of AQI Values in Different Cities of India

The analysis of AQI distribution across major Indian cities, as shown in Figure 4, reveals significant variation in pollution levels nationwide. Ahmadabad stands out, with consistently high AQI values, indicating persistent and severe air quality issues.

Figure 4 highlights the need for region-specific pollution control strategies, especially in cities with consistently high or highly variable AQI levels. It also justifies the use of a multi-pollutant machine learning model to capture and predict these complex urban air quality dynamics.

4.3. Comparative Model Performance

Among all the models tested, the Random Forest algorithm achieved the highest training accuracy at 100%, followed by LSTM (99.95%), Bi-LSTM (99.89%), and SVM (98.14%). On the validation set, Random Forest also led with an accuracy of 99.7%, while Bi-LSTM reached 99.67%, LSTM 99.61%, and SVM 98.10%. While Random Forest achieved the highest overall test accuracy among the evaluated models. This observation is supported by confusion matrices and model behavior analysis, shown in Figure 5. The class labels 0 through 5 represent categorical pollution severity levels, ranging from Good to Hazardous, as defined in Section 3.1. Performance evaluation is conducted exclusively using classification metrics, including accuracy, precision, recall, F1-score, and confusion matrices, which are appropriate for multi-class AQI category prediction tasks.

4.3.1. Random Forest Performance

The Random Forest model was optimized using 100 decision trees and a random seed of 42. As summarized in Figure 5c and Table 3, the error structure shows strong diagonal dominance, with most misclassifications occurring between adjacent NAQI categories. However, validation misclassifications, e.g., 91 instances of class 5 predicted as class 3, highlight areas where model generalization could be improved.

4.3.2. Support Vector Machine Results

SVM was trained using GridSearchCV to explore a hyperparameter grid that included the regularization parameter ‘C’, kernel types (‘linear’, ‘rbf’), and gamma values. The best configuration identified was C = 10, kernel = ‘rbf’, and gamma = 0.001, which produced a validation accuracy of 98.10%. Despite being outperformed by deep learning models, SVM maintained reasonable performance and faster training times, especially in low-resource settings.

4.3.3. LSTM Model Evaluation

The LSTM model employed three stacked LSTM layers, each followed by a dropout rate of 0.2 to prevent overfitting. A dense output layer with six units and SoftMax activation was used for multi-class prediction. The model was trained using the Adam optimizer (learning rate = 0.001), over 35 epochs, with a batch size of 32. The class-wise confusion pattern is summarized in Figure 5c, showing that most errors occur between adjacent high-pollution categories. Confusion matrices indicate consistent performance, though class 5 exhibited moderate confusion with class 3, as evidenced by 27 misclassified instances.

4.3.4. Bi-LSTM Model Evaluation

Bi-LSTM achieved strong generalization performance, misclassifying fewer samples than LSTM, each paired with a dropout rate of 0.2. The Bi-LSTM outputs were flattened and passed through three dense layers with 512, 256, and 128 units, using ReLU activations. The final output layer again used SoftMax for six-class classification. Training was performed under the same conditions as LSTM. As reflected in Table 3, Bi-LSTM achieved strong generalization, misclassifying fewer samples than LSTM; e.g., only 35 instances of class 5 were misclassified as class 3 in the validation set, compared to 42 in LSTM.

4.4. Evaluation Using Confusion Matrices

The confusion matrices provided in Figure 5c offer a visual assessment of each model’s prediction accuracy across classes. While all models demonstrated strong performance, Random Forest had the most dominant diagonal across matrices, indicating the highest classification precision. LSTM and Bi-LSTM exhibited minor off-diagonal errors, mostly between classes 3, 4, and 5, which are often close in pollutant concentration thresholds. To complement the accuracy metrics, precision, recall, and F1-scores were also computed for all models but are not displayed here due to space limitations. These metrics provided additional insight into the models’ performance, particularly for minority or borderline pollution categories.

A complete analysis of model behavior on realistic time-blocked testing is given in Figure 5a–d. The confusion matrix (Figure 5c) reveals that most of the errors are between the similar ordinal classes, especially between Very Poor and Severe and between Moderate and Poor, which represents the inter-class overlap of the pollutant concentration ranges at the class boundaries. Misclassification among the distant categories is not common, suggesting consistent discrimination on the whole. The reliability curve (Figure 5b) shows a slight overconfidence in the higher probability bins with a multiclass Brier score of around 0.08, which shows a moderate but not ideal calibration. The macro-F1 heatmap (Figure 5d) shows systematic differences between the models and the cities: Bi-LSTM always has the highest results, then LSTM, RF and SVM, and the largest improvements are in the cities with a significant seasonal or time trend.

The results of the confusion matrix in the time-blocked test set (2020) of CPCB/NAQI classes—Good, Satisfactory, Moderate, Poor, Very Poor, Severe—are as follows. The majority of misinterpretations take place between the adjoined ordinal classes (e.g., Very Poor vs. Severe, Moderate vs. Poor); errors that are far apart are not common, which is anticipated based on overlapping AQI thresholds. A plot of the prediction of the confidence of the class of the time-blocked test set (2020) is shown. The diagonal is an indicator of perfect calibration. The curve indicates slight overconfidence towards higher probabilities. Multiclass mean Brier score: The multiclass has a mean Brier score of =0.08, which means that the calibration is moderate and this can be enhanced by either post hoc calibration or threshold tuning. Analysis was conducted by model and city in the temporal split macro-F1. The Bi-LSTM is always at the top, particularly in the cities characterized by higher seasonal dynamics, followed by LSTM, RF, and SVM.

Moreover, the models are highly discriminative, with a macro-averaged AUROC of 0.91 and a macro-averaged AUPRC of 0.78, which means that they can be used to separate classes well even when temporal splits are hard. These trends are further verified by per-class precision, recall, and F1-scores that show that the remaining errors are mostly in the cases of boundaries, and further optimization of the accuracy can be achieved by targeting threshold tuning or the use of ordinal-sensitive approaches.

4.5. Performance Metric Comparison

To supplement the confusion matrix-based evaluation, Table 3 and Table 4 present a comparative analysis of the four machine learning models using standard classification metrics: accuracy, precision, recall, and F1-score. These metrics were computed using macro-averaging across all six pollutant classes, derived from the training and validation results previously illustrated in Figure 5a–d.

Training vs. generalization behavior. Table 3 reports both training and held-out test performance. Random Forest achieves perfect training accuracy (1.0000), which reflects the high separability of NAQI categories in the pollutant feature space. Importantly, the test accuracy remains extremely high (0.9971), indicating strong generalization rather than memorization. For deep models, training accuracy remains below 1.0 (LSTM: 0.9738; Bi-LSTM: 0.9814), suggesting regularization effects from dropout and stochastic optimization. The modest performance gap between training and testing sets across models indicates stable generalization. Residual errors are concentrated between adjacent NAQI severity categories, consistent with the ordinal threshold-based definition of the target labels rather than arbitrary misclassification.

Table 4 summarizes the comparative performance and deployment suitability of the four evaluated models. The macro-F1 heatmap (Figure 5d) indicates that the Bi-LSTM has the best scores in the highest category across cities, and then LSTM, RF, and SVM. This ranking is statistically confirmed in cities and temporal folds in the critical difference diagram (Figure 5), with Bi-LSTM significantly ahead of the rest at p < 0.05. Architectural assets are in line with their model advantages: RF is a strong user of tabular pollutant data, with low latency, and high explainability, and thus is a sound edge-deployment baseline. SVM is a small and a strong model in low-dimensional or embedded contexts. LSTM and Bi-LSTM are very proficient in capturing temporal and seasonal variations, particularly with respect to O₃ and wintertime PM_2.5 peaks, lagged dynamics are a factor.

4.6. Group-Based Accuracy Assessment

To evaluate model robustness across varying data distributions, group-based evaluations were conducted for three representative subsets of the dataset: Group 1, Group 2, and Group 3. Each group reflects a stratified partition of the original dataset, created based on pollutant characteristics and category distribution to ensure diversity in feature space and class composition. Performance was assessed using accuracy, precision, recall, and F1-score, with a focus on class-specific metrics to identify localized model strengths and weaknesses. All values are reported in decimal format and derived using macro-averaging, unless otherwise stated.

4.6.1. Random Forest: Group-Wise Evaluation

Across the three evaluated groups, Random Forest demonstrated consistently strong classification performance. In Group 1, the model achieved a validation accuracy of 0.971 and a macro F1-score of 0.97. Despite these high-level metrics, notable class-level confusion was observed, particularly for class 5, where 41 instances were misclassified as class 4 and 18 as class 3. This suggests potential feature overlaps among adjacent pollution categories. In Group 2, the model showed improved generalization, reaching a validation accuracy of 0.986 and a macro F1-score of 0.98. The confusion matrix revealed minimal off-diagonal entries, indicating stable performance in a moderately complex data landscape. Group 3 yielded near-ceiling results, with a validation accuracy of 0.994 and a macro F1-score of 0.99, as all six classes were predicted with consistently high precision and recall. This reflects a highly separable feature space, well-suited to the Random Forest model’s decision-tree structure.

4.6.2. LSTM: Group-Wise Evaluation

The LSTM model demonstrated strong and consistent performance across all three groups, with noticeable improvements as data complexity decreased. In Group 1, the model attained a validation accuracy of 0.936 and a macro F1-score of 0.93. However, class-level analysis revealed a slight reduction in performance for class 2 (F1 = 0.92) and class 5 (F1 = 0.90), likely due to temporal pattern ambiguities or overlapping features. Performance reductions were observed for class 2 and class 5 in Group 1 relative to the other categories. Moving to Group 2, the LSTM model showed marked improvement, reaching a validation accuracy of 0.948 and a macro F1-score of 0.94. In this group, classes 4 and 5 were predicted with consistently high recall and precision, highlighting the model’s capacity to accurately capture sequential dependencies when class boundaries are moderately distinct. In Group 3, the model achieved near-ceiling results, with a validation accuracy of 0.961 and a macro F1-score of 0.95, exhibiting highly consistent classification across all classes. Across groups, LSTM performance increased as group complexity decreased, with the highest validation accuracy and macro F1-score observed in Group 3.

4.6.3. Bi-LSTM: Group-Wise Evaluation

The Bi-LSTM model demonstrated consistently high performance across all groups, with subtle advantages in sequence understanding compared to the standard LSTM. In Group 1, it achieved a validation accuracy of 0.944 and a macro F1-score of 0.94, slightly outperforming LSTM in class 2 (F1 = 0.95 vs. 0.92) while maintaining similar scores across other classes. Nonetheless, class 5 showed mild confusion, reflected in an F1-score of 0.91, with remaining confusion concentrated in class 5 relative to adjacent categories. In Group 2, Bi-LSTM reached a validation accuracy of 0.956 and a macro F1-score of 0.95, with strong precision (0.97) achieved in class 2, underscoring its effectiveness in reducing uncertainty through forward and backward temporal learning. In Group 3, the model delivered near-ceiling classification performance, with validation accuracy and macro F1-score reaching 0.969 and 0.96, respectively, slightly exceeding the LSTM while remaining marginally below the Random Forest. In Group 3, Bi-LSTM achieved the highest validation accuracy and macro F1-score among the three groups, with minimal off-diagonal confusion. Group 3 showed stable model behavior, with all classifiers achieving consistently high metrics.

4.6.4. SVM: Group-Wise Evaluation

The SVM model exhibited stable but comparatively moderate classification performance across the three evaluated groups. In Group 1, the model achieved a validation accuracy of 0.918 and a macro F1-score of 0.91. Class-level analysis indicated performance degradation in class 1 (F1 = 0.88) and class 5 (F1 = 0.85), primarily due to misclassifications between adjacent severity categories. SVM errors were mainly observed in borderline cases where adjacent NAQI categories share overlapping pollutant ranges. In Group 2, the model demonstrated improved generalization, reaching a validation accuracy of 0.936 and a macro F1-score of 0.93. Confusion was reduced compared to Group 1, although mild off-diagonal entries persisted between Moderate and Satisfactory classes. In Group 3, SVM attained a validation accuracy of 0.951 and a macro F1-score of 0.94, reflecting improved class separability as data complexity decreased. However, unlike tree-based and bidirectional deep models, SVM continued to exhibit minor instability at class boundaries. Overall, SVM provided reliable performance but did not match the robustness of Random Forest or Bi-LSTM in handling ordinal NAQI class transitions.

4.7. Cross-Model Insights

Across the three groups, most misclassifications occurred between adjacent NAQI categories, particularly involving class 5. Bi-LSTM achieved higher class-wise precision than LSTM in Groups 1–2, while Random Forest produced the highest overall accuracy across groups. These group-wise results are reported to compare model behavior under different pollutant-feature subsets. Bi-LSTM demonstrated slight advantages in class-level precision across more difficult groups, while Random Forest consistently provided strong baseline accuracy. The LSTM model struck a balance between simplicity and performance. These results suggest that a hybrid ensemble, combining Random Forest for feature-based learning and Bi-LSTM for temporal reasoning, could be optimal for complex smart city pollutant classification systems. NO₂ and meteorological variable trends are consistent with winter inversion episodes and traffic intensity.

The global SHAP summary shows mean |SHAP| values across all cities. PM_2.5 and PM₁₀ are the dominant drivers of AQI class predictions, followed by NO₂ and meteorology (temperature, wind speed). These align with NAQI formulation and typical urban co-emission/dispersion patterns.

Per-city SHAP summary for Delhi: In addition to PM_2.5/PM₁₀, NO₂ and meteorological factors (temperature, wind speed) exert a stronger influence, consistent with wintertime inversions and traffic-related co-emissions in Delhi.

Figure 6 presents SHAP value summaries that identify the most influential features driving AQI class predictions. At the global level, PM_2.5 and PM₁₀ dominate, with NO₂ and meteorology (temperature, wind speed) also contributing substantially. While these importance rankings align with established atmospheric science, they provide diagnostic confirmation that the models rely on physically meaningful pollutant drivers rather than spurious correlations. This strengthens interpretability and deployment trust, rather than claiming novel environmental discovery.

It is important to note that SHAP values indicate features’ contributions to model predictions and reflect statistical associations within the dataset; they do not establish causal relationships between pollutants and environmental or health outcomes.

Role of meteorology and spatial effects:

SHAP summaries (Figure 6a,b) indicate that temperature and wind speed contribute meaningfully alongside PM_2.5/PM₁₀ and NO₂, consistent with dispersion and photochemical controls. The influence is strongest in winter for cities prone to inversions. Because the current models are local to stations/cities, long-range transport is only implicitly captured. We therefore specify a next-step upgrade to a graph-based, spatio-temporal architecture ingesting exogenous forecasts (temperature, wind, PBL height) to encode upwind influences and improve multi-hour alerting.

COVID-19 policy periods:

To contextualize the observed 2020 decreases, we annotate the national COVID-19 lockdown phases on the time-series plots (Figure 6): Phase 1 (25 March–14 April 2020), Phase 2 (15 April–3 May), Phase 3 (4–17 May), Phase 4 (18–31 May). AQI reductions coincide with these periods across cities, followed by partial rebounds as mobility resumed.

Model Configurations and Rationale:

Hyperparameters were selected on the 2019 validation block to avoid temporal leakage. RF: number of trees and max depth chosen via validation curves balancing accuracy and latency. SVM: kernel (RBF), C, and γ tuned by grid search to control capacity on standardized features. LSTM/Bi-LSTM: hidden units (moderate width), sequence length (to capture weekly/seasonal effects), dropout (regularization), and learning rate selected by learning curves with early stopping; Bi-LSTM was used where bidirectional context improved validation F1. The final settings are summarized in a concise table.

4.8. Comparative Review of Existing AQI Prediction Models

Synthesis and positioning. Prior studies (e.g., [17,18,19,20,21,22]) report encouraging headline accuracies across neural and tree-based models, but most rely on overall accuracy without class-wise or calibration metrics, use non-temporal or unspecified validation, and often target single pollutants or proxy sensor features rather than operational AQI category use. In contrast, our study is designed for deployment realism: we predict NAQI categories using multi-pollutant + meteorological inputs while excluding the numeric AQI to prevent leakage; we adopt time-blocked validation (train 2015–2018, validate 2019, test 2020); and we report per-class precision/recall/F1 (with supports), macro/weighted-F1, balanced accuracy, κ, MCC, macro-AUROC/AUPRC, and calibration (reliability curves, Brier). We also provide city/season-conditioned results and analyze adjacent-class errors (e.g., Very Poor vs. Severe), offering a clearer view of robustness than accuracy alone and directly addressing the gaps noted in [17,18,19,20,21,22].

In contrast, our proposed framework not only integrates multiple pollutant categories but also implements four different ML models and conducts group-based evaluations to test generalizability across varying data subsets. Our work uniquely provides class-wise F1-score, macro-averaged precision/recall, and explores model strengths across structured vs. ambiguous data groups, offering a comprehensive approach for smart city air quality monitoring that addresses the limitations highlighted in the previous literature.

5. Discussion

The current work is a comparative analysis of classical machine learning and deep sequential models of NAQI-based air quality classification in the framework of a single validation. In all experiments, the overall classification accuracy of Random Forest was the highest, whereas Bi-LSTM showed better stability at borders between two sequential pollution types. The results support the previous demonstrations that ensemble tree-based approaches are still very competitive with structured pollutant data, whilst deep sequential architectures bring incremental benefits where temporal dynamics can be of importance. Notably, the high level of consistency in the performance of models due to the nature of the relationship between pollutant concentration and NAQI thresholds indicates the need to exercise caution in designs of validation and sensitivity to boundaries in the process of conducting regulatory air quality classification.

Compared to the previous literature, the comparative behavior that can be seen in the current study is consistent with the overall air-quality prediction literature, where tree-based models often perform well on structured pollutant datasets whereas sequence models may be able to add value when temporal dependencies are informative. Indicatively, ref. [12] re-reported that systems based on the use of Random Forest may demonstrate competitive performance in predicting the quality of the air in cities where urban sensing is applied, and the article focused on the appropriateness of ensemble decision trees as a multivariate input of pollution [12]. Likewise, survey-level data indicate that both classical ML and deep learning are equally popular and widely used to do air-quality forecasting and air-quality classification in smart-city sensing, with their performance heavily contingent on data structure, feature design, and validation strategy, and not on the choice of model itself [10,18]. Deep sequential and hybrid models (as well as CNN-LSTM variants) have been demonstrated many times to enhance the predictive robustness, when temporal patterns and nonlinear interactions play a major role, which suggests that LSTM-family models re tough competitors in air-quality tasks [11]. Recent studies also indicate that ML models can be effectively used to predict the AQI of Indian cities in particular, which further supports the idea that high predictive performance can be attained in cases when AQI classes are based on pollutants and in cases when a strong training/validation split is created [8,19]. All these comparisons allow for a rigorous comparison of the current results with a single validation system and with the known results from general studies of smart city air-quality prediction and India-specific AQIs [10,12,18,19].

Across cities and seasons, the comparative results under time-blocked evaluation show a consistent pattern. Bi-LSTM attains the strongest overall discrimination when temporal and seasonal structures (e.g., winter inversions, weekday–weekend cycles) materially influence category transitions, while RF remains a robust, explainable baseline on tabular features with low latency. City-to-city variation is well explained by meteorology (temperature, wind speed/direction) and measurement density, which affect both dispersion and the stability of class boundaries. Figure 6 and Table 5 highlight that most residual errors occur near adjacent NAQI categories (Very Poor–Severe, Moderate–Poor), which is consistent with the overlapping concentration ranges at those thresholds.

From a decision perspective, miscalibration primarily affects high-risk category alerts (Very Poor and Severe). Overconfidence in these bins may increase unnecessary warnings, while under-confidence may delay public health interventions. In our experiments, recalibration (via threshold tuning on the validation block) slightly reduced overconfidence without materially changing the model ranking (Bi-LSTM > LSTM > RF > SVM). Thus, calibration refinement improves alert reliability but does not alter comparative conclusions.

5.1. High Accuracy Interpretation

The high classification accuracy of the models proves that their interpretation results should be used in the regulatory system of NAQI-based categorization. Given that NAQI classes are directly obtained by the direct relationship between pollutant concentration limits stipulated by CPCB regulations, a high prediction ability is likely when the predictor variables are the underlying pollutant characteristics. This has been observed in earlier AQI modeling where categorical prediction tasks have been found to be highly accurate when threshold-based labeling is applied to structured pollutant data [8,12,19].

Indicative of the transparent validation design and evaluation sensitive to boundaries, the current findings identify the significance of methodological novice. Time-aware data splitting, leakage control, macro-averaged measures and calibration assessment have been used to ensure that the reported performance is based upon realistic deployment conditions as opposed to information leakage or arbitrary partition benefits. Here, the demand for high overall accuracy agrees with the previous literature but the results need to be considered jointly with class-boundary behavior, patterns of confusion, and calibration reliability. This research thus makes its contribution largely by rigor in evaluation as well as reproducibility, as opposed to innovation in an algorithmic way.

5.2. Operational Implications and Uncertainties

To be deployed, (i) RF at the edge (CPU only) would be preferred to provide fast and explainable alerts as well as ongoing and continuous monitoring and (ii) Bi-LSTM at the center should be used to provide horizon expansion, scenario testing, and seasonal adjustment. The lead times and alert levels mentioned are experimentally obtained, and they are to be regarded as indicative, not as guaranteed operation levels. False-alarm behavior and re-training schedules would be needed to be city-specifically validated and policy-aligned first before deployed in the real world. The major uncertainties are the non-stationary (policy/mobility changes), sensor coverage, and derived labels (AQI categories). Routine calibration monitoring (reliability/Brier) and scheduled retraining (monthly–quarterly, with seasonal retuning\g) mitigate drift. Planned extensions include exogenous forecasts (temperature, wind, PBL height) and spatio-temporal graph modeling to encode upwind transport and inter-city coupling.

6. Limitations

Although landmark results have been achieved, there are still a number of constraints. The data is limited to Indian cities, and regional differences in the source of emissions, climate and population density may interfere with model generalizability. To be used more widely, transfer learning or retraining on geographically distributed data is required. The other limitation is that there is no dynamic modeling of the emission of pollutants without considering spatial autocorrelation or meteor forecasts. Spatial–temporal graph models or attention-based variants of LSTM that can learn time and place dependencies should be studied in the future.

Also, hybrid ensembling, i.e., stacking RF with Bi-LSTM outputs using a meta-classifier, may also be used to enhance performance, as there will be fusion of feature-based and sequence-based intelligence. Reliability measures or uncertainty estimation (Monte Carlo dropout) would also be welcome to increase the trust in high-stakes applications.

The sample employed in this research is limited to Indian cities that are running under the NAQI. There is a wide variation in climatic regimes, emission structures, monitoring densities and regulatory thresholds in different countries and regions. Therefore, the results of the models monitored in this work cannot be directly applied to other geographical situations without further training or cross-validation. The evaluation of robustness in different environmental conditions such as atmospheric and regulatory conditions would require cross-regional evaluation, domain adaptation, or transfer learning methods.

7. Conclusions and Future Works

This study presented a structured comparative evaluation of Random Forest, SVM, LSTM, and Bi-LSTM models for NAQI-based air quality classification under a time-aware validation framework. The results indicate that ensemble tree-based models are high performing, and bidirectional sequential architectures also perform better in terms of stability during boundary level transitions when pollutant data is structured and when the categories of severities are adjacent. These findings support other research outcomes that have shown that model performance in AQI classification is heavily reliant on the validation strategy, feature designs, and regulatory threshold structure and not architectural novelty. This research is a strong contribution because of its methodological rigor, addition of leakage control, temporal splitting, macro-averaged evaluation measures, calibration assessment, and group-wise pollutant analysis. Such a formal appraisal method enhances interpretability and implementation preparedness in CPCB/NAQI regulation environments.

Future studies ought to expand the framework to geographically heterogeneous areas and combine the spatial-transfer learning processes, as well as test probabilistic alert thresholds under the conditions of real-time deployment. The method can be further improved by using domain adaptation and source-apportionment-informed modeling strategies to improve cross-regional generalizability [26,27,28,29,30,31,32,33,34].

Author Contributions

Conceptualization, M.M. and M.A.; methodology, M.M. and M.A.; software, M.M.; validation, M.M., A.A. and F.J.; formal analysis, M.M. and B.H.; investigation, M.M., A.A. and F.J.; data curation, M.M. and B.N.; writing—original draft preparation, M.M.; writing—review and editing, M.A., A.A., F.J., B.H. and B.N.; visualization, M.M. and B.H.; supervision, M.A.; project administration, M.A. and B.H.; funding acquisition, M.A. and B.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2602).

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The dataset is publicly available at https://www.kaggle.com/datasets/rohanrao/air-quality-data-in-india. URL Accessed on 30 September 2025.

Acknowledgments

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2602). The authors gratefully acknowledge this institutional support.

Conflicts of Interest

Author Badil Nisar is employed by Saed Azka Limited Company, Jeddah, Saudi Arabia. This affiliation is declared for transparency. The authors confirm that this employment had no role in the study design, data collection, analysis, interpretation of results, or preparation of the manuscript. The remaining authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

SVM	Support Vector Machine
LSTM	Long Short-Term Memory
Bi-LSTM	Bi-directional Long Short-Term Memory
RF	Random Forest

Appendix A

Forest and Climate Change, Government of India. (Original 2014; subsequent CPCB circulars maintain the same category ranges and pollutant breakpoints.)
Indian Institute of Tropical Meteorology (IITM), Ministry of Earth Sciences. SAFAR—Air Quality Index: Technical Brochure. Pune, India.
World Health Organization (WHO). WHO Global Air Quality Guidelines: Particulate Matter (PM_2.5 and PM₁₀), Ozone, Nitrogen Dioxide, Sulfur Dioxide and Carbon Monoxide. 2021. (Cited only for concentration guidelines.)

References

Aruna Kumari, N.S.; Ananda Kumar, K.S. Prediction of Air Quality in Industrial Area. In Proceedings of the 5th IEEE International Conference on Recent Trends on Electronics, Information & Communication Technology (RTEICT), Bengaluru, India, 12–13 November 2020. [Google Scholar] [CrossRef]
Gupta, S.; Mohta, Y.; Heda, K.; Armaan, R.; Valarmathi, B.; Arulkumaran, G. Prediction of Air Quality Index Using Machine Learning Techniques: A Comparative Analysis. J. Environ. Public Health 2023, 2023, 4916267. [Google Scholar] [CrossRef]
Bekkar, A.; Hssina, B.; Douzi, S.; Douzi, K. Air-Pollution Prediction in Smart City, Deep Learning Approach. J. Big Data 2021, 8, 161. [Google Scholar] [CrossRef] [PubMed]
Almalki, F.A.; Alsamhi, S.H.; Sahal, R.; Hassan, J.; Hawbani, A.; Rajput, N.S.; Breslin, J.G. Green IoT for Eco-Friendly and Sustainable Smart Cities: Future Directions and Opportunities. Mob. Netw. Appl. 2023, 28, 178–202. [Google Scholar] [CrossRef]
Liu, L.; Zhang, Y. Smart Environment Design Planning for Smart City Based on Deep Learning. Sustain. Energy Technol. Assess. 2021, 47, 101425. [Google Scholar] [CrossRef]
Mahalingam, U.; Elangovan, K.; Dobhal, H.; Valliappa, C.; Shrestha, S.; Kedam, G. A Machine Learning Model for Air Quality Prediction for Smart Cities. In Proceedings of the International Conference on Wireless Communications, Signal Processing and Networking (WiSPNET); IEEE: Piscataway, NJ, USA, 2019; pp. 452–457. [Google Scholar] [CrossRef]
Artmann, M.; Kohler, M.; Meinel, G.; Gan, J.; Ioja, I.-C. How Smart Growth and Green Infrastructure Can Mutually Support Each Other—A Conceptual Framework for Compact and Green Cities. Ecol. Indic. 2019, 96, 10–22. [Google Scholar] [CrossRef]
Kumar, K.; Pande, B.P. Air Pollution Prediction with Machine Learning: A Case Study of Indian Cities. Int. J. Environ. Sci. Technol. 2023, 20, 5333–5348. [Google Scholar] [CrossRef]
Usmani, R.S.A.; Saeed, A.; Abdullahi, A.M.; Pillai, T.R.; Jhanjhi, N.Z.; Hashem, I.A.T. Air Pollution and Its Health Impacts in Malaysia: A Review. Air Qual. Atmos. Health 2020, 13, 1093–1118. [Google Scholar] [CrossRef]
Iskandaryan, D.; Ramos, F.; Trilles, S. Air Quality Prediction in Smart Cities Using Machine Learning Technologies Based on Sensor Data: A Review. Appl. Sci. 2020, 10, 2401. [Google Scholar] [CrossRef]
Gilik, A.; Ogrenci, A.S.; Ozmen, A. Air Quality Prediction Using CNN+LSTM-Based Hybrid Deep Learning Architecture. Environ. Sci. Pollut. Res. 2022, 29, 11920–11938. [Google Scholar] [CrossRef]
Yu, R.; Yang, Y.; Yang, L.; Han, G.; Move, O.A. RAQ—A Random Forest Approach for Predicting Air Quality in Urban Sensing Systems. Sensors 2016, 16, 86. [Google Scholar] [CrossRef] [PubMed]
Ullah, Z.; Al-Turjman, F.; Mostarda, L.; Gagliardi, R. Applications of Artificial Intelligence and Machine Learning in Smart Cities. Comput. Commun. 2020, 154, 313–323. [Google Scholar] [CrossRef]
Batty, M.; Axhausen, K.W.; Giannotti, F.; Pozdnoukhov, A.; Bazzani, A.; Wachowicz, M.; Ouzounis, G.; Portugali, Y. Smart Cities of the Future. Eur. Phys. J. Spec. Top. 2012, 214, 481–518. [Google Scholar] [CrossRef]
Kok, I.; Simsek, M.U.; Ozdemir, S. A Deep Learning Model for Air Quality Prediction in Smart Cities. In Proceedings of the IEEE International Conference on Big Data; IEEE: Piscataway, NJ, USA, 2017; pp. 1983–1990. [Google Scholar] [CrossRef]
Shahid, N.; Shah, M.A.; Khan, A.; Maple, C.; Jeon, G. Towards Greener Smart Cities and Road Traffic Forecasting Using Air Pollution Data. Sustain. Cities Soc. 2021, 72, 103062. [Google Scholar] [CrossRef]
Wardana, I.N.K.; Gardner, J.W.; Fahmy, S.A. Optimising Deep Learning at the Edge for Accurate Hourly Air Quality Prediction. Sensors 2021, 21, 1064. [Google Scholar] [CrossRef]
Liang, Y.-C.; Maimury, Y.; Chen, A.H.-L.; Juarez, J.R.C. Machine Learning-Based Prediction of Air Quality. Appl. Sci. 2020, 10, 9151. [Google Scholar] [CrossRef]
Natarajan, S.K.; Shanmurthy, P.; Arockiam, D.; Balusamy, B.; Selvarajan, S. Optimized Machine Learning Model for Air Quality Index Prediction in Major Cities in India. Sci. Rep. 2024, 14, 6795. [Google Scholar] [CrossRef]
Salem, M.; Shawabkeh, A.; Rodan, A. Benzene Air Pollution Monitoring Model Using ANN and SVM. In Proceedings of the Fifth HCT Information Technology Trends (ITT); IEEE: Piscataway, NJ, USA, 2018; pp. 197–204. [Google Scholar] [CrossRef]
Mejía Martínez, N.; Montes, L.M.; Mura, I.; Franco, J.F. Machine Learning Techniques for PM10 Levels Forecast in Bogotá. In Proceedings of the International Conference on Artificial Intelligence Workshops (ICAIW); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar] [CrossRef]
Pasupuleti, V.R.; Uhasri; Kalyan, P.; Srikanth; Reddy, H.K. Air Quality Prediction of Data Log by Machine Learning. In Proceedings of the 6th International Conference on Advanced Computing and Communication Systems (ICACCS); IEEE: Piscataway, NJ, USA, 2020. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A Survey of Transfer Learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2017, arXiv:1707.01926. [Google Scholar] [CrossRef]
Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18); International Joint Conferences on Artificial Intelligence Organization: Stockholm, Sweden, 2018; pp. 3634–3640. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The ERA5 Global Reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Frank, E.; Hall, M. A Simple Approach to Ordinal Classification. In Machine Learning: ECML 2001; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2001; Volume 2167, pp. 145–156. [Google Scholar] [CrossRef]
Gal, Y.; Ghahramani, Z. Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. arXiv 2015, arXiv:1506.02142. [Google Scholar] [CrossRef]
Lakshminarayanan, B.; Pritzel, A.; Blundell, C. Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles. arXiv 2016, arXiv:1612.01474. [Google Scholar] [CrossRef]
Abdelmalek, M.M.; Mahmoud, H.; Shokry, H. Prognosis of Air Quality Index and Air Pollution Using Machine Learning Techniques. Sci. Rep. 2025, 15, 25890. [Google Scholar] [CrossRef] [PubMed]
Tırınk, S. Machine Learning-Based Forecasting of Air Quality Index under Long-Term Environmental Patterns: A Comparative Approach with XGBoost, LightGBM, and SVM. PLoS ONE 2025, 20, e0334252. [Google Scholar] [CrossRef]
Kaviani Rad, A.; Nematollahi, M.J.; Pak, A.; Mahmoudi, M. Predictive Modeling of Air Quality in the Tehran Megacity via Deep Learning Techniques. Sci. Rep. 2025, 15, 1367. [Google Scholar] [CrossRef]
Guan, X.; Mo, X.; Li, H. A Novel Spatio-Temporal Graph Convolutional Network with Attention Mechanism for PM2.5 Concentration Prediction. Mach. Learn. Knowl. Extr. 2025, 7, 88. [Google Scholar] [CrossRef]

Figure 1. AQI classification.

Figure 2. Workflow of proposed methodology.

Figure 3. Correlation analysis (air pollutants and AQI). (a) presents pairwise correlations among pollutant features, and (b) illustrates correlations between key pollutants and the AQI/NAQI category.

Figure 4. Statistical features and AQI scatter plots.

Figure 5. Trend analysis and model behavior.

Figure 6. Global SHAP comparison: Delhi vs. all cities.

Table 1. CPCB/NAQI pollutant breakpoints.

NAQI Category (AQI)	PM10 (24 h)	PM2.5 (24 h)	NO₂ (24 h)	O₃ (8 h)	CO (8 h, mg/m³)	SO₂ (24 h)	NH₃ (24 h)	Pb (24 h)	Color
Good (0–50)	0–50	0–30	0–40	0–50	0–1.0	0–40	0–200	0–0.5	Green
Satisfactory (51–100)	51–100	31–60	41–80	51–100	1.1–2.0	41–80	201–400	0.6–1.0	Light Green
Moderate (101–200)	101–250	61–90	81–180	101–168	2.1–10.0	81–380	401–800	1.1–2.0	Yellow
Poor (201–300)	251–350	91–120	181–280	169–208	10.1–17.0	381–800	801–1200	2.1–3.0	Orange
Very Poor (301–400)	351–430	121–250	281–400	209–748 *	17.1–34.0	801–1600	1201–1800	3.1–3.5	Red
Severe (401–500)	≥430	≥250	≥400	≥748 *	≥34.1	≥1600	≥1800	≥3.5	Maroon

Table 2. WHO Air Quality Index ranges.

Daily AQI Color	Levels of Concern	Values of Index	Description of Air Quality
Green	Good	0 to 50	The air quality is deemed satisfactory, with minimal to no risk from air pollution.
Yellow	Moderate	51 to 100	The air quality is deemed acceptable, although there could be a risk for certain individuals, especially those particularly sensitive to air pollution.
Orange	Unhealthy for Sensitive Groups	101 to 150	Individuals in sensitive groups might encounter health effects, but the impact on the general public is expected to be lower.
Red	Unhealthy	151 to 200	Some members of the general public may experience health effects, with more severe impacts on those in sensitive groups.
Purple	Very Unhealthy	201–300	A health alert indicates an increased risk of health effects for everyone.
Maroon	Hazardous	301 and higher	A health warning under emergency conditions suggests a heightened likelihood of adverse effects on everyone.

Table 3. Result obtained by applying machine learning models.

Model	Accuracy Testing	Training Accuracy	Precision (Weighted)	Recall (Weighted)	F1-Score (Weighted)
Random Forest	0.9971	1.00	0.9972	0.9971	0.9971
SVM (RBF)	0.9442	0.9583	0.9464	0.9442	0.9445
LSTM	0.9495	0.9738	0.9563	0.9495	0.9510
Bi-LSTM	0.9615	0.9615	0.9654	0.9615	0.9622

Table 4. Model-selection table (for smart city deployment).

Model	Typical Strengths	Latency/Hardware	Explainability	Best for Pollutants	Notes
RF	Strong on tabular data, nonlinear interactions	Low latency, runs on CPU; good for edge	High (feature importance, SHAP)	PM_2.5, PM₁₀, NO₂	Excellent baseline for real-time deployment
SVM	Compact in low-dimensional spaces	Very low latency, lightweight	Moderate (support vectors)	Background pollutants, limited sensors	Useful for embedded/low power
LSTM	Captures temporal patterns, lag effects	Moderate latency (requires sequential processing)	Low–Moderate	O₃, temporal patterns	Best when diurnal/weekly cycles are important
Bi-LSTM	Best at long-range dependencies	Higher latency, GPU recommended	Low–Moderate	O₃, seasonal transitions, PM_2.5 peaks	Highest accuracy; good for central servers

Table 5. Results obtained in the previous literature.

Author	Technique	Prediction Performance	Pollutants	Areas
[20]	Neural Network	Accuracy: 92.3%	SO₂, NO₂, O₃, CO, PM_2.5, PM₁₀	Republic of Macedonia
[21]	RNN	Accuracy: 80.27%	CO, NO₂, O₃, SO₂, PM_2.5, PM₁₀	Atlanta—Sandy Springs
[22]	Neural Network	Accuracy 99.56%	Air Temperature, Humidity, MQ₂, MQ₁₃₅, MQ₅	General
[23]	Artificial Neural Network	MRE: −0.16	Benzene Concentration	General
[24]	Random Forest	Accuracy: Between 70% and 90%	PM₁₀, Wind Speed, Wind Direction, Temperature	Bogota
[25]	Random Forest	Accuracy: 79%	CO, SO₂, SO₃, Temperature, Wind Speed, Humidity, Wind Direction	General

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Mustafa, M.; Akhtar, M.; Ahmad, A.; Javaid, F.; Haldar, B.; Nisar, B. A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments. Sustainability 2026, 18, 2148. https://doi.org/10.3390/su18042148

AMA Style

Mustafa M, Akhtar M, Ahmad A, Javaid F, Haldar B, Nisar B. A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments. Sustainability. 2026; 18(4):2148. https://doi.org/10.3390/su18042148

Chicago/Turabian Style

Mustafa, Muzzamil, Maaz Akhtar, Ashfaq Ahmad, Fahad Javaid, Barun Haldar, and Badil Nisar. 2026. "A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments" Sustainability 18, no. 4: 2148. https://doi.org/10.3390/su18042148

APA Style

Mustafa, M., Akhtar, M., Ahmad, A., Javaid, F., Haldar, B., & Nisar, B. (2026). A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments. Sustainability, 18(4), 2148. https://doi.org/10.3390/su18042148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Machine Learning Framework for Multi-Pollutant Air Quality Assessment in Urban Environments

Abstract

1. Introduction

1.1. Environment Changes in Smart Cities

1.2. Problem Definition

1.3. Research Contributions

2. Literature Review

Significance of Multi-Pollutant Consideration in Air Quality Prediction

3. Methodology

3.1. Data Preparation and Modeling

3.2. Dataset Description and Sample Data

3.3. Pairwise Relationship Between Pollutants

3.4. NAQI Air Quality Guidelines and Classification

3.5. Feature Set and Leakage Control

3.6. Pollutant Correlation Analysis

3.7. City-Level Pollutant Concentration and Feature Grouping

3.8. Metrological Features and Preprocessing

3.9. Time-Aware Train/Validation/Test Splitting

4. Results

4.1. Exploratory Data Analysis (EDA)

4.1.1. Yearly Variation in AQI

4.1.2. Pollutant Correlation Matrix

4.2. Range of AQI Values in Different Cities of India

4.3. Comparative Model Performance

4.3.1. Random Forest Performance

4.3.2. Support Vector Machine Results

4.3.3. LSTM Model Evaluation

4.3.4. Bi-LSTM Model Evaluation

4.4. Evaluation Using Confusion Matrices

4.5. Performance Metric Comparison

4.6. Group-Based Accuracy Assessment

4.6.1. Random Forest: Group-Wise Evaluation

4.6.2. LSTM: Group-Wise Evaluation

4.6.3. Bi-LSTM: Group-Wise Evaluation

4.6.4. SVM: Group-Wise Evaluation

4.7. Cross-Model Insights

4.8. Comparative Review of Existing AQI Prediction Models

5. Discussion

5.1. High Accuracy Interpretation

5.2. Operational Implications and Uncertainties

6. Limitations

7. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI