Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications

Xu, Zhaoxin; Zhang, Huajian; Zhai, Andong; Kong, Chunyu; Zhang, Jinping

doi:10.3390/atmos16070776

Open AccessArticle

Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications

by

Zhaoxin Xu

¹,

Huajian Zhang

¹,

Andong Zhai

²,

Chunyu Kong

¹ and

Jinping Zhang

^3,*

¹

School of Mechanical and Power Engineering, Shenyang University of Chemical Technology, Shenyang 110142, China

²

Department of Electrical Engineering and Information Technology, Shandong University of Science and Technology, Qingdao 266042, China

³

School of Mechanical and Electrical Engineering, Yunnan Open University, Kunming 650223, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2025, 16(7), 776; https://doi.org/10.3390/atmos16070776

Submission received: 21 May 2025 / Revised: 15 June 2025 / Accepted: 18 June 2025 / Published: 24 June 2025

(This article belongs to the Section Air Quality)

Download

Browse Figures

Versions Notes

Abstract

Air pollution poses a significant global challenge, impacting human health and environmental sustainability worldwide. Accurate air quality forecasting is essential for effective mitigation strategies, particularly in rapidly urbanizing regions. This study focuses on Shenyang, China, as a representative case to analyze air quality dynamics and develop a high-precision forecasting tool. Using a comprehensive six-year dataset (2020–2025) of daily air quality and meteorological measurements, a rigorous preprocessing pipeline was applied to ensure data integrity. Five gradient-boosted decision-tree models were trained and combined through a ridge-regularized stacking ensemble to enhance the predictive accuracy. The ensemble achieved an R² of 94.17% and a mean absolute percentage error of 7.79%, outperforming individual models. The feature importance analysis revealed that ozone, PM₁₀, and PM_2.5 concentrations are the dominant drivers of daily air quality fluctuations. The resulting forecasting system delivers robust, interpretable predictions across seasonal variations, offering a valuable decision support tool for urban air quality management. This framework demonstrates how advanced machine learning techniques can be applied in a Chinese urban context to inform global air pollution mitigation efforts.

Keywords:

AQI prediction model; Kalman filtering; special engineering; K-fold cross validation; multi model stacking fusion

1. Introduction

The reliable prediction of urban air quality is critical for public health management. Traditional physics-based chemical-transport models suffer from coarse spatial resolution and heavy computational demands, motivating the shift toward data-driven approaches. Recent studies have demonstrated the effectiveness of machine learning for AQI forecasting. For example, Ravindiran et al. showed that a LightGBM model could reduce daily-AQI MAPE below 10% for Visakhapatnam [1]. A hybrid SSA–BiLSTM–LightGBM model achieved similar precision in Shijiazhuang [2]. In Beijing, lag-aware BiLSTM stacks captured the winter heating cycle and further reduced short-term error [3]. Even when meteorological drivers were added, boosted trees often remained the strongest single learners in multi-city evaluations [4].

As single-model performance plateaus, ensemble meta-learning has emerged as a frontier for air quality forecasting. A six-learner stacking ensemble in Hyderabad reduced the PM_2.5 error by about five percentage points relative to its best base model [5]. Error-compensated selective stacking achieved larger gains across several Chinese megacities [6]. In Macau, blending boosting, bagging, and kernel methods under a linear meta-learner consistently improved AQI forecasts without excessive costs [7]. An automated pipeline demonstrated that such ensembles can be retrained and served in real time [8].

Moreover, a recent comparative analysis of multiple deep learning models for monthly ambient PM_2.5 forecasting in Dezhou City highlighted the strong performance of hybrid sequence architectures [9], and a hybrid wavelet-based deep learning model in Guangzhou City achieved highly accurate daily PM_2.5 predictions with robust generalization [10].

Feature engineering has advanced in parallel with algorithm development. For instance, multi-resolution decomposition (VMD) combined with graph attention networks and BiLSTM has reduced peak-pollution errors during dust storms. A quantum-behaved PSO–LSTM model with attention mechanisms has stabilized forecasts under abrupt synoptic changes [11]. Lasso-based lag selection methods identify the most informative historical features for sub-hourly PM_2.5 prediction [12]. Additionally, coupling Himawari-8 satellite reflectance with LightGBM yields 1 km PM_2.5 forecasts with cross-validated R² ≈ 0.87 [13]. Probabilistic LightGBM frameworks can even output prediction intervals for risk-aware planning [14].

Recent graph-based architectures have also shown promise. Gao et al. proposed a temporal graph network that uncovers pollutant interactions and dynamic station links [15]. Li Y. et al. developed a dynamic graph-attention network for city-wide AQI forecasting [16]. Chen K. et al. introduced an adaptive-attention graph-convolutional network with an interpretable tree structure [17]. Park et al. designed a multi-head-attention GCN–LSTM for tiered-warning tasks with severe class imbalance [18]. Spatio-temporal Transformers promise further gains by mitigating permutation biases on long sequences [19].

Despite these advances, two key gaps remain. Most studies treat data cleaning as a preprocessing black box, even though robust noise reduction filters can markedly improve data quality [20]. Moreover, many models are trained on short records, less than 3 years, limiting their reliability in regions with sharply seasonal emissions, such as cold-temperate industrial basins [21].

To address these gaps, we compiled a six-year daily AQI and meteorology dataset for Shenyang, a cold-temperate industrial city, covering the period from 2020 to 2025. We developed a leakage-free pipeline in which raw data were trimmed, winsorized, denoised using sliding-median filters and Kalman smoothing, and then feature-engineered with calendar cyclicals, multi-day lags, and rolling-window statistics. Five boosted learners were trained on the cleaned data, and their predictions were fused by a ridge-regularized linear meta-regressor. Gain, permutation, and SHAP diagnostics accompanied every forecast, resulting in a system that is accurate, interpretable, and robust across the full seasonal spectrum of this high-latitude industrial region.

2. Materials and Methods

2.1. Data Source

The Air Quality Index (AQI) is a quantitative metric representing air pollution severity and associated health impacts. This study investigated six primary air pollutants—PM_2.5, PM₁₀, O₃, CO, NO_x, and SO₂—collected from the RESSET industry database. The full dataset spanned from 1 January 2020 to 12 April 2025, with data from 1 January 2020 to 31 December 2023 used for model development and those from the period from 1 January 2024 to 12 April 2025 reserved as an independent test set.

2.2. Background and Related Work

All experiments ran on a 16-core Intel Xeon CPU (2.3 GHz) with 16 GB RAM. Offline training, including hyperparameter tuning for five base learners, required approximately 60 min. Online inference entails feature processing (<5 ms), parallel base-learner predictions (~200 ms), and Ridge aggregation (<10 ms), for a total latency of ~0.21 s per sample. Multithreaded execution processes over 5000 records per minute on a 16-core node, supporting continuous 24/7 rolling forecasts and spatial grid predictions. The containerized service requires only a few hundred megabytes of memory and one to two CPU cores, enabling millisecond-scale responses, high scalability, and real-time AQI forecasting.

Figure 1 illustrates the linear relationships between key pollutants and AQI [22]. A strong positive correlation exists between AQI and PM_2.5 with r = 0.79 and between AQI and PM₁₀ with r = 0.80, identifying these particulate matter species as principal drivers of air quality deterioration. PM_2.5 and PM₁₀ show a high inter-correlation of r = 0.86, indicating their concentrations often rise in concert. The remaining pollutants fall below these correlations but still demonstrate meaningful links to AQI.

Figure 2 illustrates the joint temporal evolution of AQI alongside six key pollutants over 2020–2025 in Shenyang. Each 3D panel plots days since 2020-01-01 against AQI and pollutant concentration (PM_2.5, PM₁₀, NO₂, CO, SO₂, and O₃), revealing clear downward trends, pronounced seasonal cycles, and the close coupling between pollutant peaks and AQI spikes. The combined plot highlights distinct pollutant behaviors and their collective impact on air quality across all seasons.

2.3. Data Preprocessing and Outlier Mitigation

2.3.1. Outlier Mitigation via IQR and Three-Sigma Truncation

After data import and preliminary checks using Python 3.11, we confirmed that all non-O₃ pollutant measurements (PM_2.5, PM₁₀, CO, SO₂, and NO₂) were complete and obtained directly from the Shenyang Environmental Meteorological Bureau’s daily aggregate records. Missing values were identified solely in the O₃ series (≈1.8% of days), and these gaps were subsequently imputed using time-series linear interpolation, which weights adjacent known data points according to their temporal proximity. Finally, the interpolated value at time t is calculated by:

y (t) = y_{0} + \frac{t - t_{0}}{t_{1} - t_{0}} (y_{1} - y_{0})

(1)

where t₀ and t₁ are the nearest known timestamps surrounding t.

To mitigate the adverse impact of outliers and noise on model training, the interquartile range method was first used to identify extreme values. Denoting the first and third quartiles by Q₁ and Q₃, and the interquartile range by IQR = Q₃ − Q₁, the lower and upper bounds are defined as:

{L o w e r}_{I Q R} = Q_{1} - 1.5 I Q R

(2)

{U p p e r}_{I Q R} = Q_{3} + 1.5 I Q R

(3)

Values falling outside these bounds are winsorized, i.e.,

x_{w i n s o r i z e d} = \{\begin{matrix} L o w e r_{I Q R}, x < L o w e r_{I Q R}, \\ x, L o w e r_{I Q R} \leq x \leq U p p e r_{I Q R}, \\ {U p p e r}_{I Q R}, x > {L o w e r}_{I Q R} . \end{matrix}

(4)

To further guard against extreme deviations, the three-sigma rule was applied: with μ and σ denoting the sample mean and standard deviation, any value outside

[μ - 3 σ, μ + 3 σ]

(5)

is likewise truncated to the nearest boundary.

2.3.2. Kalman Filtering and Smoothing Procedure

Median filtering within a sliding window, complemented by Kalman smoothing, was utilized to suppress severe spikes and reduce random noise while preserving underlying trends. The robustness of median filtering effectively managed isolated anomalies, whereas Kalman filtering adaptively smoothed dynamic fluctuations in the time series. Consequently, these preprocessing steps enhanced data cleanliness and stability, significantly improving model robustness against extreme values and anomalies. Importantly, all interpolation and filtering parameters (mean, variance, quartiles, etc.) were strictly derived from the training set and uniformly applied to the test set to prevent any leakage of future data information [23,24,25].

The Kalman filtering procedure includes the following key steps in Table 1:

Figure 3a–g overlay raw, truncated, and Kalman-smoothed series for PM_2.5, PM₁₀, NO₂, CO, SO₂, O₃, and AQI. Winter to spring peaks and extreme spikes marked by the × symbol are strong in all panels. Winsorization removed these outliers, and Kalman smoothing shown in red preserved coherent seasonal cycles masked by raw noise shown in blue. Persistent fluctuations in particulate and gaseous pollutants reflect real pollution driven by traffic peaks and winter heating. This comparison demonstrates that our two-stage pipeline suppresses anomalies while retaining genuine environmental patterns, improving data fidelity for subsequent forecasting.

2.4. Special Engineering and Standardization Treatment

Temporal features significantly enhance AQI forecasting beyond noise and outlier mitigation. Season indicators capture periodic peaks, for example, elevated PM_2.5 in winter heating and increased O₃ in summer. Day-type labels, such as workdays, weekends, and holidays, encode intra-week emission patterns, while calendar features, such as years, months, and days, reflect long- and short-term trends [26,27]. Incorporating these features improves model generalization, reduces the misclassification of anomalies, and permits the quantification of each temporal factor’s contribution to predictions.

Accurate AQI forecasting requires modeling temporal dependencies that are not captured by current values alone. To address this, lag features such as AQIt-1, AQIt-3, and AQIt-7 and rolling statistics like the seven-day moving average and standard deviation are introduced [28,29]. Lagged features capture autocorrelation by referencing recent or periodic values while rolling statistics smooth short-term noise, highlight trend direction and quantify local variability. Together, they provide a multiscale temporal context that enhances model robustness to anomalies and improves generalization under varying meteorological and social conditions.

Different lag lengths serve distinct roles. Lag-1 reflects short-term persistence and daily meteorological influence. Lag-3 captures mid-range effects and synoptic variability, such as two to three day air mass changes. Lag-7 reveals weekly cycles tied to human activity patterns, including workdays and traffic intensity. In time-series modeling, lag features are constructed to incorporate historical observations and improve the model’s sensitivity to dynamic changes. The standard lag feature definitions based on AQI are formulated as follows:

A Q I_{lag} (t) = A Q I (t - k), (k \geq 1)

(6)

To quantify temporal dynamics in AQI, two statistical features based on a 7-day sliding window are introduced:

Seven-day rolling mean measures the average AQI over the current and preceding six days, smoothing short-term fluctuations and capturing weekly trends [30,31]. The feature is defined as follows:

A Q I_{rollmean 7} (t) = \frac{1}{7} \sum_{i = 0}^{6} A Q I (t - i)

(7)

Seven-day rolling standard deviation measures the variability in AQI within the same window; larger values indicate more volatile weekly air quality. The feature is defined as follows:

A Q I_{{r o l l}_{s} t d_{7}} (t) = \sqrt{\frac{1}{7} \sum_{i = 0}^{6} {(A Q I (t - i) - A Q I_{{r o l l}_{m} ean 7} (t))}^{2}}

(8)

After preprocessing, we constructed a heatmap of the top 20 engineered air quality features in Shenyang from 2020 to 2025. Each row corresponds to a specific lagged or original pollutant variable; the columns in Figure 4 show maximum, 75th percentile, median, 25th percentile, minimum, and variance. Color intensity encodes the magnitude of each statistic, highlighting variability and range differences across pollutants.

To ensure clean and consistent input for model training, all features were standardized using Z-score normalization, which transforms each value by subtracting the feature mean and dividing by the feature standard deviation. The formulas for normalization and its inverse transformation are expressed as follows:

z_{i} = \frac{x_{i} - μ}{σ}

(9)

σ = \sqrt{\frac{1}{n - 1} \sum_{i = 1}^{n} {(x_{i} - μ)}^{2}}

(10)

To prevent data leakage, strict separation between training and test sets was enforced during preprocessing. All transformation parameters, including those for data cleaning, feature engineering, and scaling, were derived exclusively from the training data. These parameters were then applied to the test set without modification. This protocol ensured that no information from the evaluation set influenced model training, thereby guaranteeing fair and reproducible performance metrics.

3. Model

3.1. Stacking

Stacking is an ensemble learning technique that integrates multiple base models to enhance predictive performance, as shown in Figure 5. Each base learner is trained on the same input data and generates predictions, which are then used as input features for a higher-level model called the meta-learner [32,33]. The meta-learner is trained to optimally combine these predictions and produce the final output.

3.2. K-Fold Cross Validation Optimization

To ensure reliable evaluation and mitigate biases from random data partitioning, fifteen widely used machine learning models were benchmarked using ten-fold cross-validation [34]. The models comprised support vector classifier, decision tree, adaptive boosting, random forest, extremely randomized trees, gradient boosting, multilayer perceptron neural network, k-nearest neighbors, logistic regression, linear discriminant analysis, ridge regression, support vector regression, light gradient boosting machine, extreme gradient boosting, and categorical boosting. Due to space constraints, detailed model specifications were omitted.

Given the temporal nature of AQI data, a time series split was adopted to ensure that each validation set chronologically followed its corresponding training set. The dataset was first divided into training and test sets. The training set was then partitioned into K-folds, with one fold serving as the validation set and the remaining K minus one-folds used for model fitting in each iteration. This sequential procedure as illustrated in Figure 4 reduces overfitting risk and yields a robust estimate of model generalization. A ten-fold configuration was selected to balance computational efficiency and evaluation fidelity.

K-fold cross-validation is a standard approach for evaluating a model’s generalization performance. The procedure consists of the following steps:

The dataset is randomly partitioned into K equally sized folds.

For each iteration k = 1,2,…,K, the kth-fold is designated as the validation set, while the remaining K − 1-folds are used for training.

The model is trained and evaluated across all K-folds, and the average validation error is computed as the final cross-validation estimate of model performance:

Er r_{k} = \frac{1}{|D_{k}|} \sum_{(x_{i}, y_{i}) \in D_{k}} L (y_{i}, {\hat{f}}^{(- k)} (x_{i}))

(11)

C V_{Error} = \frac{1}{K} \sum_{k = 1}^{K} E r r_{k} = \frac{1}{K} \sum_{k = 1}^{K} \frac{1}{|D_{k}|} \sum_{(x_{i}, y_{i}) \in D_{k}} L (y_{i}, {\hat{f}}^{(- k)} (x_{i}))

(12)

3.3. Hyperparameter Optimization

We performed an extensive hyperparameter optimization for all seven models—including the stacking meta-learner—by employing a randomized grid search in conjunction with ten-fold cross-validation on the training set. For each learner, fifty candidate configurations were sampled at random, as shown in Figure 6, and the combination yielding the lowest average mean absolute error across the ten folds was selected as final. This systematic approach ensured that each algorithm was evaluated under equivalent conditions and that the resulting hyperparameters were robust against overfitting [35,36,37,38,39].

Our choice of parameter ranges is detailed in Table A1 in the Appendix A to enhance readability. These ranges were guided by both established literature practices and practical considerations of computational efficiency. Furthermore, randomized search covers a broader portion of the hyperparameter space for the same number of samples compared to an exhaustive grid search, thereby reducing the risk of entrapment in local optima; combined with ten-fold cross-validation, this strategy further validates the stability of the selected configurations. Learning rates between 0.01 and 0.1 strike a balance between convergence speed and stability, while tree-based learners require depth and leaf limits that control complexity without unduly increasing variance. For random forest, limiting the number of estimators and random feature subsets preserves diversity without imposing prohibitive training times. Linear models such as ridge regression and the meta-learner are primarily hinged on regularization strength, with solver selection and intercept fitting providing additional numerical stability.

3.4. Model Performance Test

To assess the predictive performance, three core error metrics were employed: mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) [40,41], defined as follows:

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}

(13)

where

y_{i}

denotes the observed value,

\bar{y}

is the predicted value, and N is the sample size. MSE penalizes larger deviations more heavily due to the squared term, making it sensitive to outliers and suitable for evaluating the model’s ability to detect extreme pollution events.

MAE = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - \bar{y}|

(14)

MAE captures the average magnitude of prediction errors, regardless of the direction. It is more robust to outliers and reflects the model’s overall bias across daily pollutant fluctuations.

M APE = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{y_{i} - \bar{y}}{\bar{y}}|

(15)

MAPE quantifies prediction errors as a percentage of the actual values, allowing for a scale-invariant comparison across pollutants.

3.5. Feature Importance Analysis Method

Global SHAP (SHapley Additive exPlanations) analysis is a model-agnostic interpretability method that quantifies the overall importance of each feature by aggregating its contributions across all predictions [42]. Based on Shapley values from cooperative game theory, it fairly attributes a feature’s influence by averaging its marginal contributions across all possible feature combinations. In global analysis, the mean absolute SHAP values are computed for each feature across the dataset, providing a consistent and theoretically grounded measure of overall feature impact on model predictions.

4. Results

4.1. Model Comparison

The comparative evaluation employing ten-fold cross-validation across fifteen classifiers demonstrated that the boosting-based algorithms consistently achieved superior accuracy, as shown in Figure 7. Gradient boosting attained an accuracy of 93.3%, while XGBoost, LightGBM, and CatBoost each surpassed 92%, thereby confirming their capacity for effective nonlinear modeling and strong generalization. Linear approaches, such as logistic regression and linear discriminant analysis, performed less favorably, as they proved ill-suited to capturing intricate data patterns. K-nearest neighbors and support vector machines suffered from sensitivity to high-dimensional noise, whereas random forest maintained a robust performance at approximately 90.1% through bagging and random feature selection. An ensemble constructed via majority voting further elevated the accuracy to roughly 93.0%, underscoring that the combination of boosting strategies with model aggregation constitutes the most resilient framework for AQI classification.

4.2. Model Refinement

Post-optimization, each model was independently evaluated on the test set. Performance improvements were observed across all five tree-based learners, establishing a stronger individual baseline for ensemble construction. The final tuned parameters and corresponding validation metrics are summarized in Table 2 and Figure 8.

Predicting extreme weather is critical since these events coincide with pronounced air quality deterioration and elevated health risks. Accurate forecasts empower early warning systems to issue timely advisories to sensitive populations, guide targeted traffic restrictions and industrial emissions controls, optimize medical resource deployment, and support the development of resilient long-term climate adaptation strategies. By combining robust predictive performance with interpretable insights, our framework strengthens urban resilience, safeguards public health, and informs proactive environmental policymaking.

Figure 9 compares the true and predicted AQI trajectories of our stacking ensemble under extreme weather conditions from 1 January 2024 to 12 April 2025. The close alignment of the blue prediction curve with the black observation curve—including during rapid pollution spikes caused by inversion events—demonstrates the model’s ability to capture both routine fluctuations and severe anomalies. Quantitatively, the ensemble achieves an MAE of 9.08 AQI units, a MAPE of 6.50 percent, and an MSE of 262.66, reflecting high absolute and relative accuracy even when pollution peaks threaten public health.

4.3. Model Learning Curve Comparison

After hyperparameter tuning, the model’s capacity and regularization are fixed, placing it in an optimized state. To evaluate the effect of the training set size on generalization, learning curves were plotted under fixed parameter settings, thus avoiding confounding from mis-tuned configurations. This enables a clearer diagnosis of whether high error arises from underfitting, overfitting, or data insufficiency. As shown in Figure 10, the model performance stabilizes once the training set exceeds ~400 samples.

4.4. Comparative Evaluation of Model Performance

In air quality forecasting, MSE emphasizes accuracy in high-concentration episodes (e.g., PM_2.5 peaks), while MAE captures consistency across typical daily variations. The combination of MSE and MAE provides a comprehensive view of both extreme-event sensitivity and general performance.

Model performance metrics were computed using Python and Excel-based workflows. The resulting error values are summarized and analyzed in the following Figure 8 and Table 3.

4.5. Convergence Curves in Ensemble and Baseline Models

A convergence curve is a visual representation that traces the change in a model’s loss function value over successive training iterations. By plotting these curves, researchers can intuitively assess each algorithm’s learning speed, optimization efficiency, and training stability. Convergence curves guide model tuning and architecture selection by revealing how quickly a model approaches its minimum loss and how smoothly it stabilizes thereafter.

As shown in Figure 11, ensemble tree-based methods (XGBoost, CatBoost, gradient boosting decision tree, and random forest) and the stacking model all exhibit steep downward trends in loss during the early iterations, demonstrating strong fitting power and rapid convergence. Their curves gradually flatten as they near optimal performance, indicating stable training. Ridge regression, by contrast, produces an almost flat line—reflecting its non-iterative, closed-form solution and its limitations in capturing complex patterns. Importantly, the stacking model combines the strengths of multiple base learners to achieve loss reductions on par with or better than individual models, resulting in superior generalization and robustness. The use of a logarithmic loss axis further amplifies subtle differences in convergence speed at low loss values, underscoring stacking’s advantage in effectively integrating diverse learners during iterative training.

4.6. Interpretable SHAP-Driven Linear Approximation

The Global SHAP analysis confirmed that AQI predictions are predominantly driven by the combined effects of three key pollutants—PM₁₀, PM_2.5, and O₃—and by descriptors of temporal fluctuation. By integrating outputs from diverse base learners, the stacking framework not only synthesizes complementary predictive signals but also models the complex interplay between time-dependent features and pollutant concentrations. This dual capability ensures exceptional predictive performance while maintaining interpretability, thereby bolstering the model’s reliability and credibility. From the mean SHAP importance plot, O₃, PM₁₀, PM_2.5, CO at seven-day lag, and SO₂ at seven-day lag emerged as the top contributors to AQI forecasts, motivating the derivation of a concise, interpretable linear approximation for AQI estimation.

A Q I_{pred} = β_{0} + β_{1} \cdot O_{3}^{8 h} + β_{2} \cdot P M_{10} + β_{3} \cdot P M_{2.5} + β_{4} \cdot C O_{lag 7} + β_{5} \cdot S O_{2 lag 7} + ε

(16)

where β₀ denotes the intercept, β₁ through β₅ are the feature weights, and ε is the residual error term.

Figure 12 provides a multifaceted SHAP-based interpretation of our stacking ensemble’s AQI forecasts. In subfigure (a), the mean absolute SHAP values rank PM_2.5, PM₁₀, and O₃ as the three most influential drivers, each contributing substantially more than all other variables combined. The global summary in subfigure (b) then maps the full distribution of SHAP values for each feature: high PM_2.5 and PM₁₀ concentrations consistently push the predicted AQI upward (deep pink points to the right), while low values drive predictions downward (blue points to the left). CO and SO₂ at seven-day lag appear as secondary drivers, indicating that past pollutant levels also inform short-term forecasts.

Subfigures (c) and (d) delve deeper into the nonlinear dependence of O₃ and PM₁₀, respectively. Both dependence plots exhibit clear inflection points: once O₃ exceeds approximately 1.5 ppm or PM₁₀ surpasses around 75 µg/m³, their marginal impact on AQI accelerates sharply. The color gradient on these plots further reveals interaction effects—for example, the influence of O₃ is amplified when contemporaneous PM_2.5 is also high. Finally, modest but non-negligible SHAP contributions from temporal descriptors (year, month, weekday) confirm that seasonal and weekly cycles are captured by the model. Together, these visualizations demonstrate not only which factors dominate AQI predictions, but also how their effects evolve across different pollutant regimes and over time, underpinning the ensemble’s robust and interpretable performance.

5. Discussion

This study demonstrates that boosting-based tree ensembles—particularly gradient boosting, XGBoost, LightGBM, and CatBoost—consistently outperform linear, distance-based, and kernel classifiers in daily AQI classification for Shenyang. After hyperparameter tuning, gradient boosting reached an accuracy of 93.9 percent, while the stacked ensemble pushed performance to 94.2 percent and delivered the lowest error metrics, with MSE equal to 67.44 and MAE equal to 5.01. These results echo earlier international research that highlights the capacity of gradient-boosted models to capture complex nonlinear relationships between pollutants and meteorological factors. The learning curve analysis revealed that model generalization stabilizes once the training set exceeds roughly four hundred samples, indicating that large high-resolution datasets improve performance but yield diminishing returns beyond a critical size. Two working hypotheses guided the research: first, that advanced ensemble methods would surpass single models in predictive accuracy and, second, that particulate matter and ozone would remain the dominant drivers of AQI irrespective of seasonal variability. Both were confirmed. Voting and stacking ensembles reduced individual model errors by exploiting complementary strengths, and the SHAP analysis identified PM₁₀, PM_2.5, and eight-hour ozone as decisive features, with seven-day lags of CO and SO₂ also contributing, underscoring the temporal complexity of pollutant formation and dispersion. The interpretable SHAP-driven linear approximation offers municipal authorities a transparent decision-support tool capable of issuing early warnings and informing interventions such as traffic control and industrial emission cuts, supporting China’s drive for cleaner urban air and offering a template for other rapidly industrializing regions. The leakage-free preprocessing pipeline developed in this paper provides a replicable framework for global time-series air quality modelling, safeguarding against optimistic bias. Limitations include the focus on a single city, the reliance on surface meteorological variables, and the computational demands of real-time deployment, suggesting future work on multi-city validation, the incorporation of upper-air or satellite observations, hybrid physical–statistical modelling, and cost–benefit analyses of forecast-driven interventions. Overall, this study confirms that advanced boosting ensembles combined with rigorous preprocessing and interpretable modelling deliver a powerful and transparent approach to urban air quality forecasting that can be adapted well beyond the industrial hub of northeastern China.

6. Conclusions

The proposed stacking ensemble significantly outperformed all individual models in AQI forecasting. On the held-out test set, the ensemble achieved R² ≈ 94.17% and MAPE ≈ 7.79%, the best performance among all learners. This corresponds to a notably reduced mean error (MSE and MAE) in predicting both extreme pollution events and routine daily variations. By fusing five complementary gradient-boosting models with a ridge-regularized linear meta-learner, the ensemble captured complex nonlinear pollutant patterns more accurately than any single model. Consequently, the combined model reduced the predictive error beyond the levels achieved by individually boosted ones. A rigorous preprocessing pipeline, involving interquartile range and three-sigma outlier trimming, sliding-median filtering, and Kalman smoothing, ensured that the training data were clean and representative of true environmental variability. The learning curve analysis showed that the prediction error stabilized once ≈400 days of data were available, confirming that the data volume was sufficient for robust training. These steps yielded data of high fidelity, so that the ensemble generalized well to unseen samples and maintained a consistent performance across the entire seasonal spectrum.

The developed model offers enhanced interpretability, as the global SHAP analysis confirmed that AQI predictions are primarily influenced by a select set of pollutants, most notably PM₁₀, PM_2.5, and ozone, alongside their associated temporal lagged values. These insights align closely with established pollutant dynamics, including elevated PM_2.5 concentrations during winter heating periods and increased ozone levels associated with summertime photochemical processes. Leveraging these findings, a concise and interpretable linear approximation of AQI was derived. Moreover, feature importance analyses through gain and permutation metrics independently validated the critical roles of these pollutants, reinforcing confidence in the model’s robustness and decision logic.

The proposed modeling framework delivers exceptionally accurate, robust, and transparent air quality forecasts, carrying significant public health implications. Improved reliability in AQI predictions facilitates the early identification of pollution events, thereby enabling the more timely implementation of targeted interventions such as emission reduction initiatives and health advisories. Integrating extensive historical datasets with advanced ensemble modeling techniques substantially enhances the accuracy and reliability of urban air quality forecasts, thereby supporting policymakers in formulating more effective and proactive pollution mitigation strategies.

Author Contributions

Conceptualization, J.Z. and A.Z.; Methodology, Z.X. and A.Z.; Software, Z.X. and A.Z.; Validation, A.Z.; Formal analysis, H.Z. and C.K.; Investigation, C.K.; Resources, C.K.; Data curation, Z.X.; Writing—original draft, Z.X.; Writing—review & editing, J.Z.; Visualization, Z.X. and H.Z.; Supervision, H.Z.; Project administration, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Hyperparameter ranges for each model.

Model	Hyperparameter Ranges
LightGBM	learning_rate [0.01, 0.1] num_leaves [31, 127] max_depth [5, 15] n_estimators [100, 500]
XGBoost	learning_rate [0.01, 0.1] max_depth [3, 10] subsample [0.6, 1.0] colsample_bytree [0.6, 1.0] n_estimators [100, 500]
CatBoost	learning_rate [0.01, 0.1] depth [4, 10] l2_leaf_reg [1, 10] iterations [100, 500]
GBDT (Gradient Boosting)	learning_rate [0.01, 0.1] max_depth [3, 10] n_estimators [100, 500] subsample [0.6, 1.0]
Random Forest	n_estimators [100, 500] max_depth [None, 15] min_samples_split [2, 10] max_features [“sqrt”, “log2”]
Ridge Regression	alpha [0.1, 10] solver [“auto”, “saga”] fit_intercept [True, False]
Stacking Meta-Learner (Ridge)	alpha [0.01, 1.0] fit_intercept [True, False]

Table A2. List of air quality and meteorological features used in the model.

Title 1	Title 2
PM_2.5	Feature 1
PM₁₀	Feature 2
NO₂	Feature 3
CO	Feature 4
SO₂	Feature 5
O₃	Feature 6
AQI_lag1	Feature 7
PM_2.5_lag1	Feature 8
PM₁₀_lag1	Feature 9
NO2_lag1	Feature 10
CO_lag1	Feature 11
SO2_lag1	Feature 12
O₃_lag1	Feature 13
AQI_lag3	Feature 14
PM_2.5_lag3	Feature 15
PM₁₀_lag3	Feature 16
NO2_lag3	Feature 17
CO_lag3	Feature 18
SO2_lag3	Feature 19
O₃_lag3	Feature 20
AQI_lag7	Feature 21
PM_2.5_lag7	Feature 22
PM₁₀_lag7	Feature 23
NO2_lag7	Feature 24
CO_lag7	Feature 25
SO2_lag7	Feature 26
O₃_lag7	Feature 27
AQI_roll7_mean	Feature 28
AQI_roll7_std	Feature 29
PM_2.5_roll7_mean	Feature 30
PM_2.5_roll7_std	Feature 31
PM₁₀_roll7_mean	Feature 32
PM₁₀_roll7_std	Feature 33
NO2_roll7_mean	Feature 34
NO2_roll7_std	Feature 35
CO_roll7_mean	Feature 36
CO_roll7_std	Feature 37
SO2_roll7_mean	Feature 38
SO2_roll7_std	Feature 39
O₃_roll7_mean	Feature 40
O₃_roll7_std	Feature 41
year	Feature 42
month	Feature 43
day	Feature 44
is_weekend	Feature 45
is_holiday	Feature 46
is_workday	Feature 47

References

Ravindiran, G.; Hayder, G.; Kanagarathinam, K.; Alagumalai, A.; Sonne, C. Air quality prediction by machine learning models: A predictive study on the indian coastal city of Visakhapatnam. Chemosphere 2023, 338, 139518. [Google Scholar] [CrossRef]
Zhang, X.; Jiang, X.; Li, Y. Prediction of air quality index based on the SSA-BiLSTM-LightGBM model. Sci. Rep. 2023, 13, 5550. [Google Scholar] [CrossRef]
Kong, M.; Li, J.; Liu, Q. SSA–BiLSTM–LightGBM combined model for AQI forecasting in Beijing. Sci. Rep. 2023, 13, 9872. [Google Scholar]
Zhao, C.; Lin, Z.; Yang, L.; Jiang, M.; Qiu, Z.; Wang, S.; Gu, Y.; Ye, W.; Pan, Y.; Zhang, Y.; et al. A study on the impact of meteorological and emission factors on PM2.5 concentrations based on machine learning. J. Environ. Manag. 2025, 376, 124347. [Google Scholar] [CrossRef]
Ravindiran, G.; Karthick, K.; Rajamanickam, S.; Datta, D.; Das, B.; Shyamala, G.; Hayder, G.; Maria, A. Ensemble stacking of machine learning models for air quality prediction for Hyderabad city in India. iScience 2025, 28, 111894. [Google Scholar] [CrossRef]
Peng, T.; Xiong, J.; Sun, K.; Qian, S.; Tao, Z.; Nazir, M.S.; Zhang, C. Research and application of a novel selective stacking ensemble model based on error compensation and parameter optimization for AQI prediction. Environ. Res. 2024, 247, 118176. [Google Scholar] [CrossRef]
Tian, H.; Kong, H.; Wong, C. A Novel Stacking Ensemble Learning Approach for Predicting PM_2.5 Levels in Dense Urban Environments Using Meteorological Variables: A Case Study in Macau. Appl. Sci. 2024, 14, 5062. [Google Scholar] [CrossRef]
Yang, J.; Ke, H.; Gong, S.; Wang, Y.; Zhang, L.; Zhou, C.; Mo, J.; You, Y. Enhanced Forecasting and Assessment of Urban Air Quality by an Automated Machine Learning System: The AI-Air. Earth Space Sci. 2025, 12, e2024EA003942. [Google Scholar] [CrossRef]
He, Z.; Guo, Q. Comparative Analysis of Multiple Deep Learning Models for Forecasting Monthly Ambient PM_2.5 Concentrations: A Case Study in Dezhou City, China. Atmosphere 2024, 15, 1432. [Google Scholar] [CrossRef]
He, Z.; Guo, Q.; Wang, Z.; Li, X. A Hybrid Wavelet-Based Deep Learning Model for Accurate Prediction of Daily Surface PM_2.5 Concentrations in Guangzhou City. Toxics 2025, 13, 254. [Google Scholar] [CrossRef]
Chen, Y.; Lee, J.; Park, S. Attention-hybrid QPSO-LSTM for AQI prediction in Seoul. J. Big Data 2024, 11, 57. [Google Scholar]
Zhang, J.; Liu, Q.; Chen, X. PM_2.5 concentration prediction with LASSA-optimized LightGBM. Atmosphere 2024, 15, 1612. [Google Scholar]
Hu, R.; Xia, Y.; Guo, Z. Mapping PM_2.5 from Himawari-8 reflectance via LightGBM. Atmos. Environ. 2024, 330, 120560. [Google Scholar]
Li, P.; Chen, G.; Shen, C.; Dong, L.; Cai, C. KSC-ConvLSTM: A hybrid deep-learning air-pollution prediction approach based on neighbourhood selection and spatio-temporal attention. Sci. Rep. 2025, 15, 3685. [Google Scholar]
Li, Y.; Wang, Q.; Zhang, R. Multi-scale spatiotemporal graph-attention network for PM_2.5 forecasting. Inf. Sci. 2024, 640, 121072. [Google Scholar]
Chen, K.; Li, G.; Huang, H. Spatiotemporal adaptive attention GCN for city-level AQI. Sci. Rep. 2024, 14, 1654. Available online: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10432486/ (accessed on 10 December 2024).
Park, J.; Oh, S.; Lee, H. Multi-head attention GCN-LSTM model for tiered AQI early warning. Atmosphere 2024, 15, 622. Available online: https://www.mdpi.com/2073-4433/15/4/418 (accessed on 10 December 2024).
Xu, Q.; Zhou, H.; Li, S. Explainable machine-learning approaches for AQI forecasting. Aerosol Air Qual. Res. 2023, 23, 230151. [Google Scholar]
Samad, A.; Garuda, S.; Vogt, U.; Yang, B. Air pollution prediction using machine learning techniques—An approach to replace existing monitoring stations with virtual monitoring stations. Atmospheric Environ. 2023, 310, 119987. [Google Scholar] [CrossRef]
Mampitiya, L.; Rathnayake, N.; Leon, L.P.; Mandala, V.; Azamathulla, H.M.; Shelton, S.; Hoshino, Y.; Rathnayake, U. Machine-learning techniques to predict air quality using twelve parameters. Environments 2023, 10, 141. [Google Scholar] [CrossRef]
Zhai, B.; Chen, J. Development of a stacked ensemble model for forecasting and analyzing daily average PM_2.5 concentrations in Beijing, China. Sci. Total. Environ. 2018, 635, 644–658. [Google Scholar] [CrossRef]
Friendly, M. A brief history of the cluster heat map. Comput. Stat. Data Anal. 2008, 52, 328–351. [Google Scholar]
Hunt, B.R.; Kostelich, E.J.; Szunyogh, I. Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter. Phys. D Nonlinear Phenom. 2007, 230, 112–126. [Google Scholar] [CrossRef]
Bokani, A.; Yadegaridehkordi, E.; Kanhere, S.S. LSTM-H: A Hybrid Deep Learning Model for Accurate Livestock Movement Prediction in UAV-Based Monitoring Systems. Drones 2025, 9, 346. [Google Scholar] [CrossRef]
Ma, X.; Zhou, P.; He, X. Advances in Multi-Source Navigation Data Fusion Processing Methods. Mathematics 2025, 13, 1485. [Google Scholar] [CrossRef]
Jayaraman, S.; Nathezhtha, T.; Abirami, S.; Sakthivel, G. Enhancing urban air quality prediction using time-based-spatial forecasting framework. Sci. Rep. 2025, 15, 4139. [Google Scholar] [CrossRef]
Chen, C.W.S.; Chiu, L.M. Ordinal time series forecasting of the air quality index. Entropy 2021, 23, 1157. [Google Scholar] [CrossRef]
Shi, J.; Jain, M.; Narasimhan, G. Time series forecasting using various deep learning models. arXiv 2022, arXiv:2204.11115. Available online: https://arxiv.org/abs/2204.11115 (accessed on 8 January 2025).
Lu, J.; Wu, S.; Qin, Z.; Wu, D.; Yang, G. Frequency-aware attention-LSTM for PM_2.5 time series forecasting. arXiv 2025, arXiv:2503.24043. Available online: https://arxiv.org/abs/2503.24043 (accessed on 8 January 2025).
Huang, C.-Y.; Kuo, Y.-H. A rolling forecast approach for next six-hour air quality index track. J. Comput. Sci. 2020, 46, 101103. [Google Scholar]
Garg, S.; Jindal, H. Evaluation of Time Series Forecasting Models for Estimation of PM_2.5 Levels in Air. In Proceedings of the 2021 6th International Conference for Convergence in Technology (I2CT), Maharashtra, India, 10 May 2021; pp. 1–8. [Google Scholar]
Breiman, L. Stacked regressions. Mach. Learn. 1996, 24, 49–64. [Google Scholar] [CrossRef]
Wolpert, D.H. Stacked generalization. Neural Netw. 1992, 5, 241–259. [Google Scholar] [CrossRef]
Allen, D.M. The Relationship between variable selection and data agumentation and a method for prediction. Technometrics 1974, 16, 125–127. [Google Scholar] [CrossRef]
Bergstra, J.; Bengio, Y. Random search for hyper-parameter optimization. J. Mach. Learn. Res. 2012, 13, 281–305. [Google Scholar]
Snoek, J.; Larochelle, H.; Adams, R.P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; Volume 25, pp. 2951–2959. [Google Scholar]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 3146–3154. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Probst, P.; Wright, M.N.; Boulesteix, A. Hyperparameters and tuning strategies for random forest. WIREs Data Min. Knowl. Discov. 2019, 9, e1301. [Google Scholar] [CrossRef]
Willmott, C.J.; Matsuura, K. Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Clim. Res. 2005, 30, 79–82. [Google Scholar] [CrossRef]
Makridakis, S.; Hibon, M. The M3-Competition: Results, conclusions and implications. Int. J. Forecast. 2000, 16, 451–476. [Google Scholar] [CrossRef]
Molnar, C. Interpretable Machine Learning; Lean Publishing: Victoria, BC, Canada, 2020; Available online: https://christophm.github.io/interpretable-ml-book/ (accessed on 8 January 2025).

Figure 1. Heatmap of factors related to air quality in Shenyang.

Figure 2. Temporal trends of AQI and pollutants. Subfigures (a,b) show that PM_2.5 and PM₁₀ concentrations peak strongly in spring and autumn, driving corresponding AQI spikes, while (c) highlights CO’s sharp winter surges linked to heating emissions. Panel (d) demonstrates O₃’s pronounced summer maxima under intense sunlight, inversely correlated with AQI lows, and (e) reveals SO₂’s generally low baseline punctuated by sporadic industrial spikes. In (f), NO₂ exhibits clear morning-rush and winter peaks, reflecting traffic and boundary-layer dynamics. Subfigure (g) traces the AQI series itself, showing an overall slight improvement since 2021, and (h) overlays all pollutant–AQI trajectories for direct comparison of their relative magnitudes, seasonality, and combined influence on air quality.

Figure 3. Before and after comparison of outlier detection and processing. Subfigure (a) shows PM_2.5 with frequent fine-particle spikes and pronounced spring and autumn peaks; truncation removes extreme excursions and the Kalman filter uncovers the underlying cyclical trend. In (b), PM₁₀ exhibits larger amplitude spikes during the same seasons, which are effectively tracked by the smoother. Panel (c) depicts NO₂, highlighting strong wintertime elevations and maintaining diurnal and weekly variation patterns after smoothing. Subfigure (d) displays CO’s sharp cold-season emission events alongside an overall downward trend. In (e), SO₂ remains low year-round but features isolated industrial spikes that are accurately flagged as outliers. The eight-hour rolling mean of ozone in (f) shows strong summer maxima, with the smoothed curve emphasizing a long-term upward baseline. Finally, (g) integrates all species into the AQI, where truncation and smoothing together reveal a subtle improvement in overall air quality since early 2021, despite intermittent pollution episodes.

Figure 4. Descriptive statistics heatmap of engineered air quality features.

Figure 5. Stacking ensemble architecture.

Figure 6. Randomized hyperparameter sampling for ensemble learners. The color of each marker encodes the number of estimators used in the ensemble, with darker hues indicating lower estimator counts and lighter hues corresponding to higher counts. This dual encoding allows simultaneous visualization of three hyperparameters (learning rate, max depth, number of estimators) and facilitates rapid comparison of model complexity across sampled configurations.

Figure 7. Ten-fold cross-validation results.

Figure 8. Scatter plots of model predictions vs. true AQI.

Figure 9. True and predicted AQI trajectories under extreme weather conditions.

Figure 10. Learning curves for six regressors.

Figure 11. Convergence curves of different models.

Figure 12. Comprehensive SHAP interpretation of the stacking ensemble model.

Table 1. Key equations for Kalman filtering and smoothing procedures.

Step	Description	Equation
(1)	State prediction using A and B	$x_{k \| k - 1} = A x_{k - 1 \| k - 1} + B u_{k}$
(2)	Covariance prediction using A and Q	$P_{k \| k - 1} = A P_{k - 1 \| k - 1} A^{T} + Q$
(3)	Kalman gain calculation with R	$K_{k} = P_{k \| k - 1} H^{T} (H P_{k \| k - 1} H^{T} + R)^{- 1}$
(4)	State update with observation residuals	${\hat{x}}_{k \| k} = {\hat{x}}_{k \| k} + K_{k} (z_{k} - H {\hat{x}}_{k \| k})$
(5)	Covariance update using Kalman gain and H	$P_{k \| k} = (I - K_{k} H) P_{k \| k - 1}$
(6)	Smoothing gain backward adjustment	$G_{k} = P_{k \| k} A^{T} P_{k + 1 \| k}^{- 1}$
(7)	Backward state smoothing refinement	${\hat{x}}_{k \| N} = {\hat{x}}_{k \| k} + G_{k} ({\hat{x}}_{k + 1 \| N} - {\hat{x}}_{k + 1 \| k})$
(8)	Symmetrical moving average smoothing	${\tilde{x}}_{t} = \frac{1}{2 M + 1} \sum_{i = - M}^{M} x_{t + i}$
(9)	Causal moving average (real-time)	${\tilde{x}}_{t} = \frac{1}{M} \sum_{i = 0}^{M - 1} x_{t - i}$

Table 2. Comparison of six Classifiers before and after hyperparameter tuning.

Classifier	Accuracy (Before Tuning)	Accuracy (After Tuning)
LightGBM	92.43%	93.38%
XGBoost	92.85%	93.21%
CatBoost	92.12%	92.78%
Gradient Boosting	93.32%	93.89%
Random Forest	90.13%	91.78%
Ridge Regression	90.33%	91.07%
Stacking	93.89%	94.17%

Table 3. Model comparison and evaluation.

Model	R²	MSE	MAE	MAPE
LightGBM	90.38%	104.57	5.753	8.71
XGBoost	93.21%	73.744	5.474	8.69
CatBoost	92.78%	78.434	5.570	8.96
Gradient Boosting	93.89%	105.020	7.619	9.03
Random Forest	93.18%	89.235	5.253	8.09
Stacking	94.17%	67.442	5.008	7.79

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, Z.; Zhang, H.; Zhai, A.; Kong, C.; Zhang, J. Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications. Atmosphere 2025, 16, 776. https://doi.org/10.3390/atmos16070776

AMA Style

Xu Z, Zhang H, Zhai A, Kong C, Zhang J. Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications. Atmosphere. 2025; 16(7):776. https://doi.org/10.3390/atmos16070776

Chicago/Turabian Style

Xu, Zhaoxin, Huajian Zhang, Andong Zhai, Chunyu Kong, and Jinping Zhang. 2025. "Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications" Atmosphere 16, no. 7: 776. https://doi.org/10.3390/atmos16070776

APA Style

Xu, Z., Zhang, H., Zhai, A., Kong, C., & Zhang, J. (2025). Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications. Atmosphere, 16(7), 776. https://doi.org/10.3390/atmos16070776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Stacking Ensemble Learning and SHAP-Based Insights for Urban Air Quality Forecasting: Evidence from Shenyang and Global Implications

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source

2.2. Background and Related Work

2.3. Data Preprocessing and Outlier Mitigation

2.3.1. Outlier Mitigation via IQR and Three-Sigma Truncation

2.3.2. Kalman Filtering and Smoothing Procedure

2.4. Special Engineering and Standardization Treatment

3. Model

3.1. Stacking

3.2. K-Fold Cross Validation Optimization

3.3. Hyperparameter Optimization

3.4. Model Performance Test

3.5. Feature Importance Analysis Method

4. Results

4.1. Model Comparison

4.2. Model Refinement

4.3. Model Learning Curve Comparison

4.4. Comparative Evaluation of Model Performance

4.5. Convergence Curves in Ensemble and Baseline Models

4.6. Interpretable SHAP-Driven Linear Approximation

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI