Next Article in Journal
Reexamining Key Applications of the Poisson Distribution
Previous Article in Journal
Asymptotic Symmetry Behavior for a Sampled Data Model via Fractional-Order Hold and Delta Operators
Previous Article in Special Issue
Statistical Evaluation of Alpha-Powering Exponential Generalized Progressive Hybrid Censoring and Its Modeling for Medical and Engineering Sciences with Optimization Plans
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hybrid STL-Based Ensemble Model for PM2.5 Forecasting in Pakistani Cities

1
Department of Statistics, University of Sindh, Hyderabad 76080, Pakistan
2
Department of Statistics, Quaid-i-Azam University, Islamabad 45320, Pakistan
3
Department of Mathematics and Statistics, College of Science, Imam Mohammad Ibn Saud Islamic University (IMSIU), Riyadh 11432, Saudi Arabia
4
Department of Statistics, University of Peshawar, Peshawar 25120, Pakistan
5
Department of Statistics, Federal University of Bahia, Salvador 40170-110, Brazil
*
Author to whom correspondence should be addressed.
Symmetry 2025, 17(11), 1827; https://doi.org/10.3390/sym17111827 (registering DOI)
Submission received: 24 September 2025 / Revised: 22 October 2025 / Accepted: 26 October 2025 / Published: 31 October 2025
(This article belongs to the Special Issue Unlocking the Power of Probability and Statistics for Symmetry)

Abstract

Air pollution, outstanding particulate matter (PM2.5), poses severe risks to human health and the environment in densely populated urban areas. Accurate short-term forecasting of PM2.5 concentrations is therefore crucial for timely public health advisories and effective mitigation strategies. This work proposes a hybrid approach that combines machine learning models with STL decomposition to provide precise short-term PM2.5 predictions. Daily PM2.5 series from four major Pakistani cities—Islamabad, Lahore, Karachi, and Peshawar—are first pre-processed to handle missing values, outliers, and variance instability. The data are then decomposed via seasonal-trend decomposition using Loess (STL), which explicitly exploits the symmetric and recurrent structure of seasonal patterns. Each decomposed component (trend, seasonality, and remainder) is modeled independently using an ensemble of statistical and machine learning approaches. Forecasts are combined through a weighted aggregation scheme that balances bias–variance trade-offs and preserves the distributional consistency. The final recombined forecasts provide one-day-ahead PM2.5 predictions with associated uncertainty measures. The model evaluation employs multiple statistical accuracy metrics, distributional diagnostics, and out-of-sample validation to assess its performance. The results demonstrate that the proposed framework consistently outperforms conventional benchmark models, yielding robust, interpretable, and probabilistically coherent forecasts. This study demonstrates how periodic and recurrent seasonal structure decomposition and probabilistic ensemble methods enhance the statistical modeling of environmental time series, offering actionable insights for urban air quality management.

1. Introduction

Breathing clean air is essential to human health, as air quality directly impacts the respiratory system and overall well-being. Meteorological and atmospheric factors have a significant impact on air quality and introduce fluctuations in PM2.5 concentrations. Primary NOx, SO2, and VOC emissions combine through photochemical reactions in the presence of sunlight to generate secondary aerosols, while weather factors, including temperature, humidity, and boundary layer height, also affect dispersion. Knowing these mechanisms provides a physical foundation for the temporal patterns. Prolonged exposure to polluted air poses significant health risks beyond the respiratory tract, including increased mortality, the onset of chronic illnesses, disabilities, and substantial socioeconomic burdens on health systems worldwide. Consequently, air quality monitoring and forecasting have become critical components of public health strategies at both national and global levels [1,2,3].
Over the past few decades, rapid population growth, urbanization, and deforestation have contributed to elevated levels of air pollution [4,5,6,7]. Air pollution is generally defined as the presence of harmful chemicals, physical, or biological substances in the atmosphere. Among various pollutants, particulate matter (PM) remains one of the most widely used and reliable indicators of air quality. PM is a complex mixture of solid and liquid particles suspended in the air, often including nitrogen oxides (NOx), carbon oxides (COx), sulfur oxides (SOx), ozone (O3), volatile organic compounds (VOCs), and secondary aerosols. Based on aerodynamic diameter, PM is typically classified into PM10 (particles smaller than 10 µm) and PM2.5 (particles smaller than 2.5 µm) [8].
The health risks associated with particulate matter depend strongly on particle size. PM10 is primarily deposited in the upper respiratory tract and digestive system. By contrast, PM2.5 particles, due to their smaller size, can penetrate deep into the lungs and cross into the bloodstream. This enables them to exert systemic effects, including increased risks of cardiovascular disease, ischemic heart disease, renal disorders, diabetes, and endocrine dysfunctions, as well as neurological conditions such as Alzheimer’s disease, Parkinson’s disease, and multiple sclerosis [9,10]. The World Health Organization (WHO) estimates that exposure to fine particulate matter contributes to approximately 7 million deaths annually, making air pollution one of the leading global health risk factors. In Europe alone, PM2.5 is responsible for nearly 904,000 premature deaths per year, with projections indicating significant increases by 2050 [11]. These alarming statistics underscore the urgency of developing effective monitoring, forecasting, and intervention strategies.
Air pollution dynamics are influenced not only by emissions but also by meteorological and climatic conditions, such as wind speed, temperature, relative humidity, rainfall, atmospheric pressure, and ultraviolet (UV) radiation duration. Consequently, robust air quality monitoring and forecasting systems must integrate both pollution sources and meteorological drivers to provide accurate and reliable predictions [12]. From a probabilistic and statistical perspective, such systems must also capture underlying structures in the data, including symmetries in seasonal patterns, cyclical fluctuations, and approximate symmetry of error distributions.
Pakistan, like many other developing countries, faces significant challenges due to deteriorating air quality. Studies suggest that nearly two-thirds of the country experiences unhealthy air, primarily driven by elevated PM2.5 concentrations, which impose a substantial burden on public health. Respiratory diseases have the most immediate impact, with 57% of chronic obstructive pulmonary disease (COPD) cases and 40% of lower respiratory infections attributed to poor air quality. Beyond respiratory health, prolonged exposure to PM2.5 has also been linked to non-respiratory conditions such as diabetes, cardiovascular disorders, and ischemic heart disease [13]. These findings highlight the need for statistically rigorous and symmetry-aware forecasting frameworks to support evidence-based interventions and policy design.
Air quality forecasting, particularly of PM2.5 concentrations and the Air Quality Index (AQI), has gained significant attention in environmental health research. Accurate forecasts enable the timely issuance of health advisories, informed policy decisions, and the efficient implementation of mitigation strategies. Over the past two decades, a wide range of modeling techniques has been employed, spanning traditional statistical methods (e.g., autoregressive and regression-based models) to advanced machine learning (ML) and deep learning (DL) approaches, each with its own strengths and limitations across different spatiotemporal contexts.
For instance, Mahajan et al. [14] assessed time series forecasting models to predict PM2.5 concentrations in Taichung City, Taiwan. Using hourly data from Air Box microsensor devices, they compared Autoregressive Integrated Moving Average (ARIMA), Neural Network Autoregression (NNAR), and Holt-Winters (HW) models. Based on performance metrics such as the root mean square error (RMSE) and the mean absolute error (MAE), the NNAR model achieved the best predictive accuracy across monitoring stations. Similarly, Mani and Viswanadhapalli [15] investigated AQI forecasting in Chennai, India, using daily data from the Central Pollution Control Board (CPCB) between 2018 and 2020. After applying preprocessing techniques such as missing value imputation and variance stabilization, they demonstrated that ARIMA outperformed multiple linear regression (MLR) in providing stable and reliable forecasts.
In recent years, deep learning methods have emerged as particularly powerful for capturing complex nonlinear dynamics in air quality data. Cassano et al. [16] employed recurrent neural networks (RNNs), including long short-term memory (LSTM) and gated recurrent unit (GRU) architectures, for modeling air quality in Italy’s Apulia region. Leveraging pollutant and meteorological data from the Regional Environmental Protection Agency (ARPA), they demonstrated superior performance of RNNs, particularly in station-specific forecasts. Similarly, Hossain et al. [17] proposed a hybrid GRU–LSTM model for predicting AQI in Dhaka and Chattogram, Bangladesh, based on three years of data (2017–2020). Their hybrid approach consistently outperformed standalone GRU and LSTM models, as measured via a lower RMSE, the MAE, and the mean squared error (MSE). In Beijing, Zhang et al. [18] introduced a hybrid convolutional neural network–LSTM (CNN-LSTM) model that effectively captures both spatial and temporal dependencies, surpassing traditional models such as ARMA, SARIMA, and individual RNN variants.
Efforts to optimize model efficiency have also included the use of advanced machine learning techniques. Liu et al. [19] developed a genetic algorithm-based kernel extreme learning machine (GA-KELM), trained on real-world air quality data from 2019 to 2021. Their model outperformed support vector regression (SVR), Deep Belief Networks (DBN-BP), and the Community Multiscale Air Quality (CMAQ) model in both accuracy and computational speed. However, it required manual tuning of hidden parameters. In the Pakistani context, Iftikhar et al. [20] introduced three ensemble-based models—ESMT, ESME, and ESMV—for AQI forecasting using five years of PM2.5 data from major cities. Their results indicated that ensemble approaches, particularly ESMV, consistently outperformed benchmark models, although the omission of meteorological and gaseous predictors revealed a scope for improvement.
Taken together, these studies reflect a global shift toward hybrid and ensemble modeling approaches in air quality forecasting. From a statistical standpoint, such approaches exploit symmetries in temporal structures (e.g., seasonal recurrences and cyclic regularities) while accommodating asymmetries in pollutant spikes and distributional tails. While traditional models, such as ARIMA and MLR, continue to provide valuable baselines, neural network–based methods—especially those integrating LSTM, GRU, CNN, and optimization algorithms—demonstrate strong potential for addressing the nonlinear and multifaceted nature of air pollution. For environmental health applications, the adoption of symmetry-aware probabilistic and statistical models holds promise for more accurate, interpretable, and actionable forecasts, ultimately supporting improved public health outcomes and sustainable urban air quality management. However, this study defines symmetry as balanced error behavior and recurrent seasonal regularity, rather than as precise temporal symmetry, in contrast to previous works.
The rest of this paper is organized as follows. Section 2 presents the proposed method, including data preprocessing, STL decomposition, ensemble model construction, and the probabilistic–statistical framework for forecasting. Section 3 reports the experimental setup, evaluation metrics, and comparative results with benchmark models, highlighting improvements in accuracy, robustness, and symmetry-based statistical interpretation. Section 4 provides an in-depth discussion of the findings, their implications for environmental health management, and comparisons with prior studies. Finally, Section 6 concludes the paper by summarizing the main contributions, outlining policy implications, and suggesting avenues for future research.

2. Methods and Materials

This study proposes a probabilistic decomposition-based ensemble forecasting framework for predicting daily PM2.5 concentrations in four major Pakistani megacities—Islamabad, Lahore, Karachi, and Peshawar—over a one-day horizon. Rapid variations in emissions, boundary-layer dynamics, and climatic factors influencing short-term dispersion and accumulation are the causes of PM2.5 volatility. The approach is motivated by the theme of symmetry; structural decomposition reveals balanced patterns (trend, seasonality, irregularity), and ensemble integration exploits the complementary strengths of machine learning models through statistical aggregation.
The methodology is organized into three stages: data preprocessing, probabilistic decomposition, and component-wise forecasting with symmetric reconstruction.
  • Data Preprocessing: The observed daily PM2.5 concentrations (2019–2023) are preprocessed to ensure reliability. Missing observations are imputed using the Multiple Imputation by Chained Equations (MICE) procedure, which draws imputations from conditional distributions to preserve statistical symmetry and variability in the data (Section 2.1).
  • Probabilistic Decomposition: Seasonal-trend decomposition based on Loess (STL) is applied to partition the original series into three additive components: trend ( T t ), seasonal ( S t ), and remainder ( R t ). This decomposition enforces structural decomposition by isolating deterministic and stochastic parts of the signal [21].
  • Component Forecasting and Temporal Reconstruction: Each component is independently modeled using four probabilistic machine learning algorithms—support vector regression (SVR), extreme gradient boosting (XGBoost), the Neural Network Autoregressive model (NNETAR), and extreme learning machine (ELM). Their forecasts are symmetrically aggregated to reconstruct the final PM2.5 prediction (Section 2.3).
This design enhances robustness by aligning probabilistic inference with symmetry-based decomposition.

2.1. Multiple Imputation via Chained Equations (MICE)

Let the dataset be Y = { Y 1 , Y 2 , , Y p } , where each variable Y j may contain missing values. The MICE algorithm generates m plausible completed datasets by iteratively imputing missing entries from predictive distributions conditioned on the observed data:
Y j ( mis ) PMM ( Y j ( obs ) Y j ) ,
where predictive mean matching (PMM) preserves the distributional consistency between observed and imputed values. The multiple imputations
{ Y ^ ( 1 ) , Y ^ ( 2 ) , , Y ^ ( m ) }
represent posterior samples, thereby accounting for uncertainty in a symmetric Bayesian framework [22].

2.2. Mathematical Formulation of Forecasting Models

Each decomposed component ( T t , S t , R t ) is modeled separately using the following algorithms, with each providing probabilistic estimates respecting data symmetry.

2.2.1. Support Vector Regression (SVR)

SVR constructs a symmetric ε -insensitive loss function, balancing under- and over-estimation:
f ( x ) = w T ϕ ( x ) + b ,
min w , b , ξ i , ξ i * 1 2 w 2 + C i = 1 n ( ξ i + ξ i * ) ,
Subject to symmetric deviation constraints,
y i f ( x i ) ε + ξ i , f ( x i ) y i ε + ξ i * , ξ i , ξ i * 0 .

2.2.2. Extreme Gradient Boosting (XGBoost)

XGBoost ensembles regression trees via additive updates under a regularized objective:
y ^ i = k = 1 K f k ( x i ) , f k F ,
L ( t ) = i = 1 n l y i , y ^ i ( t 1 ) + f t ( x i ) + Ω ( f t ) ,
where the regularization Ω ( f t ) ensures balanced complexity–fit symmetry.

2.2.3. Neural Network AutoRegressive Model (NNETAR)

NNETAR introduces nonlinear mapping through symmetric activation functions, h ( · ) :
y t = α 0 + i = 1 k α i h j = 1 p β i j y t j + β i 0 + ε t .

2.2.4. Extreme Learning Machine (ELM)

ELM achieves computational symmetry by randomly assigning input weights and estimating output weights via a pseudoinverse:
y ^ = H β , β = H Y ,
H i j = g ( w j T x i + b j ) .
In this work, the balanced structural behavior of PM2.5 time series over equivalent temporal intervals is indicated by temporal symmetry. Auto-correlation pattern stability and the almost zero skewness of residuals derived from STL-decomposed components are used to measure symmetry. The directional neutrality between positive and negative forecast deviations is maintained by using point estimates from each model, rather than a weighted aggregate. The theoretical idea of temporal equilibrium in environmental dynamics is reflected in this method’s preservation of structural stability and cyclical regularity.

2.3. Proposed STL–Decomposition Forecasting Framework

By combining point estimates from several base learners, the proposed ensemble framework produces probabilistic projections. The combined outputs capture the uncertainty in model predictions by producing a central tendency (mean forecast).

2.3.1. STL Decomposition

The observed PM2.5 series y t is decomposed as follows:
y t = T t + S t + R t ,
with symmetric additive partitioning.

2.3.2. Component Forecasting

For each component C { T , S , R } and model i { SVR , XGB , NNETAR , ELM } :
C ^ t ( i ) = f Model i ( C ) ( X t ) .

2.3.3. Final Symmetric Reconstruction

The overall PM2.5 forecast is reconstructed by summing the aggregated component forecasts:
y ^ t = T ^ t + S ^ t + R ^ t .
This ensures structural symmetry between decomposition and reconstruction.

2.4. Evaluation Metrics

To assess forecasting accuracy, five symmetric and probabilistic metrics are employed (Table 1). These metrics quantify both scale-dependent error (MAE, RMSE) and relative symmetric performance (sMAPE, MDA), while the correlation coefficient r captures probabilistic dependence [23,24,25,26].

3. Results

Table 2 presents detailed information on the daily PM2.5 datasets collected from four major Pakistani cities—Islamabad, Lahore, Karachi, and Peshawar—covering the period from 2019 to 2023. Each dataset contains 1461 daily entries, ensuring temporal uniformity across locations. The number of observed (non-missing) data points varies slightly, with Islamabad recording 1430, Lahore 1418, Karachi 1426, and Peshawar 1433 valid observations. Consequently, the proportion of missing values remains relatively small, ranging between 1.92 % (Peshawar) and 2.94 % (Lahore). These missing observations were subsequently imputed using the MICE algorithm, which preserves the underlying statistical distribution by generating plausible values from conditional probability models. The low percentage of imputation ensures high data reliability and symmetry of information across all cities, thereby enhancing the robustness of subsequent inferential analysis and comparative modeling.
Table 3 provides a comprehensive set of descriptive statistics for each city, including raw values and their logarithmic transformations. All variable units are listed in Table 3; the squared concentration units (µg/m3)2 generate the high variance values, which indicate dispersion, rather than magnitude. The raw distributions reveal heterogeneity across cities: Lahore exhibits the widest range (minimum 5.00 , maximum 503.00 ) with the most significant variance ( 7257.72 ) and standard deviation ( 85.19 ), suggesting pronounced dispersion and potential outliers. Peshawar records the overall maximum PM2.5 value ( 557.00 ), reflecting extreme pollution episodes and elevated variability. In contrast, Islamabad and Karachi exhibit comparatively moderate levels (means of 112.34 and 109.97 , respectively) and more minor variances, indicating more stable pollution dynamics. From a probabilistic perspective, skewness and kurtosis statistics highlight the symmetry properties of the distributions. Islamabad and Karachi exhibit relatively balanced distributions (skewness of 0.43 and 0.67 , respectively, and kurtosis of 2.94 and 2.76 ), indicating near-symmetric patterns with moderate tails. Conversely, Lahore and Peshawar show stronger positive skewness ( 1.12 and 1.44 ) and higher kurtosis ( 4.18 and 8.60 ), consistent with asymmetric and leptokurtic behavior, where extreme events occur more frequently than in Gaussian processes.
Logarithmic transformation introduces statistical symmetry by stabilizing variance and reducing extreme values, as evidenced by smaller variances and lower dispersion in log-space. The negative skewness of the log-transformed variables further suggests that the transformation mitigates right-skewness, restoring distributional balance. This aligns with the symmetry theme by demonstrating how transformations recover equilibrium in statistical moments and improve interpretability for probabilistic modeling. However, the results of augmented Dickey–Fuller (ADF) and differenced ADF (D_ADF) tests confirm strong stationarity across all series, with test statistics remaining strongly negative. This indicates that the PM2.5 series, once detrended or differenced, attain stable statistical symmetry in their stochastic structure, a crucial prerequisite for time series forecasting.
The concept of symmetry, central to the theme of this study, is examined both statistically and temporally within the PM2.5 time series. From a statistical perspective, symmetry refers to the balanced distribution of concentration values around the central tendency. To assess this property, descriptive statistics such as Figure 1 present seasonal subseries plots to evaluate temporal symmetry in PM2.5. Karachi and Peshawar show inconsistencies later captured in the STL remaining components, while Lahore and Islamabad show a moderately similar pattern due to the occurrence of high-pollution episodes. However, the application of logarithmic transformation substantially reduces skewness and excess kurtosis across all cities, thereby improving the symmetry of the distributions. This enhanced statistical symmetry stabilizes the variance and mitigates the influence of outliers, improving the performance of subsequent machine learning models.
From a temporal perspective, symmetry manifests through the repetitive, cyclic patterns captured via the STL decomposition (Figure 1). The seasonal component exhibits a nearly symmetric oscillation around the long-term trend, reflecting recurring emission cycles and meteorological regularities. The decomposition thereby isolates symmetric temporal fluctuations from irregular residual variations, which enables more interpretable and stable forecasting. In addition, the proposed STL-based hybrid framework exploits these symmetrical characteristics through its ensemble design. Models with distinct inductive biases—SVR and ELM for linear relationships, and NNAR for nonlinear dynamics—are integrated to maintain equilibrium between underfitting and overfitting tendencies. This balanced modeling structure effectively enforces a probabilistic form of symmetry in the error distribution, as reflected in the nearly unbiased residuals and improved directional accuracy (MDA) across all cities. Collectively, these findings confirm that both statistical and temporal symmetry are intrinsic to the PM2.5 series and are strategically leveraged within the proposed hybrid framework to enhance predictive precision and interpretability.
In air quality, temporal symmetry refers to recurrent seasonal or diurnal patterns that exhibit steady structural repetition throughout time. Figure 1 presents seasonal subseries plots to evaluate temporal symmetry in PM2.5: Karachi and Peshawar show inconsistencies later captured in the STL remaining components, while Lahore and Islamabad show moderate similar pattern. A comparative assessment of the original and MICE-imputed time series of PM2.5 concentrations for the four major cities, Islamabad, Karachi, Lahore, and Peshawar, together with their corresponding STL decompositions, is provided in Figure 1. The imputed series, as displayed in Figure 1a,c,e,g, exhibit a high degree of concordance with the original data, successfully preserving underlying temporal patterns and abrupt fluctuations. This alignment demonstrates that the MICE procedure accurately restored missing values while maintaining the statistical symmetry of the series, i.e., the balance between smooth cyclic components and irregular stochastic variations. Notably, the imputation preserved the integrity of extreme values in more volatile environments such as Lahore and Peshawar, where sudden pollution spikes are characteristic of episodic urban activity and meteorological shocks. However, the term “volatile environments” refers to urban contexts characterized by high temporal variability and abrupt fluctuations in PM2.5 concentrations, typically driven by short-term anthropogenic and meteorological influences such as irregular industrial emissions, traffic congestion peaks, biomass burning, dust storms, and temperature inversions. Notably, the imputation preserved the integrity of extreme values in such volatile environments, particularly in Lahore and Peshawar—where sudden pollution spikes reflect episodic urban activity and meteorological shocks. In contrast, relatively stable environments like Islamabad exhibit smoother and more symmetric seasonal cycles with limited short-term disturbances. Recognizing these differences is essential, as hybrid ensemble models must adapt to both symmetric and asymmetric fluctuation patterns to maintain forecasting accuracy across diverse urban conditions.
The subsequent STL decompositions, shown in Figure 1b,d,f,h, further validate the reliability of the imputation step. In each case, the decomposition successfully disentangled the trend, seasonal, and remainder components, ensuring that the structural symmetry between long-term drift, cyclic oscillations, and irregular shocks was preserved. Islamabad and Lahore reveal smooth seasonal cycles and steady trend evolution, suggesting symmetric and stable temporal dynamics. By contrast, Karachi and Peshawar exhibit greater short-term variability, implying asymmetries introduced via external factors such as industrial activity or meteorological dispersion effects.
Taken together, these results confirm that the MICE-based imputation not only filled missing values without distorting the temporal structure but also maintained the probabilistic and structural symmetry required for reliable downstream modeling. The preservation of trend–seasonal balance and the accurate representation of extreme events ensures that the reconstructed series provides a statistically consistent foundation for subsequent hybrid ensemble forecasting of PM2.5 concentrations. However, it is acknowledged that validating the imputation using the same dataset may not provide a fully independent measure of reliability. Therefore, in this study, the evaluation of MICE-imputed PM2.5 values emphasizes internal consistency, rather than external validation. Specifically, we assessed the reliability of the imputation by comparing the statistical distributions, temporal dynamics, and decomposed STL components between the original and imputed series. The close alignment of these characteristics suggests that the MICE procedure effectively preserves the intrinsic temporal and probabilistic structure of the data without introducing artificial bias. Nevertheless, future studies could further strengthen imputation validation by employing external datasets, such as concurrent meteorological observations or adjacent monitoring station records, to independently assess imputation fidelity.

3.1. Islamabad

For Islamabad, several statistical indicators—MAE, RMSE, correlation, sMAPE, and MDA—were employed to evaluate the STL-decomposed forecasting frameworks. The comparative results, summarized in Table 4, reveal that the four leading hybrid configurations were as follows: {SVRt + ELMr + NNARs}, {NNARt + ELMr + NNARs}, {ELMt + ELMr + NNARs}, and {XGBt + ELMr + NNARs}. Among these, the {SVRt + ELMr + NNARs} model emerged as the most effective, attaining the lowest sMAPE (7.56) and RMSE (7.91), the highest MDA (0.838), and a strong correlation (0.969). This balanced performance illustrates its ability to capture both the symmetry of forecast alignment and probabilistic directional consistency. The near-identical accuracy of the second- and third-ranked models, {NNARt + ELMr + NNARs} and {ELMt + ELMr + NNARs}, highlights the robustness of ELM and NNAR integration. Finally, the {XGBt + ELMr + NNARs} specification also performed competitively (RMSE: 7.99; correlation: 0.969), underscoring the utility of boosting strategies for trend representation. Overall, these findings confirm that statistical symmetry between error minimization and directional forecasting can be effectively achieved using ensemble architectures.

3.2. Karachi

For Karachi, a consistent performance pattern emerges across STL-based hybrid configurations, as detailed in Table 5. The top four models were identified as follows: {ELMt + ELMr + NNARs}, {NNARt + ELMr + NNARs}, {SVRt + ELMr + NNARs}, and {XGBt + ELMr + NNARs}. For the most effective among these, {ELMt + ELMr + NNARs}, the lowest RMSE (7.652) and MAE (6.005) were reported, together with a favorable sMAPE (7.590) and high correlation (0.975). By employing ELM for both trend and residual components and NNAR for seasonality, this model effectively balanced nonlinear and linear dynamics, reflecting probabilistic symmetry in capturing both smooth cycles and stochastic fluctuations. The next two models, {NNARt + ELMr + NNARs} and {SVRt + ELMr + NNARs}, attained slightly higher RMSE values but exhibited identical correlation (0.975) and MDA (0.798), demonstrating equivalent directional symmetry. The {XGBt + ELMr + NNARs} model also achieved stable correlation and MDA scores, though with modestly higher error magnitudes. These outcomes reinforce the effectiveness of combining statistical learning algorithms (ELM, SVR, NNAR, XGBoost) in decomposition-based hybrid structures for air quality forecasting.

3.3. Lahore

In Lahore, model evaluation emphasized RMSE due to its sensitivity to large deviations, which is particularly relevant, given the volatility of PM2.5 levels in the region. As reported in Table 6, the {ELMt + ELMr + NNARs} configuration yielded the lowest RMSE (12.475), alongside the lowest MAE (10.120) and sMAPE (8.340). Despite exhibiting a slightly lower MDA (0.808) than its nearest rival, this model demonstrated probabilistic superiority by consistently reducing both absolute and relative forecast errors. The second-best specification, {SVRt + ELMr + NNARs}, achieved a competitive RMSE (12.521), while recording the highest MDA (0.818) and a correlation of 0.979. This indicates enhanced directional symmetry despite marginally higher error variance. The other contenders, {XGBt + ELMr + NNARs} and {NNARt + ELMr + NNARs}, posted RMSE values of 12.556 and 12.819, respectively, but they retained comparable correlation and MDA values. Collectively, these results highlight the probabilistic trade-off between minimizing error magnitudes and maximizing directional alignment, reinforcing the reliability of ELM-based ensemble schemes in volatile urban environments.

3.4. Peshawar

For Peshawar, four STL-hybrid frameworks, each employing SVR for the residual component, were evaluated (Table 7). The leading model, {SVRt + SVRr + NNARs}, reported the lowest RMSE (13.585), MAE (10.871), and a competitive sMAPE (9.902), confirming its accuracy in capturing both absolute and relative forecast structures. The {ELMt + SVRr + NNARs} variant achieved a nearly identical RMSE (13.620) and MAE (10.898) while outperforming in terms of MDA (0.779), suggesting enhanced directional symmetry despite slightly higher error levels. The remaining specifications, {NNARt + SVRr + NNARs} and {XGBt + SVRr + NNARs}, achieved RMSE scores of 13.621 and 13.651, respectively, but they did not surpass the SVR-dominated frameworks in error minimization. These findings demonstrate that using SVR consistently for both trend and residual components yields robust probabilistic accuracy, while NNAR contributes additional symmetry in capturing seasonality.
A comparative visualization of original (imputed) and predicted PM2.5 levels for the four cities is presented in Figure 2. For Islamabad (Figure 2a), forecasts nearly overlap with the imputed series, showing symmetric alignment with seasonal cycles. Karachi (Figure 2b) exhibits higher short-term variability; the models slightly underestimate extreme spikes but capture the general probabilistic distribution. In Lahore (Figure 2c), high oscillations and peak concentrations reveal a challenge in capturing extreme values, highlighting the asymmetry introduced by outliers. Finally, Peshawar (Figure 2d) demonstrates the accurate detection of a significant concentration peak, indicating strong responsiveness to abrupt shifts. Taken together, the results suggest that the models exhibit statistical symmetry and robustness in relatively stable contexts (Islamabad, Peshawar). In contrast, adaptive or hybrid refinements are necessary for more volatile settings (Lahore, Karachi) to enhance extreme-value representation and probabilistic reliability.

4. Discussion

This study used four machine learning algorithms to compare the imputed (actual) and forecasted PM2.5 concentrations in four major Pakistani cities: Peshawar, Lahore, Karachi, and Islamabad. SVR, ELM, NNAR, and XGBoost were used in the proposed hybrid framework. Performance measures, including RMSE, MAE, MAPE, correlation, sMAPE, and time series plots (actual vs. predicted), were used to empirically and visually assess the efficiency of these models. Additionally, this section presents the best-performing PM2.5 forecasting model proposed via this study, which is evaluated in detail and compared to baseline methods and the best models found in the literature. To enhance air quality predictions and management, this section also outlines potential avenues for future research and provides practical policy recommendations.

4.1. Comparative Studies with Benchmark Models and Existing Literature

To assess the robustness and practical relevance of the proposed hybrid forecasting framework, a comparative evaluation against four widely used benchmark models—support vector regression (SVR), extreme gradient boosting (XGBoost), neural network autoregression (NNAR), and extreme learning machine (ELM)—was conducted across the four urban centers of Islamabad, Karachi, Lahore, and Peshawar. The performance was examined using a comprehensive suite of accuracy metrics, including the mean absolute error (MAE), the root mean squared error (RMSE), Pearson correlation, the symmetric mean absolute percentage error (sMAPE), and the mean directional accuracy (MDA). These measures collectively capture both point-wise accuracy and directional predictive reliability, thereby addressing the probabilistic and distributional symmetry underlying forecast errors.
Table 8 presents a detailed comparative analysis. In Islamabad, the proposed model demonstrates substantial improvements over conventional baselines, achieving an RMSE of 7.912 and a high correlation of 0.969, compared with the nearest competing model (ELM), which yields an RMSE of 21.563 and a correlation of only 0.667. This reflects the ability of the proposed hybrid decomposition–ensemble strategy to capture symmetric seasonal patterns and reduce asymmetric error deviations, thereby yielding more reliable forecasts. Similarly, in Karachi, where pollution exhibits dynamic fluctuations, the proposed model attains an RMSE of 7.652 and an exceptionally high correlation of 0.975. This outperformance demonstrates how the hybrid model maintains structural balance between the trend and remainder components, aligning with the symmetry properties inherent to complex urban time series.
Lahore poses the most challenging forecasting environment due to its highly volatile and heterogeneous PM2.5 patterns. While benchmark models yield inconsistent results and a larger dispersion of errors, the proposed approach achieves an RMSE of 12.475 and a correlation coefficient of 0.979. The symmetric formulation of sMAPE in this case is particularly valuable, as it penalizes over- and under-predictions in a balanced manner, underscoring the framework’s superior capability to model asymmetric distributions. In Peshawar, despite extreme episodes of pollution, the proposed model again exhibits resilience, recording an RMSE of 13.585 and a correlation coefficient of 0.837, while also achieving the highest MDA, highlighting its predictive stability in the face of directional changes.
A broader comparison with the existing literature suggests that classical machine learning models (e.g., SVR, XGBoost) often fail to capture the nonlinearities and seasonal symmetries inherent to PM2.5 series, resulting in higher error magnitudes and reduced correlation. By contrast, the proposed hybrid approach systematically integrates decomposition, statistical learning, and ensemble weighting to exploit both temporal symmetry and probabilistic consistency. The consistent superiority across all four cities confirms the model’s adaptability, interpretability, and robustness. Additionally, our methodology makes sure that positive and negative errors are treated equally, but this shouldn’t be taken as finding temporal symmetry. Decomposition and ensemble integration, not symmetry characteristics, are responsible for the benefits. More importantly, the integration of symmetric error metrics, such as sMAPE and balanced correlation analysis, aligns with the objectives of the Symmetry special issue, illustrating how probabilistic and statistical perspectives on symmetry enhance predictive performance and strengthen decision-making in urban air quality management. The hybrid STL–ensemble technique provides a data-driven approach to managing urban air quality while improving forecasting accuracy and interpretability overall.

4.2. Comparative Analysis with Prior Studies

Beyond the benchmark comparisons, the performance of the proposed hybrid STL-based forecasting framework was evaluated against results reported in prior studies on PM2.5 prediction. This additional comparison provides a broader perspective on its relative efficiency and highlights how probabilistic and statistical considerations of symmetry in forecast errors contribute to improved accuracy.
An ensemble technique reported in [20] using the same dataset achieved RMSE values of 20.32, 48.31, 22.99, and 35.06 for Islamabad, Lahore, Karachi, and Peshawar, respectively. By contrast, the proposed hybrid model achieved significantly lower RMSEs of 7.91, 12.48, 7.65, and 13.59 for the same cities. These results represent error reductions of approximately 61%, 74%, 67%, and 61%, respectively, demonstrating the proposed framework’s capacity to capture underlying structural patterns with a higher degree of symmetry and reduced variability in forecast deviations.
Similarly, in [27], the reported RMSE values were 16.566 (Islamabad), 20.835 (Lahore), 22.084 (Karachi), and 18.743 (Peshawar). When compared to the proposed model, which obtained RMSEs of 7.912, 12.475, 7.652, and 13.585, the reductions were to 52%, 40%, 65%, and 28%, respectively. These improvements indicate not only higher predictive accuracy but also a more stable and symmetric distribution of residuals, which is critical for enhancing generalization across heterogeneous urban air quality contexts.
In another study, ref. [28] employed an artificial neural network (ANN) with 20 neurons and five independent variables to predict PM2.5 in Islamabad, obtaining an RMSE of 9.82. By comparison, the proposed STL-based model, which relies solely on intrinsic PM2.5 dynamics without exogenous predictors, achieved a lower RMSE of 7.912, representing a 19% improvement. This reinforces the advantage of decomposition-based models in capturing temporal symmetries without heavy reliance on external covariates.
Furthermore, ref. [29] utilized ANN models incorporating meteorological and historical information, producing RMSE values of 18 for Karachi and 39 for Lahore. In contrast, the proposed model achieved RMSEs of 7.652 and 12.475 for Karachi and Lahore, respectively, representing improvements of over 57% and 68%. This outcome emphasizes the robustness of the proposed approach in capturing both short-term fluctuations and the seasonal symmetries inherent in PM2.5 data.
A more advanced deep learning approach in [30], which integrated CNN-LSTM with Multi-Fractal Detrended Fluctuation Analysis (MF-DFA), reported an RMSE of 11.732 using combined meteorological and air quality data. While effective, the proposed hybrid STL-based framework still outperformed this model, achieving an average RMSE of 10.91 across the four cities, despite utilizing only the PM2.5 series as input. This highlights the effectiveness of decomposition and ensemble strategies in enhancing forecast precision while preserving statistical symmetry in error dynamics.
Therefore, the comparative evidence across diverse studies consistently demonstrates the superiority of the proposed framework in terms of error reduction, robustness, and adaptability. By integrating decomposition, ensemble learning, and symmetric error measures such as sMAPE, the approach systematically balances over- and under-predictions, reduces asymmetry in residual distributions, and achieves higher probabilistic consistency. Additionally, meteorological factors like temperature, humidity, and wind speed are out of this study, although this design decision improves the models’ applicability in low-resource settings where such data are difficult to obtain or inconsistent. Future iterations of the suggested framework, however, might include these exogenous factors to investigate how they affect the interpretability and accuracy of the model.

5. Policy Implications and Future Studies

The implications of this work extend beyond methodological contributions to environmental policy, public health, and sustainable urban management. Accurate short-term forecasts of PM2.5 concentrations are crucial for issuing timely health advisories, managing traffic, emission control, and implementing regulatory interventions. For urban planners, such forecasts can inform land-use strategies, infrastructure development, and mitigation policies aimed at reducing long-term exposure to hazardous pollutants. The demonstrated reliability of the proposed framework across multiple cities underscores its scalability for nationwide deployment. Unlike models that require extensive meteorological datasets, this decomposition–ensemble method provides a cost-effective and practical solution for resource-constrained regions, such as Pakistan. When integrated into regulatory monitoring systems, the model can enhance compliance mechanisms, improve environmental governance, and foster evidence-based decision-making.
Despite its strong performance, several avenues for future research remain open. Incorporating exogenous covariates, such as meteorological conditions, socioeconomic activity, or industrial emissions, could further enhance predictive accuracy, particularly during extreme pollution episodes. Additionally, by adding external covariates, sophisticated symmetry measures, and Bayesian-based probabilistic augmentation, future studies could expand this symmetry-aware ensemble, and exploring alternative decomposition techniques such as Variational Mode Decomposition (VMD), Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN), or Empirical Wavelet Transform (EWT) may capture asymmetries and nonlinearities more effectively in highly volatile environments. Extending the model toward probabilistic forecasting frameworks could also quantify uncertainty, providing policymakers with confidence intervals and risk assessments, rather than point estimates. In conclusion, the proposed hybrid ensemble model represents a statistically rigorous, symmetry-aware, and interpretable solution for urban air quality forecasting. Its adaptability, accuracy, and efficiency make it a valuable tool for policymakers, environmental researchers, and urban planners seeking proactive, data-driven approaches to sustainable air quality management.

6. Conclusions

This study has introduced a novel hybrid ensemble framework for forecasting PM2.5 concentrations in four major Pakistani cities—Islamabad, Lahore, Karachi, and Peshawar—by leveraging seasonal—trend decomposition using Loess (STL) in combination with support vector regression (SVR), an extreme learning machine (ELM), and neural network autoregression (NNAR). The probabilistic and statistical synergy between decomposition and ensemble modeling enabled the framework to capture both linear and nonlinear dynamics while preserving the temporal symmetry of the underlying air quality processes. The empirical evaluation demonstrated that the proposed approach consistently achieved superior forecasting performance compared to benchmark models, such as SVR, XGBoost, ANN, and CNN—LSTM. With an average RMSE of 10.91 across all locations, the model outperformed the prior literature, including CNN—LSTM—MF—DFA (RMSE 11.73) and traditional ANN-based systems (RMSEs above 18 and 39 for Karachi and Lahore, respectively). The reduction in error ranged from 40% to 65%, underscoring both the robustness and efficiency of the proposed model. Importantly, these results were achieved without incorporating exogenous meteorological covariates, highlighting the model’s capacity to operate in data-constrained environments. The decomposition of the PM2.5 series into trend, seasonal, and remainder components enhanced interpretability by revealing the probabilistic balance and structural symmetry between persistent trends, cyclic variations, and irregular shocks. Visual diagnostics and statistical accuracy measures confirmed that the hybrid ensemble not only reduced errors but also improved directional accuracy, thereby reinforcing its practical relevance. Overall, this study contributes to the literature on environmental forecasting by demonstrating how probability-driven decomposition and symmetry-aware hybridization can lead to both accuracy and interpretability in complex time series modeling. In addition to offering a reliable and understandable method for short-term PM2.5 forecasting, the framework may be expanded to include other environmental time series.

Author Contributions

Conceptualization, methodology, and software, H.I.; validation, H.I., A.F.H., M.Q., and P.C.R.; formal analysis, H.I. and A.F.H.; investigation, H.I., A.F.H., M.Q., and P.C.R.; resources, M.Q., A.F.H. and P.C.R.; data curation, H.I. and M.Q.; writing—original draft preparation, H.I., A.F.H., M.Q., and P.C.R.; writing—review and editing, H.I., M.Q., P.C.R., and A.F.H.; visualization, A.F.H., P.C.R., and M.Q.; supervision, P.C.R. and H.I.; project administration, A.F.H. and P.C.R.; funding acquisition, A.F.H. and P.C.R. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported and funded by the Deanship of Scientific Research at Imam Mohammad Ibn Saud Islamic University (IMSIU) (grant number IMSIU-DDRSP2502).

Data Availability Statement

The data used in this study are available at https://www.iqair.com/pakistan (accessed on 20 June 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Maio, S.; Sarno, G.; Tagliaferro, S.; Pirona, F.; Stanisci, I.; Baldacci, S.; Viegi, G. Outdoor air pollution and respiratory health. Int. J. Tuberc. Lung Dis. 2023, 27, 7–12. [Google Scholar] [CrossRef] [PubMed]
  2. Wu, C.; Wang, R.; Lu, S.; Tian, J.; Yin, L.; Wang, L.; Zheng, W. Time-Series Data-Driven PM2.5 Forecasting: From Theoretical Framework to Empirical Analysis. Atmosphere 2025, 16, 292. [Google Scholar] [CrossRef]
  3. Li, J.; Liang, L.; Lyu, B.; Cai, Y.S.; Zuo, Y.; Su, J.; Tong, Z. Double Trouble: The Interaction of PM2.5 and O3 on Respiratory Hospital Admissions. Environ. Pollut. 2023, 338, 122665. [Google Scholar] [CrossRef]
  4. Destiartono, M.E.; Hartono, D. Does Rapid Urbanization Drive Deforestation? Evidence From Southeast Asia. Econ. Dev. Anal. J. 2022, 11, 442–453. [Google Scholar]
  5. Oyetunji, P.; Ibitoye, O.; Akinyemi, G.; Fadele, O.; Oyediji, O. The effects of population growth on deforestation in Nigeria: 1991–2016. J. Appl. Sci. Environ. Manag. 2020, 24, 1329–1334. [Google Scholar] [CrossRef]
  6. Song, J.; Ma, C.; Ran, M. AirGPT: Pioneering the Convergence of Conversational AI with Atmospheric Science. NPJ Clim. Atmos. Sci. 2025, 8, 179. [Google Scholar] [CrossRef]
  7. Liu, Y.; Chen, B.; Zheng, Y.; Cheng, L.; Li, G.; Lin, L. ODMixer: Fine-Grained Spatial-Temporal MLP for Metro Origin-Destination Prediction. IEEE Trans. Knowl. Data Eng. 2025, 37, 5508–5522. [Google Scholar] [CrossRef]
  8. Harrison, R.M. Airborne particulate matter. Philos. Trans. R. Soc. A 2020, 378, 20190319. [Google Scholar] [CrossRef]
  9. Kyung, S.Y.; Jeong, S.H. Particulate-matter related respiratory diseases. Tuberc. Respir. Dis. 2020, 83, 116. [Google Scholar] [CrossRef]
  10. Pryor, J.T.; Cowley, L.O.; Simonds, S.E. The physiological effects of air pollution: Particulate matter, physiology and disease. Front. Public Health 2022, 10, 882569. [Google Scholar] [CrossRef]
  11. Tarín-Carrasco, P.; Im, U.; Geels, C.; Palacios-Peña, L.; Jiménez-Guerrero, P. Contribution of fine particulate matter to present and future premature mortality over Europe: A non-linear response. Environ. Int. 2021, 153, 106517. [Google Scholar] [CrossRef] [PubMed]
  12. Zhang, Y. Dynamic effect analysis of meteorological conditions on air pollution: A case study from Beijing. Sci. Total Environ. 2019, 684, 178–185. [Google Scholar] [CrossRef]
  13. Fatima, M.; Butt, I.; Nasar-u Minallah, M.; Atta, A.; Cheng, G. Assessment of air pollution and its association with population health: Geo-statistical evidence from Pakistan. Geogr. Environ. Sustain. 2023, 16, 93–101. [Google Scholar] [CrossRef]
  14. Mahajan, S.; Chen, L.J.; Tsai, T.C. An empirical study of PM2.5 forecasting using neural network. In Proceedings of the 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI), San Francisco, CA, USA, 4–8 August 2017; IEEE: New York, NY, USA, 2017; pp. 1–7. [Google Scholar]
  15. Mani, G.; Viswanadhapalli, J.K.; Stonier, A.A. Prediction and forecasting of air quality index in Chennai using regression and ARIMA time series models. J. Eng. Res. 2022, 10, 179–194. [Google Scholar] [CrossRef]
  16. Cassano, F.; Casale, A.; Regina, P.; Spadafina, L.; Sekulic, P. A recurrent neural network approach to improve the air quality index prediction. In Ambient Intelligence–Software and Applications–, 10th International Symposium on Ambient Intelligence; Springer: Cham, Switzerland, 2020; pp. 36–44. [Google Scholar]
  17. Hossain, E.; Shariff, M.A.U.; Hossain, M.S.; Andersson, K. A novel deep learning approach to predict air quality index. In Proceedings of the International Conference on Trends in Computational and Cognitive Engineering: Proceedings of TCCE 2020; Springer: Singapore, 2020; pp. 367–381. [Google Scholar]
  18. Zhang, J.; Li, S. Air quality index forecast in Beijing based on CNN-LSTM multi-model. Chemosphere 2022, 308, 136180. [Google Scholar] [CrossRef]
  19. Liu, C.; Pan, G.; Song, D.; Wei, H. Air quality index forecasting via genetic algorithm-based improved extreme learning machine. IEEE Access 2023, 11, 67086–67097. [Google Scholar] [CrossRef]
  20. Iftikhar, H.; Qureshi, M.; Zywiołek, J.; López-Gonzales, J.L.; Albalawi, O. Short-term PM2.5 forecasting using a unique ensemble technique for proactive environmental management initiatives. Front. Environ. Sci. 2024, 12, 1442644. [Google Scholar] [CrossRef]
  21. Wen, Q.; Gao, J.; Song, X.; Sun, L.; Xu, H.; Zhu, S. RobustSTL: A robust seasonal-trend decomposition algorithm for long time series. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 5409–5416. [Google Scholar]
  22. Van Buuren, S.; Groothuis-Oudshoorn, K. mice: Multivariate imputation by chained equations in R. J. Stat. Softw. 2011, 45, 1–67. [Google Scholar] [CrossRef]
  23. Iftikhar, H.; Zafar, A.; Turpo-Chaparro, J.E.; Canas Rodrigues, P.; López-Gonzales, J.L. Forecasting Day-Ahead Brent Crude Oil Prices Using Hybrid Combinations of Time Series Models. Mathematics 2023, 11, 3548. [Google Scholar] [CrossRef]
  24. Aamir, M.; Iftikhar, H.; Nasir, J.; Rodrigues, P.C.; Alharbi, A.A.; Allohibi, J. A Novel Hybrid LMD-SPF Forecasting Framework for Financial Time Series: Evidence from Gold Returns. AIMS Math. 2025, 10, 21875–21901. [Google Scholar] [CrossRef]
  25. Cuba, W.M.; Huaman Alfaro, J.C.; Iftikhar, H.; López-Gonzales, J.L. Modeling and Analysis of Monkeypox Outbreak Using a New Time Series Ensemble Technique. Axioms 2024, 13, 554. [Google Scholar] [CrossRef]
  26. Carbo-Bustinza, N.; Iftikhar, H.; Belmonte, M.; Cabello-Torres, R.J.; De La Cruz, A.R.H.; López-Gonzales, J.L. Short-Term Forecasting of Ozone Concentration in Metropolitan Lima Using Hybrid Combinations of Time Series Models. Appl. Sci. 2023, 13, 10514. [Google Scholar] [CrossRef]
  27. Ahmed, M.; Xiao, Z.; Shen, Y. Estimation of ground PM2.5 concentrations in Pakistan using convolutional neural network and multi-pollutant satellite images. Remote Sens. 2022, 14, 1735. [Google Scholar] [CrossRef]
  28. Sadiq, N.; Uddin, Z. Modeling of PM2.5 concentrations using artificial neural networks: A case study of Islamabad. Glob. NEST J. 2024, 27, 06865. [Google Scholar]
  29. Sadiq, N.; Uddin, Z. Prediction of PM2.5 via precursor method using meteorological parameters. EQA-Int. J. Environ. Qual. 2025, 67, 45–50. [Google Scholar]
  30. Pak, U.; Kim, H.; Jong, U.; Hyon, R.; Kim, J.; Kim, K.; Kim, K. A deep learning approach via multifractal detrended fluctuation analysis for PM2.5 prediction. J. Atmos. Sol.-Terr. Phys. 2025, 268, 106444. [Google Scholar] [CrossRef]
Figure 1. Comparison of original and STL-decomposed PM2.5 series for four cities.
Figure 1. Comparison of original and STL-decomposed PM2.5 series for four cities.
Symmetry 17 01827 g001aSymmetry 17 01827 g001b
Figure 2. Original vs. fitted PM2.5 values for four monitoring stations.
Figure 2. Original vs. fitted PM2.5 values for four monitoring stations.
Symmetry 17 01827 g002
Table 1. Evaluation metrics and their mathematical formulations.
Table 1. Evaluation metrics and their mathematical formulations.
MetricFormula
Mean Absolute Error (MAE) MAE = 1 n t = 1 n | y t y ^ t |
Root Mean Squared Error (RMSE) RMSE = 1 n t = 1 n ( y t y ^ t ) 2
Correlation Coefficient (r) r = t = 1 n ( y t y ¯ ) ( y ^ t y ^ ¯ ) t = 1 n ( y t y ¯ ) 2 t = 1 n ( y ^ t y ^ ¯ ) 2
Symmetric Mean Absolute Percentage Error (sMAPE) sMAPE = 100 % n t = 1 n | y t y ^ t | ( | y t | + | y ^ t | ) / 2
Mean Directional Accuracy (MDA) MDA = 1 n 1 t = 2 n I [ ( y t y t 1 ) ( y ^ t y ^ t 1 ) > 0 ]
Table 2. Details about the datasets considered in this work.
Table 2. Details about the datasets considered in this work.
AttributeIslamabadLahoreKarachiPeshawar
Data period2019–20232019–20232019–20232019–2023
Total1461146114611461
Observed1430141814261433
Missing31433528
Imputed (%)2.122.942.401.92
Table 3. Descriptive statistics of PM2.5 (raw and logarithmic values).
Table 3. Descriptive statistics of PM2.5 (raw and logarithmic values).
City/StatisticMinimumMaximum25%50%75%MeanVarianceStd. Dev.
Islamabad10.00270.0084.00106.00139.00112.341576.9539.71
Lahore5.00503.00128.00162.00218.00183.157257.7285.19
Karachi9.00277.0077.0097.00143.00109.971795.1742.37
Peshawar16.00557.00112.00142.00168.00146.012662.0951.60
City/StatisticSkewnessKurtosisADF StatisticD_ADF Statisticlog(Min)log(Max)log(25%)log(50%)
Islamabad0.432.94−4.05−16.972.305.604.434.66
Lahore1.124.18−4.24−16.121.616.224.855.09
Karachi0.672.76−3.66−16.822.205.624.344.57
Peshawar1.448.60−4.68−17.382.776.324.724.96
City/Statisticlog(75%)log(Mean)log(Variance)log(Std. Dev.)log(Skewness)log(Kurtosis)log(ADF)log(D_ADF)
Islamabad4.934.650.150.39−1.036.23−4.54−17.42
Lahore5.385.100.250.50−1.4311.44−5.02−15.85
Karachi4.964.620.160.40−0.666.43−4.34−16.57
Peshawar5.124.920.130.36−0.927.95−5.37−16.20
Table 4. Islamabad: accuracy mean error.
Table 4. Islamabad: accuracy mean error.
STL Model CombinationsMAERMSECorrelationsMAPEMDA
SVRt + SVRr + SVRs10.32412.8320.78411.6610.495
SVRt + SVRr + XGBs11.89514.4980.72113.2620.475
SVRt + SVRr + NNARs6.5848.5180.9187.5620.737
SVRt + SVRr + ELMs7.7739.9790.8978.9310.737
SVRt + XGBr + SVRs11.95514.7590.71613.4980.485
SVRt + XGBr + XGBs12.51615.6670.67714.0820.475
SVRt + XGBr + NNARs10.16312.8280.79411.3320.646
SVRt + XGBr + ELMs10.98013.4370.77212.3350.667
SVRt + NNARr + SVRs12.84515.7290.69014.0620.485
SVRt + NNARr + XGBs13.91016.9440.63915.2160.485
SVRt + NNARr + NNARs9.15811.4670.83210.3470.707
SVRt + NNARr + ELMs10.03112.4500.79511.1600.616
SVRt + ELMr + SVRs11.09913.4450.78012.4430.566
SVRt + ELMr + XGBs12.89715.5260.69714.2160.535
SVRt + ELMr + NNARs6.2777.9120.9697.5590.838
SVRt + ELMr + ELMs8.13410.0350.9519.5890.869
XGBt + SVRr + SVRs10.32112.8540.78411.6560.495
XGBt + SVRr + XGBs11.90414.5240.72013.2670.475
XGBt + SVRr + NNARs6.6128.5480.9187.5970.737
XGBt + SVRr + ELMs7.80810.0170.8978.9760.737
XGBt + XGBr + SVRs11.97314.7900.71613.5120.485
XGBt + XGBr + XGBs12.54315.7020.67714.1080.475
XGBt + XGBr + NNARs10.20012.8610.79411.3720.646
XGBt + XGBr + ELMs11.00613.4790.77212.3650.667
XGBt + NNARr + SVRs12.84315.7330.69014.0560.485
XGBt + NNARr + XGBs13.90816.9530.63915.2070.485
XGBt + NNARr + NNARs9.16211.4690.83210.3490.707
XGBt + NNARr + ELMs10.04012.4630.79511.1720.616
XGBt + ELMr + SVRs11.13113.4920.78012.4770.566
XGBt + ELMr + XGBs12.93615.5720.69714.2550.535
XGBt + ELMr + NNARs6.3337.9860.9697.6310.838
XGBt + ELMr + ELMs8.19410.1060.9519.6600.869
NNARt + SVRr + SVRs10.32112.8470.78411.6570.495
NNARt + SVRr + XGBs11.90114.5150.72013.2650.475
NNARt + SVRr + NNARs6.6038.5390.9187.5860.737
NNARt + SVRr + ELMs7.79710.0050.8978.9620.737
NNARt + XGBr + SVRs11.96714.7800.71613.5070.485
NNARt + XGBr + XGBs12.53515.6910.67714.1000.475
NNARt + XGBr + NNARs10.18912.8510.79411.3590.646
NNARt + XGBr + ELMs10.99813.4660.77212.3560.667
NNARt + NNARr + SVRs12.84315.7310.69014.0570.485
NNARt + NNARr + XGBs13.90916.9500.63915.2090.485
NNARt + NNARr + NNARs9.16111.4680.83210.3480.707
NNARt + NNARr + ELMs10.03712.4590.79511.1680.616
NNARt + ELMr + SVRs11.12113.4770.78012.4660.566
NNARt + ELMr + XGBs12.92415.5570.69714.2430.535
NNARt + ELMr + NNARs6.3167.9630.9697.6090.838
NNARt + ELMr + ELMs8.17510.0840.9519.6380.869
ELMt + SVRr + SVRs10.32112.8470.78411.6570.495
ELMt + SVRr + XGBs11.90114.5160.72113.2650.475
ELMt + SVRr + NNARs6.6038.5390.9187.5860.737
ELMt + SVRr + ELMs7.79710.0050.8978.9620.737
ELMt + XGBr + SVRs11.96714.7800.71613.5070.485
ELMt + XGBr + XGBs12.53515.6910.67714.1000.475
ELMt + XGBr + NNARs10.18912.8510.79411.3590.646
ELMt + XGBr + ELMs10.99813.4660.77212.3560.667
ELMt + NNARr + SVRs12.84315.7310.69014.0570.485
ELMt + NNARr + XGBs13.90816.9500.63915.2090.485
ELMt + NNARr + NNARs9.16111.4680.83210.3480.707
ELMt + NNARr + ELMs10.03712.4590.79511.1680.616
ELMt + ELMr + SVRs11.12113.4770.78012.4660.566
ELMt + ELMr + XGBs12.92415.5580.69714.2430.535
ELMt + ELMr + NNARs6.3167.9630.9697.6090.838
ELMt + ELMr + ELMs8.17610.0850.9519.6380.869
Table 5. Karachi: accuracy mean error.
Table 5. Karachi: accuracy mean error.
STL Model CombinationsMAERMSECorrelationsMAPEMDA
SVRt + SVRr + SVRs9.70813.6520.85111.7750.556
SVRt + SVRr + XGBs10.30314.7260.82312.5210.545
SVRt + SVRr + NNARs6.8918.3510.9518.6290.677
SVRt + SVRr + ELMs7.99010.3880.93010.0000.697
SVRt + XGBr + SVRs11.38715.7630.79413.7590.444
SVRt + XGBr + XGBs11.46416.5490.77513.9220.414
SVRt + XGBr + NNARs10.19712.3460.88112.8800.545
SVRt + XGBr + ELMs11.03913.7400.85013.6810.566
SVRt + NNARr + SVRs13.45617.7740.73316.2210.434
SVRt + NNARr + XGBs14.15018.8630.70816.9820.404
SVRt + NNARr + NNARs9.86613.1490.86112.2020.515
SVRt + NNARr + ELMs10.82414.3420.83313.2120.505
SVRt + ELMr + SVRs10.45614.0520.85112.7020.616
SVRt + ELMr + XGBs11.42015.2760.81313.8980.596
SVRt + ELMr + NNARs6.0737.7060.9757.6940.798
SVRt + ELMr + ELMs8.22410.1610.96210.4430.869
XGBt + SVRr + SVRs9.82913.7180.85111.9550.556
XGBt + SVRr + XGBs10.40514.7870.82312.6520.545
XGBt + SVRr + NNARs7.0018.4760.9518.7810.677
XGBt + SVRr + ELMs8.12010.5320.93010.1990.697
XGBt + XGBr + SVRs11.41215.8180.79413.7830.444
XGBt + XGBr + XGBs11.53416.6010.77514.0110.414
XGBt + XGBr + NNARs10.28012.4270.88112.9600.545
XGBt + XGBr + ELMs11.11813.8460.85013.7650.566
XGBt + NNARr + SVRs13.45417.7920.73216.1760.434
XGBt + NNARr + XGBs14.15718.8790.70716.9530.404
XGBt + NNARr + NNARs9.90313.1850.86112.2170.515
XGBt + NNARr + ELMs10.86214.4060.83313.2580.505
XGBt + ELMr + SVRs10.64114.1990.85112.9710.616
XGBt + ELMr + XGBs11.54315.4110.81314.0610.596
XGBt + ELMr + NNARs6.4067.9900.9758.1820.798
XGBt + ELMr + ELMs8.54010.4220.96210.8850.869
NNARt + SVRr + SVRs9.70313.6530.85111.7680.556
NNARt + SVRr + XGBs10.29814.7280.82312.5150.545
NNARt + SVRr + NNARs6.8928.3510.9508.6310.677
NNARt + SVRr + ELMs7.99210.3880.93010.0010.697
NNARt + XGBr + SVRs11.38715.7630.79413.7590.444
NNARt + XGBr + XGBs11.46516.5510.77513.9230.414
NNARt + XGBr + NNARs10.19912.3450.88112.8830.545
NNARt + XGBr + ELMs11.04313.7390.85013.6850.566
NNARt + NNARr + SVRs13.45217.7770.73316.2170.434
NNARt + NNARr + XGBs14.14818.8670.70816.9810.404
NNARt + NNARr + NNARs9.86913.1520.86112.2100.515
NNARt + NNARr + ELMs10.82814.3440.83313.2180.505
NNARt + ELMr + SVRs10.45314.0510.85112.6980.616
NNARt + ELMr + XGBs11.41915.2760.81313.8980.596
NNARt + ELMr + NNARs6.0747.7070.9757.6930.798
NNARt + ELMr + ELMs8.22510.1610.96210.4450.869
ELMt + SVRr + SVRs9.68513.6440.85111.7390.556
ELMt + SVRr + XGBs10.28314.7210.82312.4950.545
ELMt + SVRr + NNARs6.8718.3320.9508.6010.677
ELMt + SVRr + ELMs7.97110.3640.9309.9680.697
ELMt + XGBr + SVRs11.38615.7560.79413.7580.444
ELMt + XGBr + XGBs11.45216.5440.77513.9070.414
ELMt + XGBr + NNARs10.18412.3320.88112.8680.545
ELMt + XGBr + ELMs11.02813.7210.85013.6690.566
ELMt + NNARr + SVRs13.45417.7770.73316.2280.434
ELMt + NNARr + XGBs14.15318.8670.70816.9970.404
ELMt + NNARr + NNARs9.86713.1500.86112.2150.515
ELMt + NNARr + ELMs10.82514.3360.83313.2150.505
ELMt + ELMr + SVRs10.42114.0260.85112.6500.616
ELMt + ELMr + XGBs11.39515.2530.81313.8660.596
ELMt + ELMr + NNARs6.0057.6520.9757.5900.798
ELMt + ELMr + ELMs8.16410.1100.96210.3570.869
Table 6. Lahore: accuracy mean error.
Table 6. Lahore: accuracy mean error.
STL Model CombinationsMAERMSECorrelationsMAPEMDA
SVRt + SVRr + SVRs19.05324.0540.74814.8240.616
SVRt + SVRr + XGBs21.79026.9550.67216.7670.586
SVRt + SVRr + NNARs12.63115.4240.9339.7010.747
SVRt + SVRr + ELMs15.63419.5180.89211.8460.778
SVRt + XGBr + SVRs22.79028.8560.62617.1780.515
SVRt + XGBr + XGBs24.15330.1000.57418.2920.495
SVRt + XGBr + NNARs18.41123.2240.76813.7410.636
SVRt + XGBr + ELMs19.98425.8680.70314.8770.586
SVRt + NNARr + SVRs23.76631.4300.55617.8750.424
SVRt + NNARr + XGBs26.38733.2440.48219.9050.394
SVRt + NNARr + NNARs18.57524.1970.74714.0010.596
SVRt + NNARr + ELMs20.32727.1710.66915.2860.596
SVRt + ELMr + SVRs18.44723.3180.77214.4630.576
SVRt + ELMr + XGBs21.43626.4000.68516.5680.586
SVRt + ELMr + NNARs10.15112.5210.9798.3860.818
SVRt + ELMr + ELMs13.44216.5920.95410.6140.879
XGBt + SVRr + SVRs18.84423.9470.74914.6800.626
XGBt + SVRr + XGBs21.55626.7640.67416.5990.586
XGBt + SVRr + NNARs12.45615.1870.9349.6380.747
XGBt + SVRr + ELMs15.33119.1940.89511.6860.778
XGBt + XGBr + SVRs22.63828.8180.62617.0530.515
XGBt + XGBr + XGBs23.99129.9780.57518.1480.505
XGBt + XGBr + NNARs18.30623.1320.76913.6740.636
XGBt + XGBr + ELMs19.87125.6820.70514.8030.586
XGBt + NNARr + SVRs23.66431.3440.55717.7550.424
XGBt + NNARr + XGBs26.19833.0860.48419.7260.394
XGBt + NNARr + NNARs18.55724.0420.74813.9620.596
XGBt + NNARr + ELMs20.18026.9350.67115.1530.596
XGBt + ELMr + SVRs18.54923.5240.77214.5510.576
XGBt + ELMr + XGBs21.48226.4860.68716.6120.586
XGBt + ELMr + NNARs10.34912.8190.9798.6410.808
XGBt + ELMr + ELMs13.56516.6600.95610.8070.879
NNARt + SVRr + SVRs19.05024.0550.74814.8230.616
NNARt + SVRr + XGBs21.78726.9490.67216.7660.586
NNARt + SVRr + NNARs12.62515.4140.9339.7010.747
NNARt + SVRr + ELMs15.61919.5020.89211.8400.778
NNARt + XGBr + SVRs22.79028.8650.62517.1760.515
NNARt + XGBr + XGBs24.15430.1020.57318.2890.505
NNARt + XGBr + NNARs18.41023.2280.76813.7390.636
NNARt + XGBr + ELMs19.98425.8660.70214.8780.586
NNARt + NNARr + SVRs23.75531.4230.55617.8650.424
NNARt + NNARr + XGBs26.37333.2320.48219.8930.394
NNARt + NNARr + NNARs18.57124.1810.74713.9970.596
NNARt + NNARr + ELMs20.31327.1510.66915.2740.596
NNARt + ELMr + SVRs18.46523.3440.77114.4790.576
NNARt + ELMr + XGBs21.44426.4160.68516.5770.586
NNARt + ELMr + NNARs10.17512.5560.9798.4140.808
NNARt + ELMr + ELMs13.46416.6090.95410.6380.879
ELMt + SVRr + SVRs19.10824.0900.74814.8630.616
ELMt + SVRr + XGBs21.82326.9910.67216.7860.586
ELMt + SVRr + NNARs12.65715.4600.9339.7360.747
ELMt + SVRr + ELMs15.68819.5630.89211.8710.778
ELMt + XGBr + SVRs22.82928.9500.62517.1880.515
ELMt + XGBr + XGBs24.19830.1110.57318.2950.505
ELMt + XGBr + NNARs18.44323.2590.76813.7240.636
ELMt + XGBr + ELMs19.98125.8400.70314.8750.586
ELMt + NNARr + SVRs23.77431.4350.55517.8830.424
ELMt + NNARr + XGBs26.42933.2880.48119.9470.394
ELMt + NNARr + NNARs18.60124.2150.74714.0180.596
ELMt + NNARr + ELMs20.35427.2010.66815.3060.596
ELMt + ELMr + SVRs18.49323.3690.77114.5110.576
ELMt + ELMr + XGBs21.50826.4910.68616.5930.586
ELMt + ELMr + NNARs10.28712.7200.9788.5660.808
ELMt + ELMr + ELMs13.50716.6340.95610.8540.879
Table 7. Peshawar: accuracy mean error.
Table 7. Peshawar: accuracy mean error.
STL Model CombinationsMAERMSECorrelationsMAPEMDA
SVRt + SVRr + SVRs14.13718.3690.66112.7130.515
SVRt + SVRr + XGBs15.90820.2580.59214.2530.485
SVRt + SVRr + NNARs10.87113.5850.8379.9020.768
SVRt + SVRr + ELMs12.18215.4620.79711.0460.798
SVRt + XGBr + SVRs15.89720.6020.56814.1590.465
SVRt + XGBr + XGBs17.52622.2440.51015.5720.444
SVRt + XGBr + NNARs13.27816.8530.74411.9400.687
SVRt + XGBr + ELMs14.13218.3160.69212.6420.657
SVRt + NNARr + SVRs16.39121.6410.51114.6580.515
SVRt + NNARr + XGBs17.74923.1330.45515.8670.414
SVRt + NNARr + NNARs13.14715.7720.76011.9260.657
SVRt + NNARr + ELMs14.34318.1450.66712.8770.606
SVRt + ELMr + SVRs14.83818.8680.68613.3890.586
SVRt + ELMr + XGBs16.75021.2600.59614.9520.515
SVRt + ELMr + NNARs10.91313.8510.88610.1530.808
SVRt + ELMr + ELMs13.43116.2370.86212.3210.909
XGBt + SVRr + SVRs14.16718.4150.66012.7430.515
XGBt + SVRr + XGBs15.95220.3140.59214.2880.485
XGBt + SVRr + NNARs10.91613.6510.8369.9420.768
XGBt + SVRr + ELMs12.25815.5470.79611.1180.808
XGBt + XGBr + SVRs15.92920.6600.56714.1840.465
XGBt + XGBr + XGBs17.56522.3110.51015.5940.444
XGBt + XGBr + NNARs13.32416.9260.74311.9780.687
XGBt + XGBr + ELMs14.21318.4070.69012.7110.657
XGBt + NNARr + SVRs16.37821.6580.51114.6410.515
XGBt + NNARr + XGBs17.77623.1620.45415.8810.424
XGBt + NNARr + NNARs13.18015.7990.75911.9520.667
XGBt + NNARr + ELMs14.39318.1920.66612.9250.606
XGBt + ELMr + SVRs14.91318.9450.68613.4550.586
XGBt + ELMr + XGBs16.82621.3430.59515.0130.515
XGBt + ELMr + NNARs10.97313.9590.88610.2110.808
XGBt + ELMr + ELMs13.50916.3570.86012.3910.909
NNARt + SVRr + SVRs14.14018.3920.66012.7170.515
NNARt + SVRr + XGBs15.91820.2790.59214.2610.485
NNARt + SVRr + NNARs10.89813.6210.8369.9250.768
NNARt + SVRr + ELMs12.20615.4890.79711.0710.808
NNARt + XGBr + SVRs15.89820.6300.56714.1600.465
NNARt + XGBr + XGBs17.53122.2700.51015.5720.455
NNARt + XGBr + NNARs13.31316.8910.74311.9710.687
NNARt + XGBr + ELMs14.15818.3480.69112.6670.657
NNARt + NNARr + SVRs16.36821.6240.51114.6140.515
NNARt + NNARr + XGBs17.74323.1100.45515.8560.424
NNARt + NNARr + NNARs13.13215.7630.75911.9040.667
NNARt + NNARr + ELMs14.31718.1290.66612.8460.606
NNARt + ELMr + SVRs14.77818.8080.68613.3300.586
NNARt + ELMr + XGBs16.67821.1960.59514.9320.515
NNARt + ELMr + NNARs10.86913.8480.88610.1460.808
NNARt + ELMr + ELMs13.41316.2120.86012.3080.909
ELMt + SVRr + SVRs14.15918.3770.66012.7260.515
ELMt + SVRr + XGBs15.92620.2850.59214.2670.485
ELMt + SVRr + NNARs10.91613.6310.8369.9330.768
ELMt + SVRr + ELMs12.20615.4780.79711.0630.808
ELMt + XGBr + SVRs15.89820.6190.56714.1510.465
ELMt + XGBr + XGBs17.53122.2510.51015.5620.455
ELMt + XGBr + NNARs13.31316.8720.74311.9620.687
ELMt + XGBr + ELMs14.15818.3320.69112.6570.657
Table 8. Comparative performance of the proposed hybrid framework versus benchmark models across four Pakistani cities.
Table 8. Comparative performance of the proposed hybrid framework versus benchmark models across four Pakistani cities.
Islamabad
ModelMAERMSECorrelationsMAPEMDA
SVR15.99721.3420.51114.8290.525
XGBoost17.64624.4530.45515.0620.641
NNAR15.99021.6740.76011.9950.658
ELM17.64621.5630.66712.7710.602
Proposed Model6.2777.9120.9697.5590.838
Karachi
ModelMAERMSECorrelationsMAPEMDA
SVR15.99023.3420.76814.2970.808
XGBoost17.64625.3340.80815.9450.809
NNAR16.08922.6410.64614.7980.652
ELM14.96826.6310.69413.8830.748
Proposed Model6.0057.6520.9757.5900.798
Lahore
ModelMAERMSECorrelationsMAPEMDA
SVR35.26850.9990.71521.6730.686
XGBoost38.09953.4270.89328.0060.705
NNAR34.84049.6830.87221.4560.778
ELM33.96648.3040.65621.1780.795
Proposed Model10.12012.4750.9798.3400.808
Peshawar
ModelMAERMSECorrelationsMAPEMDA
SVR27.26533.4420.61015.7160.544
XGBoost28.27836.5300.73617.9400.687
NNAR24.32031.3160.61917.6420.657
ELM23.23931.2410.59117.0500.665
Proposed Model10.87113.5850.8379.9020.768
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qureshi, M.; Hashem, A.F.; Iftikhar, H.; Rodrigues, P.C. A Hybrid STL-Based Ensemble Model for PM2.5 Forecasting in Pakistani Cities. Symmetry 2025, 17, 1827. https://doi.org/10.3390/sym17111827

AMA Style

Qureshi M, Hashem AF, Iftikhar H, Rodrigues PC. A Hybrid STL-Based Ensemble Model for PM2.5 Forecasting in Pakistani Cities. Symmetry. 2025; 17(11):1827. https://doi.org/10.3390/sym17111827

Chicago/Turabian Style

Qureshi, Moiz, Atef F. Hashem, Hasnain Iftikhar, and Paulo Canas Rodrigues. 2025. "A Hybrid STL-Based Ensemble Model for PM2.5 Forecasting in Pakistani Cities" Symmetry 17, no. 11: 1827. https://doi.org/10.3390/sym17111827

APA Style

Qureshi, M., Hashem, A. F., Iftikhar, H., & Rodrigues, P. C. (2025). A Hybrid STL-Based Ensemble Model for PM2.5 Forecasting in Pakistani Cities. Symmetry, 17(11), 1827. https://doi.org/10.3390/sym17111827

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop