1. Introduction
Air quality in China has deteriorated markedly amid rapid economic growth, and
pollution has become increasingly severe. These fine particles carry a multitude of harmful substances, remain suspended for long durations, and can travel considerable distances. Thus, they act as a carriers of various harmful pollutants in the air, significantly degradeing air quality and visibility. And they also simultaneously constitute serious threats to human health, such as impaired renal function, coronary artery disease, asthma, lung cancer, and other respiratory diseases [
1,
2,
3,
4]. Therefore, building a high-precision
concentration prediction model is of key significance for achieving proactive warnings of air pollution and precise management and control. An accurate prediction model can quantify the spatiotemporal evolution patterns of air pollutants, providing quantitative decision-making support for the sustainable management of urban ecological environments. Its prediction results can directly support practical actions such as the dynamic formulation of pollution control measures, the scientific optimization of emission reduction plans, and the efficient allocation of environmental governance resources. Research on
concentration prediction has been validated and applied in practical scenarios such as urban air environment monitoring and regional sustainable development assessment, both domestically and internationally, becoming an important technological support for the promotion of sustainable development in the ecological environment field.
At present, approaches to forecasting
concentrations generally fall into two broad categories: mechanistic and machine learning models. Mechanistic approaches rely on fundamental physical and chemical principles for their implementation. These approaches simulate the diffusion, transport, and deposition processes of pollutant plumes, with the ultimate goal of estimating ambient pollutant concentrations [
5,
6,
7,
8]. These methods provide good interpretability. However, such models require prior knowledge of emission sources, meteorological conditions, and geographical features, and rely on empirical parameters and assumptions to estimate future concentration changes, which limits their applicability to specific scenarios [
9]. For example, AERMOD is an empirical model primarily designed for simulating and predicting small-scale pollutant dispersion [
5]. When third-generation air quality models are applied, complete emission inventories and accurate meteorological fields are required as inputs, which renders these models inapplicable in most cases [
6]. Prediction methods based on mechanistic models simulate the physical evolution of atmospheric elements and their interregional transport. Although mechanism-based methods can simulate the physical evolution and interregional transport of atmospheric components, their prediction accuracy is largely constrained by the difficulty of obtaining precise input data as well as sufficient knowledge of emission sources, reaction mechanisms, and chemical kinetics [
10,
11].
Compared with mechanism-based approaches, machine learning methods are more effective in terms of capturing the nonlinear characteristics of
and have become a research hotspot worldwide. Traditional machine learning encompasses a wide range of predictive models. Among these, Support Vector Machines (SVMs) stand out as representative and widely applied approaches. By employing Particle Swarm Optimization (PSO) to optimize a hybrid kernel SVM, Zhang et al. developed a PM
2.5 prediction model that achieved strong accuracy and efficiency [
12]. The field of
prediction has also seen the application of the Extreme Learning Machine (ELM)—a rapid learning algorithm for Single Hidden Layer Feedforward Neural Networks (SLFNs), where its optimized variants have demonstrated promising predictive performance [
13,
14]. Studies further indicate that meteorological factors, such as surface pressure, precipitation, and temperature, as well as other pollutants including PM
10, CO, and SO
2, play significant roles in influencing
concentration levels [
15,
16]. Although traditional machine learning methods exhibit certain capabilities in handling nonlinear data, they are limited in extracting deeper representations from large-scale datasets [
17]. Building on traditional machine learning, the advanced deep learning provides a new paradigm for atmospheric time series prediction by enabling hierarchical feature learning from large inputs. The deep learning field encompasses a variety of specialized architectures, notably Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs). RNNs have demonstrated particular advantages in time series modeling [
18]. It is well-established that LSTM, the predominant RNN variant, is extensively applied to
prediction [
19,
20]. By introducing memory cells and gating mechanisms, LSTM effectively addresses the long-term dependency problem and retains temporal information. Designed as a streamlined alternative to LSTM, the Gated Recurrent Unit (GRU) reduces model complexity and accelerates training, which has led to its wide adoption in
prediction tasks [
21,
22,
23,
24]. However, the relationship between
concentrations and meteorological factors is characterized by a dynamic interplay, where complexity stems from the interconnectedness of multiple variables rather than from any single factor. When considered independently, these factors cannot fully capture the interactive effects of multiple variables on
concentrations, which limits prediction accuracy.
To elevate the accuracy of predictive models, numerous scholars have explored a range of avenues, such as optimizing the internal architecture of existing models and integrating heterogeneous model frameworks into a unified hybrid system. A prominent example of such optimized architectures is the Bidirectional Long Short-Term Memory (BiLSTM) network, an advanced iteration of the traditional LSTM that mitigates predictive deviations by effectively capturing sequential features in both forward and backward directions [
25]. Other studies have focused on parameter optimization of LSTM models using optimization algorithms and their improved versions, including the Genetic Algorithm (GA) [
26], Quantum Particle Swarm Optimization (QPSO) [
27], Bayesian Optimization (BO) [
28], and the Whale Optimization Algorithm (WOA) [
29]. These algorithms perform global searches to automatically determine optimal hyperparameters (e.g., training epochs, learning rate, and batch size), thereby avoiding inefficient manual tuning and improving stability and accuracy. In addition, hybrid neural networks have been developed to exploit the complementary strengths of different architectures, improving accuracy, generalization, adaptability to complex data, and robustness against noise. For example, the fusion of convolutional and recurrent networks has emerged as a viable strategy to enhance the efficiency of prediction tasks in deep learning applications [
30,
31,
32]. Ding et al. [
33] proposed a hybrid model integrating LSTM and a weighted Random Forest (RF) for
prediction. In this framework, by employing RF to assess the importance of input variables (e.g., temperature, wind speed, and historical
concentrations), subsequently identifying and selecting the most relevant features. A fully connected network (FCN) was then employed to assign weights to these features, quantifying their relative influence. Finally, the LSTM processed the weighted feature sequence to capture long-term temporal dependencies, enabling it to predict
concentrations over the subsequent six hours.
In order to optimize the model’s performance and improve the accuracy of predictions, previous studies have explored model enhancements and the fusion of different neural networks. Although some progress has been achieved, limitations remain in capturing complex feature relationships within the data. With the widespread application of attention mechanisms across various fields [
34,
35], the combination of data preprocessing and machine learning has provided new opportunities for improving prediction performance. Through probabilistic weighting, attention mechanisms extract long-term dependencies and minimize information loss, thereby enhancing overall predictive accuracy. It has been confirmed that
prediction models combined with attention mechanisms achieve improved accuracy and generalization [
36,
37,
38]. For example, aiming at two-day-ahead forecasts of
, Zhang et al. [
39] developed a hybrid CNN–BiLSTM–attention model. This model is designed as a multi-modal fusion system, synthesizing strategies from convolutional, recurrent, and attention-based neural processing. Experiments that were evaluated on data from Shunyi District, Beijing, show that this model outperforms Lasso regression, ridge regression, and XGBoost in both short-term and long-term predictions. It is noteworthy that integrating an attention mechanism into the model can improve prediction accuracy, but it requires calculating attention weights at each time step [
25,
38,
40].
In addition, the integration of data preprocessing techniques has been shown to improve
prediction accuracy. This is achieved by simplifying raw sequences and decomposing them into different distinct subsequences, with each bearing a variety of feature information [
37]. Notably, researchers have developed hybrid models by fusing neural networks with Empirical Mode Decomposition (EMD), its derivatives, and Variational Mode Decomposition (VMD); these models have since delivered significant improvements in
prediction accuracy [
37,
38,
41,
42]. To address the challenges in
concentration prediction within 0–24 h, Teng et al. [
43] introduced a hybrid model that incorporates EMD and Sample Entropy (SE) into a BiLSTM framework. Experimental results indicated that this model outperformed single deep learning models, with short-term (within 6 h) prediction accuracy improved by at least 50%. These studies indicate that employing hybrid data decomposition techniques, coupled with further refinement through secondary decomposition, can achieve more detailed feature extraction and thus enhance the model’s predictive accuracy [
44].
Overall, despite the significant progress made in prediction, numerous limitations remain unresolved in current research. For concentration time series characterized by seasonality and influenced by multiple complex factors, existing models fail to effectively address non-stationarity and lack refined decomposition of fluctuating components, often leading to mode mixing. Additionally, most models have a limited ability to focus on key information such as meteorological factors, face difficulties in multi-source data fusion, and some hybrid models suffer from structural redundancy. Although attention mechanisms have been introduced, they are often simply combined with existing models without fully exploiting their advantages. These techniques provide models with higher-quality inputs, enabling efficient fusion of multi-source data and collectively enhancing the capacity of the model for processing sophisticated data structures. Despite the progress achieved, substantial improvements are still required to realize deep collaboration among techniques and to enhance real-time prediction. Therefore, building a real-time concentration prediction model that can accurately forecast dynamic climate changes while offering high adaptability, high interpretability, and high precision is not only a practical requirement for precise air pollution control and sustainable ecological management, but also a key scientific challenge that urgently needs to be addressed.
To address these challenges, this paper proposes a high-accuracy
concentration prediction method, with the technical process shown in
Figure 1. Using
monitoring data from Guangzhou and Shenzhen from 2020 to 2022 as the research object, after completing data supplementation, the seasonal variation characteristics of
concentration are first analyzed. The correlations between pollution factors (
, CO,
, and
) and meteorological factors (precipitation, temperature, atmospheric pressure, etc.) with
concentration are explored, thereby revealing the driving mechanisms of
concentration changes. Subsequently, the preprocessed dataset is divided and normalized. Employing the ‘decomposition–prediction–reconstruction’ modeling approach, this is input into the OVMD–PeepholeLSTM–attention model (hereafter referred to as PeepholeLSTM-OA) to obtain the final
concentration predictions. Finally, quantitative evaluation indicators such as MAE, RMSE, and
are calculated to comprehensively assess the predictive performance of the model.
3. Research Methods
In the current research on time series prediction of concentrations, the original monitoring data is highly non-stationary and significantly affected by noise, and it is difficult for a single deep learning model to accurately capture the key features and dynamic relationships of the sequence. To further improve the accuracy and robustness of prediction models and make them more applicable to practical regional air pollution control needs, this paper proposes a concentration prediction model, PeepholeLSTM-OA. This model integrates optimal variational mode decomposition (OVMD), Peephole Long Short-Term Memory network (PeepholeLSTM), and an attention mechanism (AM). OVMD effectively mitigates the non-stationarity and noise interference in the original data, and the attention mechanism is introduced into the PeepholeLSTM model, enabling the model to dynamically focus on the parts of the input sequence most relevant to the current output, thereby enhancing model performance.
3.1. Peephole Long Short-Term Memory Network (Peephole LSTM)
Peephole Long Short-Term Memory (PeepholeLSTM) is a variant of Long Short-Term Memory networks (LSTMs) [
49]. It extends and optimizes the traditional LSTM, thereby enhancing its capability to process complex data. The structure of the PeepholeLSTM unit is illustrated in
Figure 5. The
concentration time series exhibits obvious long-term dependencies and phased abrupt change characteristics. In actual monitoring data,
concentrations often show—under a relatively stable background level, influenced by changes in meteorological conditions (such as stagnant weather or temperature inversion) and sudden surges in human emissions—short-term spikes or rapid accumulation followed by slow decay, and there is usually a long interval between these high-pollution events. This characteristic means that, when updating the gating state, models need to accurately perceive and distinguish whether ‘the current change is sufficient to break the existing pollution accumulation state.’ The standard LSTM gating mechanism relies only on the current input and the previous hidden state. In long-span sequences, the influence of the cell state on gating decisions is indirect, which can easily lead to insufficient perception of the pollution accumulation level, especially near concentration spikes or turning points, where gate updates may lag. Peephole LSTM introduces direct access to the previous cell state in the input gate, forget gate, and output gate, enabling gating decisions to explicitly sense the current
accumulation level and evolution trend. The update formulas for the gating unit states are provided in Equations (
2)–(
4).
where
,
, and
denote the forget gate, input gate, and output gate at time step
t respectively;
is the Sigmoid activation function;
W represents the gate weight matrix corresponding to each gate;
is the previous cell state;
is the previous hidden state;
is the current input; and
b denotes the bias vector.
3.2. Attention Mechanism
The core operation of the attention mechanism is computing a weight vector that quantifies the importance of each input sequence element to the current output, essentially implementing a weighted processing procedure. Its core idea is to enable the model to dynamically focus on the parts of the input sequence that are most relevant to the current output, by increasing the weight of important sequences and reducing the weight of non-important data, thereby improving the model’s performance. Especially in time series tasks, the attention mechanism helps the model automatically select historical data most relevant to the prediction, reducing reliance on all inputs. At the same time, it can also alleviate the issues of uneven temporal contributions and redundant multivariate input information in the formation process of , thereby enhancing the model’s ability to capture key stages of pollution evolution. Combined with the cumulative memory characteristics of Peephole LSTM, the attention mechanism further strengthens the model’s capability to represent critical moments and key driving factors, which partly explains the improvement in the model’s predictive performance. Standard attention mechanisms typically involve three vectors: the Query vector (Q), which represents the current task’s focus information; the Key vector (K), which indexes input sequence elements for matching with Q; and the Value vector (V), which contains each element’s actual information and is aggregated via weighted summation. The attention mechanism can generally be divided into the following steps:
- (1)
Calculate the attention scores. The attention score is derived from a similarity function between the
Q and
K. Common methods for calculating similarity are shown in
Table 1.
- (2)
Calculate attention weights. To obtain the final attention weights, the raw attention scores are normalized through the application of a softmax function. These weights represent the relative importance of each input element to the current output.
- (3)
Weighted Summation. Use attention weights to perform a weighted sum of the Value vectors (
V), resulting in the final attention value.
3.3. Optimal Variational Mode Decomposition (OVMD)
The Variational Mode Decomposition (VMD) algorithm is an adaptive signal processing and modal decomposition method that can decompose a time series into
K intrinsic mode functions (IMFs) with different central frequencies and bandwidths, thereby achieving effective separation of the original signal [
50]. The expression of an amplitude–frequency modulated signal is given as follows:
where
K denotes the number of decomposition modes,
represents the modal component,
is its amplitude, and
denotes the instantaneous phase of the mode component
.
Through continuous iterative updates of modal components and central frequencies, it decomposes the original data into K intrinsic mode functions (IMFs) with distinct frequency characteristics. The efficacy of VMD and the subsequent predictive accuracy of the model are largely influenced by the choice of K. An insufficient number of decomposed modes will lead to insufficient decomposition and ineffective identification of data patterns. Conversely, an excessive modal decomposition can result in mode mixing between adjacent components. Therefore, in this study, the Optimal Variational Mode Decomposition (OVMD) algorithm was employed to decompose the monitoring data.
Unlike VMD, which requires empirical setting of hyperparameters, the OVMD algorithm determines the optimal value of K by analyzing the distribution of the center frequencies of the IMF components. In this way, the decomposition avoids both insufficient separation and mode mixing. The update step size is a core hyperparameter used in the OVMD decomposition process to control the convergence speed of signal decomposition iterations and to balance decomposition accuracy and computational efficiency. Its value directly affects the Residual Evaluation Index (REI), which in turn determines the rationality of the OVMD decomposition results: if is too large, it can lead to insufficient decomposition and larger residuals, while if it is too small, it increases the computational load and may cause over-decomposition. After the optimal K is obtained, the OVMD method reconstructs the IMF components generated under different step sizes and calculates their Relative Entropy Index (REI). The step size that yields the minimum REI value is selected as the optimal parameter, with the range of step size as [0, 1].
In this study, during the sample construction phase for concentration prediction, the original dataset was first divided into a training set and a test set according to a time sequence. OVMD decomposition was performed only on the training set data to determine the optimal number of modes K, the update step , and other core parameters. Subsequently, the decomposition parameters determined from the training set were directly applied to the independent test set data for decomposition and subsequent prediction. A sliding window method was used throughout both the training and test sets to partition input and output sequences. The length of the sliding window was determined based on data characteristics and model input requirements, and the window moved continuously during iterations, which effectively reduced the impact of boundary effects on the error of real-time prediction results. The main decomposition procedure of OVMD is illustrated as follows:
- (1)
The optimal number of decomposition modes (K) is determined by iteratively computing and analyzing the distribution of modal center frequencies across candidate K values.
- (2)
The modal component sequences are reconstructed, and the Residual Evaluation Index (REI) between the reconstructed sequence and the original sequence is calculated to determine the optimal step size. The calculation of REI is given in Equation (
8).
where
U denotes the number of decomposed modes,
f represents the original signal, and
N is the total number of signal samples.
3.4. Model Construction
The PeepholeLSTM-OA model adopts a “decomposition–prediction–reconstruction” framework, as illustrated in
Figure 6.
In the decomposition stage, OVMD is employed to decompose the monitoring data, transforming the complex time series into multiple relatively stationary subsequences (IMFs). This adaptive decomposition alleviates non-stationarity and noise in the raw data, thereby providing high-quality inputs for subsequent prediction. In the prediction stage, PeepholeLSTM is used as the base learner. Compared with the traditional LSTM, PeepholeLSTM introduces direct connections from the cell state to the input, forget, and output gates, which enables the network to capture long-term dependencies more accurately within the gating mechanism and enhances its ability to model complex time series. Furthermore, an attention mechanism is incorporated into the PeepholeLSTM framework. The attention mechanism assigns weights to the contributions of different IMF components, allowing the model to focus on critical information while preserving more detailed data features, thus providing a solid foundation for performance improvement. In the reconstruction stage, the prediction results of all IMF components are aggregated through weighted summation to generate the complete concentration sequence. Through this “decomposition–prediction–reconstruction” process, the PeepholeLSTM-OA model achieves comprehensive extraction and utilization of multi-scale features, effectively addressing the non-linearity, non-stationarity, and multiple spatiotemporal characteristics of concentration series. Consequently, the model improves forecasting performance and evaluation metrics, offering a more reliable technical framework for air quality prediction and early warning. The specific prediction procedure is detailed as follows:
- (1)
monitoring data were processed for missing value imputation, feature selection, normalization, and split into training and testing sets. These data, together with pollutant and meteorological factors, were used as model inputs (x), then the data were split the dataset into a training set and a test set in an 8:2 ratio.
- (2)
The training set time series were decomposed using OVMD. The optimal number of modes K was determined by iteratively evaluating the center frequencies of IMF components, and the update step size was set by minimizing the REI. After obtaining the OVMD decomposition parameters, the test set was then decomposed into K intrinsic modal functions (IMF 1, IMF 2, …, IMF K) of different frequency features. Each captured different frequency features such as trends, periodic variations, and high-frequency fluctuations in the original data, thus achieving data decomposition.
- (3)
Each IMF subsequence was subjected to separate prediction via a PeepholeLSTM network, with the network leveraging the preceding 12 h of data to forecast the hourly concentration for the subsequent time step, obtaining prediction results for each IMF component. In this framework, the Peephole LSTM captured trend and seasonal patterns while learning nonlinear temporal dependencies and long-term correlations.
- (4)
An attention mechanism was applied to assign weights to each IMF prediction according to its importance, generating a weighted output.
- (5)
The PeepholeLSTM-OA model predictions were fused and reconstructed on the test set. Model performance was evaluated using mean absolute error (MAE), root mean square error (RMSE), and coefficient of determination ().
3.5. Error Evaluation Indices
To evaluate model performance, the coefficient of determination (
) was used to measure regression line fitting accuracy. This metric ranges from 0 to 1, with values close to 1 signifying strong alignment between observed and predicted data. The MAE and RMSE were also computed to quantify the PeepholeLSTM-OA model’s predictive performance and generalization capability for
concentrations. The formulas for
, MAE and RMSE are shown in Equations (
9)–(
11), respectively, as follows:
where
n denotes the number of data samples,
represents the predicted value,
denotes the observed value, and
is the mean of the observed values.
4. Analysis of Prediction Results
For neural network and machine learning models, parameter selection significantly affects training cost and predictive performance. Proper parameter settings can enhance model training efficiency, improve prediction accuracy, and reduce the risk of overfitting. Among network parameters, the number of neurons is critical: increasing the number of neurons can improve the model’s fitting ability for complex time series, but excessive neurons may lead to overfitting and higher computational cost. Epoch denotes the number of times the entire training set is processed; an appropriate value balances generalization and computational efficiency. Batch size refers to the number of samples used to update the model in each iteration, and powers of two (e.g., 32, 64, 128) are commonly chosen to optimize computational performance. In this study, the neural network models (LSTM, GRU, and PeepholeLSTM) were configured with a single hidden layer containing 50 neurons, a fully connected output layer; an Adam optimizer with a learning rate of 0.001 was used to prevent overfitting during model training. The following were also used: Dropout rate (e.g., 0.2), weight decay coefficient (L2 = 1 × 10−4); activation function selected—ReLU; Mean squared error (MSE) loss function, 100 epochs; and a batch size of 64. Meanwhile, set an Early Stopping strategy so that when the validation loss does not decrease for 10 consecutive epochs, training is automatically terminated to avoid ineffective training and overfitting.
For the support vector regression (SVR) model, the kernel type is the most critical parameter, as it maps input data into a high-dimensional feature space to handle nonlinear relationships. The kernel bandwidth (gamma) controls the influence range of individual samples: a large gamma may cause overfitting, whereas a small gamma produces a smoother kernel, making the model more sensitive to overall trends. The penalty parameter (C) regulates the tolerance for training errors; a large C can lead to overfitting, while a small C may result in underfitting. In this study, the SVR model was configured with a radial basis function (RBF) kernel, C set to 100, and gamma set to 0.1. The model parameters are shown in
Table 2.
Data from the Guangzhou and Shenzhen stations were separately input into the SVR, LSTM, GRU, PeepholeLSTM, and PeepholeLSTM–attention models for training and
concentration prediction. The first 80% of the data were used for model training to capture temporal patterns and trends, while the remaining 20% were reserved as the test set to evaluate model performance. A time step of 12 was applied, using the previous 12 h to predict the
concentration in the 13th hour. The fitting between the predicted and observed values for both stations is shown in
Figure 7. Due to the large number of hourly data points, the comparison plot is not visually clear; therefore, 24 h averaged
concentrations were calculated to generate a daily comparison plot. To further illustrate differences among models, enlarged plots of six representative weeks were additionally provided.
Figure 7 visualizes the predictive performance of different models by plotting the actual against the predicted
concentrations on the test sets. The black curve represents observed values, while other colors indicate predictions from SVR, LSTM, GRU, PeepholeLSTM, and PeepholeLSTM–attention models. Overall, predictions from all five models approximately match the observed values for both Guangzhou and Shenzhen stations, capturing the general trends of
variations. However, the SVR model exhibits noticeably poorer performance compared with the deep learning models. Although SVR captures the overall trend, its predictions at peak
values often deviate substantially from observations and may even show trends opposite to the actual data. The LSTM, GRU, and PeepholeLSTM models demonstrate similar performance, effectively reflecting the true trends and outperforming SVR. Nevertheless, these models still show some discrepancies at peak
values. Introducing the attention mechanism into PeepholeLSTM model (PeepholeLSTM–attention) leads to a further improvement in predictive performance.
Although the combined PeepholeLSTM–attention model improved predictive performance, peak values were still not well captured. Therefore, OVMD was applied to decompose the time series, and the resulting IMF components were predicted using the PeepholeLSTM–attention model. The final prediction of concentrations was obtained by reconstructing the predicted results of the IMFs.
OVMD was first applied to the Guangzhou station data. As shown in
Figure 8, when the number of modes
K reached 7, the center frequencies of IMFs for both Guangzhou and Shenzhen stations became stable, and further increases in
K did not lead to significant changes. Thus,
K = 7 was selected as the optimal decomposition number.
The optimal update step
was determined by minimizing the REI value. Let
traverse the interval [0, 1] for a search, with the update step set to 0.01. For each candidate value of
, calculate the Residual Evaluation Index (REI). By comparing the REI values for different
values across the entire interval, finally select the
that minimizes the REI as the optimal update step for OVMD decomposition. This ensures the decomposition process is efficient and the decomposition results are accurate, providing high-quality feature inputs for subsequent model training. As shown in
Figure 9, the REI reached its minimum at
= 0.85 for Guangzhou and
= 0.88 for Shenzhen.
Accordingly, for both stations, OVMD was set to
K = 7, with
= 0.85 for Guangzhou, and
= 0.88 for Shenzhen. The final decomposition results are shown in
Figure 10. As can be seen from the figure, whether at Guangzhou Station or Shenzhen Station, the waveform fluctuations of the IMF1 component are the most dense and significantly higher than those of the subsequent components, which is a typical high-frequency fluctuation characteristic. This component has the fastest amplitude changes and the highest frequency, reflecting the short-term, rapid random fluctuations and high-frequency variations caused by sudden pollution events in the
data. The waveform fluctuations of the IMF7 component are minimal, with overall changes being gentle, showing the long-term variation trend of
concentration, which is a typical low-frequency trend characteristic. The fluctuation frequencies of the other components lie between high-frequency and trend components, containing certain short-term fluctuations as well as some trend changes, making them transitional mid-frequency components.
After OVMD decomposition, seven IMF components were obtained for each station, and each was individually predicted by sequentially inputting into the PeepholeLSTM–attention model with a time step of 12. The results of
concentration prediction for the two sites were greatly improved, especially in terms of capturing the actual trends and peak positions more accurately. The prediction results of each IMF component are shown in
Figure 11. The results indicate that the model has fully grasped the characteristic patterns of each IMF component (including the long-term stable trend of low-frequency components and the short-term fluctuation features of high-frequency components), and is able to accurately capture the variation details of each component. This enables precise prediction of the actual change trend and peak positions of
concentrations, and also validates the rationality and effectiveness of combining OVMD decomposition with the Peephole LSTM–attention model.
To further reveal the mechanism of how the attention mechanism functions in predicting each IMF component and to clarify the model’s focus on features at different time steps of the input sequence, this study also visualizes the attention weight heatmaps during the prediction process of each modal component. Since high-frequency IMFs directly reflect the core fluctuation patterns of pollution events, accurately capturing their changes is key to improving prediction accuracy. Therefore, this paper only presents the heatmap of attention weight distribution for the high-frequency IMF1 component, which exhibits strong predictive performance (
Figure 12). Analysis shows that the model demonstrates significantly differentiated attention characteristics for input time step features of different modal components; notably, the IMF components have the highest weights compared to the other 11 features, and this attention pattern aligns closely with the physical significance of each IMF component. For pollutants like
, which have strong temporal inertia, the model inherently assigns higher attention weights to time steps closer to the prediction moment, while weights for more distant time steps decay rapidly. Combined with historical
data, when sudden changes in
concentration occur (such as unexpected heavy pollution or sudden drops in concentration), the attention weights exhibit noticeable ‘abnormal peak shifts’.
The reconstructed
concentration predictions obtained by combining all IMF components are presented in
Figure 13. The results demonstrate that a superior overall predictive performance for the PeepholeLSTM-OA model closely approximates the observed data at both stations, achieving accurate fit for abrupt changes, as well as peak and trough values, thereby achieving accurate prediction of
concentration variations.
To comprehensively assess model performance in
concentration prediction, we not only visualized prediction outputs to gauge their ability to track concentration fluctuations but also employed evaluation metrics to quantify the true effectiveness of these predictions. These metrics provide a more direct reflection of model stability, sensitivity, and fitting effectiveness. The MAE, RMSE, and
for each model are presented in
Table 3. The results consistently indicated the superior performance of deep learning approaches (LSTM, GRU, and PeepholeLSTM) relative to SVR for the Shenzhen case, and consistent with results at Guangzhou, indicating that deep learning approaches are more suitable for long time series prediction tasks compared to traditional machine learning methods. Among the individual models, GRU achieved results comparable to LSTM, while PeepholeLSTM exhibited the best performance, a phenomenon consistently observed at both Guangzhou and Shenzhen sites. When the attention mechanism was integrated into the PeepholeLSTM, the combined model (PeepholeLSTM–attention) demonstrated further improvements in all three metrics, though the performance gains were marginal. Finally, with the introduction of the OVMD decomposition algorithm, the PeepholeLSTM-OA model achieved significant improvements over the single PeepholeLSTM. At the Guangzhou site, the MAE decreased by about 39%, the RMSE decreased by about 45%, and the
increased by 0.0457; at the Shenzhen site, the MAE decreased by about 45%, the RMSE decreased by about 51%, and the
increased by 0.0765. These results indicate that the stability, sensitivity to large errors, and fitting capability of the model were substantially enhanced.
In order to gain a more intuitive understanding of the prediction error distribution of the PeepholeLSTM-OA model, this study plotted the original error distribution chart (
Figure 14). According to the statistics, the error range for the Guangzhou station is [−11.56, 7.60] μg/m
3, with a standard deviation of 1.675 μg/m
3 and a mean of −0.791 μg/m
3; for the Shenzhen station, the error range is [−14.41, 12.66] μg/m
3, with a standard deviation of 1.413 μg/m
3 and a mean of −0.105 μg/m
3. From this, it can be concluded that the model has excellent
concentration prediction performance.
In summary, the OVMD–PeepholeLSTM–attention model proposed in this paper performs excellently in predicting concentrations at two sites. All evaluation metrics surpass those of the comparison models, particularly demonstrating significant advantages in capturing sudden concentration changes, peak positions, and long-term trends, thereby validating the rationality and effectiveness of the model’s design.
To further verify the authenticity of this excellent performance metric and the generalization ability of the model, and to eliminate the risk of overfitting, this study conducted K-fold cross-validation experiments. A 5-fold cross-validation strategy was used, where the original training set was randomly divided into five equally sized subsets. Each time, four subsets were selected as the training set and one subset as the validation set. The training and validation process was repeated five times, and the evaluation metrics of the five validation results were calculated to assess the stability of the model and eliminate the random errors caused by a single training session. The experimental results are shown in
Table 4. The results of the 5-fold cross-validation show that the average
value of the model across five validations is higher than 0.94, and the average MAE and RMSE values show little difference from the single training results, with no significant fluctuations, indicating that the model has good stability and that random errors from a single training session have been ruled out.
The analysis of the above experiments indicates that the prediction results of the PeepholeLSTM-OA model are highly reliable. It performs excellently in terms of stability, sensitivity to anomalous data, and the ability to uncover data patterns, making it practically meaningful for predicting changes in concentrations.