A Comparative Study of CNN-sLSTM-Attention-Based Time Series Forecasting: Performance Evaluation on Data with Symmetry and Asymmetry Phenomena
Abstract
1. Introduction
1.1. Literature Review
- PatchTST [8] segments time series into interpretable patches, reducing computational complexity while preserving long-range dependencies.
- iTransformer [9] inverts attention to model cross-variate dependencies, demonstrating superior performance on multivariate meteorological data.
- TimesNet [10] converts 1D series into 2D temporal matrices, enabling simultaneous analysis of periodicity and trends.
1.2. Technical Challenges
- Spatiotemporal Coupling: Temperature variations are influenced by complex interactions between local microclimates and regional weather systems. This requires models to simultaneously capture spatial correlations and temporal dynamics—a capability lacking in most existing architectures, especially when handling both symmetric spatial patterns (e.g., uniform temperature distribution in a stable weather system) and asymmetric spatial patterns (e.g., uneven temperature gradients in a storm front).
- Long-Range Dependencies: Meteorological phenomena often exhibit periodic patterns spanning days to seasons. Traditional LSTMs suffer from performance degradation with sequences longer than 100 steps, while transformer models become prohibitively expensive for extended time series, hindering the modeling of long-term symmetric cycles and asymmetric long-term trends.
- Non-Stationarity: Temperature data frequently contains abrupt shifts caused by weather fronts, seasonal transitions, or extreme events. This challenges models’ ability to adapt to distributional shifts without catastrophic forgetting.
- Multi-Scale Feature Integration: Effective temperature prediction requires integrating fine-grained (hourly) and coarse-grained (daily) patterns, a task complicated by the varying importance of different temporal scales across prediction horizons.
- Computational Efficiency: Real-world applications demand models that can process large volumes of sensor data in near real-time, posing a direct conflict with the high computational requirements of state-of-the-art attention-based architectures.
- Noise Interference Challenge: Temperature time-series data is inherently susceptible to noise interference from sensor errors, environmental disturbances (e.g., sudden wind gusts affecting surface temperature readings), or incomplete data acquisition—issues that exacerbate the difficulty of modeling symmetric (e.g., diurnal temperature cycles) and asymmetric (e.g., heatwave spikes) patterns. Accurate modeling of such noisy, complex temperature data requires addressing two core challenges simultaneously: first, enhancing robustness against asymmetric and random noise interference to avoid distorting key temporal patterns; second, ensuring efficient capture of critical structural features (e.g., stable symmetric cycles, abrupt asymmetric anomalies) amid noise. While this dual challenge has been extensively explored in fields such as hyperspectral remote sensing and high-dimensional computer vision—where structural feature processing and stable representation methods have been developed to handle noisy high-dimensional data—these insights remain underutilized in temperature time-series prediction. Adapting such cross-domain strategies could provide valuable support for improving the noise resistance of temperature forecasting models [14,15,16].
1.3. Contributions
- Stable LSTM (sLSTM) Architecture: A modified LSTM incorporating exponential gating and layer normalization is proposed, which extends the effective gradient propagation threshold from 100 to over 500 time steps. This enhancement enables robust modeling of multi-periodic temperature patterns and long-term degenerative trends. Thus, we primarily target medium-to-long-term forecasting, and there is a good performance of the model on this type of dataset.
- Collaborative Feature Fusion Framework: A hybrid CNN-sLSTM-Attention model is developed, integrating convolutional layers for local spatial feature extraction, sLSTM for capturing long-range temporal dependencies, and a lightweight attention mechanism that dynamically emphasizes critical weather patterns without quadratic complexity.
- Cross-Scenario Validation: Comprehensive evaluations are conducted across six diverse datasets (urban, agricultural and industrial temperature monitoring) with varying temporal resolutions (from hourly to weekly). The results demonstrate consistent improvements: a 33% reduction in RMSE compared to traditional LSTMs, 10× faster convergence than LSTM models, and superior performance during extreme weather events.
- Ablation Studies Systematic experiments were performed to verify the contribution of each component. The findings show that sLSTM reduces prediction error by 15–47% for long sequences, while the attention mechanism improves the detection of extreme temperature events.
1.4. Limitations
- Spatial Generalization: The current model achieves optimal performance when deployed in regions with climatic characteristics similar to those of the training data. However, its performance degrades in environments with rapidly changing climates or in geographically distinct areas.
- Extreme Event Handling: Although the model’s performance in this regard is improved compared to the baseline method, it still struggles with unprecedented extreme weather events (e.g., record-breaking heatwaves). This is primarily due to a scarcity of training data for such anomalous events.
- Computational Trade-offs: While the attention mechanism is lightweight compared to Transformer models, it still introduces additional computational overhead relative to vanilla LSTMs. This constraint may limit the model’s deployment in resource-constrained edge devices [17].
- Feature Dependency: The model performance depends on access to a diverse set of meteorological features (e.g., humidity, wind speed). Consequently, it is less effective in regions with limited sensor coverage or incomplete historical meteorological data.
2. Related Work
2.1. CNN Model Research
- Sparse Connections: Each neuron is connected only to a local region in the previous layer through its receptive fields.
- Weight Sharing: Convolutional kernels are shared across the input, maintaining translational invariance.
2.2. sLSTM Model Research
2.3. Attention Mechanisms
3. Research Methodology
3.1. CNN-sLSTM-Attention Model Principle
3.1.1. Comparative Architectural Analysis
3.1.2. CNN and sLSTM Feature Interaction
3.1.3. sLSTM and Attention Gradient Synergy
3.2. Forecast Framework
3.2.1. Data Preprocessing
- Sliding Window: A window length of time steps is used to construct input sequences.
- Dataset Split: The data is partitioned into training, validation, and test sets with an 80%–10%–10% ratio.
- Normalization: Min-max scaling is applied independently to each sliding window.
3.2.2. Model Training
- Early Stopping: Training is halted if no improvement is observed on the validation loss for consecutive epochs.
- Dropout: A dropout rate of 0.2 is applied to the sLSTM layers to regularize the model.
- Batch Size: Models are trained using batches of 64 samples.
3.2.3. Evaluation Protocol
- Validation Loss Monitoring: The Root Mean Square Error (RMSE) is monitored during training,
- Test Set Metrics: The following metrics are computed on the test set:
- Mean Absolute Error (MAE):
- Coefficient of Determination ():where is the mean of the observed values.
- Pearson Correlation Coefficient ():This metric quantifies the linear correlation between the predictions and the actual observations.
4. Research Process
4.1. Data Overview
4.2. Sliding Window Analysis
4.3. Sensitivity Analysis
4.3.1. Hyperparameter Grid Tuning
Hyperparameter Grid and Model Structure Definition
- CNN Layer: Extracts local features through two convolutional layers (channel dimensions: 1 → 16 → 32) with max-pooling (stride 2), reducing the temporal length to one-quarter of the original.
- sLSTM Layer: Processes the CNN outputs using a custom sLSTMCell enhanced with layer normalization to stabilize the learning of long-term dependencies.
- Multi-head Attention Layer: Computes temporal attention weights through independent heads and averages the results to emphasize critical time steps.
- Fully Connected Layer: Produces single-step predictions. All architectural hyperparameters are adjustable via the grid search parameters.
Grid Search and Model Evaluation
- Iterate through all configurations generated by ParameterGrid, initializing a separate model instance for each unique combination.
- Train each model using the Adam optimizer (learning rate = 0.0001) for 50 epochs, minimizing the mean squared error (MSE) loss.
- Record the training and validation loss for every epoch, ensuring gradient computation is disabled during the validation phase.
- Extract the final validation MSE and the total parameter count for each configuration, storing the results in a structured DataFrame for subsequent analysis.
4.3.2. Visualization Analysis
- National-illness (low-frequency): Best results (MSE = 0.1403) with kernel size 7 and hidden size 256, with hidden size having a greater impact than kernel size.
- ETTh1 (high-frequency): Optimal performance (MSE = 0.0045) achieved with kernel size 7 and hidden size 256, showing significant improvement with larger hidden dimensions.
- Weather (multivariate): Minimum MSE (0.3019) observed with kernel size 3 and hidden size 256, demonstrating different kernel size preferences compared to other datasets.
- Greater sensitivity to sLSTM hidden size than CNN kernel size.
- Consistent performance improvement with increasing hidden size (64 → 256) reducing MSE by 37–52%.
- More modest gains from kernel size adjustments (3 → 7), improving MSE by 5–10%.
4.3.3. Dataset Parameter Configuration
4.4. Data Segmentation
5. Research Results and Analysis
5.1. Ablation Study Performance
| Dataset | Metric | LSTM | sLSTM | CNN-LSTM-FC | CNN-sLSTM-FC | CNN-sLSTM-Attention | CNN-SsLSTM-Attention |
|---|---|---|---|---|---|---|---|
| Temperature | R2 | 0.912 | 0.701 | 0.912 | 0.84 | 0.905 | 0.942 |
| Pearson | 0.966 | 0.918 | 0.967 | 0.955 | 0.962 | 0.971 | |
| RMSE | 0.648 | 0.954 | 0.518 | 0.595 | 0.436 | 0.454 | |
| MAE | 5.205 | 7.376 | 4.313 | 4.69 | 3.332 | 3.524 | |
| Traffic | R2 | 0.912 | 0.474 | 0.892 | 0.915 | 0.897 | 0.918 |
| Pearson | 0.955 | 0.806 | 0.958 | 0.964 | 0.954 | 0.964 | |
| RMSE | 0.357 | 0.824 | 0.373 | 0.318 | 0.315 | 0.324 | |
| MAE | 2.309 | 6.021 | 2.29 | 1.805 | 1.923 | 1.986 | |
| Exchange Rate | R2 | 0.937 | 0.797 | 0.922 | 0.932 | 0.964 | 0.991 |
| Pearson | 0.971 | 0.929 | 0.967 | 0.982 | 0.98 | 0.982 | |
| RMSE | 0.177 | 0.360 | 0.223 | 0.119 | 0.127 | 0.105 | |
| MAE | 1.375 | 2.8 | 1.743 | 0.906 | 0.972 | 0.775 | |
| National-illness | R2 | −0.438 | −4.14 | −0.186 | −2.16 | 0.214 | −0.054 |
| Pearson | 0.545 | 0.018 | 0.53 | 0.632 | 0.554 | 0.515 | |
| RMSE | 2.139 | 3.732 | 1.638 | 3.154 | 2.060 | 1.931 | |
| MAE | 20.336 | 34.318 | 14.611 | 29.049 | 19.047 | 17.439 | |
| ETTh1 | R2 | 0.609 | −0.391 | 0.916 | 0.918 | 0.906 | 0.906 |
| Pearson | 0.854 | 0.597 | 0.958 | 0.959 | 0.955 | 0.955 | |
| RMSE | 0.614 | 0.415 | 0.138 | 0.134 | 0.134 | 0.134 | |
| MAE | 4.842 | 3.196 | 0.945 | 0.884 | 0.913 | 0.884 | |
| Weather | R2 | 0.983 | 0.780 | 0.981 | 0.983 | 0.983 | −0.722 |
| Pearson | 0.991 | 0.894 | 0.991 | 0.992 | 0.992 | 0.934 | |
| RMSE | 0.205 | 0.735 | 0.216 | 0.205 | 0.203 | 0.213 | |
| MAE | 1.227 | 5.387 | 1.369 | 1.267 | 1.264 | 1.392 |
5.2. Model Performance Evaluation
5.3. Predictive Performance Analysis
5.3.1. Temperature Prediction Performance
5.3.2. National-Illness Prediction Performance
5.3.3. Exchange Rate Prediction Performance
5.4. Attention Mechanism Analysis
6. Conclusions and Prospects
6.1. Summary of Research Results
- Collaborative Feature Fusion Framework: The proposed model achieves synergistic integration of convolutional neural networks (CNN), stable long short-term memory (sLSTM), and attention mechanisms, moving beyond simple module stacking. The CNN extracts local spatiotemporal features, the sLSTM captures long-term dependencies through exponential gating, and the attention mechanism dynamically prioritizes critical timesteps. This “local-temporal-key” hierarchical optimization explicitly maps spatial features to temporal dependencies, effectively overcoming the feature dilution problem prevalent in traditional CNN-LSTM architectures.
- Performance and Efficiency Gains: Extensive evaluations on six diverse datasets demonstrate consistent improvements: a significant reduction in RMSE compared to conventional LSTMs for temperature prediction, 25× faster convergence than transformer-based models in traffic flow forecasting, and substantial error reduction for sequences exceeding 300 timesteps. The model exhibits exceptional robustness in handling both symmetric periodic patterns (e.g., temperature cycles) and asymmetric noisy data (e.g., traffic flow peaks, disease phase shifts), addressing key challenges in time series analysis.
- Interpretability and Theoretical Foundations: Comprehensive ablation studies, sensitivity analyses, and visual analytics validate the model’s effectiveness while elucidating its underlying mechanisms. The exponential gating in sLSTM amplifies strongly dependent features, layer normalization stabilizes gradient propagation, and attention gradients guide cell state updates toward critical temporal regions. These insights establish theoretical foundations for stable temporal modeling while providing practical guidance for real-world implementation.
6.2. Research Limitations and Future Directions
- Adaptability to Non-Stationary Data: The model exhibits performance degradation when handling datasets with abrupt distribution shifts or multi-phase patterns, as observed in the national-illness dataset. This highlights a limitation in modeling non-stationary sequences. Future work could incorporate adaptive normalization techniques and concept drift detection mechanisms to improve robustness in dynamically changing environments.
- Generalization Under Data Scarcity: The experimental evaluation relies on Min–Max normalization and contains limited examples of extreme events, such as unprecedented heatwaves or pandemic surges. This may affect real-world generalization. Future studies should explore robust scaling methods and integrate adversarial training with synthetically generated extreme events to enhance model resilience.
- Computational Efficiency: The integration of CNN and attention mechanisms increases model complexity, leading to training times 3–5× longer than those of vanilla LSTM models. This poses a challenge for resource-constrained applications. Future efforts should focus on model compression techniques, such as architectural pruning and quantization, to enable efficient deployment on edge devices.
- Architectural Efficiency: Investigate depthwise-separable convolutions, sparse attention mechanisms, and knowledge distillation to reduce computational overhead while maintaining performance, particularly for edge computing scenarios.
- Multimodal Integration: Incorporate complementary data sources—such as meteorological, geographical, and socioeconomic features—using cross-modal attention to improve contextual modeling and interpretability.
- Adaptive Learning Strategies: Develop dynamic hyperparameter optimization frameworks and meta-learning approaches to enhance adaptability to non-stationary data distributions and concept drift.
- Explainability Enhancements: Leverage attention weight visualization, Shapley value analysis, and counterfactual explanations to increase model transparency and facilitate adoption by domain experts.
- Application Expansion: Extend the framework to new domains, including financial volatility forecasting, industrial predictive maintenance, and online continual learning under streaming data constraints.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chen, W.G.; Teng, L.; Liu, J. Genetic algorithm optimized SVM model for transformer winding hotspot temperature prediction. Trans. China Electrotech. Soc. 2014, 29, 44–51. [Google Scholar]
- Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 2nd ed.; OTexts: Melbourne, Australia, 2018. [Google Scholar]
- Greff, K.; Srivastava, R.K.; Koutník, J.; Steunebrink, B.R.; Schmidhuber, J. LSTM: A Search Space Odyssey. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2222–2232. [Google Scholar] [CrossRef]
- Zhou, Z.K.; Yang, L.C.; Wang, Z. Remaining Useful Life Prediction of AeroEngine using CNN-LSTM and mRMR Feature Selection. In Proceedings of the 4th International Conference on System Reliability and Safety Engineering, Guangzhou, China, 28–30 October 2022; pp. 41–45. [Google Scholar]
- Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar]
- Tay, Y.; Dehghani, M.; Bahri, D.; Metzler, D. Efficient Transformers: A Survey. ACM Comput. Surv. 2022, 55, 1–28. [Google Scholar] [CrossRef]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
- Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2023, arXiv:2310.06625. [Google Scholar]
- Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. In Proceedings of the 11th International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Zhou, Z.H. Machine Learning; Springer: Singapore, 2021. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Wu, Y.; Che, T.; Lin, Z.; Memisevic, R.; Salakhutdinov, R.R.; Bengio, Y. Architectural complexity measures of recurrent neural networks. In Advances in Neural Information Processing Systems 29; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 1–9. [Google Scholar]
- Xu, S.; Ke, Q.; Peng, J.; Cao, X.; Zhao, Z. Pan-Denoising: Guided Hyperspectral Image Denoising via Weighted Represent Coefficient Total Variation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5528714. [Google Scholar] [CrossRef]
- Xu, S.; Cao, X.; Peng, J.; Ke, Q.; Ma, C.; Meng, D. Hyperspectral Image Denoising by Asymmetric Noise Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5545214. [Google Scholar] [CrossRef]
- Xu, S.; Zhao, Z.; Cao, X.; Peng, J.; Zhao, X.; Meng, D.; Zhang, Y.; Timofte, R.; Gool, L.V. Parameterized Low-Rank Regularizer for High-dimensional Visual Data. Int. J. Comput. Vis. 2025. [Google Scholar] [CrossRef]
- Han, S.; Mao, H.; Dally, W.J. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef] [PubMed]
- Beck, M.; Pöppel, K.; Spanring, M.; Auer, T.; Prudnikova, O.; Kopp, M.; Klbl, G.; Brandstetter, J.; Hochreiter, S. xLSTM: Extended Long Short-Term Memory. arXiv 2024, arXiv:2405.04517. [Google Scholar]
- Lim, B.; Zohren, S. Time Series Forecasting with Deep Learning: A Survey. Philos. Trans. R. Soc. A 2021, 379, 20200209. [Google Scholar] [CrossRef] [PubMed]
- Prechelt, L. Early Stopping—But When. In Neural Networks: Tricks of the Trade; Orr, G.B., Müller, K.R., Eds.; Springer: Berlin/Heidelberg, Germany, 1998; pp. 55–69. [Google Scholar]
- Kong, Y.; Wang, Z.; Nie, Y.; Zhang, Q.; Li, D. Unlocking the Power of LSTM for Long Term Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 22–27 February 2025; Volume 39, pp. 11968–11976. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Mariappan, Y.; Ramasamy, K.; Velusamy, D. An Optimized Deep Learning Based Hybrid Model for Prediction of Daily Average Global Solar Irradiance Using CNN-SLSTM Architecture. Sci. Rep. 2025, 15, 10761. [Google Scholar] [CrossRef] [PubMed]
- Zhao, C.Y.; Huang, X.Z.; Li, Y.X.; Liu, Z.Q. A Novel Cap-LSTM Model for Remaining Useful Life Prediction. IEEE Sens. J. 2021, 21, 23498–23509. [Google Scholar] [CrossRef]





















| Dataset | Features | Timesteps | Granularity |
|---|---|---|---|
| Weather | 21 | 52,696 | 10 min |
| Exchange-rate | 6 | 7589 | 1 day |
| National-illness | 7 | 966 | 1 week |
| ETTh1 | 7 | 17,420 | 1 h |
| Delhi Temperature | 4 | 1463 | 1 day |
| Traffic | 862 | 17,544 | 1 h |
| Hyperparameter | Candidate Values | Rationale |
|---|---|---|
| CNN kernel size | 3, 5, 7 | To capture short-, medium-, and long-range local temporal features, respectively |
| sLSTM hidden size | 64, 128, 256 | To progressively increase the capacity for temporal information processing |
| Attention heads | 1, 4 | To compare the performance of single-head versus multi-head attention mechanisms |
| Dropout rate | 0.1, 0.3, 0.5 | To evaluate different regularization strengths for preventing overfitting |
| Model Type | Parameter Strategy | Specific Configuration |
|---|---|---|
| Proposed Model | Dataset-specific optimal parameters | ETTh1: K7-H256-A1-D0.5; National-illness: K7-H256-A1-D0.1; Weather: K3-H256-A1-D0.3 |
| Baseline Models | Industry-standard parameters | LSTM hidden size = 128, CNN kernel = 3 |
| Component Models | Aligned with proposed model | Matching sLSTM dimensions for fair comparison |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Liu, H.; Yang, L. A Comparative Study of CNN-sLSTM-Attention-Based Time Series Forecasting: Performance Evaluation on Data with Symmetry and Asymmetry Phenomena. Symmetry 2025, 17, 1846. https://doi.org/10.3390/sym17111846
Liu H, Yang L. A Comparative Study of CNN-sLSTM-Attention-Based Time Series Forecasting: Performance Evaluation on Data with Symmetry and Asymmetry Phenomena. Symmetry. 2025; 17(11):1846. https://doi.org/10.3390/sym17111846
Chicago/Turabian StyleLiu, Haopeng, and Lufeng Yang. 2025. "A Comparative Study of CNN-sLSTM-Attention-Based Time Series Forecasting: Performance Evaluation on Data with Symmetry and Asymmetry Phenomena" Symmetry 17, no. 11: 1846. https://doi.org/10.3390/sym17111846
APA StyleLiu, H., & Yang, L. (2025). A Comparative Study of CNN-sLSTM-Attention-Based Time Series Forecasting: Performance Evaluation on Data with Symmetry and Asymmetry Phenomena. Symmetry, 17(11), 1846. https://doi.org/10.3390/sym17111846

