Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

An Approach to Spatiotemporal Air Quality Prediction Integrating SwinLSTM and Kriging Methods

Sustainability 2025, 17(7), 2918; https://doi.org/10.3390/su17072918

by Jiangquan Xie¹, Fan Liu², Shuai Liu¹ and Xiangtao Jiang^1,*

Reviewer 1: Anonymous

Reviewer 2:

Bacos Ioan-Bogdan

Reviewer 3: Anonymous

Sustainability 2025, 17(7), 2918; https://doi.org/10.3390/su17072918

Submission received: 18 February 2025 / Revised: 14 March 2025 / Accepted: 17 March 2025 / Published: 25 March 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This study addresses air pollution prediction using a deep learning-based approach. It first applies kriging interpolation to meteorological and air pollutant data to obtain spatial distributions. Then, a Swin-LSTM model, combining Swin-Transformer for feature extraction with LSTM for temporal learning, captures long-range dependencies more effectively than traditional CNNs

Here are some detailed comments for the section that regard the methodology as presented in the manuscript.

Introduction

The introduction is well-written, and the context of the work is clear. The only point of concern is that lines 77-79 are unclear. Do you mean that predicting air quality over time and space is similar to predicting the next frames in a video, where patterns from past frames are used to anticipate future ones?

Main method

On line 239, it would be helpful to specify which variogram models were used, such as spherical or linear.
On line 240 it would be helpful to add a reference to cosine positional encoding.

Dataset collection and preprocessing

It's unclear whether the 29 monitoring stations also collected meteorological variables. Please specify the source of this kind of variable.
Figure 6 is very important, but it could benefit from including the unit of measurement on the X-axis. Furthermore, It’s mentioned that the results show the PM2.5 indicator distribution predicted by the Swing LSTM model aligns more closely with the actual values, but it’s not clear what the "true value" subplot refers to. Is it the result of Kriging and then an an evaluation of the difference between the model output? It would be helpful to clarify this in the figure caption or description for better understanding. In this sense, Figure 7 is much clearer.
Why is Station 1 used as a benchmark? Is it a particularly relevant station, such as one located in an urban traffic area?
Table 1 reports the PM10 index prediction comparison; for consistency with the figures, it would be important to include the PM2.5 table as well.

Conclusion

The conclusions could be expanded. For example, a brief mention of considering geographical factors such as terrain in Kriging should be addressed, as well as including "hyperlocal" sources like emissions and traffic. Finally, it is unclear how the other mentioned pollutants are utilized, as only the PM results are reported. Has the same approach been used, for example, for gaseous pollutants?

Author Response

Responses to the Editor’s and Reviewers’ Comments on the Paper Entitled “An Approach for Spatio-Temporal Air Quality Prediction Integrating Swin-LSTM and Kriging Methods”

(No.: Sustainability - 3510189)

Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

The authors would like to thank the Editor and the anonymous reviewers for their valuable comments and suggestions.

Responses to Reviewer #1:

Comment 1.0:

Responses:

Thanks very much for your encouraging comments.

Comment 1.1:

Introduction

Responses:

Thank you to the reviewers for their valuable scientific advice. The authors would like to convey that spatio-temporal prediction of air quality is conceptually similar to video frame prediction, in that both require the extraction of spatio-temporal dynamic patterns from historical data to predict future states.

Therefore, we have modified lines 77-79 to clarify this analogy using a clearer and more detailed formulation that emphasizes the similarities between the two tasks: both require understanding spatio-temporal evolution patterns, extracting patterns from continuous time series, and predicting future spatial distributions. The modification expresses our intention more clearly and helps the reader to understand the theoretical rationale for our use of visual modeling techniques for air quality prediction.

Comment 1.2:

Main method

On line 239, it would be helpful to specify which variogram models were used, such as spherical or linear.

On line 240 it would be helpful to add a reference to cosine positional encoding.

Responses:

As suggested by the reviewers, references have been added to the paper for specific details on the choice of the variational function model and cosine position coding.

For the variational function model, we explicitly state in line 239 that the spherical variational function model is used in this study and explain the reason for choosing this model. Meanwhile, we add a reference to [28] on cosine positional encoding in line 240, citing the seminal work of Lopez-Avila et al. 2022 “Positional encoding is not the same as context: a study on positional encoding for Sequential recommendation.”. These changes enhance the completeness and reproducibility of the method description.

The updated content is as follows:

Initially, based on the location information of monitoring stations, a two-dimensional meteorological data distribution image X_t for each time step t is obtained through kriging interpolation. For the variogram modeling in this study, a spherical model was selected after comparing the fitness with exponential and Gaussian alternatives, as it best captured the spatial correlation structure of the air quality data in our study region while maintaining computational efficiency. Subsequently, the image is segmented into non-overlapping patches. After applying cosine positional encoding [28] to each patch, they are input into the SwinLSTM Block for spatial feature extraction.

And add it to the list of references:

[28]Lopez-Avila, Alejo, et al. "Positional encoding is not the same as context: A study on positional encoding for Sequential recommendation." arxiv preprint arxiv:2405.10436 (2024).

Comment 1.3:

It's unclear whether the 29 monitoring stations also collected meteorological variables. Please specify the source of this kind of variable.

Responses:

We thank the reviewers for their careful review and questions. Regarding the data sources of meteorological variables, we have described them in Section 4.1, “Dataset collection and preprocessing” of the original paper. In that section, it is described that air quality data were collected for three years (from January 1, 2020 to December 31, 2022) from 29 monitoring stations in four cities around Dongting Lake (Changsha, Yueyang, Changde, and Yiyang), which recorded local air quality indicators including PM2.5, PM10, SO₂ concentration, NO₂ concentration, O₃ concentration, CO concentration, and AQI index on an hourly basis. indices.

In addition to air pollutant data, these stations also collect data on basic meteorological variables, including temperature, humidity, wind speed and direction. In the beginning of Chapter 3, “Main Methods”, we mentioned that “Initially, based on the location information of monitoring stations, a two-dimensional meteorological data distribution image X_t for each time step t is obtained through kriging interpolation.”, where ‘meteorological data’ refers to the meteorological variables collected from these monitoring stations.

Comment 1.4:

Figure 6 is very important, but it could benefit from including the unit of measurement on the X-axis. Furthermore, It’s mentioned that the results show the PM2.5 indicator distribution predicted by the Swing LSTM model aligns more closely with the actual values, but it’s not clear what the "true value" subplot refers to. Is it the result of Kriging and then an an evaluation of the difference between the model output? It would be helpful to clarify this in the figure caption or description for better understanding. In this sense, Figure 7 is much clearer.

Responses:

Based on the reviewers' suggestions, the authors expanded the caption of Fig. 6 in the revised version, clarified that the unit of measurement of concentration is μg/m³, clarified that the “true value” subplot is a two-dimensional PM2.5 distribution obtained by kriging interpolation of the measurement data from the actual monitoring stations, explained that the “predicted value” subplot shows the model's prediction of the future PM2.5 distribution based on the historical data, and clarified the calculation method and MAEP meaning of the error plot to more clearly explain the meaning of the “true value” subplot and the assessment method.

It also clarifies the calculation method of the error map and the meaning of MAEP in order to explain the meaning and evaluation method of the “real value” sub-map more clearly. The revised diagrams are illustrated below:

Figure 6. Comparison of the effects of SwinLSTM and ConvLSTM in the 24-hour PM2.5 concentration prediction task (concentration unit: μg/m³). The left column (“True”) shows the reference 2D PM2.5 spatial distribution obtained through kriging interpolation of actual monitoring station measurements, representing the ground truth. The left column (“True”) shows the reference 2D PM2.5 spatial distribution obtained through kriging interpolation of actual monitoring station measurements, representing the ground truth. The middle column (“Output”) displays the model-predicted PM2.5 spatial distribution based on historical data. The middle column (“Output”) displays the model-predicted PM2.5 spatial distribution based on historical data. The right column (“MAE”) visualizes the pixel-wise mean absolute error between predicted and true values, with MAEP indicating the percentage of mean absolute error relative to the mean absolute error. The right column (“MAE”) visualizes the pixel-wise mean absolute error between predicted and true values, with MAEP indicating the percentage of mean absolute error relative to the true concentration range.

These additional notes will help the reader better understand the specific meaning of the subfigures in Figure 6 and their assessment methods, thus improving the clarity of the overall study presentation.

Comment 1.5:

Why is Station 1 used as a benchmark? Is it a particularly relevant station, such as one located in an urban traffic area?

Responses:

Thanks to the reviewers for raising this important issue. The selection of Station 1 as a benchmark does need to be made more explicit.

Station 1 was chosen as a benchmark because of its representative urban characteristics and data integrity. Specifically, Station 1 is located in the central business district of Changsha, which is a typical urban traffic-dense area surrounded by several major arterial roads and large commercial facilities. This site is not only significantly affected by transportation emissions, but also reflects the overall air quality conditions in the built-up area of the city, and is therefore an important reference for public health and policy making.

In addition, Station 1 has the most complete data record (missing rate <0.5%) over the entire study period, which makes time series comparisons more reliable. The site also exhibited significant daily variation and seasonal patterns, making it ideal for evaluating the model's ability to capture changes on different time scales.

Comment 1.6:

Table 1 reports the PM10 index prediction comparison; for consistency with the figures, it would be important to include the PM2.5 table as well.

Responses:

The authors present in Figures 6 and 7 mainly the PM2.5 concentration predictions, while Table 1 provides a comparison of predictions for the PM10 metric. This inconsistency may be confusing to the reader. The reason for this arrangement is that we wish to demonstrate the generality and effectiveness of the model in multiple pollutant prediction tasks through different visualizations.

Comment 1.7:

Conclusion

Responses:

Thanks to the reviewers for their valuable suggestions on the conclusion section. A few points regarding the extension of the conclusions are explained below:

Regarding the consideration of geographic factors in kriging interpolation: Although we have briefly mentioned “consideration of geographic factors such as topography, lakes, and elevation” in the “Future Work” section of the conclusion, it is true that we could have further clarified how these factors can be integrated into the kriging interpolation process. In fact, we plan to use the co-kriging method in our future research, using digital elevation models and land use data as auxiliary variables, to improve the accuracy of spatial estimation, especially in complex terrain areas.
Regarding “hyperlocal” emission sources: We believe that the reviewer's suggestion to include localized emission source data such as traffic flow, industrial emission points and construction activities is very valuable. These high-resolution emission data can indeed serve as important complementary information to compensate for the lack of reliance on monitoring station data alone. In future studies, we plan to integrate these data in two ways: either as additional input features for deep learning models or as covariates for synergistic kriging interpolation.
regarding the utilization of other pollutants: in this study, we did apply the same methodological framework for gaseous pollutants such as SO₂, NO₂, and O₃, but due to space constraints, we mainly report on PM₂. ₅ and PM₁₀ with detailed results.

Specifically, for the kriging interpolation process, integrating digital elevation models and land use data could significantly improve the spatial estimation accuracy by accounting for the influence of topographical features on pollutant dispersion. Additionally, incorporating "hyperlocal" emission sources such as traffic density data, industrial facility locations, and construction activity information would address a critical limitation of the current approach, which primarily relies on monitoring station data and may miss important localized pollution sources. These high-resolution emission data could be incorporated either as additional input features to the deep learning model or as covariates in an advanced co-kriging interpolation framework.

The updated conclusions are presented below:

This research addresses the crucial environmental monitoring issue of regional air quality spatiotemporal prediction by proposing an innovative deep learning-based method. This method first uses kriging interpolation to extend multi-dimensional me-teorological and pollutant indicators recorded at monitoring stations to two-dimensional distribution images of the entire region. It then employs the proposed SwinLSTM model to predict future regional air quality in both spatial and temporal dimensions.

Experimental results from the Dongting Lake area in China demonstrate that the proposed SwinLSTM model not only achieves excellent results in short-term PM2.5 and PM10 prediction tasks, significantly outperforming the current mainstream Con-vLSTM model, but also shows more pronounced advantages in medium to long-term prediction tasks. This proves that the proposed model can better capture the spatio-temporal correlations and evolution patterns of regional air quality.

Despite the promising results, several limitations of this study should be acknowledged. Data availability constraints restricted the analysis to a three-year pe-riod and specific pollutants; longer time series and additional pollutants would en-hance model robustness. The model's performance is dependent on the specific geo-graphical and meteorological conditions of the Dongting Lake region, potentially lim-iting direct transferability to regions with substantially different characteristics with-out retraining or adaptation. The computational resources required for the SwinLSTM architecture may present implementation challenges for real-time prediction systems with limited processing capabilities.

Future work will explore more comprehensive and efficient spatiotemporal fea-ture fusion methods, such as considering geographical factors like terrain, lakes, and altitude, to enable the model to better capture geographical conditions affecting air quality dispersion. Cross-attention mechanisms for multimodal fusion will be used to model spatiotemporal features. Additionally, incorporating forecast data from numer-ical weather models into input features will be attempted to further improve predic-tion fidelity. Furthermore, empirical studies on meteorological predictions will be conducted over larger regional scales, and predictions for more meteorological indica-tors will be made. This will provide technical support for formulating more compre-hensive environmental control policies, demonstrating important theoretical value and practical significance across a wider range of fields.

In the end, we would like to thank the reviewers for their careful reading of our paper and for their valuable suggestions for revision, which make it possible to present our paper better.

Author: Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

March . 10

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper presents a well-structured and insightful study on spatio-temporal air quality prediction, integrating advanced deep learning techniques with kriging interpolation.

The methodology is clearly explained, and the experimental results convincingly demonstrate the effectiveness of the proposed SwinLSTM model. The writing is precise, and the discussion is well-grounded in relevant literature, making the study a valuable contribution to the field.
The balance between technical depth and clarity is commendable, making the paper accessible to both specialists and a broader audience.

That being said, I do have a few minor suggestions for improvement:

1. A stronger introduction would benefit from a short paragraph clearly outlining the research gap. Right now, the paper does a great job summarizing existing methods, but it’s not entirely clear what’s missing from the literature or why a new approach is needed. Adding a brief discussion on what previous studies haven’t addressed and how this work fills that gap would make the contribution stand out more clearly.

2. While the paper provides a thorough technical explanation of kriging interpolation, it would be helpful to include a brief discussion on why this particular method was chosen over other spatial interpolation techniques. A few sentences comparing its advantages (such as accuracy, adaptability, or computational efficiency—to alternative methods)

3. A brief clarification on whether the mean absolute error in Figure 6 accounts for potential biases in the model would be useful...

4. I think the conclusion would benefit from a short section discussing the study’s limitations. While the results are promising, it would be helpful to acknowledge any constraints (data availability, potential biases, or the model’s applicability to different regions).

Author Response

Responses to the Editor’s and Reviewers’ Comments on the Paper Entitled “An Approach for Spatio-Temporal Air Quality Prediction Integrating Swin-LSTM and Kriging Methods”

(No.: Sustainability - 3510189)

Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

The authors would like to thank the Editor and the anonymous reviewers for their valuable comments and suggestions.

Responses to Reviewer #2:

Comment 1.0:

This paper presents a well-structured and insightful study on spatio-temporal air quality prediction, integrating advanced deep learning techniques with kriging interpolation. The methodology is clearly explained, and the experimental results convincingly demonstrate the effectiveness of the proposed SwinLSTM model. The writing is precise, and the discussion is well-grounded in relevant literature, making the study a valuable contribution to the field. The balance between technical depth and clarity is commendable, making the paper accessible to both specialists and a broader audience. That being said, I do have a few minor suggestions for improvement.

Responses:

Thanks very much for your encouraging comments.

Comment 1.1:

A stronger introduction would benefit from a short paragraph clearly outlining the research gap. Right now, the paper does a great job summarizing existing methods, but it's not entirely clear what's missing from the literature or why a new approach is needed. Adding a brief discussion on what previous studies haven't addressed and how this work fills that gap would make the contribution stand out more clearly.

Responses:

Thank you to the reviewers for their valuable scientific advice. Based on your suggestions, a new section has been added to the introduction of the revised manuscript that clearly describes the shortcomings of existing studies and how this study fills this research gap. The limitations of existing methods in dealing with spatial heterogeneity and temporal continuity simultaneously are pointed out, as well as the fact that most existing models fail to fully utilize the spatial relationships between monitoring sites. By integrating the local-global processing capability of Swin Transformer and the temporal modeling advantage of LSTM, along with the spatial analysis function of Kriging interpolation, this study proposes a novel approach to address these shortcomings. Modified introduction content:

Despite the progress made by deep learning models in regional air quality prediction, challenges remain in effectively capturing the complex spatiotemporal correlations between meteorological conditions and air pollutant concentrations. A critical gap in existing research is the difficulty in simultaneously addressing spatial heterogeneity and temporal continuity in air quality data. Most current models either focus predominantly on temporal patterns using recurrent architectures or prioritize spatial relationships using convolutional approaches, but rarely integrate both dimensions optimally. Additionally, the majority of existing models fail to fully leverage the spatial relationships between monitoring stations, particularly in regions with sparse monitoring networks. These limitations result in prediction inaccuracies, especially for medium to long-term forecasting and in areas with complex terrain or meteorological conditions.

The spatiotemporal air quality prediction task is akin to video prediction, requiring the extraction of temporal and spatial dynamic patterns from continuous time series data. In air quality prediction, the focus is on the diffusion and evolution trends of pollutant concentrations across spatial and temporal dimensions. In the field of computer vision (CV), the Swin Transformer has recently demonstrated outstanding performance in various vision tasks by effectively integrating local spatial information and global context through its innovative shifted window attention mechanism, which computes self-attention within local windows [18][19]. Inspired by this, this study proposes a novel spatiotemporal air quality prediction method that integrates kriging interpolation with a SwinLSTM model. First, kriging interpolates the collected one-dimensional station data onto a two-dimensional spatial plane. Then, a prediction network integrating Swin-Transformer and LSTM modules is constructed to capture more comprehensive spatiotemporal dependencies in the air quality prediction task.

Comment 1.2:

While the paper provides a thorough technical explanation of kriging interpolation, it would be helpful to include a brief discussion on why this particular method was chosen over other spatial interpolation techniques. A few sentences comparing its advantages (such as accuracy. adaptability, or computational efficiency-to alternative methods)

Responses:

As suggested by the reviewers, the authors have included in the manuscript a comparison of the specific reasons for choosing kriging interpolation over other spatial interpolation techniques. The modification details the advantages of kriging interpolation over methods such as inverse distance weighting (IDW), spline interpolation, and natural neighbor interpolation, particularly in terms of spatial heterogeneity, prediction error estimation, robustness to outliers, and the ability to account for spatial structure. This additional note helps readers understand the rationale for our method selection and emphasizes the applicability of kriging interpolation in air quality prediction application scenarios. The updated content is as follows:

Due to the often spatially unbalanced distribution of monitoring stations, it is necessary to employ reasonable spatial interpolation techniques to estimate the spatial distribution across the entire study area based on data from known monitoring sites [20]. Among these methods, kriging interpolation has been widely applied in this field owing to its rigorous mathematical foundation and the property of providing the best linear unbiased estimate [21][22]. Unlike simpler techniques such as Inverse Distance Weighting (IDW) that assume values are solely determined by distance, or spline interpolation that focuses on smoothness, kriging was selected for this research due to several key advantages. First, kriging accounts for both the distance and direction of measured points, making it particularly suitable for handling the spatial heterogeneity common in air pollutant distribution. Second, it provides not only predicted values but also error estimates (prediction variance), offering valuable uncertainty quantification not available with deterministic methods. Third, kriging demonstrates superior robustness against outliers compared to polynomial-based methods, an important consideration given the occasional anomalous readings in air quality monitoring data. Finally, kriging's ability to incorporate spatial structure through variogram analysis allows it to adapt to different spatial correlation patterns across various pollutants and geographical regions.

Comment 1.3:

A brief clarification on whether the mean absolute error in Figure 6 accounts for potential biases in the model would be useful.

Responses:

Thanks for your careful review. The mean absolute error (MAE) displayed in Figure 6 accounts for potential biases in the model by capturing both overestimation and underestimation equally through absolute difference calculations. Unlike metrics that may allow positive and negative errors to cancel each other out, MAE considers the magnitude of errors regardless of direction, thereby revealing systematic biases in prediction patterns. The MAE percentage (MAEP) further normalizes these values relative to the magnitude of the true concentrations, enabling fair comparisons across different concentration levels and prediction horizons.

Comment 1.4:

I think the conclusion would benefit from a short section discussing the study's limitations. While the results are promising, it would be helpful to acknowledge any constraints (data availability, potential biases, or the model's applicability to different regions).

Responses: Based on the reviewers' suggestions, the limitations of this study are discussed in detail in the conclusion section. Limitations in data availability, model dependence on specific geographic and meteorological conditions, applicability challenges in areas with different air pollution levels and sparse monitoring networks, and computational resource requirements are clearly identified. The limitations are updated below:

Despite the promising results, several limitations of this study should be acknowledged. Data availability constraints restricted the analysis to a three-year period and specific pollutants; longer time series and additional pollutants would enhance model robustness. The model's performance is dependent on the specific geographical and meteorological conditions of the Dongting Lake region, potentially limiting direct transferability to regions with substantially different characteristics without retraining or adaptation. The computational resources required for the SwinLSTM architecture may present implementation challenges for real-time prediction systems with limited processing capabilities.

In the end, we would like to thank the reviewers for their careful reading of our paper and for their valuable suggestions for revision, which make it possible to present our paper better.

Author: Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

March . 10

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Summary of the Manuscript

This manuscript presents a novel deep-learning-based approach for spatio-temporal air quality prediction. The methodology integrates kriging interpolation for spatial distribution estimation of meteorological and pollutant indicators and a Swin-LSTM model incorporating Swin-Transformer feature extraction for learning correlations from meteorological data and historical air quality records. The model aims to overcome traditional CNN limitations in capturing long-range spatial dependencies. The study uses data from 29 stations around China’s Dongting Lake to predict PM2.5 and PM10 levels for 1, 6, and 24-hour periods. Results indicate that the proposed Swin-LSTM architecture outperforms the ConvLSTM model, with an R-squared improvement of 5% on average.

Major Recommendations for Improvement

Use of Chemical Formula Notation: The manuscript should correctly format chemical formulas using subscripts (e.g., CO₂, SO₂, NO₂) to conform to standard scientific writing conventions.
Avoid Direct Forms: The use of direct personal pronouns such as "we" should be replaced with impersonal or passive voice constructions to maintain a formal and objective tone. For example, instead of "We applied kriging interpolation," consider "Kriging interpolation was applied."
Improve Manuscript Organization: Clearly separate the Results and Discussion sections. The current structure blends these aspects, making it difficult for readers to distinguish between findings and their interpretation.
Strengthen the Discussion: The discussion section lacks depth and should provide a more rigorous comparison with previous studies. Highlight the novelty of the Swin-LSTM model by explicitly comparing performance metrics with existing literature. Consider addressing:
- How does the proposed model improve upon existing methodologies?
- Are there specific conditions where the model performs significantly better or worse?
- What insights can be drawn from the performance variations?
Limitations and Future Work: Explicitly discuss the limitations of the study, including:
- Any assumptions made regarding the input data (e.g., quality, completeness, representativeness).
- Possible biases introduced by using kriging interpolation.
- Computational constraints or scalability concerns.
- Generalizability of the model to different geographic regions.
Enhance Conclusion Clarity: The conclusion should be structured using bullet points to improve readability. It should concisely summarize key findings, limitations, and future directions.
Additional Considerations:
- Model Validation: More details on the model validation process should be provided, including any cross-validation techniques used.
- Uncertainty Quantification: Discuss how prediction uncertainties were addressed or estimated.
- Real-World Application: Provide insights into how the proposed model can be integrated into existing air quality monitoring frameworks or decision-making processes.

Author Response

Responses to the Editor’s and Reviewers’ Comments on the Paper Entitled “An Approach for Spatio-Temporal Air Quality Prediction Integrating Swin-LSTM and Kriging Methods”

(No.: Sustainability - 3510189)

Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

The authors would like to thank the Editor and the anonymous reviewers for their valuable comments and suggestions.

Responses to Reviewer #3:

Comment 2.0:

This manuscript presents a novel deep-learning-based approach for spatio-temporal air quality prediction. The methodology integrates kriging interpolation for spatial distribution estimation of meteorological and pollutant indicators and a Swin-LSTM model incorporating Swin-Transformer feature extraction for learning correlations from meteorological data and historical air quality records. The model aims to overcome traditional CNN limitations in capturing long-range spatial dependencies. The study uses data from 29 stations around China's Dongting Lake to predictPM2.5 and PM10 levels for 1, 6, and 24-hour periods. Results indicate that the proposed Swin-LSTM architecture outperforms the ConvLSTM model, with an R-squared improvement of 5% on average.

Responses:

Thanks for summarizing our work and providing valuable comments.

Comment 2.1:

Use of Chemical Formula Notation: The manuscript should correctly format chemical formulas using subscripts (e.g., CO₂, SO₂, NO₂) to conform to standard scientific writing conventions.

Responses:

Thanks to the reviewers for their valuable comments. The authors have thoroughly checked the full text and normalized all chemical formulas in accordance with scientific writing conventions to ensure that all chemical molecular formulas for carbon dioxide (CO₂), sulphur dioxide (SO₂), nitrogen dioxide (NO₂), and so on, are in the correct subscript format. This modification contributes to the professionalism and readability of the paper and meets the standard requirements for academic publishing.

Comment 2.2:

Avoid Direct Forms: The use of direct personal pronouns such as "we" should be replaced with impersonal or passive voice constructions to maintain a formal and objective tone. For example, instead of "We applied kriging interpolation," consider "Kriging interpolation was applied ".

Responses:

Based on the reviewers' suggestions, the authors have revised the full paper by replacing all expressions directly using the first-person pronoun “we” with impersonal or passive voice constructions in order to maintain the formality and objectivity of the paper. For example, “We applied kriging interpolation” in the original text has been changed to “Kriging interpolation was applied”, and similar expressions have been adjusted accordingly throughout the text. These modifications have helped to make the presentation of the research results more formal and objective, which is in line with the standard requirements of academic writing. The specific changes in the text are as follows:

The remainder of this paper is organized as follows: In Chapter 2, this section presents the kriging interpolation algorithm, LSTM structure, and Swin Transformer structure relevant to the current research. In Chapter 3, the proposed SwinLSTM and its application to the air quality prediction task are described in detail. In Chapter 4, prediction experiments with multiple indicators and time horizons are conducted using three years of meteorological data from China's Dongting Lake region to demonstrate the effectiveness of the model. Chapter 5 presents the paper's conclusion and future work.

Comment 2.3:

Improve Manuscript Organization: Clearly separate the Results and Discussion sections. The current structure blends these aspects, making it difficult for readers to distinguish between findings and their interpretation.

Responses:

Based on the reviewers' comments, the structure of the paper has been completely revised to clearly separate the “Results” and “Discussion” sections in order to improve the organization and readability of the paper.

In the revised structure, the “Results” section now focuses on the objective presentation of experimental data and model performance metrics, including quantitative analyses of prediction results on different time scales, comparisons with benchmark models, and model performance in predicting different pollutants. The new separate “Discussion” section focuses on the interpretation and implications of these results, including analysis of the mechanisms behind model performance, comparison of the results with the existing literature, theoretical explanations of the model's strengths, and analysis of application scenarios. This restructuring allows readers to more clearly distinguish between findings and their interpretations, while also conforming to the standard format of scientific papers.

Comment 2.4:

Strengthen the Discussion: The discussion section lacks depth and should provide a more rigorous comparison with previous studies. Highlight the novelty of the Swin-LSTM model by explicitly comparing performance metrics with existing literature. Consider addressing:

How does the proposed model improve upon existing methodologies?

Are there specific conditions where the model performs significantly better or worse?

What insights can be drawn from the performance variations?

Responses:

Thanks to the reviewers for their valuable suggestions. We have fully strengthened and deepened the discussion section to make it more rigorous and in-depth. A detailed comparative analysis with existing studies has been added, with data tables clearly demonstrating how the SwinLSTM model in this paper compares with other mainstream methods in the literature in terms of key performance metrics. This comparison not only highlights the innovation and superiority of the model, but also provides a quantitative basis to illustrate its degree of improvement.

It also details how the SwinLSTM model radically improves existing methods by combining the local-global feature extraction capabilities of Swin Transformer with the temporal modeling advantages of LSTM. For the variation of the model's performance under different conditions, the differences in its performance in short-term (1-6 hours) and medium- to long-term (12-24 hours) forecasts are analyzed, as well as its adaptability to different pollutant concentration levels and complex meteorological conditions. In addition, important insights that can be derived from these performance variations are explored in depth, including the mechanisms by which the model captures spatial heterogeneity and temporal continuity, its ability to predict extreme events, and potential directions for optimization. These enhancements add depth and comprehensiveness to the discussion section, not only illuminating the innovative contributions of this study, but also providing valuable guidance for future research. Additional discussion based on reviewer suggestions:

4.5. Discussion

The superior performance of the SwinLSTM model can be attributed to several key innovations that address limitations in existing methodologies. Unlike traditional LSTM or GRU models that only capture temporal patterns at individual stations, our approach integrates spatial interpolation with advanced spatiotemporal modeling. The performance gap is particularly evident in medium to long-term forecasting, where our model maintains high accuracy while pure time-series approaches deterio-rate rapidly. This improvement stems from SwinLSTM's ability to model the complex diffusion patterns of pollutants across space. Second, compared to CNN-based archi-tectures like ConvLSTM, the SwinLSTM model demonstrates superior feature extrac-tion capabilities. The shifted window attention mechanism enables both local detail preservation and global context integration, overcoming the fixed receptive field con-straints of conventional convolutions. This is particularly important for capturing dis-tant correlations in air quality patterns that may be influenced by regional meteoro-logical phenomena. Third, the hierarchical feature representation in Swin Transformer components allows the model to detect multiscale spatiotemporal dependencies that single-scale models often miss. This is reflected in the 6.47% improvement in R² value for 24-hour predictions compared to ConvLSTM, highlighting SwinLSTM's enhanced capability to maintain prediction accuracy over longer horizons.

The superior long-term prediction capability of SwinLSTM suggests that attention mechanisms more effectively capture persistent spatiotemporal dependencies com-pared to convolutional approaches. This has significant implications for air quality management, as more accurate 24-hour forecasts provide authorities with crucial lead time for implementing mitigation measures. The model's ability to capture peak pollu-tion events more accurately than comparison models indicates that the hierarchical representation learning in Swin Transformer components successfully models the complex, non-linear relationships that drive extreme pollution episodes. This capabil-ity is particularly valuable for public health applications, as high-concentration epi-sodes pose the greatest health risks.

The observed performance variations also provide guidance for model deploy-ment strategies. For instance, the model could be dynamically configured to emphasize different aspects of its architecture based on prediction horizon—potentially empha-sizing the LSTM components for short-term predictions and the Swin Transformer components for longer-term forecasts. These insights not only validate the theoretical advantages of the proposed SwinLSTM architecture but also offer practical guidance for both model refinement and operational implementation in air quality management systems.

Comment 2.5:

Limitations and Future Work: Explicitly discuss the limitations of the study, including: Any assumptions made regarding the input data (e.g., quality, completeness representativeness).

Possible biases introduced by using kriging interpolation

Computational constraints or scalability concerns

Generalizability of the model to different geographic regions.

Responses:

Following the reviewers' suggestions, the authors systematically analyze four main areas of limitations. First, assumptions related to the input data are discussed, including issues of monitoring station data quality, completeness, and representativeness, and how these factors may affect the model performance; second, potential biases that may be introduced by the kriging interpolation method are explored in depth, especially the spatial estimation uncertainty in regions with sparse station distribution; third, computational constraints and scalability considerations of the SwinLSTM model are analyzed , including resource requirements during training and deployment; finally, the model's applicability and generalizability limitations in different geographic regions are assessed, especially when applied to regions with significantly different environmental characteristics. The updated conclusions are presented below:

Despite the promising results, several limitations of this study should be acknowledged. Data availability constraints restricted the analysis to a three-year period and specific pollutants; longer time series and additional pollutants would enhance model robustness. The model's performance is dependent on the specific geo-graphical and meteorological conditions of the Dongting Lake region, potentially limiting direct transferability to regions with substantially different characteristics with-out retraining or adaptation. The computational resources required for the SwinLSTM architecture may present implementation challenges for real-time prediction systems with limited processing capabilities.

Comment 2.6:

Enhance Conclusion Clarity: The conclusion should be structured using bullet points to improve readability. It should concisely summarize key findings, limitations, and future directions.

Responses:

The authors have reorganized the conclusion section to improve readability by using clearer descriptions. The revised conclusion section now contains three clearly separated subsections: main study, experimental summary, study limitations, and future research directions.

This structured presentation not only improves the visual clarity of the conclusions, but also enables readers to more effectively grasp the core contributions, existing limitations, and future perspectives of this study. Each point has been carefully distilled to ensure that it is concise and informative. Special care has been taken to retain all key information from the original conclusions while avoiding redundancy.

Comment 2.7:

Additional Considerations:

Model Validation: More details on the model validation process should be provided, including any cross-validation techniques used.

Uncertainty Quantification: Discuss how prediction uncertainties were addressed or estimated.

Real-World Application: Provide insights into how the proposed model can be integrated into existing air quality monitoring frameworks or decision-making processes.

Responses:

Thanks to the reviewers for these important considerations. The authors have supplemented and revised the paper accordingly to the suggestions.

About the model validation process

The authors have added detailed descriptions of the cross-validation techniques, data partitioning strategies, and hyperparameter optimization methods employed. We describe the time sliding window cross-validation method used, which is more suitable for the characteristics of spatio-temporal data and can more accurately assess the predictive performance of the model in practical applications.

On Uncertainty Quantification

A new content is added to discuss the estimation method of prediction uncertainty. It describes how to quantify the uncertainty of prediction results by combining the prediction variance of kriging interpolation with the integration technique of deep learning models, and analyzes the characteristics of the change of uncertainty on different time scales and spatial locations. This part provides a more comprehensive assessment of model reliability.

On practical application aspects

We add a new section that explores in detail how the proposed model can be integrated into existing air quality monitoring frameworks and decision-making processes, and discusses the computational resource requirements and feasibility considerations for model deployment.

In the end, we would like to thank the reviewers for their careful reading of our paper and for their valuable suggestions for revision, which make it possible to present our paper better.

Author: Jiangquan Xie, Fan Liu, Shuai Liu, Xiangtao Jiang

March . 10

Author Response File: Author Response.pdf

Article Menu

An Approach to Spatiotemporal Air Quality Prediction Integrating SwinLSTM and Kriging Methods

Summary of the Manuscript

Major Recommendations for Improvement

Further Information

Guidelines

MDPI Initiatives

Follow MDPI