Vision-Based Prediction of Flashover Using Transformers and Convolutional Long Short-Term Memory Model
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript "Vision-based Prediction of Flashover using Transformers and Convolutional Long Short-term Memory Model" predicts the growth of room fires by analyzing spatiotemporal infrared thermal imaging data obtained from full-scale room fire tests. It aims to provide a vision-based intelligent solution for future fire growth prediction tasks. The manuscript has certain practical and theoretical significance in terms of its topic, but the overall content structure of the paper is somewhat disorganized. The manuscript validates existing SwinLSTM-B and SwinLSTM-D models using self-built datasets, and the innovative thinking behind the proposed ideas is relatively simple.
1. The existing work on flashover growth prediction is not specifically listed. The achievements of previous studies and the problems that still exist are not discussed, nor is it clear which issues this manuscript aims to solve.
2. The logical progression between the subsections in Chapter 2, “Materials and Methods,” is somewhat unclear.
3. In Section 3.2, it states, “Therefore, for simplification, we use SwinLSTM from hereon interchangeably with the SwinLSTM-D,” but no comparative experiments are provided to explain whether alternating between SwinLSTM and SwinLSTM-D truly simplifies the task. Why were SwinLSTM and SwinLSTM-D chosen for interchangeability, rather than alternating between SwinLSTM-B and SwinLSTM-D variants?
4. In Section 3.2, it states, “Table 2 also illustrates the comparison size of different LSTM-based models used in this study. Looking at the table, it is clear that SwinLSTM achieved a better MSE loss value (definition of MSE provided in Equation 3) with the cost of slower performance in comparison to the other two models,” but SwinLSTM is not listed in Table 2. In Chapter 3, the distinction between “SwinLSTM” and its variants “SwinLSTM-B” and “SwinLSTM-D” is not clearly made, leading to unclear expression.
5. There is a typo in the introduction section of the manuscript where “ConvLST” is used incorrectly.
6. All images, tables, and equations in the manuscript should be accompanied by detailed explanations within the same chapter. The paper structure is disorganized.
7. There is no detailed explanation for Table 1. A detailed format check of the manuscript is necessary.
Comments for author File: Comments.pdf
Author Response
1 Answer to Reviewer 1 Comments:
1.1 Summary of the Paper
This paper explores the application of Long Short-Term Memory (LSTM) network models for predicting the growth of room flash fires by training on initial frames of spatiotemporal infrared thermal imaging data to forecast future frames. The study also demonstrates that SwinLSTM, an enhanced LSTM model incorporating transformer components, achieves significantly higher accuracy in predicting room fire flashover compared to ConvLSTM.
1.2 Overall comments
This paper is well-organized and clearly written. The introduction effectively presents the background and motivation for the study and Section two provides a detailed overview of the various network structures used, making it easy for readers to follow and understand the methodology. The paper is technically sound overall; however, I believe the section on numerical experiments requires substantial revision to more effectively support the claims made throughout.
1.3 Concerns
1.3.1 Data preparation
(i) In Line 210, did you use varying image resolutions during training? It seems that in Line 220, only 64*64 resolution images were used. Listing multiple resolutions in Line 210 is confusing if only one size was actually used. Similarly, Line 216 lists different frame lengths, but it's unclear if all were utilized or just one, e.g., 20 in Line 220. In addition, Figure 3 displays 95 seconds of frames, but it appears that only 20 seconds of frames were used, as mentioned in Line 218.
Answer: Thank you so much for your great comment. We fixed this issue of multiple numbers, and for Figure 3, we added more explanations to clarify this.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(ii) In Line 256, you mention using 300 testing samples, but in Line 220, it states that the testing data constitutes 10\% of the total dataset, which would amount to 3388 * 10\% (or approximately 339 samples). This discrepancy needs clarification to ensure consistency regarding the number of testing samples used.
Answer: Thank you so much for your great comment. As you mentioned correctly, we just reported approximate numbers, and we fixed this issue for consistency.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.3.2 Results presentations
(i) In Line 245, you mention that the pretrained model "performs more effectively," but I did not see any comparison between using and not using the pretrained model. Including a direct comparison would strengthen this claim. Besides, Table 1 is not referenced in the text.
Answer:
Thank you so much for your great comment. In reality, the results without pre-trained model weights provided almost low-quality frames since the model is relatively large in size compared to our dataset. The original article of the SwinLSTM also provided the results after using pre-trained weights. We believe that in order for the SwinLSTM model to be trained from scratch, the number of data should be significantly larger. Therefore, the pre-trained weights trained on the Moving MNIST dataset are mandatory for this application study. The main goal of our work is not to show the power of ConvLSTM models and their comparisons. We selected one of the recent and successful ConvLSTM models, SwinLSTM, and our goal was to show that ConvLSTM architecture can generate future frames for the sake of flashover prediction in room fire tests. An extensive study is needed to compare ConvLSTM models. To fix this issue, we rewrote the paragraph.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(ii) Some results/discussions appear scattered throughout the paper without clear organization. For instance, Table 2 is referenced both in Line 243 and Line 296. Also, Line 263 asks the reader to compare Figure 5 and Figure A4, but it would be more effective to present them together for easier comparison.
Answer:
Thank you so much for your great comment. We fixed this issue.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(iii) Can Figure 6 (and Figure 7 as well) be split into two separate figures—one for MSE/MAE and another for SSIM? This would make the presentation of results clearer, as it would allow each metric to be examined individually without overloading a single figure with too much information.
It is more clear if the authors can make the title of figure 5 explicit, such as label each frame as name "GT frame: 5sec" rather than "GT_0"
Text in figure 1 is not clear enough, and higher resolution figure will help.
Answer: Thank you so much for your great comment. We fixed these issues.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(iv) Some sentences would benefit from additional clarification. For example, in Line 217, when you say "Based on the results of our hyperparameter tuning algorithms," it would be helpful to provide more details about the specific tuning process used and the rationale behind it. It would be also useful to address whether the results might vary with other datasets, as this would help readers understand the generalizability. also, for caption of table 2, "on a random selected test IR video data", did you use a single example to present the error, rather than averaging over the entire test set?
Answer: Thank you so much for your great comment. We fixed these issues.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
(v) Could you elaborate on how to validate the claim in the abstract using your numerical results? The improvement shown in table 2 is very marginal.
"Our findings reveal that SwinLSTM, an enhanced version of LSTM combined with Transformers for computer vision purposes, predicts the occurrence of room fire flashover with significantly higher accuracy in comparison to its previous counterparts, such as Native Convolutional LSTM (ConvLSTM)"
Answer: Thank you so much for your great comment. We fixed these issues by reducing the emphasis on the quantitative results. Your comment about marginal improvements is totally valid, and one of our main goals here is to show that ConvLSTM (especially SwinLSTM) can be used for the prediction of next frames in vision-based fire safety applications. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.4 Minor
(i) Line 70: "ConvLST" should be "ConvLSTM"
(ii) The main focus of this paper is on the use of LSTM-based convolution/transformer models. It would be more comprehensive if the author could review more related papers, such as (1) Deep reinforced attention learning for quality-aware visual recognition, eccv 2020; (2) Dianet: Dense-and-implicit attention network, aaai 2022; (3) Self-Attention ConvLSTM for Spatiotemporal Prediction, aaai 2020
Answer: Thank you so much for your great comment. We fixed these issues.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsSummary of the Paper:
This paper explores the application of Long Short-Term Memory (LSTM) network models for predicting the growth of room flash fires by training on initial frames of spatiotemporal infrared thermal imaging data to forecast future frames. The study also demonstrates that SwinLSTM, an enhanced LSTM model incorporating transformer components, achieves significantly higher accuracy in predicting room fire flashover compared to ConvLSTM.
Overall comments:
This paper is well-organized and clearly written. The introduction effectively presents the background and motivation for the study and Section two provides a detailed overview of the various network structures used, making it easy for readers to follow and understand the methodology. The paper is technically sound overall; however, I believe the section on numerical experiments requires substantial revision to more effectively support the claims made throughout.
Concerns:
(1) Data preparation:
(i) In Line 210, did you use varying image resolutions during training? It seems that in Line 220, only 64*64 resolution images were used. Listing multiple resolutions in Line 210 is confusing if only one size was actually used. Similarly, Line 216 lists different frame lengths, but it's unclear if all were utilized or just one, e.g., 20 in Line 220. In addition, Figure 3 displays 95 seconds of frames, but it appears that only 20 seconds of frames were used, as mentioned in Line 218.
(ii) In Line 256, you mention using 300 testing samples, but in Line 220, it states that the testing data constitutes 10\% of the total dataset, which would amount to 3388 * 10\% (or approximately 339 samples). This discrepancy needs clarification to ensure consistency regarding the number of testing samples used.
(2) Results presentations
(i) In Line 245, you mention that the pretrained model "performs more effectively," but I did not see any comparison between using and not using the pretrained model. Including a direct comparison would strengthen this claim. Besides, Table 1 is not referenced in the text.
(ii) Some results/discussions appear scattered throughout the paper without clear organization. For instance, Table 2 is referenced both in Line 243 and Line 296. Also, Line 263 asks the reader to compare Figure 5 and Figure A4, but it would be more effective to present them together for easier comparison.
(iii) Can Figure 6 (and Figure 7 as well) be split into two separate figures—one for MSE/MAE and another for SSIM? This would make the presentation of results clearer, as it would allow each metric to be examined individually without overloading a single figure with too much information.
It is more clear if the authors can make the title of figure 5 explicit, such as label each frame as name "GT frame: 5sec" rather than "GT_0"
Text in figure 1 is not clear enough, and higher resolution figure will help.
(iv) Some sentences would benefit from additional clarification. For example, in Line 217, when you say "Based on the results of our hyperparameter tuning algorithms," it would be helpful to provide more details about the specific tuning process used and the rationale behind it. It would be also useful to address whether the results might vary with other datasets, as this would help readers understand the generalizability. also, for caption of table 2, "on a random selected test IR video data", did you use a single example to present the error, rather than averaging over the entire test set?
(v) Could you elaborate on how to validate the claim in the abstract using your numerical results? The improvement shown in table 2 is very marginal.
"Our findings reveal that SwinLSTM, an enhanced version of LSTM combined with Transformers for computer vision purposes, predicts the occurrence of room fire flashover with significantly higher accuracy in comparison to its previous counterparts, such as Native Convolutional LSTM (ConvLSTM)"
Minor:
(i) Line 70: "ConvLST" should be "ConvLSTM"
(ii) The main focus of this paper is on the use of LSTM-based convolution/transformer models. It would be more comprehensive if the author could review more related papers, such as (1) Deep reinforced attention learning for quality-aware visual recognition, eccv 2020; (2) Dianet: Dense-and-implicit attention network, aaai 2022; (3) Self-Attention ConvLSTM for Spatiotemporal Prediction, aaai 2020
Author Response
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 Answer to Reviewer 2 Comments:
The manuscript "Vision-based Prediction of Flashover using Transformers and Convolutional Long Short-term Memory Model" predicts the growth of room fires by analyzing spatiotemporal infrared thermal imaging data obtained from full-scale room fire tests. It aims to provide a vision-based intelligent solution for future fire growth prediction tasks. The manuscript has certain practical and theoretical significance in terms of its topic, but the overall content structure of the paper is somewhat disorganized. The manuscript validates existing SwinLSTM-B and SwinLSTM-D models using self-built datasets, and the innovative thinking behind the proposed ideas is relatively simple.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.1 Comment 1
The existing work on flashover growth prediction is not specifically listed. The achievements of previous studies and the problems that still exist are not discussed, nor is it clear which issues this manuscript aims to solve.
Answer: Thank you so much for your great comment. We fixed this issue by adding recent review works about challenges in the literature on flashover prediction.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.2 Comment 2
The logical progression between the subsections in Chapter 2, “Materials and Methods,” is somewhat unclear.
Answer: Thank you so much for your great comment. We fixed this issue.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.3 Comment 2
In Section 3.2, it states, “Therefore, for simplification, we use SwinLSTM from hereon interchangeably with the SwinLSTM-D,” but no comparative experiments are provided to explain whether alternating between SwinLSTM and SwinLSTM-D truly simplifies the task. Why were SwinLSTM and SwinLSTM-D chosen for interchangeability, rather than alternating between SwinLSTM-B and SwinLSTM-D variants?
Answer: Thank you so much for your great comment. We fixed this issue by adding more details about SwinLSTM-B and SwinLSTM-D. However, SwinLSTM B has one ConvLSTM, and SwinLSTM-D has a network structure containing many ConvLSTM cells. For this reason, the performance of SwinLSTM would be much higher due to the larger and more extensive network size relative to the SwinLSTM, which has only one cell. Our goal was not to compare SwinLSTM B and D, and we just followed the original study, which introduced these two models consecutively for an easier understanding of the SwinLSTM architecture.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.4 Comment 2
In Section 3.2, it states, “Table 2 also illustrates the comparison size of different LSTM-based models used in this study. Looking at the table, it is clear that SwinLSTM achieved a better MSE loss value (definition of MSE provided in Equation 3) with the cost of slower performance in comparison to the other two models,” but SwinLSTM is not listed in Table 2. In Chapter 3, the distinction between “SwinLSTM” and its variants “SwinLSTM-B” and “SwinLSTM-D” is not clearly made, leading to unclear expression.
Answer: Thank you so much for your great comment. We fixed this issue by adding details.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.5 Comment 2
There is a typo in the introduction section of the manuscript where “ConvLST” is used incorrectly.
Answer: Thank you so much for your great comment. We fixed this issue.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.6 Comment 2
All images, tables, and equations in the manuscript should be accompanied by detailed explanations within the same chapter. The paper structure is disorganized.
Answer: Thank you so much for your great comment. We fixed this issue by adding explanations and move tables, images, and equations in context location.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.7 Comment 2
There is no detailed explanation for Table 1. A detailed format check of the manuscript is necessary.
Answer: Thank you so much for your great comment. We fixed this issue.
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for Authors-
Line 253 indicates that for simplification, "SwinLSTM" and "SwinLSTM-D" will be used interchangeably. However, Table 1 uses "SwinLSTM-B." To enhance the consistency and readability of the paper, it is recommended to modify the expression in Table 1 accordingly.
-
Place the figures and their corresponding descriptions in the same section to improve the coherence of the paper. Please check and correct the correspondence between all figures and their descriptions in the manuscript.
-
There are errors in the citation of references in the manuscript. Please review and correct them carefully.
Author Response
Comment 1: Line 253 indicates that for simplification, "SwinLSTM" and "SwinLSTM-D" will be used interchangeably. However, Table 1 uses "SwinLSTM-B." To enhance the consistency and readability of the paper, it is recommended to modify the expression in Table 1 accordingly.
Thank you so much for your comment. We fixed this issue.
Comment 2: Place the figures and their corresponding descriptions in the same section to improve the coherence of the paper. Please check and correct the correspondence between all figures and their descriptions in the manuscript.
Thank you so much for your comment. This is more about optimization in the LATEX template to same space in the LATEX format. We changed this to address your comment, but it comes with the cost of not optimizing space in the manuscript.
Comment 3: There are errors in the citation of references in the manuscript. Please review and correct them carefully.
We look at all references and fix any issues. All references are generated automatically using DOI available in the original works or we manually entered them. We think that during the publication stage, the references will be reviewed again.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe author has addressed my concerns effectively. Thank you for the clarifications and revisions.
Author Response
Thank you so much