Advancing Early Wildfire Detection: Integration of Vision Language Models with Unmanned Aerial Vehicle Remote Sensing for Enhanced Situational Awareness
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe manuscript titled "Advancing Early Wildfire Detection: Integration of Vision
Language Models with UAV Remote Sensing for Enhanced
Situational Awareness" seeks to establish the use of drones for identifying wildfire events quickly and effectively using vision language models. The authors found that the vision language models were able to describe information about the forest itself, along with fire state and type, fairly effectively, reaching accuracy over 75%. I believe the science and results are worth publishing, but there are a few comments and suggested changes to address first. I appreciate the many tests and results presented. The fact that you compared your results to many other studies was wonderful. Thank you for allowing me to reach such an interesting article.
- The introduction is far too short. Perhaps the beginning of section 2 should become the introduction. The introduction should establish the basics and background of the study, including the why. There should certainly be more cited research to back up what you are attempting to do, why it is unique, and how it is scientifically important. You address these things across sections 1 and 2 to some degree, but it should coalesce into something together.
- There is no statement of research question or research objectives. This should be clear for the reader in section 1.
- Figures are fantastic!
- You discuss the results a good bit in section 3, which should take place in the discussion. I suggest finding a better balance with that.
Otherwise, great job.
Author Response
Thank you for taking the time to review our paper! We appreciate your comments and suggestions, which we believe will significantly enhance the quality and impact of our work. Below, we address each of your points in detail.
Comment 1: The introduction is far too short. Perhaps the beginning of section 2 should become the introduction. The introduction should establish the basics and background of the study, including the why. There should certainly be more cited research to back up what you are attempting to do, why it is unique, and how it is scientifically important. You address these things across sections 1 and 2 to some degree, but it should coalesce into something together.
Response 1:
Thank you for the feedback. We have added a section ''Related Work'' with more cited research. You can find this starting on page 4, line 97.
The first two chapters have also been restructured according to your suggestion, to more clearly separate the background and introduction from the used methodology.
Comment 2: There is no statement of research question or research objectives. This should be clear for the reader in section 1.
Response 2:
Thank you for this suggestion. We have included the specific research objectives now in the section 1 (page 2, line 40)
Comment 4: You discuss the results a good bit in section 3, which should take place in the discussion. I suggest finding a better balance with that.
Response 4: We did a refactoring of the results and discussion sections to address this.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper features early wildfire detection and characterization using finely tuned open Vision Language Models (VLM) trained on custom image datasets collected by camera carrying UAVs. Over all the paper is well presented and it's methodology sound. It also demonstrates that the implemented VLM performs better than the state of the art closed weight VLMs. The results of the paper can be further enhanced if the model is tested with a wider dataset, one that also contains vital information about the fire and not just the smoke detection. Also, it would be good to see these models performing on an onboard computer of a drone, as pointed out by the authors themselves.
Author Response
Thank you for dedicating your time to review our paper! Your comments and suggestions are highly appreciated.
Comment 1: The results of the paper can be further enhanced if the model is tested with a wider dataset, one that also contains vital information about the fire and not just the smoke detection.
Response 1: We did evaluate with our own dataset that contains such information. We added a table on page 8 (Table 2) to compare the datasets and environments used in this work.
Comment 2: Also, it would be good to see these models performing on an onboard computer of a drone, as pointed out by the authors themselves.
Response 2:
Thank you for pointing this out. We added a comparison of inference times with different hardware, including the Jetson Orin NX used as UAV onboard computer.
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes a forest fire early detection and situational awareness system based on the integration of Vision-Language Models (VLMs) with UAV remote sensing. The authors not only built a complete UAV perception and communication platform, but also designed a structured fire description system, and conducted a systematic evaluation and fine-tuning of several mainstream VLM models. Experimental results show that their fine-tuned model (ForestFireVLM-7B) significantly outperforms both untuned models and closed-source large models in multiple dimensions. I believe this work is a valuable attempt to apply multimodal AI in emergency response, demonstrating a certain degree of innovation, completeness, and engineering practicality.
(1) The introduction lacks sufficient discussion; the background, significance, current status, and existing challenges are not clearly articulated.
(2) It is recommended to add a "Related Work" section, introducing recent deep learning methods for disaster scenarios and the latest large language model applications, such as: Landslide extraction from aerial imagery considering context association characteristics,Localization, balance and affinity: A stronger multifaceted collaborative salient object detector in remote sensing images,A cross-view intelligent person search method based on multi-feature constraints,Building height extraction from high-resolution single-view remote sensing images using shadow and side information,BBEMRF-Net: boundary enhancement and multiscale refinement fusion for building extraction from remote sensing imagery.
(3) The current model performs recognition and inference only based on static images, lacking the utilization of temporal features such as fire progression or smoke spread.
(4) All image annotations were manually labeled by the authors, who are not fire service professionals, which may introduce subjective bias. The logic and scientific validity of dataset construction should be clarified.
(5) A comprehensive logic flowchart or technical pipeline diagram of the proposed method should be included to enhance clarity.
(6) The application scenarios of the method should be further expanded by selecting different environments to verify the generalization capability of the approach.
Author Response
We are grateful for the time you dedicated to reviewing our manuscript. Your comments and suggestions are highly appreciated, and we have responded to each of your points in detail below.
Comment 1: The introduction lacks sufficient discussion; the background, significance, current status, and existing challenges are not clearly articulated.
Response 1:
We have restructured the first to paragraphs to more clearly separate between the background, current status and the used methodology.
Comment 2: It is recommended to add a "Related Work" section, introducing recent deep learning methods for disaster scenarios and the latest large language model applications, such as: Landslide extraction from aerial imagery considering context association characteristics,Localization, balance and affinity: A stronger multifaceted collaborative salient object detector in remote sensing images,A cross-view intelligent person search method based on multi-feature constraints,Building height extraction from high-resolution single-view remote sensing images using shadow and side information,BBEMRF-Net: boundary enhancement and multiscale refinement fusion for building extraction from remote sensing imagery.
Comment 2:
We have included now a ''Related Work''-Section in our paper, starting on page 4, line 97. We thank you for the suggested papers, and even as they cover very interesting topics, we don't think they are relevant to this paper in particular. We will include them in future works more aligned with the topics discussed there.
Comment 3: The current model performs recognition and inference only based on static images, lacking the utilization of temporal features such as fire progression or smoke spread.
Response 3: This is true, we added further clarification in the "Limitations and future work" section
Comment 4: All image annotations were manually labeled by the authors, who are not fire service professionals, which may introduce subjective bias. The logic and scientific validity of dataset construction should be clarified.
Response 4: Same for this comment, we also added further clarification in the "Limitations and future work" section
Comment 5: A comprehensive logic flowchart or technical pipeline diagram of the proposed method should be included to enhance clarity.
Response 5: We add a flowchart for the full application in page 17 (Figure 8).
Comment 6: The application scenarios of the method should be further expanded by selecting different environments to verify the generalization capability of the approach.
Response 6: We did evaluate with our own dataset that contains diverse environments. We added a table on page 8 (Table 2) to compare the datasets and environments used in this work and make this more clear.