Next Article in Journal
Deep Feature Migration for Real-Time Mapping of Urban Street Shading Coverage Index Based on Street-Level Panorama Images
Previous Article in Journal
Spatiotemporal Reconstruction of MODIS Normalized Difference Snow Index Products Using U-Net with Partial Convolutions
 
 
Article
Peer-Review Record

Learning Spatio-Temporal Attention Based Siamese Network for Tracking UAVs in the Wild

Remote Sens. 2022, 14(8), 1797; https://doi.org/10.3390/rs14081797
by Junjie Chen 1,†, Bo Huang 1,†, Jianan Li 1, Ying Wang 1, Moxuan Ren 1 and Tingfa Xu 1,2,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2022, 14(8), 1797; https://doi.org/10.3390/rs14081797
Submission received: 24 January 2022 / Revised: 22 March 2022 / Accepted: 29 March 2022 / Published: 8 April 2022
(This article belongs to the Topic Big Data and Artificial Intelligence)

Round 1

Reviewer 1 Report

This paper presents a Learning Spatio-Temporal Attention based Siamese Network for Tracking UAVs in the wild.

Revise the English grammar. In the text avoid using the first person "we".

For a better presentation of the experiments, if possible, if the authors can, generate videos of the results shown in Figure 7 and show it on YouTube or another similar platform.

Author Response

Comments:

This paper presents a Learning Spatio-Temporal Attention based Siamese Network for Tracking UAVs in the wild.

Revise the English grammar. In the text avoid using the first person "we".

Response: Thanks for your comments. We have carefully revised the English grammar and reduced the use of “we”.

For a better presentation of the experiments, if possible, if the authors can, generate videos of the results shown in Figure 7 and show it on YouTube or another similar platform

Response: Thanks for your comments. We generate videos of the comparison results shown in Figure 7 (In the revised version, corresponding to Figure 6) and show it in the website https://youtu.be/_l4hP1ZWG3w (We have added the link to the video in the caption of Figure 6 in the revised version.). The red rectangle represents the prediction result of SiamSTA, the green rectangle is the real UAV state manually labeled, the yellow rectangle is the prediction result of DiMP algorithm[1], the pink rectangle is the prediction result of the baseline algorithm SiamRCNN[2], and the blue rectangle is the prediction result of SiamRPN++[3].

[1] Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning Discriminative Model Prediction for Tracking. ICCV, 2019.

[2] Voigtlaender, P.; Luiten, J.; Torr, P.H.; Leibe, B. Siam R-CNN: Visual Tracking by Re-Detection. CVPR, 2020.

[3] Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. CVPR, 2019.

Reviewer 2 Report

This paper proposes a spatio-temporal attention based Siamese method called SiamSTA, which performs local search and wide-range re-detection alternatively for robustly tracking drones in the wild. A two-stage re-detection network is used in the proposed method to predict the target state employing the template of the first frame and the prediction results of previous frames. Furthermore, in case of losing the target from local regions due to fast movement, another third stage re-detection module is also proposed. Experiments on three anti-UAV datasets verified the effectiveness of the proposed method.

The paper is generally well-written and the reviewer has almost no comments. However, the reviewer hopes that the authors can discuss the performance of the proposed algorithm in the following conditions as well.

+ Weather conditions like rainy and/or cloudy days, at daytime and nighttime.

+ Detection range (distance from UAVs)

+ The existences of multiple UAVs

Author Response

Comments:

This paper proposes a spatio-temporal attention based Siamese method called SiamSTA, which performs local search and wide-range re-detection alternatively for robustly tracking drones in the wild. A two-stage re-detection network is used in the proposed method to predict the target state employing the template of the first frame and the prediction results of previous frames. Furthermore, in case of losing the target from local regions due to fast movement, another third stage re-detection module is also proposed. Experiments on three anti-UAV datasets verified the effectiveness of the proposed method. The paper is generally well-written and the reviewer has almost no comments.

Response: Thanks for your comments. We are extremely grateful for your approval and support of our work.

However, the reviewer hopes that the authors can discuss the performance of the proposed algorithm in the following conditions as well.

+ Weather conditions like rainy and/or cloudy days, at daytime and nighttime.

Response: Thanks for your comments. Weather conditions have a huge impact on tracking performance. In our evaluation dataset, tracking scenarios such as daytime and nighttime are included, along with cloudy days and foggy days. Since these datasets do not assign challenge attributes according to the weather conditions for each tracking video, we cannot quantitatively evaluate the performance of our algorithm, but we provide some visualization of the tracking scenarios, as shown in the figure below(Pictures come from AntiUAV[1] dataset. The left side is the first frame of the video sequence in the visible light scene, and the right side is the pictures of infrared light imaging at the same moment from the same shooting angle). As we can see, the infrared imaging has good image quality in foggy, cloudy weather and nighttime, which is more suitable for anti-UAV tasks than visible imaging. SiamSTA performs well in dealing with various tracking scenarios such as nighttime (the first row), daytime (the second row), foggy (the third row), and cloudy (the last row).

We have added the following figure (Figure.8) to the new version of the paper and have focused on analyzing the performance of the algorithm in response to weather conditions in Section 4.5.

picture can be found in reviewer2.pdf

 

+ Detection range (distance from UAVs)

Response: Thanks for your comments. In the three evaluated datasets, the detection range varies from 0.1 km to 2.5 km. Our algorithm captures UAVs well at both short range (large target size) and long range (small target size and easily overwhelmed by cluttered background). We have added this part to the 4.1.3 section in the revised version of the paper.

+ The existences of multiple UAVs

Response: Thanks for your comments. In the three evaluated datasets, there is only one UAV target in the video, so our tracker focuses on addressing single-object tracking challenges, and in future work, we will explore the application of our algorithm on the multi-UAV target tracking tasks.

[1] Jiang, N.; Wang, K.; Peng, X.; Yu, X.; Wang, Q.; Xing, J.; Li, G.; Guo, G.; Zhao, J.; Han, Z. Anti-UAV: A Large Multi-Modal Benchmark for UAV Tracking. arXiv preprint arXiv:2101.08466457, 2021

 

Author Response File: Author Response.pdf

Reviewer 3 Report

Paper is interesting and projected correctly.

I have a few doubts:

  1. In introduction at the beginning there is a figure. I do not feel if it ok, usually in the introduction or related works there is no figures.
  2. What is a robustness of this method?
  3. What are the specific conditions tho the images and objects (UZV) that should be fulfilled to this method? I mean picture size, UAV platform size etc.

Author Response

Comments:

Paper is interesting and projected correctly.

Response: Thanks for your comments. We deeply appreciate your recognition of our paper and your valuable comments.

I have a few doubts:

In introduction at the beginning there is a figure. I do not feel if it ok, usually in the introduction or related works there is no figures.

Response: Thanks for your comments. In the original manuscript, we put Figure 1 in the introduction section to highlight the tracking effect of our algorithm. Based on your valuable comments, we have removed the Figure 1 in the revised version.

What is a robustness of this method?

Response: Thanks for your comments. The robustness of the method means the ability of the algorithm to accurately predict the location as well as the size of the target when dealing with tracking challenges such as occlusion, scale variations, thermal crossover, and target loss.

What are the specific conditions tho the images and objects (UZV) that should be fulfilled to this method? I mean picture size, UAV platform size etc.

Response: Thanks for your comments. Our method can predict the target state continuously in the subsequent frames as long as the position and size of the target are given in the first frame of the video sequence. In addition, the algorithm has no requirement for the picture size, and the UAV platform size, because our algorithm preprocesses the search picture to normalize the picture size to a fixed size, which ensures the generality of the algorithm.

In the evaluation datasets, the picture sizes include 640×512, 640×480 and 1280×720 (in pixels). The UAV size in the image ranges from 5×5 to 60×90 (in pixels), including a variety of UAV platform such as DJI-Phantom4(196×289.5×289.5mm), DJI-Mavic-Air(168×184×64mm), DJI-Spark(143*143*55mm), DJI-Mavic-Pro(322×242×84mm), DJI-Inspire(438×451×301mm) and Parrot. We have added this part to the 4.1.3 section in the revised version.

Reviewer 4 Report

Very good study... thank you.

Just very minor point: some of the acronyms were used before mentioning their meaning.. Examples: TIR, RPN,.. 

Author Response

Comments:

Very good study... thank you.

Response: Thank you for your high recognition of our work.

Just very minor point: some of the acronyms were used before mentioning their meaning.. Examples: TIR, RPN,..

Response: Thanks for your comments. We provide additional explanations of the acronyms in the revised version, e.g., Thermal Infrared (TIR), Region Proposal Network (RPN).

Back to TopTop