Next Article in Journal
FSSBP: Fast Spatial–Spectral Back Projection Based on Pan-Sharpening Iterative Optimization
Next Article in Special Issue
Mask R-CNN–Based Landslide Hazard Identification for 22.6 Extreme Rainfall Induced Landslides in the Beijiang River Basin, China
Previous Article in Journal
Double Inversion Layers Affect Fog–Haze Events over Eastern China—Based on Unmanned Aerial Vehicles Observation
Previous Article in Special Issue
Infrared Small Target Detection Based on a Temporally-Aware Fully Convolutional Neural Network
 
 
Article
Peer-Review Record

Long-Tailed Object Detection for Multimodal Remote Sensing Images

Remote Sens. 2023, 15(18), 4539; https://doi.org/10.3390/rs15184539
by Jiaxin Yang, Miaomiao Yu, Shuohao Li, Jun Zhang * and Shengze Hu
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Remote Sens. 2023, 15(18), 4539; https://doi.org/10.3390/rs15184539
Submission received: 12 August 2023 / Revised: 7 September 2023 / Accepted: 12 September 2023 / Published: 15 September 2023

Round 1

Reviewer 1 Report

The paper deals with object detection in multi-modal images for remote sensing. It aims to improve the performance of object detectors (YOLOv8s in this work) in presence of long-tailed class distribution by (a) defining a Dynamic Feature Fusion Model, (b) an Instance Balanced Mosaic, and (c) Class Balanced BCE Loss.

The overall impression is that the paper is clear, easy to understand and well-motivated. The novelty is average, in the sense that it combines different techniques (not really new), but the overall framework represents an added values to the literature in this field. Results are convincing.

The only concern I have is in the Dynamic Feature Fusion Model: how this module is affected by geometric misalignments between multimodal images? Robustness tests would have represented a nice point. I kindly invite the authors to (at least) better discuss this point in the paper or in the conclusions as a future activity.

Minors:

-        In Equation 22, please remove the comma, because it seems like a subscript that might mislead the reader.

-        The confusion matrices in Figures 7 and 8 should have a larger font. In the current form, it is tough to read the class names and the labels.   

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

1.  Line  242  :why choose Sobel operator to calculate the image?The authors should conduct some experiments to support this view.

2.table 2 ,table3,table4: why did  authors choose the different models to evaluate the proposed model on different datasets? Moreover, why do table3 and table 4 provide mAP50:95 ,  but table 2 only provides  mAP50?

3.Authors should compare different data augmentation methods.

4.The author should give some pictures of the errors in the recognition and analyze the causes of the errors.

Minor editing of English language required

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

The paper addresses the poor performance in long-tailed object detection and single-modal detection in remote sensing image object detection and introduces a multi-modal long-tailed object detection method, which combines visible light and infrared remote sensing images. The method includes the Dynamic Feature Fusion Module, the Instance Balanced Mosaic data augmentation, and Class Balanced BCE Loss. Through the collaboration of three modules, the detection of multi-modal long-tailed datasets is achieved. Experimental results on three public benchmark datasets demonstrate that the proposed method achieves certain performance improvements. However, below are some questions and concerns.

1. Is the “stitching point” in line 304 the same as the “stitching point PKin line 300? Please explain the “stitching point“ in line 304 .

2. In line 205, it mentions the impact of "crop" on remote sensing image object detection and please explain whether IBM has implemented measures to address the issues caused by "crop".

3. In line 341, it may be necessary to clarify that γ is a hyperparameter.

4. Figure 6 should explain the reason for not including the LLVIP dataset in the statistics to help readers understand.

5. The "baseline method" in Figure 8 should be specified with the specific model name for better comparison with the performance of the proposed model.

6. The explanation for the subpar performance of the IBM and CBB modules in terms of mAP50:95 from line 492 to 444 may not be sufficient.

7. After adding CBB in Table 5, the proposed model shows a significant improvement in mAP50 but only a slight increase of 0.1 in mAP50:95. This raises concerns about whether CBB performs poorly or even has a negative effect on high-precision evaluation metrics such as AP75.

8. In line 501, "Figure8" may be incorrect and should possibly be changed to "Figure9".

9. In Table 5, it may be appropriate to bold the value "67.6" in the mAP50:95 column.

10. In Figure 3, the "RGB" illustration uses red, green, and yellow squares, which may be easier to understand with red, green, and blue. The same issue exists in Figure 4.

11. Please explain why there is no comparison of more advanced remote sensing target detection methods in recent years in the single-mode part of the experimental comparison part.

 

 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

I have no question.

 Minor editing of English language required

Reviewer 3 Report

Thank you for the careful revision and hard work. The manuscript is qualified for publication.

Back to TopTop