Review Reports - Cross-Modality Data Augmentation for Aerial Object Detection with Representation Learning

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper presents a novel approach to improving RGB-T (visible-infrared) object detection through cross-modality data augmentation and representation learning. RGB-T object detection leverages both visible and infrared images for better detection performance under varying lighting conditions. The key contributions of the paper are:

1.Cross-Modality Data Augmentation: Based on masked image modeling, the paper proposes a feature-space data augmentation method that reconstructs images by integrating visible and infrared modalities, thereby enhancing dataset diversity and model generalization.

2.Full-Scale Mosaic Data Augmentation: The paper optimizes the Mosaic data augmentation method, which is widely used in object detection tasks, by incorporating a multi-scale training strategy. This optimization is particularly effective in aerial imagery and accelerates network convergence.

3.Complementarity between Data-Space and Feature-Space Augmentation: The paper explores the complementarity between traditional data-space augmentation methods (like Mosaic) and the proposed feature-space augmentation methods. It aims to combine the strengths of both approaches to achieve superior performance.

Experimental results validate the effectiveness of the proposed methods, showing significant improvements in RGB-T object detection, especially in scenarios with limited data.However, the following concerns need to be addressed before the paper can be accepted for publication:

1.All comparison methods in the experiments focus on data augmentation in the data space. It is recommended to include methods that perform data augmentation in the feature space for a more comprehensive comparison. Additionally, the selected methods for comparison do not seem to include recent advancements in the past two years.

2.While the paper utilizes unimodal data augmentation methods like MixUp as baselines, the multimodal DroneVehicle dataset comprises paired infrared and visible images. A more detailed explanation is needed regarding how these unimodal techniques are adapted to the multimodal setting. Specifically, is it feasible to directly superimpose infrared and visible images for augmentation purposes?

3.RGB-T object detection algorithms often assume precise alignment between infrared and visible image pairs. Given the slight misalignments present in the DroneVehicle dataset, the paper should elaborate on the strategies employed to mitigate this issue when using MMRotate, a single-modality detector. Furthermore, a more in-depth discussion is needed regarding the adaptation of MMRotate to the multimodal detection task and the specific modifications made to accommodate the unique characteristics of multimodal data.

4.Are there other data augmentation methods based on MAE? If so, they should be included in the related work section with a clear comparison to the proposed method.

5.The comparative experiments in the paper are somewhat limited. It is recommended to conduct more extensive experiments on a wider range of datasets and using multiple evaluation metrics to demonstrate the advantages of the proposed method.

Comments on the Quality of English Language

The language must be improved

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

1. In the contribution section, you mentioned that you proposed certain methods and achieved specific goals, but I think it is necessary to explain what problems these methods are proposed for and why they are proposed.

2. Figure 4 is a visualization of the experimental results, but some of the labels are too small and blurry to be recognized. It is recommended to adjust the font size of the labels appropriately. In addition, the line color of some detection boxes is similar to the background and is not obvious enough. It is recommended to adjust the color and thickness of the detection boxes appropriately.

3. The contrast network used in the experiment is relatively old. It is recommended to consider some related research in the past three years to ensure the cutting-edge and relevance of the experimental results.

4. Figure 5 shows the effects of different object filtering and editing methods, but the effects are not very obvious. I suggest marking different parts with circles or highlighting them in other ways to enhance the contrast effect.

5. It is recommended that the author add more related papers in the past three years in the references.

[1] Li Y, Yang Y, An Y, et al. LARS: Remote Sensing Small Object Detection Network Based on Adaptive Channel Attention and Large Kernel Adaptation[J]. Remote Sensing, 2024, 16(16): 2906.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

The work done is appreciable, I recommend adding this part:

Comments regarding limited applicability in certain contexts are recommended. In some cases , not all data types lend themselves well to data augmentation, especially in cases where each individual data item represents a unique entity or where variations are unrealistic.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

maybe you should put a little bit more attention to real world applications rather than proving you understand the technique in general

Comments for author File: Comments.pdf

Author Response

Please see the attachment.

Author Response File: Author Response.docx