Review Reports - Hierarchical Deep Learning Model for Identifying Similar Targets in UAV Imagery

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper proposes a hierarchical Deep-Learning Model, which achieves target detection through the cascading of Faster R-CNN, YOLO, and FT-Transformer. It shows certain innovation for precise object recognition in complex UAV-based monitoring environments. Experiments on the VisDrone2019 dataset demonstrate good F1-score performance. I believe the methodology and experiments in this paper are complete, and the results are reliable. However, the following issues need to be further addressed:

1. The multi-layer structure increases network complexity. For UAV-based deployment, computational resources are usually limited. The experimental environment around line 571 was implemented on a high-performance GPU. How can the model be adapted to the computational performance of edge devices? Or how do the authors consider this issue for practical applications?

2. The diagram in Figure 1 does not fully cover the innovations of the paper or the multi-layer structure and specific methods mentioned in the abstract. I suggest the authors further improve it to better reflect the logic of the multi-layer structure and the adopted methods.

3. I recommend adding ablation experiments. Since the proposed method mainly relies on cascading different models, more comprehensive and detailed module substitution experiments are necessary.

4. The discussion on the scientific validity of the multi-layer structure is somewhat insufficient. Around line 770, the authors mention this issue, but there is a lack of supporting data. I believe this significantly impacts the scientific soundness of the proposed method. The elimination of cascading errors should be further clarified in the conclusion section.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The aim of this research is to enhance the classification accuracy of similar targets in real time UAV images. The authors propose a novel, hierarchical deep learning model to recognise targets in UAV imagery

The proposed architecture includes a combination of:

R-CNN to propose targets,

YOLO for granular feature extraction and

FT-transformer for fine grained classification

The literature review gives examples of the problems encountered, specifically in UAV images, example applications and a summary of the architectures used in this type of application.

The major sections of the paper describe the hierarchical model, the methodology for selecting optimal feature extraction and classification modules, the experimental validation, results, including comparison with a baseline model and th estate fo the art, and a discussion and critique of the work.

The hierarchical model does a coarse to fine feature extraction and classification. More stages can be added as required: one class is extracted at each level, so level i will classify class i vs everything else. Everything else gets passed to level i+1, so one can add another layer at the end to recognise an additional class.

The process is initialised by an object detection (level 0). The object proposals go to the classifiers. At each level the system does feature extraction and classification. Feature extraction uses info derived at all levels of a CNN, not just the final level, and there is a method for choosing which levels are to be used. The concatenation of these features should maximise inter-class distance (the two classes here being the one to be recognised and everything else). The FT-Transformer is used to classify, a seperate one is trained at each level.

Specific comments

Figure 3

Change the line for the loop: instead of the line from the for statement to the end, this should be from step 6 back to the for statement.

Figure 4

Can there really be an infinite number of classifiers? In theory yes, in practice what limits this number?

The paper presents a proof of concept application is illustrated that identifies three classes of vehicles and people. and other vehicles.

How does this scale as more classes are added?

how does the order of recognising targets affect efficiency? Can one recognise the “cheapest” class first?

Are paragraphs starting on lines 513 and 516 repeating information?

Lines 610, 618

Can one really say the performance is “exceptionally strong” or “outstanding” without comparing it to other systems’ results?

Tables 5 and 6

Where did the F1 score for the present system come from? 94.9% does not appear in the earlier tables, nor can it be computed from the values that are presented.

Can the table give standard deviations? Only if these are given do can one claim a “significant” improvement over the baseline.

How much are these results skewed by the imbalances between classes?

The performance quoted is quite an improvement over the baseline, but less of an improvement over the SOTA.

Fig 10a and b

Can the authors explain in the caption the significance of the blue bounding boxes.

The system includes multiple networks for identifying potential objects, and multiple levels of feature extraction and classification. Can the authors comment on the size of the system, and the training and inferencing times. This is hinted at in the Discussion (line 727).

line 767

Claims a statistically significant improvement – this is not demonstrated in the paper.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

General comments

This study presents a novel hierarchical deep learning architecture for accurate object detection in UAV imagery, integrating Faster R-CNN, specialized YOLO models, and an FT-Transformer to address inter-class ambiguity through a multi-level classification cascade. Experimental results on a complex dataset show the proposed model achieves a 94.9% F1-score, a 2.44% improvement over a traditional non-hierarchical baseline, demonstrating its effectiveness for precise object recognition in UAV-based monitoring. The article is highly engaging for readers, but certain aspects lack clarity, particularly the use of the COCO dataset, which is recommended to be replaced with other dataset specific（UAVDT，Stanford Drone Dataset or UAV123） to UAVs. The detailed suggestions are as follows:

(1) The logic and written expression of the manuscript

In line 73, where CNN is first introduced, the full term is not provided, yet in line 81, the full term "Convolutional Neural Network" is given. The full term and its abbreviation should be provided at the first mention, with the abbreviation used consistently thereafter.

In line 73, where CNN is first introduced, the full term is not provided, yet in line 81, the full term "Convolutional Neural Network" is given. The full term and its abbreviation should be provided at the first mention, with the abbreviation used consistently thereafter.

The formulas presented in the article, if not originally developed herein, should include relevant references to facilitate reader comprehension.

In lines 380 and 443 of the manuscript, "regions of interest (ROIs)" is mentioned. The full term "Regions of Interest" should be provided at the first mention, with the abbreviation "ROIs" used consistently thereafter.

The section "2.4. Construction of a Multi-Level Target Detection Model in Imagery" states: "In this study, to validate the model and method, it is proposed to build the architecture of a multi-level target detection model based on empirical considerations. In future studies, an appropriate method for this purpose is planned to be proposed." This content feels somewhat abrupt. It is recommended to merge this section with Section 2.5 for improved coherence.

Lines 434–445 of the manuscript contain references to “4,” “5,” and “6,” which appear disconnected from the surrounding text. Please verify and align with the context.

In lines 481–484 of the manuscript, the text “At the second classification layer, the goal is to divide objects of the ‘vehicle’ class into two subclasses: trucks (𝑇) and other vehicles (𝑂). At the second classification level, the task is to divide the class ‘vehicles’ into two subclasses: ‘Trucks’ (T) and ‘Other vehicles’ (O).” appears to be repeated consecutively. What is the author’s intention with this repetition?

(2) Figures and tables

In Figure 5, the left box and arrow in the "First Level" interfere with each other. It is recommended to increase the vertical spacing to avoid this interference.
The font size of Figure 7 in the manuscript could be larger to make it easier for readers to see.

(3) Scientific of the manuscript

Figures 1 and figures 3 to 6 appear somewhat redundant, particularly Figure 6, which seems to be a composite of Figures 4 and 5. Additionally, the overall description in the article lacks logical clarity. The authors are recommended to revise the manuscript to enhance logical coherence.

The figures 1–5 in the article provide schematic diagrams and principles of the multi-level object classification approach. However, Figure 6 only depicts Level 1, Level 2, and Level 3. The manuscript does not clearly state or emphasize why three levels were chosen over other possibilities, such as two or five levels. As this represents a key innovation of the study, the authors are encouraged to dedicate sufficient space to explain and highlight this point.

Table 6 demonstrates the superiority of the proposed method compared to other approaches, Figure 8 give the right example while Figure 10 illustrates the limitations of the proposed method. The authors could include additional examples similar to those in Figures 11, showcasing cases where the proposed method correctly identifies objects while other methods fail, to intuitively highlight its advantages.

The manuscript further validates the approach using the Common Objects in Context (COCO) dataset. However, the general-purpose nature of COCO is not well-suited to the theme of the journal 《Drones》and this manuscript. It is recommended that the authors consider using the UAVDT (UAV Detection and Tracking Dataset), released by the Institute of Automation ( Chinese Academy of Sciences in 2018),the Stanford Drone Dataset (released by Stanford University in 2016) or the UAV123 dataset, released by the University of Waterloo in 2016, as these datasets are more relevant and better demonstrate the superiority of the proposed method compared to the COCO dataset.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Dear authors,

I have some thoughts and suggestions to offer regarding the paper “Hierarchical Deep-Learning Model for Identifying Similar Targets in UAV Imagery” written by Borovyk D. et al. This paper proposed a novel hierarchical deep learning architecture designed to detect and classify visually similar targets in UAV imagery. The proposed method, unlike conventional DL methods that use monolithic detection systems, can break down a complex multi-class problem into a sequence of manageable, fine-grained classification stages. The authors have invested a lot of time demonstrating the superiority of the proposed approach in identifying and classifying similar targets in UAV imagery. In this regard, the paper's ideas and results can make a significant contribution to the development of related fields.

Although this manuscript is excellent in all aspects, there are some minor and major errors. These errors are crucial because they cannot be overlooked if the manuscript is to be released in its current format. My comments are written on separate pages, which won't be a huge burden on your editing efforts.

Best regards,

Reviewer

Common comments:

Citing Formulas

If formulas were cited in other literature, please provide references.

Additional comments:

Line 12: Highlights

This section is unnecessary because it has already been included in the main text. Furthermore, a glance at the paper's format in 'drones' will prove that this section is inconsequential.

Line 170: Compression of background theory explanation

Chapter 2.2 provides a very detailed and logical explanation of the theoretical basis for the ML model, which I highly commend. However, I'm concerned that the explanation is too long and could easily bore readers. Therefore, I would appreciate it if you could shorten this section by leaving only the essential points and providing relevant references.

Line 434: Correcting numbering error

Please consider whether you need to add units for the inter-class distances.

Line 465: Unit in Table 1

Please consider whether you need to add units for the inter-class distances.

Line 482: Remove the repeated sentence

Please remove the duplicated sentence.

Line 530: Full name for abbreviation MLRS

Please provide the full name of the abbreviation that first appears in the text.

Line 531: Complete the sentence

Please complete the sentence by filling a missing word or phrase after ‘in’.

Line 544: 2.6 Evaluation metrics

Precision, Recall, and F1-score are essential factors in evaluating the method proposed in this study. Why aren't the formulas for these metrics described and instead only the formula for calculating the average (Formula 14), which doesn't require any explanation, presented?

Line 579: Basic information about VisDrone2019 dataset

For the convenience of readers, let us know some basic information about drone images. This is especially important, as image resolution plays a crucial role in object detection and classification..

Line 592: Figure 8(a)

Zooming in on this, it looks vaguely like a person, but it is hard to determine for sure. If there is only one individual, please change it to 'person' instead of 'persons'.

Line 607: SD in table 2

I would appreciate it if you could explain how and from what you calculated the SD. I am also interested in the true value (mean) for calculating the residuals.

Line 643, 708: Choosing an appropriate word

Can you use significantly for a simple 2.4% accuracy increase, or consider using other alternative words (slightly or noticeably)?

Line 781: Problem with accessing the supplementary materials.

The webpage you indicated could not be found. The URL may be incorrect or webpage may have been moved into another part.

- The End –

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

I think the author has already solved the problem I raised quite well. However, before publication, it is recommended that the author add more photos of test results from different scenarios in the experimental section to enhance readers' readability and intuitive understanding. I think the work of this article is meaningful.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors responded to all my questions and made revisions. In particular, they added validation using the UAV123 dataset. I also recommend adding comparisons with other methods for UAV123 dataset to highlight the superiority of their approach, rather than just presenting their own method.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 4 Report

Comments and Suggestions for Authors

Dear authors,

I sincerely appreciate your prompt and extensive work in revising, updating, and supplementing the text. I am of the opinion that my comments have been properly addressed and it is safe to publish this revised manuscript in its present format.

Best regards,

Reviewer

Author Response

Please see the attachment.

Author Response File: Author Response.pdf