Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Improving YOLOv7-Tiny for Infrared and Visible Light Image Object Detection on Drones

Remote Sens. 2023, 15(13), 3214; https://doi.org/10.3390/rs15133214

by Shuming Hu

, Fei Zhao^*, Huanzhang Lu, Yingjie Deng, Jinming Du and Xinglin Shen

Reviewer 1:

Christopher C. Stewart

Reviewer 2:

Jing Wang

Reviewer 3:

Dongdong Pang

Remote Sens. 2023, 15(13), 3214; https://doi.org/10.3390/rs15133214

Submission received: 17 April 2023 / Revised: 17 June 2023 / Accepted: 18 June 2023 / Published: 21 June 2023

(This article belongs to the Special Issue Artificial Intelligence-Driven Methods for Remote Sensing Target and Object Detection)

Round 1

Reviewer 1 Report

The paper mainly focus on hard-to-detect and class imbalance problems of object detection task on drone images. To address these problems, the paper proposes some improvements, such as aspect ratio anchor boxes, a hard sample mining loss function, and implement them on the popular YOLOv7-tiny model. Overall, the improved algorithm achieved better results on the DroneVehicle dataset in both RGB and infrared image formats. The paper overall is well designed, improvements are clearly stated. However, the paper has some issues needed to be addressed.

1. The paper title "Improving YOLOv7-tiny for Infrared and Visible Light Image 3 Object Detection on Drones" is not proper. From the paper, all improvements stated by the authors are specifically designed for the dataset DroneVehicle used in the paper. The aspect ratio anchor box is set for objects in the dataset; So is the HSM loss function. Either modify the paper name to state "Improving ...... on DroneVehicle Dataset" or add more datasets to validate it.

2. Line 74, line 141, line 207, line214, etc, no reference specified. All figures in the paper are not referenced correctly.

3. Please move section 3.1 dataset introduction into introduction section.

The quality of the writing is good overall.

Author Response

Response to Reviewer 1 Comments

We appreciate you for your precious time in reviewing our paper and providing valuable comments. It was your valuable and insightful comments that led to possible improvements in the current version. The authors have carefully considered the comments and tried our best to address every one of them. We hope the manuscript after our careful revisions meet your high standards. We also welcome further constructive comments if any.

We provide the point-by-point responses below. All modifications are made through “Track Changes”.

Point 1: The paper title "Improving YOLOv7-tiny for Infrared and Visible Light Image 3 Object Detection on Drones" is not proper. From the paper, all improvements stated by the authors are specifically designed for the dataset DroneVehicle used in the paper. The aspect ratio anchor box is set for objects in the dataset; So is the HSM loss function. Either modify the paper name to state "Improving ...... on DroneVehicle Dataset" or add more datasets to validate it.

Response 1: Thank you for your valuable comments. We have combined the DroneVehicle dataset to analyze our algorithm and data in order to more clearly describe the details of our current method. The anchor box design method proposed in this paper is designed for the object size characteristics and natural scene differences under the drone view, while the HSM loss is to optimize the network from the perspective of mining hard samples. These problems are widely present under the drone view, but we only analyzed the characteristics of the objects in the DroneVehicle dataset in the paper to explain why our method is effective. We agree with your valuable suggestions and will verify our algorithm on other datasets in our future work.

Point 2: Line 74, line 141, line 207, line214, etc, no reference specified. All figures in the paper are not referenced correctly.

Response 2:Thank you for your careful review and we are very sorry. We have checked the original manuscript we submitted and this reference is normal. We have made modifications in the latest submitted manuscript.

Point 3: Please move section 3.1 dataset introduction into introduction section.

Response3: Thanks for this good suggestion. We introduced the dataset in Section 3.1 because we thought it was part of the experimental environment description, but we have moved this part to the introduction section according to your suggestion.

Thank you again for your positive comments and valuable questions to improve the quality of our manuscript. If there are any other modifications we could make, we would like very much to modify them and we really appreciate your help.

Author Response File: Author Response.docx

Reviewer 2 Report

In section 3.3.1, Table 2 presents the Average Precision (AP) values for different categories. From the distribution of the previous dataset, it can be seen that there is a significant imbalance between the sample categories in the dataset, with the most samples belonging to the "car" category and relatively fewer samples for "truck," "Freight-car," "bus," and "van." It is worth considering analyzing the reasons for the significant differences in AP values between these three categories and the other two.

In the Focal Loss expression, typical values of α and γ are 0.25 and 2, respectively. Please indicate cited references.

In the HSM loss , please describe the reason for setting the weight threshold of samples with high confidence to 0.5.

In the HSM loss expression, the modulation factor is increased by 0.1 for balance. Please describe in detail the reason for the increase of 0.1.

Please give the horizontal and vertical coordinates of Figure 5 completely, and briefly explain the advantages of the curve HSM loss.

In section 3.3.1, the article states that "Furthermore, from the AP of each category, we can see that the detection performance of categories with fewer samples is significantly improved, which proves that the proposed method of clustering separately for each category can alleviate the impact of data long-tail distribution." However, the "bus" category with fewer samples did not show a significant improvement in AP value despite changes in prior information (anchor boxes). It is worth considering analyzing the reasons for this.

Figure 6 and 7 present the model detection results of YOLOv7-tiny and the proposed model in this article. It is worth considering whether to compare the prediction results in Figure 6 with the GroundTruth image to demonstrate the superiority of the proposed model in this article.

Table 6 provides the detection results of different models on the dataset, while Table 7 provides the results of indicators such as parameter size, computational cost, and speed. These data in the table is suggested to be analyzed. Meanwhile, the performance differences of different models' accuracy results should be explained in terms of the principle of model and dataset.

For section 2.1 line 141, section 2.2 lines 207, 214, 221, 224, and 230, section 3.3.1 lines 383, 384, 393, 410, and 416, section 3.3.2 lines 435, 452, and 458, and section 3.4 lines 481, 489, 492, and 504 where "Error! Reference source not found." appears, please consider addressing this issue to make the article more complete.

Please consider adjusting the column width and formatting of Table 1 to make it look more aesthetically.

It is recommended to add a statistics on the number of targets in different categories of visible light images on Figure 4.

Figure 7 seems to have reversed the type explanation for the images of a and b, please make corrections.

Please pay attention to the verb tense usage rules in the full text.

Please re-examine the formatting of reference figures in text paragraphs.

Author Response

Response to Reviewer 2 Comments

We provide the point-by-point responses below. All modifications are made through “Track Changes”.

Point 1: In section 3.3.1, Table 2 presents the Average Precision (AP) values for different categories. From the distribution of the previous dataset, it can be seen that there is a significant imbalance between the sample categories in the dataset, with the most samples belonging to the "car" category and relatively fewer samples for "truck," "Freight-car," "bus," and "van." It is worth considering analyzing the reasons for the significant differences in AP values between these three categories and the other two.

Response 1: Thank you for your valuable comments. We would like to thank you for your interest in our work. In this paper, we conducted an analysis of the samples and discovered that 'car' achieved the highest AP owing to its large number of samples, whereas the other four categories had fewer samples and lower APs. However, we also observed that 'truck,' 'Freight car,' and 'van.' performed worse than 'bus', which had a similar AP to 'car'. The main reason for this phenomenon is that 'car' and 'bus' share similar shapes and features, while the other three categories are large vehicles that differ significantly from 'car'. Our algorithm is based on Convolutional Neural Network (CNN), which has a characteristic of having many shared weights. As a result, there are many convolutional kernels in the network that are mainly interested in the features related to car. This gives an advantage to bus, which benefits from the network feature bias caused by car and achieves a higher AP. On the other hand, truck and Freight car are very similar to each other but belong to different categories, which leads to mutual interference and lower APs for both of them. In fact, a large part of this problem can be attributed to the imbalance of sample size among different categories. We plan to conduct further research on this issue in our future work and update our analysis accordingly in a revised manuscript.

Point 2: In the Focal Loss expression, typical values of and are 0.25 and 2, respectively. Please indicate cited references.

Response 2:We appreciate your valuable comments on our paper. In the focal loss paper, the authors performed extensive experiments to determine the optimal values of and , and they reported that 0.25 and 2 were the best choices. Therefore, we followed their settings in our paper. We have cited the focal loss paper as a reference in our revised manuscript.

Point 3: In the HSM loss , please describe the reason for setting the weight threshold of samples with high confidence to 0.5.

Response3: HSM loss is an improvement over focal loss that assigns more loss to hard samples. We assume that for a positive sample, the lower the predicted confidence, the more difficult it is to detect. Instead of applying a decay factor less than 1 to all samples as in focal loss, we propose to enhance the learning of hard samples and adjust the decay factor according to the confidence of hard samples. We define hard samples as those with confidence below 0.5, which is consistent with our prediction threshold. That is, we only consider those predictions with confidence greater than 0.5 as true detections. Therefore, our loss function aligns with our prediction criterion. In our visualization results, we show that some missed detections with confidence below 0.5 are successfully detected with confidence above 0.5 after using our loss function, which demonstrates that our network can learn better from hard samples. We have added more details about our loss function in our revised manuscript.

Point 4: In the HSM loss expression, the modulation factor is increased by 0.1 for balance. Please describe in detail the reason for the increase of 0.1.

Response4: As we explained in section 4.2 of the paper, focal loss multiplies a modulation factor between 0 and 1 to the cross entropy loss. In question 3, we extended this factor to range from 0 to 4. However, we encountered a challenge that the positive samples with high confidence and the negative samples with low confidence decayed too much. For instance, when the confidence of positive samples exceeded 0.9, they were decayed by a factor of 100, which reduced their contribution to the network loss significantly. This made it harder for the network to predict the object with high confidence. We believed that this was not optimal for the network learning process. Therefore, we added a constant term of 0.1 to prevent excessive decay (at most 10 times). This way, the network could still learn from the objects with relatively high confidence. We did not evaluate the network performance solely based on the accuracy and recall rate at a fixed confidence level of 0.5, but rather considered the overall mAP metric. That was why we introduced this constant term of 0.1. Regarding its specific value, we chose it based on our intuition and experience, rather than extensive parameter tuning. We thought it was a reasonable trade-off between avoiding too much decay and preserving hard sample enhancement effects. Our main point was that setting an effective lower bound for decay was beneficial for HSM loss. We have added more details about our loss function in our revised manuscript.

Point 5: Please give the horizontal and vertical coordinates of Figure 5 completely, and briefly explain the advantages of the curve HSM loss.

Response5: We sincerely apologize for our oversight. We did not include the coordinate axis in our previous figures because the loss function was not directly related to any physical quantity. We have corrected this issue in the revised manuscript and added more explanations in response to questions 3 and 4.

Point 6: In section 3.3.1, the article states that "Furthermore, from the AP of each category, we can see that the detection performance of categories with fewer samples is significantly improved, which proves that the proposed method of clustering separately for each category can alleviate the impact of data long-tail distribution." However, the "bus" category with fewer samples did not show a significant improvement in AP value despite changes in prior information (anchor boxes). It is worth considering analyzing the reasons for this.

Response6: This problem is similar to problem 1. When using the original clustering method, “car” has the most samples. As shown in Figure 3(b), most anchor boxes tend to this category during clustering. The anchor box obtained by clustering is closest in size to “car”. The size of the “bus” category is slightly smaller than that of “car”, while the other categories have larger differences. The “bus” benefits from the network and anchor box tendency brought by “car”, so its AP is higher. As shown in Table 2 of the ablation experiment results, the AP of “car” even decreases after using the method proposed in this paper. In addition, since the AP baseline of “bus” is higher, although there are fewer “bus”, there is only a lower improvement. For this problem, we will revise it together with problem 1 and supplement the analysis in the revised manuscript.

Point 7: Figure 6 and 7 present the model detection results of YOLOv7-tiny and the proposed model in this article. It is worth considering whether to compare the prediction results in Figure 6 with the GroundTruth image to demonstrate the superiority of the proposed model in this article.

Response7: Thank you for your suggestion. We will include a comparison between the predicted results and the GroundTruth image in the revised manuscript to demonstrate the superiority of our proposed method.

Point 8: Table 6 provides the detection results of different models on the dataset, while Table 7 provides the results of indicators such as parameter size, computational cost, and speed. These data in the table is suggested to be analyzed. Meanwhile, the performance differences of different models' accuracy results should be explained in terms of the principle of model and dataset.

Response8: Thank you for your suggestion. We will supplement the analysis and explanation of these data in the revised manuscript.

Point 9: For section 2.1 line 141, section 2.2 lines 207, 214, 221, 224, and 230, section 3.3.1 lines 383, 384, 393, 410, and 416, section 3.3.2 lines 435, 452, and 458, and section 3.4 lines 481, 489, 492, and 504 where "Error! Reference source not found." appears, please consider addressing this issue to make the article more complete.

Response9: Thank you for your careful review. We have checked the original manuscript we submitted and this reference is normal. We have made modifications in the latest submitted manuscript.

Point 10: Please consider adjusting the column width and formatting of Table 1 to make it look more aesthetically.

Response10: Thank you for your careful review.We will adjust the table in the revised manuscript to make it more aesthetically.

Point 11: It is recommended to add a statistics on the number of targets in different categories of visible light images on Figure 4.

Response11: Thank you for your valuable comments. We will add a statistics on the number of targets in different categories of visible light images on Figure 4 in the revised manuscript .

Point 12: Figure 7 seems to have reversed the type explanation for the images of a and b, please make corrections.

Response12: Thank you for your careful review and we are very sorry. We will correct the type explanation of images a and b in Figure 7 in the revised manuscript.

Point 13: Please pay attention to the verb tense usage rules in the full text.

Response13: Thank you for your comments. We will check the verb tense usage rules in the full text and correct it.

Point 14: Please re-examine the formatting of reference figures in text paragraphs.

Response14: Thank you for your comments. We will check the formatting of reference figures in text paragraphs and correct it.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments for author File: Comments.pdf

Author Response

Response to Reviewer 3 Comments

We provide the point-by-point responses below. All modifications are made through “Track Changes”.

Point 1: Part of the annotations should be more detailed as shown in Figure 2 and Figure 5, so that the article's expression is more accurate and clear.

Response 1: Thank you to the reviewers for their suggestions on the annotation details in Figures 2 and 5. We have supplemented and improved the annotations in Figures 2 and 5 based on the comments of the reviewers, making the expression of the article more accurate and clear. The specific modifications are as follows:

-In Figure 2, we have added a description and introduction of the network structure.

-In Figure 5, we added labels and units of the coordinate axis, and introduced the advantages of the Loss function.

We believe that these modifications will help improve the quality and readability of the article. Thank you again for the valuable feedback from the reviewers.

Point 2: The object detection datasets from the perspective of drones are not limited, but there are only a few that have both infrared and visible modes at the same time. It can be explained more clearly when introducing the datasets.

Response 2:Thank you for your valuable comments. We have revised this section and introduced the dataset from the drone perspective more accurately. We have also added two more datasets of infrared images from the drone perspective, making this section more complete.

Point 3: Please unify the professional language expression, and use oriented bounding boxes for rotating object detection, while rotated bounding boxes appear in some parts of the text.

Response3: Thank you for your careful review and we are very sorry. Thank you for your valuable feedback. We have made modifications to this and reviewed the entire text to avoid similar errors.

Point 4: This method achieved good performance. However, in the conclusion, it is anticipated that fusion detection will be carried out in the future. It should not be said that because the dataset has two modal images, fusion detection should be carried out, but rather in terms of the benefits brought by fusion.

Response4: Thanks for this good suggestion. We fully agree with your suggestion and have made modifications to this section, looking forward to future work from a fusion perspective. We believe that your suggestion will be helpful for our work.

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

I have no further comments about the manuscript.

Please check the abbreviation OBB in line 45 if it is directly relevant to the subject in this sentence.

Author Response

Response to Reviewer 2 Comments

We provide the point-by-point responses below. All modifications are made through “Track Changes”.

Point 1: Please check the abbreviation OBB in line 45 if it is directly relevant to the subject in this sentence.

Response 1: Thank you for your comments. We apologize for our negligence. OBB is the abbreviation of oriented bounding boxes, and we mistakenly wrote it after oriented object detection. Thank you for your careful review. We have made the correction in the revised manuscript.

Author Response File: Author Response.docx

Article Menu

Printed Edition

Improving YOLOv7-Tiny for Infrared and Visible Light Image Object Detection on Drones

Further Information

Guidelines

MDPI Initiatives

Follow MDPI