Long-Distance Person Detection Based on YOLOv7
Round 1
Reviewer 1 Report
Summary: Based on the YOLOv7 algorithm, this article presents a modified algorithm for detecting tiny people. Due to the problem of small object size and a large background in the images used for detection, the detection of small objects is a specialized task. Another challenging aspect of the problem is dealing with different poses. Based on specific requirements, a new architecture for the one-stage YOLO algorithm is proposed, comprising a tiny object detection head, an attention mechanism, and convolutional attention. The proposed solution based on YOLOv7 improves the overall detection performance of YOLOv7 by 2.4%. This article contributes in three areas:
1. Problem definition, current research in the areas, and the definition of specific requirements for tiny people detection.
2. Model specification based on YOLOv7 with modified detection head, attention mechanism, and convolutional attention,
3. Definition of augmentation mechanism-specific diminutive person detection
The article provides an excellent overview of the small object detection problem, a comprehensive list of related research papers, a method for model preparation, experimental setup, and an evaluation of the proposed model.
Citations and resources: Since additional YOLOv5 versions exist, the reference for the well-known YOLOv5 model is missing and should be stated explicitly (line 125). All necessary datasets and evaluation algorithms are referenced and defined.
Manuscript: The article is well-organized and provides all the necessary information for understanding the problem, methodology, and application. Experimental design adheres to standards established for similar types of research and yields structured results.
General comments:
- Since the TinyPerson dataset contains 1610 tagged images, dividing the data into a training set of 49.3 percent (794 images) and a testing set of 50 percent seems somewhat wasteful (816). Is there a particular reason for acting in this manner? I suggest detailing the rationale for such a split. (Lines 265-267)
- Please correct the commas and dots throughout the entire article so that each one is followed by a space.
- Formula (9) contains a comma, but it is somewhat unclear and could be misinterpreted as FP' notation. Please separate the formula (or formulas for P and R separately) into two lines.
Specific comments:
Lines 95-96: This seems to be only the fragment of some other text. Fix it, please.
Line 125: Define the reference to the mentioned YOLOv5 algorithm version, please.
Lines 237-244: The sentence is very complicated and difficult to understand. Could you reformulate this part for better understanding, please?
Reproducibility: To ensure reproducibility, the algorithm architecture, modifications, datasets, and experimental results are defined.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The goal of This paper show that compared with the baseline model YOLOv7, the detection accuracy of this method on the TinyPerson dataset is improved from 7.1% to 9.5%, and the detection speed reaches 208 frames per second
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
The manuscript presents a very interesting topic with a large number of applications.
There are no big issues with English language.
The title should be changed. Tiny people (in images) could not represent real tiny people like children. Consider to replace by “long distance people detection…” or something similar.
Background theory and applied methodology are explained in detail.
Since it is a scientific manuscript, it should be written using the third person. Expressions like “we”, “our” must be avoided.
Acronyms and abbreviations must be spelled out completely on initial appearance in text. (eg: GPU, CNN, YOLO, ELAN, …)
Tables and figures should be mentioned in a sentence before they appear in the document. (eg: Figure 1, Figure 2, Table 2)
Line 95 is out of context.
In Line 114, authors have said “The algorithm of Anchor-free is not suitable for detecting tiny objects in this paper, so the Anchor-based algorithm is chosen.” You must explain why.
In line 271, authors have written “As training tiny objects requires more time, we set the number of epochs and batch size to 1000 and 32, respectively.” Were these values obtained by trial and error? Can you justify these values?
In Table 2 authors present a comparison of the performance between several mainstream approaches. However, the selected approaches were developed for general purposes and not for this specific task. It is reasonable to predict that your proposed approach will perform better.
For a correct performance evaluation, you should compare the proposed approach with other approaches developed for long distance human detection, such as the ones presented in the following manuscripts and similar.
T. Liu, H. Y. Fu, Q. Wen, D. K. Zhang and L. F. Li, "Extended faster R-CNN for long distance human detection: Finding pedestrians in UAV images," 2018 IEEE International Conference on Consumer Electronics (ICCE), Las Vegas, NV, USA, 2018, pp. 1-2, doi: 10.1109/ICCE.2018.8326306.
Zhu, Yaling & Yang, Jungang & Xieg, Xiaokai & Wang, Zhihui & Deng, Xinpu. (2020). Long-distance infrared video pedestrian detection using deep learning and background subtraction. Journal of Physics: Conference Series. 1682. 012012. 10.1088/1742-6596/1682/1/012012.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Thank you for your efforts to address all my comments/suggestions.
The revised version has been much improved.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf