Review Reports - A Multi-Scale Traffic Object Detection Algorithm for Road Scenes Based on Improved YOLOv5

Round 1

Reviewer 1 Report

This paper is based on traffic object detection and is designed on top of the YOLOv5s algorithm.

1. The authors provide good motivation for their work on why traditional convolutional neural networks are not enough for this task.

2. The methods section provides a good explanation on the data used to run the experiments. Also the authors provide good justification behind each of the layers and other parameters used in their model, which is a great aspect for such non-theoretical work.

3. Overall a good paper with good grammar, presentation and formatting.

Author Response

Dear Reviewer,

Thank you very much for your comments and suggestions, as well as your recognition of our work. We have further optimized and improved the paper, and the modified part has been marked in red font. The improvements we made are as follows:

We continued to improve part of the expression of the paper to make it more clear and easy to read.
In the Introduction section, we further supplemented the corresponding research status and challenges, and added several references.
Supplement to the experimental part.
Other changes to reduce the paper repetition rate.

Best regards,

Mr. Ang Li

Reviewer 2 Report

The introduction of the paper focuses more on related work rather than explanation of the introduced method and the challenges it is going to address. The paper can be improved by emphasizing on these points in the related work section, which is brief (and a bit outdated). It is also not discussed why this specific detector is utilized. In the introduction, it is mentioned that four different detection heads are used for different object sizes. It is not clear whether these four heads are proposed by the authors or if it is only a small object detection head that is added. If the three heads are already there for different scales, then the architecture is already multi-scale, but it seems that the authors make it multi-scale. The words are also used in a confusing way: tiny, extremely small, small, etc. The words need to be well-defined and the use of words needs to be clear and consistent.

It was not clear what CARAFE is until section 3.4.2 and, in addition, it seemed that it is a method that is proposed by the authors. However, CARAFE is Content-Aware ReAssembly of FEatures, which is an available method that the authors used with Feature Pyramid Network. Even the figure from the CARAFE paper is directly copied without giving a reference to the original paper (Fig. 4).

The explanation of SPD-Conv is also not clear. Isn't it the same as channel-wise convolution on downsampled channels?

The names of different proposed components (NAM, SPD, etc.) are not defined or the definition and abbreviation come after. This makes the reader confused; what does NAM stand for? Also, CBAM?!

It is not specified if the dataset is publicly available or if the authors have the intention to make it available. This information should be provided. It is also not explained why benchmark datasets have not been tried in evaluations. To have a better evaluation of the method, such experiments can be provided.

In Table 1, the performance of the detector with different components used separately is compared, but there is no comparison when all components are used together (later, we will see that there is an ablation study as well). It is also difficult and confusing to keep track of one-letter abbreviations and their combinations in the experiments, such as S, F, C, FCS, etc.

Table 4 shows that with all the bells and whistles, the proposed method only gains 0.1% mAP over the baseline YOLOv5x (85.3 vs 85.4), which means that adding all the components did not help. By the way, with all these additional computations, the number of FLOPs reduces in the proposed model with respect to the original model (205 to 32). How this happened needs to be elaborated.

Author Response

Dear Reviewer,

Thank you very much for your comments and suggestions. Those comments are all valuable and very helpful for revising and improving our paper, as well as the important guiding significance to our researches. We have studied comments carefully and have made correction which we hope meet with approval. Revised portion are marked in red in the paper. The main corrections in the paper and the responds to the reviewer's comments are as flowing：

Responds to the reviewer' s comments：

We have improved the Introduction part, describing in detail the research status of object detection methods in traffic scenes, introducing the deficiencies of traditional object detection methods, the deficiencies of CNN-based object detection methods and the deficiencies of traditional YOLO object detection model, and adding relevant references. The YOLOv5 detector contains a total of three detectors for small, medium, and large objects, and we added an additional detector for extremely small objects, which is highlighted in the author's contribution, and is indicated by a red dotted line in Figure 2. For the size of objects, we have a unified word, respectively using extreme small, small, medium, large to express.

In the author contribution part, we have improved the explanation of several innovations, added their full names and improvement effects, and improved the expression in the subsequent part, making it more clear and easy to understand. We reconstructed CARAFE's schematic diagram combined with our datasets, as shown in Figure 4.

According to your suggestion, we have added two public datasets of COCO and VOC comparative experiments in the experimental part of introducing attention mechanism, as shown in Tables 3 and 4. These two datasets are widely used in the experimental verification part of the attention mechanism module.

For the description of abbreviation F, C, S, N, etc., we have explained at the beginning of section 4.3. In Table 1, we separately compared the influences of several improvements on various detected objects, mainly to highlight the improvement of each improvement on the detection accuracy of extremely small objects. The comparison experiment when all components were used together was included in the subsequent ablation experiment.

I want to explain the methods comparative experiment section at the end. According to the depth and width of the network, the YOLOv5 model is divided into YOLOv5n (0.33*0.25), YOLOv5s (0.33*0.50), YOLOv5m (0.0.67*0.75), YOLOv5l (1.0*1.0), YOLOv5x (1.33*1.25), and YOLOv5x (1.33*1.25). They're just trained on different network sizes, and their network structure is exactly the same. As the size of the network increases, params and flops of the model will increase dramatically, resulting in extremely slow training speed and detection speed, so fewer params and flops of our model are not caused by our improvement. We chose YOLOv5s with both speed and precision as the benchmark model, and the final accuracy improvement was also the YOLOv5s with contrast. The improved YOLOv5s is now as accurate as the maximum network size YOLOv5m, while remaining lightweight. We used YOLOv5m for improvement before, and the improvement effect reached a certain bottleneck, which was not obvious compared with YOLOv5s, so we gave up using a larger network. Please understand.

Best regards,

Mr. Ang Li

Reviewer 3 Report

A multi-scale traffic object detection algorithm for road scenes based on improved YOLOv5

“Traffic object detection in road scenes is a key part of intelligent transport systems and 1

autonomous driving, and achieving the detection of tiny traffic objects has always been a difficult 2

task.” Kindly rephrase it.

Introduction fails to motivate readers, about problems, motivation for work, challenges discuss?

Kindly rewrite introduction

Contributions part needs improvement:

“A more lightweight upsampling operator, CARAFE, is used in the feature fusion stage, which effectively aggregates the image contextual information, increases the image perceptual field, and takes up only a small amount of computational resources.”

This can be written as:

A new CARAFE lightweight upsampling operator is designed and used for fusion, which improves ……

Similarly other contributions can also be rewritten.

Kindly use, this paper as citation in work as:

“EnsembleNet: A hybrid approach for vehicle detection and estimation of traffic density based on faster R-CNN and YOLO models”

SDN-Based Traffic Monitoring in Data Center Network Using Floodlight Controller

Confusion matrix need to be presented.

Dataset used is missing?

Kindly explain about dataset.

Author Response

Dear Reviewer,

Responds to the reviewer' s comments：

We have made language improvements to the part of Abstract you pointed out.

We have improved the Introduction part, describing in detail the research status of object detection methods in traffic scenes, introducing the deficiencies of traditional object detection methods, the deficiencies of CNN-based object detection methods and the deficiencies of traditional YOLO object detection model, and adding relevant references, which contains references that you suggest to cite

We have made language improvements in the author contribution part you pointed out.