Next Article in Journal
Wrapping Trust for Interoperability: A Preliminary Study of Wrapped Tokens
Next Article in Special Issue
Enhanced Feature Pyramid Vision Transformer for Semantic Segmentation on Thailand Landsat-8 Corpus
Previous Article in Journal
Review of Tools for Semantics Extraction: Application in Tsunami Research Domain
 
 
Article
Peer-Review Record

Object Detection of Road Assets Using Transformer-Based YOLOX with Feature Pyramid Decoder on Thai Highway Panorama

Information 2022, 13(1), 5; https://doi.org/10.3390/info13010005
by Teerapong Panboonyuen 1,*, Sittinun Thongbai 2, Weerachai Wongweeranimit 3, Phisan Santitamnont 3,4, Kittiwan Suphan 2 and Chaiyut Charoenphon 4,*
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Information 2022, 13(1), 5; https://doi.org/10.3390/info13010005
Submission received: 15 November 2021 / Revised: 17 December 2021 / Accepted: 22 December 2021 / Published: 25 December 2021
(This article belongs to the Special Issue Deep Learning and Signal Processing)

Round 1

Reviewer 1 Report

This research is interesting and has good application value. According to this manuscript, I have the following comments:

  1. In fig.4, I think it is better to visualized show the multi scale information or features obtained by FPN,
  2. In page 2, there are YOLOv5S, YOLOv5M, and YOLOv5L, but in section 5, there are YOLO-V5-L, YOLO-V5-S, and YOLO-V5-M, are they same? Besides, this there methods should be described in detail, the similarities and differences.
  3. Except Yolo, there are also some methods to solve the multi-scale object detection problem, like anchor based: Faster-RCNN, or anchor-free: CentreNet. I think more comparison should be performed in the experiment part.
  4. For YoLo, there model should be training by samples, so how many training samples? How to get them? And how to label them? The detailed information about training samples and test samples should be described.
  5. For Fig.7, it seems like the test samples are from videos. So, does the test samples are continuous or sampled? If sampled, how to set the sample ratio?
  6. how about the detection time consumption of one frame? Maybe the time cost should be listed.

Author Response

We appreciate your detailed comments and suggestions. There are identify some important points which we hope to clarify and address here and in our revision via this Google Doc link.

Responding to reviewers' comments (Panboonyuen, Information 2021): https://docs.google.com/document/d/1u3S_GWk9fIvWb5SgNyjaLQnt0KjZq2vgzYGbUPGVR8M/edit?usp=sharing

Revised Version (Paper) with highlights: https://drive.google.com/drive/folders/12czs2UAzZHd_c586UwfNTsX6DrHrAiag?usp=sharing

Thank you very much for the constructive feedback. 

Q1: This research is interesting and has good application value. According to this manuscript, I have the following comments: In fig.4, I think it is better to visualize show the multi-scale information or features obtained by FPN
We sincerely appreciate your additional feedback.
A1: We have revised this Fig.4 by splitting it to Fig.4 will show the multi-scale information and Fig.5 will show features obtained by FPN.

Q2: On page 2, there are YOLOv5S, YOLOv5M, and YOLOv5L, but in section 5, there are YOLO-V5-L, YOLO-V5-S, and YOLO-V5-M, are they the same? Besides, this there methods should be described in detail, the similarities and differences.
A2: Yes, they are the same but different models sizes. And We have already revised these methods and described in detail, the similarities and differences.


Q3: Except for Yolo, there are also some methods to solve the multi-scale object detection problem, like anchor-based: Faster-RCNN, or anchor-free: CentreNet. I think more comparisons should be performed in the experiment part.
A3: Thank you for your suggestions about comparing these baselines (anchor-based: Faster-RCNN, or anchor-free: CentreNet.). So, we have a new experiment and have already added anchor-based: Faster-RCNN, or anchor-free: CentreNet. into our baselines in Table 1. And, we agreed that seeing the differences with more models would be very interesting.

Q4: For YOLO, their model should be training by samples, so how many training samples? How do get them? And how to label them? The detailed information about training samples and test samples should be described.
A4: Thank you very much for your helpful feedback. Sorry for any indistinction. As mentioned in the general rebuttal, we agree with this unclear number of the training sample. So, We have created and added Table 1 (Numbers of training, validation, and testing sets) in Section 5.

Q5: For Fig.7, it seems like the test samples are from videos. So, does the test samples are continuous or sampled? If sampled, how to set the sample ratio?
A5: The test samples are continuous.

Q6: How about the detection time consumption of one frame? Maybe the time cost should be listed.
A6: We totally agree with your comments. We have added the detection time consumption of one frame in Table 1. Image result To calculate frames per second, we just take the number of rendered frames and divide it by the seconds passed.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper proposed an approach for object detection using a combination of vision transformer, feature pyramid network, and YOLOX. The authors have used the combination to improve the performance of a four class classification problem. The results are compared to those from YOLOV5M, S and L models. Using larger number of parameters the performance is better than these models. It is not clear what is the cost of using larger number of parameters on the computations or in the processing time. Also for the fairness of comparison will it possible to uses the same number of parameters for both the proposed and the SOTA methods?

 

Additionally I have the following suggestions/comments:

 

    • It is advised to rewrite the abstract and conclusion especillay paying attention to clarity of thought and language.
    • It is required to rephrase the follwoing sentences because it is difficult to read:

Identifying road asset objects in Thailand highway monitoring image sequences is essential for intelligent traffic monitoring and administration of the highway.,

"A more distant road surface may usually be evaluated from an eye-observing angle".,

"At this viewing angle, the vehicle’s object size varies enormously, and the detection accuracy of a small item far away from the road is low".,

"Among modern Convolutional Neural Networks (ConvNet/CNNs), there are many techniques, e.g., dynamic heads with attentions [1], dual attention [2], self-attention [3] have gained increasing attention due to their capability".

"May be necessry to rewrite this sentence. Still, all suffer from accuracy performance issues."

    • The following statements are not connecting with each other:

With the widespread use of traffic surveillance cameras, an extensive library of traffic video footage has been available for examination. A more distant road surface may usually be evaluated from an eye-observing angle.

    • “In the face of complicated camera scenarios, it’s critical to address and implement the difficulties listed above successfully.” Where is the list of difficulties? This was not mentioned earlier and therefore confusing. The same is applicable to this statement “We will focus on the challenges mentioned earlier in this post and provide a suitable solution.”
    • Write a sentence about what is multi-object tracking and asset object counting. Or perhaps add references.
    • What is meant by “exact object detection”?
    • First, define the abbreviation then use the abbreviated form:

For example, the author uses “Conv” blocks as convolution (Conv) blocks before defining them at the first instance. Same is true about SOTA.

    • You mentioned that mobile mapping systems and artificial intelligence were applied to collect Thailand road data. It will be more interesting if there is further explanation about how the information was collected?
    • It is mentioned that 2D images are down-sampled with HW3 dimensions. It is not mentioned clearly. Does it mean H stands for height, w stands for width. I think, 3 is the Channel size. It is advised to clearly state the things to make them completely clear. I will suggest to rewrite such information clearly.
    • It seems that the mathematical expressions are obtained from somewhere else. If this is true reference is required. Else, the authors should introduce these expressions properly from basics.
    • Figure 6 is not readable. Labels are too small.
    • It is better if the performance is compared using other evaluation measures, namely F1-measure and recall.

Author Response

We appreciate your detailed comments and suggestions. There are identify some important points which we hope to clarify and address here and in our revision via this Google Doc link.

Responding to reviewers' comments (Panboonyuen, Information 2021): https://docs.google.com/document/d/1u3S_GWk9fIvWb5SgNyjaLQnt0KjZq2vgzYGbUPGVR8M/edit?usp=sharing

Revised Version (Paper) with highlights: https://drive.google.com/drive/folders/12czs2UAzZHd_c586UwfNTsX6DrHrAiag?usp=sharing

Thank you very much for the constructive feedback. 

Q1: This paper proposed an approach for object detection using a combination of vision transformer, feature pyramid network, and YOLOX. The authors have used the combination to improve the performance of a four-class classification problem. The results are compared to those from YOLOV5M, S, and L models. Using a larger number of parameters the performance is better than these models. It is not clear what is the cost of using a larger number of parameters on the computations or in the processing time.  Also for the fairness of comparison will it be possible to use the same number of parameters for both the proposed and the SOTA methods?
A1: We appreciate your detailed comments and suggestions. There are identified some essential points which we hope to clarify and address here and in our revision. Sorry for any indistinction. As mentioned in the general rebuttal, we agree with this unclear. So, We have created and added Table 1 (Numbers of training, validation, and testing sets) in Section 5. Also, we have revised our baselines in Table 1. And, we agreed that seeing the differences with more models would be very interesting.


Q2: It is advised to rewrite the abstract and conclusion especially paying attention to the clarity of thought and language.
A2: We have strictly followed all suggestions. If there are any further improvements needed, please let us know. Moreover, we have fully revised our paper with a native English speaker to improve the paper's writing quality significantly.

Q3: What is meant by “exact object detection”?
A3: We have already unclear the word in Section Introduction.

Q4: You mentioned that mobile mapping systems and artificial intelligence were applied to collect Thailand road data. It will be more interesting if there is further explanation about how the information was collected?
A4: We have revised this part by adding how the information was collected to Section 3 Road Asset Data Set.

Q5: It is mentioned that 2D images are down-sampled with HW3 dimensions. It is not mentioned clearly. Does it mean H stands for height, w stands for width? I think, 3 is the Channel size. It is advised to clearly state the things to make them completely clear. I will suggest rewriting such information clearly.
A5: Yes, it mean H stands for height, w stands for width and 3 is the number of channel/band.

Q6: It seems that the mathematical expressions are obtained from somewhere else. If this is true reference is required. Else, the authors should introduce these expressions properly from the basics.
A6: We agree with this reviewer to revise as a true reference. We have already revised by rewriting as: "Our Transformer-based YOLOX follows a sequence-to-sequence vector with transformers from [36]."

** Zheng, Sixiao, et al. "Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

Q7: Figure 6 is not readable. Labels are too small.
A7: We totally agree with your comments on the figures, specifically the text in Figure 6 and we have already redrawn improved the quality of the figures, specifically the text in Figure 6. Also, it is now effortless to read.

Q8: It is better if the performance is compared using other evaluation measures, namely F1-measure and recall.
A8: We have already added the F1 score into Table 1 of our experiments.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

All my concerned has been solved, I think this manuscript can be accepted.

Reviewer 2 Report

The article has improved after revision. Before final submission please pay some more attention to the language. Otherwise article is suitable for publication.

Back to TopTop