Next Article in Journal
A Multi-Waypoint Motion Planning Framework for Quadrotor Drones in Cluttered Environments
Next Article in Special Issue
Advancing mmWave Altimetry for Unmanned Aerial Systems: A Signal Processing Framework for Optimized Waveform Design
Previous Article in Journal
LOFF: LiDAR and Optical Flow Fusion Odometry
Previous Article in Special Issue
Neural Network and Extended State Observer-Based Model Predictive Control for Smooth Braking at Preset Points in Autonomous Vehicles
 
 
Review
Peer-Review Record

A Systematic Survey of Transformer-Based 3D Object Detection for Autonomous Driving: Methods, Challenges and Trends

by Minling Zhu 1,*, Yadong Gong 1, Chunwei Tian 2,3 and Zuyuan Zhu 4
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Submission received: 11 June 2024 / Revised: 7 August 2024 / Accepted: 20 August 2024 / Published: 22 August 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Authors of the article titled  “A Systematic Survey of Transformer-based 3d Object Detection for autonomous driving: Methods, Challenges and Trends" provide a review on the technical nuances of Transformer architectures with their practical applications in diverse scenarios of autonomous driving technology. While the topic is of great interest, there are some issues the authors have to consider before the paper is suitable to be published. 

Major concerns:

1. Normally in a systematic review, the authors opt for the PRISMA guidelines , with an entire section dedicated to the search strategy, where the authors explain how they reached the final papers they want to analyse and discuss the papers, as well as their time frame of analysis. I would strongly advise the authors to include this analysis since is of paramount importance to guide the readers.

2. In section 2, 2.1. data sources, in the paragraph the authors mention about sensor fusion (e.g. line 121 and 122 "Multi-modal 3D object detection methods, through feature fusion, achieve superior performance by leveraging features from various sensors" ) it should be supported by references (same for table 1). There are several works that support for example this affirmation. For example see :

"Alaba SY, Gurbuz AC, Ball JE. Emerging Trends in Autonomous Vehicle Perception: Multimodal Fusion for 3D Object Detection. World Electric Vehicle Journal. 2024; 15(1):20. https://doi.org/10.3390/wevj15010020"

"M. Oliveira, R. Cerqueira, J. R. Pinto, J. Fonseca and L. F. Teixeira, "Multimodal PointPillars for Efficient Object Detection in Autonomous Vehicles," in IEEE Transactions on Intelligent Vehicles, doi: 10.1109/TIV.2024.3409409"

"C. Pereira, R. P. M. Cruz, J. N. D. Fernandes, J. R. Pinto and J. S. Cardoso, "Weather and Meteorological Optical Range Classification for Autonomous Driving," in IEEE Transactions on Intelligent Vehicles, doi: 10.1109/TIV.2024.3387113"

"Chitta, K., Prakash, A., Jaeger, B., Yu, Z., Renz, K., & Geiger, A. (2022). Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence."

3. Figures 2 and 3 belong to other works and they are not referenced. This is a major flaw that should be corrected by re-doing the figures on your own and referencing it as based on the original paper.

4. The subsection 2.3. concerning the datasets should be better organized. A Table would be better appreciated. Moreover, more information should be included, such as the features of the sensores  (e.g. velodyne with 64 lasers). Also, your search should be more robust as there are some datasets missing here (e.g. Cityscapes, Synthia,  ApolloScape) 

5. Metrics, subsection 2.4 should be improved. mAP is missing here (the formula) , mTP is composed of 5 metrics and they should be better described instead of just mentioning them. 

6. In figure 4 of section 3, the models within each category do not match entirely the text. For example, you mention PointNet++ and Point Pillars in section 3.2. but I can't find them in the figure 4 on that category. Consistency here needs to be improved. 

7. In Table 3, I would like the authors to explain why they present the detection performance on the validation set and not on the test set, where there are some models that present results on the test set. 

8. In subsection 4.1.3. the inference performance, the authors should also mention the importance of adopting edge devices for real-time inference and the efforts that there are already in the state of the art.

9. In subsection 4.2.1, collaborative perception, the authors should enrich this section by adding more references. Here are some references the authors should also support: 

"Han, Y., Zhang, H., Li, H., Jin, Y., Lang, C., & Li, Y. (2023). Collaborative perception in autonomous driving: Methods, datasets, and challenges. IEEE Intelligent Transportation Systems Magazine."

"Malik S, Khan MJ, Khan MA, El-Sayed H. Collaborative Perception—The Missing Piece in Realizing Fully Autonomous Driving. Sensors. 2023; 23(18):7854. https://doi.org/10.3390/s23187854"

10. The final section, the proposal of a novel 3d object detection framework is highly questionable. There are already some proposals of this framework. The authors should check the following references:

"Yang, J., Desai, K., Packer, C., Bhatia, H., Rhinehart, N., McAllister, R., & Gonzalez, J. (2024). CARFF: Conditional Auto-encoded Radiance Field for 3D Scene Forecasting. arXiv preprint arXiv:2401.18075."

"Hu, B., Huang, J., Liu, Y., Tai, Y. W., & Tang, C. K. (2023). Nerf-rpn: A general framework for object detection in nerfs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 23528-23538)."

"Kerr, J., Kim, C. M., Goldberg, K., Kanazawa, A., & Tancik, M. (2023). Lerf: Language embedded radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 19729-19739)."

"Sun, J., Xu, Y., Ding, M., Yi, H., Wang, C., Wang, J., ... & Schwager, M. (2023). NeRF-Loc: Transformer-based object localization within neural radiance fields. IEEE Robotics and Automation Letters."

I would strongly recommend to re-write this part of the paper, when the authors claim to propose a novel framework and support on the current state of the art , suggestion that the trends using nerfs follows these directions.

Minor concerns:

1. In the introduction, consider the following reference:

"Mao, J., Shi, S., Wang, X., & Li, H. (2023). 3D object detection for autonomous driving: A comprehensive survey. International Journal of Computer Vision131(8), 1909-1963."

2. In the introduction, Autonomous driving could be an acronym for AD since it is used often in the manuscript.

3. Figure 1 does not respect the margins (in a printed version some part of the figure is cut).

4. In the beginning of 2.1., I guess something is missing here. Starting with "Table 1 illustrates" does not seem appropriate. Still in that line, the authors mention "... advantages and disadvantages of these common sensors" Which sensors?

5.Table 1 is a bit confusing in a way I found it hard to understand the row division of each line. A bullet point in the advantages, disadvantages and application in autonomous driving would benefit more of understanding. 

6. In line 450 "..as shown in Figure ??" which figure? 

7. Often along the text, the authors start "... Chen et al. and Jiang et al. ..." (line 517) without any reference. There are more cases. The authors should correct this. 

8. The paper is a review-type article (first thing even before the title). It should also be corrected and changed from "Article" to "Review"

 

Comments on the Quality of English Language

Minor changes should be considered on the English. For example avoid saying "Qi et al. think that..." (line 491). It does not sound scientifically correct. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The paper is very well written and comprehensively reviews and analyzes various aspects of Transformer-based 3D object detection for autonomous driving. It provides a comprehensive overview of input data and methodologies and introduces a novel taxonomy for Transformer-based approaches.

However, several issues must be addressed before the paper can be considered for publication. If the following concerns are adequately resolved, this reviewer believes that the paper's contributions will be significant for the field of Transformer-based 3D object detection.

Figure 4: As the core diagram representing the overview of Transformer-based 3D object detection methods, Figure 4 is somewhat cluttered and confusing. It is recommended to reorganize and reconstruct this figure for greater clarity.

Section 3.1: The Image-based 3D object detection methods are categorized into three types, but the section does not include a discussion on models based on stereo images. It is suggested to either merge this category into multi-view images or provide a detailed analysis of stereo image-based methods.

Section 3.1.2: The discussion mentions that both LSS and Transformer-based methods construct BEV-view from surround-view images. The BEV-view transformation is a critical component of the entire BEV paradigm. Therefore, it is recommended to expand the analysis on the advantages and disadvantages of these two approaches.

Section 4.3: The discussion on the novelty of the Transformer model requires revision. Firstly, the high-level information presented in subsection 4.3.1 does not contribute significant improvements and increases computational burden. For TPVFormer, the success should be attributed more to the 3D occupancy representation, as replicating TPVFormer benefits from height information is challenging.

Section 5: The newly proposed framework based on NeRF and Transformer lacks distinctiveness. Please elucidate the differences and advantages of using NeRF for feature extraction compared to traditional ViT/ResNet backbones. It is recommended to qualitatively elaborate on NeRF's advantages through more comprehensive metrics.

English Language and Style: The language and style are generally fine, with only minor spell checks required.

CONCLUSIONS: The conclusion section needs to be expanded as it currently feels like an afterthought. The authors are advised to highlight the important findings and provide a more thorough discussion of the implications of their work.

By addressing these points, the paper will be substantially improved, making its contributions clearer and more impactful.

Comments on the Quality of English Language

I am not able to evaluate the quality of Eglish language.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

A very relevant and well-written survey of the literature landscape on transformer-based 3D object detection for autonomous driving. The contribution begins with an overview of the commonly used sensor modalities and a refresher of the operating principle of the transformer architecture. The text then delves into the different 3D object detection methods based on single-view / multi-view camera as well as lidar point clouds including a summary of the achieved results of different models on the nuScenes dataset. The article ends with a list of challenges, trends, and a suggestion for a future research direction by the authors.

While the contribution provides sufficient breadth and depth on the discussed topics, the following points could make it more useful to the community:

- Table 3: Sort the methods by one of the metrics. Add a column noting the computational complexity for training / inference for each method (when not exactly known from the original paper, include an estimate).

- While lidar currently plays an important role in self-driving, imaging radar is poised to replace it due to its weather robustness, long range, and low cost. A few comments on the possibility of camera-radar fusion and the corresponding literature would be very interesting.

Minor problems:

- Line 109: Please review the sentence about LiDAR.

- Line 113: "from the Table 1" --> "from Table 1"

- Line 258: Remove "commonly".

- Line 271: "pedestrian, pedestrian" --> "pedestrian"

- Line 450: Figure ??

- Line 493: "Method alignes" --> "The method aligns"

- Line 526: "pole representation" --> "polar representation"

- Line 594: "a object" --> "an object"

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

After carefull review, the authors included great part of the comments. However, some points still need to be improved:

1. The part of the search strategy needs to be improved. The authors improved the clarity of the paper by adding a screening procedure. However, there are missing some information such as what were the databases they used to gather the information, which metrics they based their search for the final selection. Inside section 3, I insist that there should be a section dedicated to this called "Search Strategy". Moreover, the quality of the image should be improved to svg format. There are tools such as draw io that allow the figures to be saved as vectorial format. 

 

2. Figures 2 and 3 still need to be referenced. This figures are based on other papers and they should be referenced. 

 

3. In the datasets (2.3), although they are improved, if the authors decide to do a brief description of the datasets, all of them should be covered. 

 

4. Considering table 3.1, I appologize if I was not clear enough. The correct approach here is to present the test results and not the validation results. The authors still mention the val results obtained on those papers but it should be the test results. With a quick review of some papers, they mention in there the test results too. If by any chance, there not test results available and only val results, the authors can include but with a footnote saying those are obtained on val set. However, the title of the table should change to test results and the results should be covered on test and not val. 

5. In section 4.1.3. some references on inference performance are missing. 

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

No further commets.

Author Response

Dear Reviewer,

Thank you for reviewing our manuscript and providing valuable feedback. We are pleased to hear that you have no further comments. Your insights have significantly contributed to improving our paper.

Thank you again for your time and assistance.

Best regards,
Yadong Gong

Round 3

Reviewer 1 Report

Comments and Suggestions for Authors

The authors addressed well all the comments. No further revisions to add. 

Back to TopTop