An Improved Lightweight Network Using Attentive Feature Aggregation for Object Detection in Autonomous Driving
Abstract
:1. Introduction
- A new lightweight object detector capable of being trained on a single GPU system.
- Trained weights have then been deployed on NXP BlueBox 2.0 [16] to perform real-time vehicular object detection and Frame-Per-Second (FPS) evaluation.
- Section 2 provides an overview of object detection models and related work.
- Section 3 describes the methodology of the proposed MobDet3 object detection network.
- Section 4 provides an overview of NXP BlueBox 2.0, an advanced automotive development platform used in the experiments.
- Section 5 evaluates the MobDet3 object detector by benchmarking it on different datasets and on NXP BlueBox 2.0.
- Section 6 concludes the paper.
2. Background and Literature Review
2.1. Environmental Awareness for Autonomous Vehicles: The Role of Visual Perception
2.2. Advancing Autonomous Driving: The Significance of Real-Time Object Detection
2.3. Exploring the Inner Workings of Object Detectors
- Neck (feature aggregator):
- Head (to draw bounding boxes and make class predictions):
3. MobDet3
3.1. Backbone
MobileNetV3 CNN
- Depthwise Convolution: A lightweight convolution for spatial filtering mathematically represented by Equation (1) and graphically represented in Figure 5.
- 2.
- Pointwise Convolution: A more substantial 1×1 Pointwise Convolution for feature generation mathematically represented by Equation (2) and graphically represented in Figure 6.
3.2. Neck
3.2.1. FPN and PAN
3.2.2. SPP
3.3. Head
3.4. Bounding Box Regression
4. NXP BlueBox 2.0 with RTMaps: A Development Platform for Automotive High-Performance Computing (AHPC)
4.1. A Brief Overview of NXP BlueBox 2.0
4.1.1. Computer Vision Processing with S32V234 (S32V)
4.1.2. High Performance Computing with LS2084A (LS2)
4.1.3. Radar Information Processing with S32R274 (S32R)
4.2. Real-Time Multi-Sensor Applications (RTMaps)
5. An Analysis of Experimental Results
5.1. Computer Vision Datasets
5.1.1. BDD100K Dataset
5.1.2. Microsoft COCO Dataset
5.2. Evaluation Metrics for the Model
- The ratio of overlap between the predicted bounding box and the ground-truth bounding box is calculated using GIoU, as defined in (5). When the computed GIoU value surpasses a predetermined threshold, it indicates that the object detector has successfully detected the target object within the image.
- Once it is confirmed that an object has been successfully detected in step 1, the matching of class labels between the predicted and ground-truth bounding boxes is carried out.
5.3. Experimental Configuration
- 3B NVIDIA Tesla V100 GPU
- 20-core Intel 6248 CPU
- 1.92 TB Solid-State Drive (SSD)
- 768 GB of RAM
- PyTorch 2.0.0
- CUDA 11.8
- Python 3.9
5.4. Experimental Results
5.5. Limitations of the Proposed Methodology
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Pang, Y.; Cao, J. Deep Learning in Object Detection. In Deep Learning in Object Detection and Recognition; Jiang, X., Hadid, A., Pang, Y., Granger, E., Feng, X., Eds.; Springer: Singapore, 2019; pp. 19–57. ISBN 978-981-10-5152-4. [Google Scholar]
- Rosenblatt, F. The Perceptron—A Perceiving and Recognizing Automaton; Cornell Aeronautical Laboratory: Ithaca, NY, USA, 1957. [Google Scholar]
- Berners-Lee, C.M. Cybernetics and Forecasting. Nature 1968, 219, 202–203. [Google Scholar] [CrossRef]
- Aizenberg, I.N.; Aizenberg, N.N.; Vandewalle, J. Multi-Valued and Universal Binary Neurons; Springer US: Boston, MA, USA, 2000; ISBN 978-1-4419-4978-3. [Google Scholar]
- Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bengio, Y.; Senecal, J.-S. Adaptive Importance Sampling to Accelerate Training of a Neural Probabilistic Language Model. IEEE Trans. Neural Netw. 2008, 19, 713–722. [Google Scholar] [CrossRef] [Green Version]
- Ranzato, M.A.; Poultney, C.; Chopra, S.; Cun, Y. Efficient Learning of Sparse Representations with an Energy-Based Model. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2006; Volume 19. [Google Scholar]
- Yu, D.; Deng, L. Deep Learning and Its Applications to Signal and Information Processing [Exploratory DSP]. IEEE Signal Process. Mag. 2011, 28, 145–154. [Google Scholar] [CrossRef]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
- Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020. [Google Scholar]
- Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
- Cureton, C.; Douglas, M. Bluebox Deep Dive—NXP’s AD Processing Platform; NXP: Eindhoven, The Netherlands, 2019. [Google Scholar]
- Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Volume 9905, pp. 21–37. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Kalgaonkar, P.; El-Sharkawy, M. CondenseNeXt: An Ultra-Efficient Deep Neural Network for Embedded Systems. In Proceedings of the 2021 IEEE 11th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 27–30 January 2021; pp. 0524–0528. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Du, X.; Lin, T.-Y.; Jin, P.; Ghiasi, G.; Tan, M.; Cui, Y.; Le, Q.V.; Song, X. SpineNet: Learning Scale-Permuted Backbone for Recognition and Localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer International Publishing: Cham, Switzerland, 2014; pp. 346–361. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Receptive Field Block Net for Accurate and Fast Object Detection. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 404–419. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [Green Version]
- Li, H.; Xiong, P.; An, J.; Wang, L. Pyramid Attention Network for Semantic Segmentation. arXiv 2018, arXiv:1805.10180. [Google Scholar]
- Liu, S.; Huang, D.; Wang, Y. Learning Spatial Fusion for Single-Shot Object Detection. arXiv 2019, arXiv:1911.09516. [Google Scholar]
- Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
- Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
- Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
- Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into High Quality Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- YOLOv5 Documentation. Available online: https://docs.ultralytics.com/ (accessed on 23 September 2022).
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
- Yang, T.-J.; Howard, A.; Chen, B.; Zhang, X.; Go, A.; Sandler, M.; Sze, V.; Adam, H. NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
- Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2011–2023. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
- RTMaps Development and Validation Environment for Multisensor Applications (ADAS/AD, Robotics, etc.). Available online: https://www.dspace.com/shared/data/pdf/2021/dSPACE-RTMaps_Product-Information_2021-12_E.pdf (accessed on 15 May 2023).
- Padilla, R.; Passos, W.L.; Dias, T.L.B.; Netto, S.L.; da Silva, E.A.B. A Comparative Analysis of Object Detection Metrics with a Companion Open-Source Toolkit. Electronics 2021, 10, 279. [Google Scholar] [CrossRef]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
- Stewart, C.A.; Welch, V.; Plale, B.; Fox, G.; Pierce, M.; Sterling, T. Indiana University Pervasive Technology Institute. 2017. [Google Scholar] [CrossRef]
Name of the Dataset | Data Type | Number of Classes | Number of Images | Resolution of Images | Class Overlap | |
---|---|---|---|---|---|---|
Train | Test | |||||
BDD100K | Road | 10 | 70,000 | 20,000 | 1280 × 720 | True |
COCO | Common | 80 1 | 118,287 | 5000 | Multi-scale | True |
Architecture Name | Description |
---|---|
MobDet3 | The architecture presented in this paper, known as MobDet3. |
MobDet3 + FPN | MobDet3 solely incorporates the FPN network. |
MobDet3-SAM | MobDet3 without spatial attention modules. |
ShuffDet + FPAN | MobDet3 with a ShuffleNet backbone and a PAN network. |
ShuffDet + FPN | MobDet3 with a ShuffleNet backbone and a FPN network. |
SqDet + FPAN | MobDet3 with a SqueezeNet backbone and a PAN network. |
SqDet + FPN | MobDet3 with a SqueezeNet backbone and a FPN network. |
Architecture Name | Backbone | Mean Precision | mAP | AP50 | AP75 |
---|---|---|---|---|---|
MobDet3 | MobileNetv3 | 58.30% | 31.30% | 45.36% | 33.94% |
MobDet3 + FPN | MobileNetv3 | 54.68% | 30.21% | 41.85% | 32.79% |
MobDet3-SAM | MobileNetv3 | 56.23% | 30.76% | 42.46% | 33.05% |
ShuffDet + FPAN | ShuffleNet | 54.92% | 27.02% | 39.11% | 29.28% |
ShuffDet + FPN | ShuffleNet | 50.81% | 25.83% | 37.81% | 28.02% |
SqDet + FPAN | SqueezeNet | 51.61% | 26.66% | 38.52% | 28.90% |
SqDet + FPN | SqueezeNet | 49.89% | 25.47% | 37.15% | 27.59% |
Architecture Name | Backbone | Mean Precision | mAP | AP50 | AP75 |
---|---|---|---|---|---|
MobDet3 | MobileNetv3 | 56.08% | 51.68% | 72.67% | 56.29% |
MobDet3 + FPN | MobileNetv3 | 54.45% | 50.58% | 70.91% | 54.65% |
MobDet3-SAM | MobileNetv3 | 54.87% | 50.91% | 71.58% | 55.07% |
ShuffDet + FPAN | ShuffleNet | 49.87% | 45.53% | 64.45% | 52.57% |
ShuffDet + FPN | ShuffleNet | 48.28% | 45.25% | 63.22% | 47.47% |
SqDet + FPAN | SqueezeNet | 45.75% | 45.09% | 63.16% | 47.20% |
SqDet + FPN | SqueezeNet | 45.80% | 44.66% | 62.48% | 46.80% |
Architecture Name | Inference Time on NXP BlueBox 2.0 |
---|---|
MobDet3 | 88.92 FPS |
MobDet3 + FPN | 90.21 FPS |
MobDet3-SAM | 87.47 FPS |
ShuffDet + FPAN | 79.30 FPS |
ShuffDet + FPN | 80.92 FPS |
SqDet + FPAN | 75.29 FPS |
SqDet + FPN | 78.08 FPS |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kalgaonkar, P.; El-Sharkawy, M. An Improved Lightweight Network Using Attentive Feature Aggregation for Object Detection in Autonomous Driving. J. Low Power Electron. Appl. 2023, 13, 49. https://doi.org/10.3390/jlpea13030049
Kalgaonkar P, El-Sharkawy M. An Improved Lightweight Network Using Attentive Feature Aggregation for Object Detection in Autonomous Driving. Journal of Low Power Electronics and Applications. 2023; 13(3):49. https://doi.org/10.3390/jlpea13030049
Chicago/Turabian StyleKalgaonkar, Priyank, and Mohamed El-Sharkawy. 2023. "An Improved Lightweight Network Using Attentive Feature Aggregation for Object Detection in Autonomous Driving" Journal of Low Power Electronics and Applications 13, no. 3: 49. https://doi.org/10.3390/jlpea13030049
APA StyleKalgaonkar, P., & El-Sharkawy, M. (2023). An Improved Lightweight Network Using Attentive Feature Aggregation for Object Detection in Autonomous Driving. Journal of Low Power Electronics and Applications, 13(3), 49. https://doi.org/10.3390/jlpea13030049