A Marine Organism Detection Framework Based on Dataset Augmentation and CNN-ViT Fusion
Abstract
:1. Introduction
2. Related Works
2.1. Target Detection Algorithm
2.2. Data Augmentation
3. Method
3.1. Data Augmentation
3.1.1. Random Expansion of Small Objects
3.1.2. Non-Overlapping Filling of Scarce Samples
3.2. Target Detection
3.2.1. Backbone of Feature Extraction Network
3.2.2. Deformable Convolution
3.2.3. Trident Block
3.2.4. Loss Function
4. Experiment
4.1. Experimental Details
4.2. Ablation Experiments
4.2.1. Random Expansion of Small Objects
4.2.2. Non-Overlapping Filling of Scarce Samples
4.2.3. Ablation Experiment of Each Module in The Proposed Model
4.3. Comparative Experiments
4.3.1. Detection Performance Comparison
4.3.2. Detection Real-Time Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Teng, B.; Zhao, H. Underwater target recognition methods based on the framework of deep learning: A survey. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420976307. [Google Scholar] [CrossRef]
- Qi, J.; Gong, Z.; Xue, W.; Liu, X.; Yao, A.; Zhong, P. An Unmixing-Based Network for Underwater Target Detection From Hyperspectral Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 5470–5487. [Google Scholar] [CrossRef]
- Rova, A.; Mori, G.; Dill, L.M. One fish, two fish, butterfish, trumpeter: Recognizing fish in underwater video. DBLP 2007, 404–407. [Google Scholar]
- Yuan, F.; Huang, Y.; Chen, X.; Cheng, E. A Biological Sensor System Using Computer Vision for Water Quality Monitoring. IEEE Access 2018, 6, 61535–61546. [Google Scholar] [CrossRef]
- Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
- Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE, Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. ArXiv 2018, arXiv:1804.02767. [Google Scholar]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. ArXiv 2020, arXiv:2004.10934. [Google Scholar]
- Github. YOLOv5. Available online: https://github.com/ultralytics/yolov5 (accessed on 28 May 2021).
- Li, H.; Zhuang, P.; Wei, W.; Li, J. Underwater Image Enhancement Based on Dehazing and Color Correction. In Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Xiamen, China, 16–18 December 2019; pp. 1365–1370. [Google Scholar]
- Luo, W.; Duan, S.; Zheng, J. Underwater Image Restoration and Enhancement Based on a Fusion Algorithm With Color Balance, Contrast Optimization, and Histogram Stretching. IEEE Access 2021, 9, 31792–31804. [Google Scholar] [CrossRef]
- Inzartsev, A.V.; Pavin, A.M. AUV Cable Tracking System Based on Electromagnetic and Video Data. In Proceedings of the OCEANS 2008—MTS/IEEE Kobe Techno-Ocean, Kobe, Japan, 8–11 April 2008; pp. 1–6. [Google Scholar]
- Liu, R.; Fan, X.; Zhu, M.; Hou, M.; Luo, Z. Real-World Underwater Enhancement: Challenges, Benchmarks, and Solutions Under Natural Light. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4861–4875. [Google Scholar] [CrossRef]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. ArXiv 2019, arXiv:1902.07296. [Google Scholar]
- Chen, Y.; Zhang, P.; Li, Z.; Li, Y.; Zhang, X.; Qi, L.; Sun, J.; Jia, J. Dynamic Scale Training for Object Detection. ArXiv 2020. [Google Scholar]
- Chen, Y.; Zhang, P.; Li, Z.; Li, Y.; Zhang, X.; Meng, G.; Xiang, S.; Sun, J.; Jia, J. Stitcher: Feedback-driven Data Provider for Object Detection. ArXiv 2020, arXiv:2004.12432. [Google Scholar]
- Shrivastava, A.; Gupta, A.; Girshick, R. Training Region-Based Object Detectors with Online Hard Example Mining. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 42, 318–327. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. ArXiv 2017, arXiv:1706.03762. [Google Scholar]
- Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv 2019, arXiv:1810.04805. [Google Scholar]
- Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners. ArXiv 2020, arXiv:2005.14165. [Google Scholar]
- Raffel, C.; Shazeer, N.M.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. ArXiv 2019, arXiv:1910.10683. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. ArXiv 2021, arXiv:2010.11929. [Google Scholar]
- Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. ArXiv 2020, arXiv:2010.04159. [Google Scholar]
- Yeh, C.H.; Lin, C.H.; Kang, L.W.; Huang, C.H.; Wang, C.C. Lightweight Deep Neural Network for Joint Learning of Underwater Object Detection and Color Conversion. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6129–6143. [Google Scholar] [CrossRef]
- Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020. [Google Scholar]
- Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
- Li, Y.; Chen, Y.; Wang, N.; Zhang, Z.X. Scale-Aware Trident Networks for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 6053–6062. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
- Zhang, H.; Chang, H.; Ma, B.; Wang, N.; Chen, X. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training. In Proceedings of the Computer Vision–ECCV 2020 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 260–275. [Google Scholar]
- Han, F.; Yao, J.; Zhu, H.; Wang, C. Marine Organism Detection and Classification from Underwater Vision Based on the Deep CNN Method. Math. Probl. Eng. 2020, 2020, 3937580. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. ArXiv 2017, arXiv:1704.04861. [Google Scholar]
- Zhang, Y.-F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and Efficient IOU Loss for Accurate Bounding Box Regression. Neurocomputing 2021, 506, 146–157. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Zitnick, C.L. Microsoft COCO: Common Objects in Context; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Zhang, X.; Fang, X.; Pan, M.; Yuan, L.; Zhang, Y.; Yuan, M.; Lv, S.; Yu, H. A Marine Organism Detection Framework Based on the Joint Optimization of Image Enhancement and Object Detection. Sensors 2021, 21, 7205. [Google Scholar] [CrossRef] [PubMed]
- Jia, J.; Fu, M.; Liu, X.; Zheng, B. Underwater Object Detection Based on Improved EfficientDet. Remote Sens. 2022, 14, 4487. [Google Scholar] [CrossRef]
Schedule | Value | Description |
---|---|---|
hsv_h | 0.015 | image HSV-Hue augmentation (fraction) |
hsv_s | 0.7 | image HSV-Saturation augmentation (fraction) |
hsv_v | 0.4 | image HSV-Value augmentation (fraction) |
degrees | 0.0 | image rotation degrees (+/− deg) |
translate | 0.1 | image translation (+/− fraction) |
scale | 0.5 | image scale (+/− gain) |
shear | 0.0 | image shear (+/− deg) |
perspective | 0.0 | image perspective (+/− fraction), range 0–0.001 |
flipud | 0.0 | image flip up-down (probability) |
fliplr | 0.5 | image flip left-right (probability) |
mosaic | 1.0 | image mosaic (probability) |
mixup | 0.0 | image mixup (probability) |
expand | 1.0 | label bbox expand (pixel) |
labelbalance | true | balance the number of labels (true/false) |
Method | Pretrained Model | Recall | Precision | mAP@0.5 | mAP@0.5:0.95 | F1-Score |
---|---|---|---|---|---|---|
YOLOv5 | yolov5 × 6 | 0.7344 | 0.5975 | 0.6542 | 0.3536 | 0.6589 |
YOLOv5 + Trident [1, 2, 3] | yolov5 × 6 | 0.7492 | 0.6134 | 0.6829 | 0.3623 | 0.6745 |
YOLOv5 + Trident [1, 1, 1] | yolov5 × 6 | 0.7371 | 0.6375 | 0.6734 | 0.3571 | 0.6837 |
YOLOv5 + Trident [2, 2, 2] | yolov5 × 6 | 0.7149 | 0.6081 | 0.6647 | 0.3469 | 0.6572 |
YOLOv5 + Trident [3, 3, 3] | yolov5 × 6 | 0.7067 | 0.5747 | 0.6314 | 0.3294 | 0.6339 |
YOLOv5 + ViT | yolov5 × 6 | 0.7476 | 0.6036 | 0.6633 | 0.3614 | 0.6679 |
YOLOv5 + ViT + Trident [1, 2, 3] | yolov5 × 6 | 0.738 | 0.6307 | 0.6762 | 0.3651 | 0.6801 |
YOLOv5 + ViT+DfConv | yolov5 × 6 | 0.7596 | 0.6095 | 0.6783 | 0.3686 | 0.6763 |
Proposed model (YOLOv5 + ViT + Trident [1, 2, 3] + DfConv) | yolov5 × 6 | 0.7621 | 0.6422 | 0.6848 | 0.3717 | 0.6970 |
Method | Pretrained Model | Recall | Precision | mAP@0.5 | mAP@0.5:0.95 |
---|---|---|---|---|---|
Zhang et al. [38] | ResNet-50 + Cascade | 0.6629 | 0.5612 | 0.6791 | 0.4142 |
EDR-D0 [39] | EfficientNet-B0 | 0.6120 | 0.6267 | 0.6443 | 0.3374 |
SSD | VGG-16 | 0.5420 | 0.5257 | 0.5676 | 0.2927 |
YOLO-v3 | DarkNet-53 | 0.6434 | 0.5596 | 0.5713 | 0.3248 |
YOLO-v4 | CSPDarkNet-53 | 0.6767 | 0.6024 | 0.6374 | 0.3427 |
YOLO-v5 | yolov5-l | 0.7344 | 0.5975 | 0.6542 | 0.3536 |
Proposed model | yolov5-l | 0.7621 | 0.6422 | 0.6848 | 0.3717 |
Proposed framework (Data augmentation + Proposed model) | yolov5-l | 0.8853 | 0.8917 | 0.9249 | 0.8091 |
Method | Parameters | Fps | |
---|---|---|---|
On PC | On Jetson Xavier NX | ||
SSD | 33.0 M | 15.82 | 20.32 |
YOLO-v3 | 61.5 M | 14.83 | 17.43 |
YOLO-v4 | 52.5 M | 15.01 | 18.95 |
YOLO-v5 | 47.0 M | 15.34 | 17.08 |
Proposed Framework | 64.2 M | 14.66 | 17.46 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Jiang, X.; Zhang, Y.; Pan, M.; Lv, S.; Yang, G.; Li, Z.; Liu, J.; Yu, H. A Marine Organism Detection Framework Based on Dataset Augmentation and CNN-ViT Fusion. J. Mar. Sci. Eng. 2023, 11, 705. https://doi.org/10.3390/jmse11040705
Jiang X, Zhang Y, Pan M, Lv S, Yang G, Li Z, Liu J, Yu H. A Marine Organism Detection Framework Based on Dataset Augmentation and CNN-ViT Fusion. Journal of Marine Science and Engineering. 2023; 11(4):705. https://doi.org/10.3390/jmse11040705
Chicago/Turabian StyleJiang, Xiao, Yaxin Zhang, Mian Pan, Shuaishuai Lv, Gang Yang, Zhu Li, Jingbiao Liu, and Haibin Yu. 2023. "A Marine Organism Detection Framework Based on Dataset Augmentation and CNN-ViT Fusion" Journal of Marine Science and Engineering 11, no. 4: 705. https://doi.org/10.3390/jmse11040705
APA StyleJiang, X., Zhang, Y., Pan, M., Lv, S., Yang, G., Li, Z., Liu, J., & Yu, H. (2023). A Marine Organism Detection Framework Based on Dataset Augmentation and CNN-ViT Fusion. Journal of Marine Science and Engineering, 11(4), 705. https://doi.org/10.3390/jmse11040705