MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition
Highlights
- This study proposes a Multi-View Deep Convolutional Neural Network based on feature fusion. By leveraging shared-weight backbone networks to extract discriminative features and adopting cross-view max-pooling to fuse complementary features (e.g., texture and shape), the network constructs a comprehensive feature space for underwater targets, enabling robust fine-grained recognition in complex marine environments.
- This study designs a data augmentation method based on the principles of multi-view sonar imaging. Given a number of single-view samples and number of views , the method systematically combines the original single-view images of a target using the combination formula . It then screens these combinations within a defined azimuth range to select qualified multi-view training samples, ultimately constructing a dedicated dataset for multi-view sonar image target recognition.
- Compared to single-view models, the proposed architecture achieves more confident category predictions and superior class separation in the feature space, significantly enhancing underwater target recognition accuracy (on the Custom Side-Scan Sonar Image Dataset and Nankai Sonar Image Dataset, the two-view MVDCNN achieves an average accuracy of 94.72% and 97.24%, with increases of 7.93% and 5.05% compared with single-view baselines; the three-view MVDCNN further improves the average accuracy to 96.60% and 98.28%), while significantly reducing the misclassification rate of small-sample categories.
- The designed data augmentation method effectively mitigates the data scarcity problem in Sonar Automatic Target Recognition by systematically generating multi-view training samples, thereby lowering the barrier to its deployment in real-world scenarios.
Abstract
1. Introduction
- (1)
- This study proposes a modular and extensible MVDCNN framework. Its core innovation lies in decomposing the multi-view recognition pipeline into four dedicated modules: input reshaping, feature extraction, feature fusion, and classification. This design allows the feature extraction module to seamlessly integrate various pre-trained CNN or Transformer backbones, enabling the framework to evolve with advances in foundation models while leveraging their learned generic visual features. Compared to single-view baselines, MVDCNN achieves a more comprehensive and robust target understanding through feature-level fusion.
- (2)
- This study introduces a data augmentation technique from the field of multi-view SAR image recognition, effectively alleviating the critical bottleneck of scarce training data in sonar ATR. Addressing the challenge of limited open-source multi-view sonar image datasets, this method leverages geometric modeling to systematically generate numerous qualified multi-view training pairs from limited single-view images by combining and screening samples within a defined azimuth range of the same target.
- (3)
- This study establishes a multidimensional, visualization-supported model evaluation and mechanism analysis methodology. Beyond analyzing traditional performance metrics such as accuracy, precision, and recall, this research employs Kernel Density Estimation and t-SNE visualization. This approach not only quantitatively validates the significant improvements in classification accuracy and prediction confidence achieved through multi-view fusion but also qualitatively reveals its intrinsic mechanism of enhancing inter-class separation and intra-class compactness from the perspective of feature space distribution, providing clear evidence for the method’s effectiveness.
2. Materials and Methods
- Input Reshaping: Single-channel sonar images are adapted to pre-trained backbones via tensor reshaping and channel replication.
- Feature Extraction: Shared-weight backbone networks (e.g., ResNet and Transformer) extract high-level features from each view.
- Feature Fusion: Feature maps are aggregated by applying max-pooling across the view dimension, retaining the most activated features.
- Classification: A lightweight fully-connected head with Softmax projects the fused feature vector into the probability space for the final decision.
2.1. Multi-View Sample Augmentation Method
2.2. The Proposed MVDCNN Framework
2.2.1. Input Reshaping Module
2.2.2. Feature Extraction Module
2.2.3. Feature Fusion Module and Classification Module
2.3. Visualization Methods
2.3.1. t-Distributed Stochastic Neighbor Embedding
2.3.2. Kernel Density Estimation
2.4. Dataset
Partition Strategy
2.5. Experimental Setup
2.5.1. Baseline and MY-YOLO
2.5.2. Experimental Environment and Hyperparameters
2.5.3. Evaluation Metrics
2.5.4. Ablation Study of Weight Sharing Mechanism
3. Results
3.1. Convergence Behavior
3.2. Overall Classification Performance
3.3. Feature Distribution Optimization
3.4. Prediction Uncertainty Reduction
3.5. View-Invariant Feature Capture
3.6. Small-Sample Robustness Enhancement
4. Discussion
4.1. Mechanism and Effectiveness of the Multi-View Feature Fusion
4.2. Comparison Between MVDCNN and MV-YOLO
4.3. Mechanism for Robustness Enhancement of Small-Sample Categories
4.4. Limitations and Future Work
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ATR | Automatic Target Recognition |
| MVDCNN | Multi-View Deep Convolutional Neural Network |
| NKSID | Nankai Sonar Image Dataset |
| CSSID | Custom Side-Scan Sonar Image Dataset |
| UXO | Unexploded Ordinance |
| CNN | Convolutional Neural Network |
| ViT | Vision Transformer |
| SwinT | Swin Transformer |
| GANs | Generative Adversarial Networks |
| KDE | Kernel Density Estimation |
| t-SNE | t-Distributed Stochastic Neighbor Embedding |
| VAF | View Attention Fusion |
| VAM | View Attention Module |
References
- Williams, D.P. Transfer Learning with SAS-Image Convolutional Neural Networks for Improved Underwater Target Classification. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 78–81. [Google Scholar]
- Cheng, Z.; Huo, G.; Li, H. A Multi-Domain Collaborative Transfer Learning Method with Multi-Scale Repeated Attention Mechanism for Underwater Side-Scan Sonar Image Classification. Remote Sens. 2022, 14, 355. [Google Scholar] [CrossRef]
- Luo, X.; Qin, X.; Wu, Z.; Yang, F.; Wang, M.; Shang, J. Sediment Classification of Small-Size Seabed Acoustic Images Using Convolutional Neural Networks. IEEE Access 2019, 7, 98331–98339. [Google Scholar] [CrossRef]
- Qin, X.; Luo, X.; Wu, Z.; Shang, J. Optimizing the Sediment Classification of Small Side-Scan Sonar Images Based on Deep Learning. IEEE Access 2021, 9, 29416–29428. [Google Scholar] [CrossRef]
- Khan, A.; Rauf, Z.; Sohail, A.; Rehman, A.; Asif, H.; Asif, A.; Farooq, U. A Survey of the Vision Transformers and Their CNN-transformer Based Variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
- Lou, M.; Yu, Y. OverLoCK: An Overview-First-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. arXiv 2025, arXiv:2502.20087. [Google Scholar]
- Dai, Q.; Zhang, G.; Xue, B.; Fang, Z. Capsule-Guided Multi-View Attention Network for SAR Target Recognition with Small Training Set. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar]
- Wang, Z.; Wang, C.; Pei, J.; Huang, Y.; Zhang, Y.; Yang, H.; Xing, Z. Multi-View SAR Automatic Target Recognition Based on Deformable Convolutional Network. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 3585–3588. [Google Scholar]
- Pei, J.; Wang, Z.; Sun, X.; Huo, W.; Zhang, Y.; Huang, Y.; Wu, J.; Yang, J. FEF-net: A Deep Learning Approach to Multiview SAR Image Target Recognition. Remote Sens. 2021, 13, 3493. [Google Scholar] [CrossRef]
- Chen, Z.; Xie, G.; Deng, X.; Peng, J.; Qiu, H. DA-YOLOv7: A Deep Learning-Driven High-Performance Underwater Sonar Image Target Recognition Model. J. Mar. Sci. Eng. 2024, 12, 1606. [Google Scholar] [CrossRef]
- Cao, L.; Ma, Z.; Hu, Q.; Xia, Z.; Zhao, M. DCE-net: An Improved Method for Sonar Small-Target Detection Based on YOLOv8. J. Mar. Sci. Eng. 2025, 13, 1478. [Google Scholar] [CrossRef]
- Basha S, K.; Kiran B, A.; Nambiar, A.; Rajendran, S. A Novel Context-Adaptive Fusion of Shadow and Highlight Regions for Efficient Sonar Image Classification. arXiv 2025, arXiv:2506.01445. [Google Scholar] [CrossRef]
- Li, S.; Li, T.; Wu, Y. Side-Scan Sonar Mine-like Target Detection Considering Acoustic Illumination and Shadow Characteristics. Ocean Eng. 2025, 336, 121711. [Google Scholar] [CrossRef]
- Groen, J.; Coiras, E.; Williams, D.P. False-Alarm Reduction in Mine Classification Using Multiple Looks from a Synthetic Aperture Sonar. In Proceedings of the Oceans’10 IEEE Sydney, Sydney, Australia, 24–27 May 2010; pp. 1–8. [Google Scholar]
- Williams, D.P. Underwater Target Classification in Synthetic Aperture Sonar Imagery Using Deep Convolutional Neural Networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2497–2502. [Google Scholar]
- Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-View Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
- Su, J.-C.; Gadelha, M.; Wang, R.; Maji, S. A Deeper Look at 3D Shape Classifiers. In Computer Vision—ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 645–661. [Google Scholar]
- Neupane, D.; Seok, J. A Review on Deep Learning-Based Approaches for Automatic Sonar Target Recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
- Jiang, W.; Wang, Y.; Li, Y.; Lin, Y.; Shen, W. Radar Target Characterization and Deep Learning in Radar Automatic Target Recognition: A Review. Remote Sens. 2023, 15, 3742. [Google Scholar] [CrossRef]
- Jegorova, M.; Karjalainen, A.I.; Vazquez, J.; Hospedales, T. Full-Scale Continuous Synthetic Sonar Data Generation with Markov Conditional Generative Adversarial Networks. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3168–3174. [Google Scholar]
- Peng, C.; Jin, S.; Bian, G.; Cui, Y.; Wang, M. Sample Augmentation Method for Side-Scan Sonar Underwater Target Images Based on CBL-sinGAN. J. Mar. Sci. Eng. 2024, 12, 467. [Google Scholar] [CrossRef]
- Pei, J.; Huo, W.; Wang, C.; Huang, Y.; Zhang, Y.; Wu, J.; Yang, J. Multiview Deep Feature Learning Network for SAR Automatic Target Recognition. Remote Sens. 2021, 13, 1455. [Google Scholar] [CrossRef]
- Zhang, P.; Tang, J.; Zhong, H.; Wu, H.; Li, H.; Fan, Y. Orientation Estimation of Rotated Sonar Image Targets via the Wavelet Subimage Energy Ratio. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9020–9032. [Google Scholar] [CrossRef]
- Zhang, P.; Tang, J.; Zhong, H. Self-trained target detection of radar and sonar images using automatic deep learning. IEEE Trans. Geosci. Remote Sens. 2021. [Google Scholar]
- Jiao, W.; Zhang, J.; Zhang, C. Open-Set Recognition with Long-Tail Sonar Images. Expert Syst. Appl. 2024, 249, 123495. [Google Scholar] [CrossRef]





















| Dataset | Class | Training Set | Testing Set | Sum | Total |
|---|---|---|---|---|---|
| CSSID | Cone | 35 | 15 | 50 | 174 |
| Cylinder | 35 | 15 | 50 | ||
| Globe | 16 | 8 | 24 | ||
| Shipwreck | 35 | 15 | 50 | ||
| NKSID | Big propeller | 142 | 61 | 203 | 2617 |
| Cylinder | 201 | 87 | 288 | ||
| Fishing net | 14 | 6 | 20 | ||
| Floats | 665 | 286 | 951 | ||
| Iron pipeline | 78 | 34 | 112 | ||
| Small propeller | 65 | 29 | 94 | ||
| Soft pipeline | 80 | 35 | 115 | ||
| Tire | 583 | 251 | 834 |
| Model Category | Specific Model | Learning Rate | Batch Size | Epochs |
|---|---|---|---|---|
| YOLO Series | YOLOv8m | 8 | 20 | |
| YOLO11 series | 8 | 20 | ||
| ResNet | ResNet-34/50/101 | 16 | 10 | |
| Transformer | SwinT-tiny | 16 | 10 | |
| ViT-base | 16 | 10 |
| Dataset | Backbone | Single View | 2V-DCNN | 3V-DCNN |
|---|---|---|---|---|
| CSSID | ResNet34 | 90.57 | 94.34 (+3.77) | 98.11 (+7.54) |
| ResNet50 | 88.68 | 92.45 (+3.77) | 94.34 (+5.66) | |
| ResNet101 | 92.45 | 94.34 (+1.89) | 94.34 (+1.89) | |
| SwinT-tiny | 92.45 | 98.11 (+5.66) | 100.0 (+7.55) | |
| ViT-base | 69.81 | 94.34 (+24.53) | 96.23 (+26.42) | |
| NKSID | ResNet34 | 91.51 | 97.47 (+5.96) | 98.86 (+7.35) |
| ResNet50 | 89.48 | 97.59 (+8.11) | 97.85 (+8.37) | |
| ResNet101 | 92.65 | 97.21 (+4.56) | 97.59 (+4.94) | |
| SwinT-tiny | 94.42 | 98.99 (+4.57) | 99.87 (+5.45) | |
| ViT-base | 92.90 | 94.93 (+2.03) | 97.21 (+4.31) |
| Dataset | Model | Single-View | 2-View | 3-View |
|---|---|---|---|---|
| CSSID | YOLOv8m | 73.58 | 86.79 (+13.21) | 94.34 (+20.76) |
| YOLO11n | 84.91 | 92.45 (+7.54) | 94.34 (+9.43) | |
| YOLO11m | 86.79 | 92.45 (+5.66) | 92.45 (+5.66) | |
| YOLO11x | 84.91 | 92.45 (+7.54) | 90.57 (+5.66) | |
| NKSID | YOLOv8m | 88.47 | 91.51 (+3.04) | 95.18 (+6.71) |
| YOLO11n | 87.83 | 91.38 (+3.55) | 94.17 (+6.34) | |
| YOLO11m | 88.59 | 90.87 (+2.28) | 94.04 (+5.45) | |
| YOLO11x | 84.79 | 90.49 (+5.70) | 93.79 (+9.00) |
| Evaluation Metric | Shared Weights | Non-Shared Weights | Improvement |
|---|---|---|---|
| Accuracy (%) | 96.4 | 88.7 | +7.7% |
| Weighted F1 Score | 0.962 | 0.883 | +7.9% |
| Parameters (M) | 27.5 | 82.6 | −66.7% |
| Training Time (s/epoch) | 45.2 | 52.3 | −13.6% |
| Parameter Efficiency | 3.51 | 1.07 | +328% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Fan, Y.; Peng, C.; Zhang, P.; Zhang, Z.; Zhang, G.; Tang, J. MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sens. 2026, 18, 76. https://doi.org/10.3390/rs18010076
Fan Y, Peng C, Zhang P, Zhang Z, Zhang G, Tang J. MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sensing. 2026; 18(1):76. https://doi.org/10.3390/rs18010076
Chicago/Turabian StyleFan, Yue, Cheng Peng, Peng Zhang, Zhisheng Zhang, Guoping Zhang, and Jinsong Tang. 2026. "MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition" Remote Sensing 18, no. 1: 76. https://doi.org/10.3390/rs18010076
APA StyleFan, Y., Peng, C., Zhang, P., Zhang, Z., Zhang, G., & Tang, J. (2026). MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sensing, 18(1), 76. https://doi.org/10.3390/rs18010076

