MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition

Fan, Yue; Peng, Cheng; Zhang, Peng; Zhang, Zhisheng; Zhang, Guoping; Tang, Jinsong

doi:10.3390/rs18010076

Open AccessArticle

MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition

by

Yue Fan

^1,2,†

,

Cheng Peng

^1,†

,

Peng Zhang

^3,*

,

Zhisheng Zhang

¹,

Guoping Zhang

² and

Jinsong Tang

¹

Naval University of Engineering, Wuhan 430033, China

²

Central China Normal University, Wuhan 430079, China

³

National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2026, 18(1), 76; https://doi.org/10.3390/rs18010076

Submission received: 29 October 2025 / Revised: 8 December 2025 / Accepted: 19 December 2025 / Published: 25 December 2025

(This article belongs to the Special Issue Underwater Remote Sensing: Status, New Challenges and Opportunities)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

This study proposes a Multi-View Deep Convolutional Neural Network based on feature fusion. By leveraging shared-weight backbone networks to extract discriminative features and adopting cross-view max-pooling to fuse complementary features (e.g., texture and shape), the network constructs a comprehensive feature space for underwater targets, enabling robust fine-grained recognition in complex marine environments.
This study designs a data augmentation method based on the principles of multi-view sonar imaging. Given a number of single-view samples $n_{i}$ and number of views $k > 1$ , the method systematically combines the original single-view images of a target using the combination formula $C_{n_{i}}^{k}$ . It then screens these combinations within a defined azimuth range to select qualified multi-view training samples, ultimately constructing a dedicated dataset for multi-view sonar image target recognition.

What is the implication of the main finding?

Compared to single-view models, the proposed architecture achieves more confident category predictions and superior class separation in the feature space, significantly enhancing underwater target recognition accuracy (on the Custom Side-Scan Sonar Image Dataset and Nankai Sonar Image Dataset, the two-view MVDCNN achieves an average accuracy of 94.72% and 97.24%, with increases of 7.93% and 5.05% compared with single-view baselines; the three-view MVDCNN further improves the average accuracy to 96.60% and 98.28%), while significantly reducing the misclassification rate of small-sample categories.
The designed data augmentation method effectively mitigates the data scarcity problem in Sonar Automatic Target Recognition by systematically generating multi-view training samples, thereby lowering the barrier to its deployment in real-world scenarios.

Abstract

Automatic Target Recognition (ATR) in single-view sonar imagery is severely hampered by geometric distortions, acoustic shadows, and incomplete target information due to occlusions and the slant-range imaging geometry, which frequently give rise to misclassification and hinder practical underwater detection applications. To address these critical limitations, this paper proposes a Multi-View Deep Convolutional Neural Network (MVDCNN) based on feature-level fusion for robust sonar image target recognition. The MVDCNN adopts a highly modular and extensible architecture consisting of four interconnected modules: an input reshaping module that adapts multi-view images to match the input format of pre-trained backbone networks via dimension merging and channel replication; a shared-weight feature extraction module that leverages Convolutional Neural Network (CNN) or Transformer backbones (e.g., ResNet, Swin Transformer, Vision Transformer) to extract discriminative features from each view, ensuring parameter efficiency and cross-view feature consistency; a feature fusion module that aggregates complementary features (e.g., target texture and shape) across views using max-pooling to retain the most salient characteristics and suppress noisy or occluded view interference; and a lightweight classification module that maps the fused feature representations to target categories. Additionally, to mitigate the data scarcity bottleneck in sonar ATR, we design a multi-view sample augmentation method based on sonar imaging geometric principles: this method systematically combines single-view samples of the same target via the combination formula and screens valid samples within a predefined azimuth range, constructing high-quality multi-view training datasets without relying on complex generative models or massive initial labeled data. Comprehensive evaluations on the Custom Side-Scan Sonar Image Dataset (CSSID) and Nankai Sonar Image Dataset (NKSID) demonstrate the superiority of our framework over single-view baselines. Specifically, the two-view MVDCNN achieves average classification accuracies of 94.72% (CSSID) and 97.24% (NKSID), with relative improvements of 7.93% and 5.05%, respectively; the three-view MVDCNN further boosts the average accuracies to 96.60% and 98.28%. Moreover, MVDCNN substantially elevates the precision and recall of small-sample categories (e.g., Fishing net and Small propeller in NKSID), effectively alleviating the class imbalance challenge. Mechanism validation via t-Distributed Stochastic Neighbor Embedding (t-SNE) feature visualization and prediction confidence distribution analysis confirms that MVDCNN yields more separable feature representations and more confident category predictions, with stronger intra-class compactness and inter-class discrimination in the feature space. The proposed MVDCNN framework provides a robust and interpretable solution for advancing sonar ATR and offers a technical paradigm for multi-view acoustic image understanding in complex underwater environments.

Keywords:

sonar ATR; multi-view; feature fusion; feature extraction; CNN; transformer; data augmentation; confidence distribution analysis; t-SNE

Graphical Abstract

1. Introduction

Sonar imaging, which utilizes acoustic pulses to visualize underwater targets in complex environments, is a pivotal underwater detection technology. An advanced development in this field is Sonar Automatic Target Recognition (ATR), which plays an indispensable role in missions ranging from unexploded ordnance (UXO) clearance [1], shipwreck localization [2], to seabed sediment classification [3,4].

The widespread success of deep learning in optical [5,6,7] and synthetic aperture radar (SAR) [8,9,10] image recognition, characterized by its high automation and accuracy, presents a promising paradigm for sonar ATR. Consequently, current sonar ATR research has largely focused on enhancing single-view image performance. For instance, Chen et al. [11] introduced a dual-attention mechanism and a weighted multi-scale fusion module into YOLOv7 to improve small-target recognition, while Cao et al. [12] enhanced YOLOv8 with a depth-wise convolution module to address issues like limited target pixels and imbalanced class distribution. Other approaches [13,14] have concentrated on improving image quality and integrating distinctive shadow/highlight features to boost detection capability with limited data.

However, these single-view approaches are fundamentally limited by the inherent information incompleteness of a single perspective-sonar’s slant-range imaging geometry introduces geometric distortions and occluded shadows, rendering single-view-based classification inherently ambiguous and error-prone [15]. While multi-view fusion offers a potential solution, existing attempts in sonar ATR, such as the image-level fusion by Williams et al. [16], are suboptimal. Their method of directly concatenating raw images can lead to feature space inconsistency and fails to leverage shared intermediate features across views.

In contrast, feature-level fusion strategies prevalent in 3D reconstruction [17,18] and SAR [10] employ shared-weight Convolutional Neural Network (CNN) branches to extract and fuse high-level features, proving more robust to viewpoint variations. Inspired by these insights, this study presents a Multi-View Deep Convolutional Neural Network (MVDCNN) that leverages feature-level fusion.

Moreover, existing data augmentation methods for sonar ATR can be categorized into two types, both of which have notable limitations. The first type includes conventional augmentation techniques for single-view images, such as geometric rotation, pixel-level noise injection, and contrast adjustment [19,20]. These methods only expand samples within a single-view dimension and fail to introduce cross-view complementary features, thus remaining unable to break through the essential limitation of information incompleteness in single-view imaging.

The second type encompasses a few multi-view sample construction schemes: some rely on manual multiple acquisitions of multi-view sequences for the same target, which are constrained by the high cost and complex environment of underwater detection, making large-scale sample accumulation impractical; others randomly concatenate single-view images of different targets, ignoring the azimuth geometric constraints of sonar imaging, resulting in samples lacking physical rationality and prone to introducing invalid noise that interferes with model training. Additionally, certain methods [21,22] based on Generative Adversarial Networks (GANs) can generate pseudo-multi-view samples but require large amounts of initial labeled data and tend to produce false features that deviate from real acoustic characteristics, exhibiting poor adaptability to small-sample categories.

The SAR ATR domain has established multi-view augmentation methods worthy of reference [23]. The multi-view sonar image sample augmentation method specifically designed in this paper achieves breakthroughs in both technical principle and practical effectiveness. Based on the azimuth geometric model of sonar imaging, this method systematically combines single-view samples of the same target using combination formulas and screens valid samples within a preset azimuth range. It not only realizes the transformation from limited single-view data to large-scale high-quality multi-view samples but also ensures the physical rationality of the generated samples in terms of acoustic imaging. Meanwhile, this method does not rely on large-scale initial samples or complex generative models, can directly alleviate the feature sparsity problem of small-sample categories, and significantly reduces the cost of manual acquisition and screening, providing reliable data support for the training of multi-view sonar ATR models.

The main contributions of this paper are as follows:

(1): This study proposes a modular and extensible MVDCNN framework. Its core innovation lies in decomposing the multi-view recognition pipeline into four dedicated modules: input reshaping, feature extraction, feature fusion, and classification. This design allows the feature extraction module to seamlessly integrate various pre-trained CNN or Transformer backbones, enabling the framework to evolve with advances in foundation models while leveraging their learned generic visual features. Compared to single-view baselines, MVDCNN achieves a more comprehensive and robust target understanding through feature-level fusion.
(2): This study introduces a data augmentation technique from the field of multi-view SAR image recognition, effectively alleviating the critical bottleneck of scarce training data in sonar ATR. Addressing the challenge of limited open-source multi-view sonar image datasets, this method leverages geometric modeling to systematically generate numerous qualified multi-view training pairs from limited single-view images by combining and screening samples within a defined azimuth range of the same target.
(3): This study establishes a multidimensional, visualization-supported model evaluation and mechanism analysis methodology. Beyond analyzing traditional performance metrics such as accuracy, precision, and recall, this research employs Kernel Density Estimation and t-SNE visualization. This approach not only quantitatively validates the significant improvements in classification accuracy and prediction confidence achieved through multi-view fusion but also qualitatively reveals its intrinsic mechanism of enhancing inter-class separation and intra-class compactness from the perspective of feature space distribution, providing clear evidence for the method’s effectiveness.

The remainder of this paper is structured as follows. Section 2 details the proposed framework, including the problem formulation and the multi-view sample augmentation method. Section 3 presents the experimental results, including ablation studies and comparative analyses, to validate the framework’s effectiveness and superiority. Finally, Section 4 concludes the paper and outlines future research directions.

2. Materials and Methods

In this chapter, we present the geometric model for multi-view imaging and describe a data augmentation strategy tailored for multi-view sonar images. We then elaborate on the design principles and functional roles of the four core modules in MVDCNN. As depicted in Figure 1, the MVDCNN architecture consists of the following key components:

Input Reshaping: Single-channel sonar images are adapted to pre-trained backbones via tensor reshaping and channel replication.
Feature Extraction: Shared-weight backbone networks (e.g., ResNet and Transformer) extract high-level features from each view.
Feature Fusion: Feature maps are aggregated by applying max-pooling across the view dimension, retaining the most activated features.
Classification: A lightweight fully-connected head with Softmax projects the fused feature vector into the probability space for the final decision.

2.1. Multi-View Sample Augmentation Method

In practical underwater missions, sonar imaging systems typically capture images of the same target at identical pitch angles but varying azimuth angles [24]. In this context, the viewing angle is defined as the azimuth between the sonar target line and geographical north. A single sonar system can only acquire images from one viewing angle per pass. To construct a multi-view sonar image dataset, the same sonar should be used to conduct multiple single-view observations of the same target, with the resulting images then randomly combined to form a multi-view sonar image dataset for target recognition.

Figure 2 illustrates the geometric model of multi-view sonar imaging. Given a viewing angle interval

θ

and the number of viewpoints

k (k > 1)

, the imaging sonar acquires a series of images of the underwater target from different viewing angles. The multi-view sonar ATR system can extract enhanced target classification information from these multi-view sonar images.

In theory, the multi-view sonar imaging geometric model illustrated in Figure 2 can generate an infinite number of multi-view sonar image sample combinations. However, constrained by the conditions and costs associated with sonar data acquisition, it is challenging to collect large quantities of original multi-view sonar training samples based on this model in practical applications.

Therefore, this paper adapts the multi-view sample augmentation method from the field of SAR image recognition [10] for generating multi-view sonar training samples, effectively mitigating data scarcity and small-sample problems [25].

The process begins with a single acquisition session to collect raw sonar images of underwater targets with consistent resolution within the azimuth range from

0^{\circ}

to

360^{\circ}

. Let

X_{r} a w = {X_{1}, X_{2}, \dots, X_{C}}

represent the collection of raw sonar samples for all target categories. Define

X_{i} = {x_{1}, x_{2}, \dots, x_{n_{i}}}

as the set of target samples belonging to a category label, where each image in

X_{i}

has an observation azimuth angle of

φ (x_{i})

. Here,

y_{i} \in [1, 2, \dots, C]

represents the number of categories.

Given the total number of single-view samples

n_{i}

and the number of viewing angles

k > 1

, the total number of possible view combinations for sonar images of the same target category is as follows:

C_{n_{i}}^{k} = \frac{n_{i}!}{k! (n_{i} - k)!}

(1)

The sonar images within each viewing angle combination

\{x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{k}}\}

are sorted according to their observation azimuth angles: either

φ (x_{i_{1}}) < φ (x_{i_{2}}) < \dots < φ (x_{i_{k}})

or

φ (x_{i_{1}}) > φ (x_{i_{2}}) > \dots > φ (x_{i_{k}})

. Finally, the formula is utilized to screen multi-view sonar sample combinations of target

y_{i}

that fall within the same viewing angle interval

θ

, which subsequently serve as training samples for MVDCNNs.

|φ (X_{i_{j}}) - φ (X_{i_{k}})| \leq θ

(2)

Figure 3 illustrates a concrete example of the multi-view sample augmentation method (with the number of views

k = 3

): this approach can transform six original single-view sonar samples into nine distinct three-view sonar sample combinations.

2.2. The Proposed MVDCNN Framework

2.2.1. Input Reshaping Module

Existing pre-trained backbone networks are predominantly designed for processing single-view optical images, with their input data format typically represented as 4D tensors

[B, C, H, W]

where the channel dimension

C = 3

. However, multi-view sonar image training samples possess a 5D tensor structure

[B, V, C, H, W]

(where

C = 1

). Therefore, as illustrated in Figure 4, we merge the batch dimension B and view dimension V, reshaping the training samples into the format

[B \times V, C, H, W]

. Additionally, following the process shown in Figure 5, we replicate the single-channel grayscale images into three identical channels to achieve channel dimension compatibility for the input multi-view sonar images.

2.2.2. Feature Extraction Module

The primary objective of the feature extraction module is to derive highly discriminative feature representations from each view of the sonar images. To fully leverage the generic visual features learned by existing deep models on large-scale datasets and overcome data scarcity, this module adopts a transfer learning strategy. It integrates various pre-trained CNN and Transformer architectures as backbone networks. Specifically, the module supports multiple models, including the ResNet, Vision Transformer (ViT), and Swin Transformer (SwinT), which have demonstrated exceptional performance in natural image recognition tasks. This selection covers two main paradigms: CNNs that excel at capturing local spatial features and Transformers that are adept at modeling long-range dependencies among features.

The module is designed following the principles of modularity and extensibility. A unified model loader is employed to instantiate different backbone networks and adapt their classification heads according to task requirements. For pre-trained models, we retain the weights of their convolutional layers and Transformer encoders, replacing only the final fully connected classification layer to accommodate the specific number of target categories in our task. This strategy not only enables the model to transfer rich hierarchical feature representations from natural images but also allows it to quickly adapt to the unique characteristics of sonar images (such as low contrast, geometric distortions, and acoustic shadows) through fine-tuning.

As described in Section 2.2.1, during the feature extraction process, the input multi-view sonar image tensor is first reshaped into a 4D tensor

[B \times V, C, H, W]

. This tensor is then fed into the backbone network for forward propagation. Each view’s image is processed independently by the same shared-weight backbone network, ensuring consistency in feature extraction. Through weight sharing, we significantly reduce the number of model parameters, improve training efficiency, and enhance the model’s generalization capability across different views.

The final layers of the backbone network serve as feature extractors, outputting a fixed-dimensional feature vector for each view. Specifically, for ResNet, we remove the final fully connected layer and extract features after global average pooling; for Transformer-based models (such as ViT and SwinT), we utilize the class token or features after global pooling. Upon completion of feature extraction, the module outputs a 3D feature tensor

F_{v i e w} \in R^{B \times V \times D}

, where D represents the feature dimension. This structured output explicitly preserves the view dimension information, preparing the data for subsequent cross-view fusion in the feature fusion module.

2.2.3. Feature Fusion Module and Classification Module

The feature fusion module aggregates feature representations from multiple views through max-pooling to generate robust and highly discriminative target representations. The workflow consists of two main stages:

First, feature reshaping and view dimension restoration are performed. The feature tensor output from the feature extraction module has the format

[B \times V, D]

. We reshape it into a 3D tensor

[B, V, D]

, restoring the explicit structure of the view dimension. This step enables an independent analysis of the V view features for each sample.

Then, multi-view feature fusion is conducted. Max-pooling is applied along the view dimension V to select the most significant feature responses from the multi-view features. The fused feature

F_{fused} \in R^{B \times D}

is calculated as

F_{fused} [i, d] = max_{1 \leq v \leq V} F [i, v, d], \forall i \in 1, \dots, B, d \in 1, \dots, D

(3)

This strategy essentially functions as a “feature selector”: for each feature channel d, it retains only the strongest response value across V views. This approach effectively suppresses interference from noisy views while preserving the target’s common characteristics (such as shape contours and key textures).

Finally, the fused feature

F_{fused}

is mapped to the category space through the classification decision module:

y = Softmax (W_{c} F_{fused} + b_{c})

(4)

where

W_{c} \in R^{C \times D}

represents the classification weight matrix and C denotes the number of classes.

2.3. Visualization Methods

To quantitatively and intuitively analyze the feature representation capability and prediction decision characteristics of the MVDCNN, this study introduces two visualization methods, namely t-Distributed Stochastic Neighbor Embedding (t-SNE) and Kernel Density Estimation (KDE), which are used for feature space distribution analysis and prediction confidence distribution characterization, respectively, to provide clear evidence for the model’s effectiveness. The specific principles and implementations are as follows.

2.3.1. t-Distributed Stochastic Neighbor Embedding

t-SNE is a nonlinear dimensionality reduction algorithm whose core objective is to map high-dimensional feature spaces to 2D or 3D low-dimensional spaces while preserving the local neighborhood structure and global distribution characteristics of samples in the high-dimensional space. Its implementation process consists of two steps.

First, a high-dimensional spatial probability distribution is constructed. For each sample in the high-dimensional feature set, the Gaussian conditional probability

p_{j | i}

between the sample and other samples is calculated to characterize the probability that sample i considers sample j as a neighbor, with the probability

p_{j | i}

for samples of the same class being significantly higher than that for samples of different classes.

Second, low-dimensional probability space distribution matching is performed. Random coordinates are initialized for each sample in the low-dimensional space, and the joint probability

q_{i j}

between samples is calculated via the t-distribution. Kullback–Leibler Divergence is then used to minimize the difference between the high-dimensional distribution P and the low-dimensional distribution Q, achieving low-dimensional visualization of high-dimensional features.

2.3.2. Kernel Density Estimation

The confidence distribution curve plotted by KDE not only reflects the certainty of the model’s classification results but also deeply reveals the robustness and discriminative power of its internal feature representations. In classification tasks, an ideal model should exhibit high and concentrated confidence when predicting samples of the same class, indicating that the model has learned the essential features of the class rather than surface patterns formed by chance or overfitting.

KDE is a non-parametric probability density estimation method that can fit the overall probability distribution through the local density of sample points without presupposing the data distribution form. Its formula is as follows:

\hat{f} (x) = \frac{1}{n h} \sum_{i = 1}^{n} K (\frac{x - x_{i}}{h})

(5)

where n is the number of samples, h is the bandwidth (controlling the smoothness), and

K (\cdot)

is the kernel function (a Gaussian kernel function

K (u) = \frac{1}{\sqrt{2 π}} e^{- \frac{u^{2}}{2}}

is selected in this study).

In the KDE plot, the abscissa represents the range of the model’s prediction confidence, normalized to the interval

[0.0, 1.0]

, where 0.0 indicates that the model has no confidence in the prediction result at all, and 1.0 indicates that the model has absolute certainty. The ordinate represents the estimated value of the probability density function, reflecting the relative frequency of sample distribution within a specific confidence neighborhood. In addition, the second moment (variance) of the density curve quantitatively characterizes the dispersion of the prediction confidence distribution. A wider distribution curve indicates high variance in the model’s confidence, reflecting high uncertainty in prediction results; whereas a narrower and sharper distribution indicates that the model exhibits a highly consistent confidence level in predictions.

2.4. Dataset

Due to the limited availability of multi-view sonar image datasets, to ensure the effectiveness and credibility of the proposed method, this paper validates the approach using two datasets: the Custom Side-Scan Sonar Image Dataset (CSSID) and the Nankai Side-Scan Sonar Imagery Dataset (NKSID) [26], as detailed in Table 1.

CSSID is a self-built dataset by the research group, collected using a 650 kHz side-scan sonar system in a controlled experimental environment in offshore shallow waters. Its original imaging resolution is

512 \times 512

pixels, corresponding to a physical scale of 0.02 m per pixel. After manual screening, a total of 174 valid images is obtained, covering 4 typical underwater target categories: Cone, Cylinder, Globe, and Shipwreck. The imaging results of the 4 target categories in CSSID from different viewing angles are shown in the following Figure 6.

NKSID is a public dataset collected by a professional marine detection team in the South China Sea using a 0.75/1.2 MHz multi-beam forward-looking sonar Oculus M750d (Blueprint Subsea, Low Wood, Cumbria, UK) It has a larger data scale and richer category diversity, containing 2617 images covering 8 types of actual marine targets (Big propeller, Cylinder, Fishing net, Floats, Iron pipeline, Small propeller, Soft pipeline, Tire), with an original image resolution of

256 \times 256

pixels (corresponding to a physical scale of 0.015 m per pixel). Notably, NKSID can directly provide multi-view images of the same target without additional manual processing, which is its core advantage over most existing sonar datasets for multi-view recognition research. The imaging results of the 8 target categories in NKSID are shown in the following Figure 7.

Partition Strategy

To address the class imbalance problem and avoid the impact of random partitioning on experimental reproducibility, this study adopts a stratified sampling strategy to partition the datasets, ensuring that the class distribution of each subset is consistent with that of the overall dataset: CSSID is divided into a training set (70%, 121 samples) and a test set (30%, 53 samples) at a ratio of 7:3; NKSID is also partitioned at a 7:3 ratio (1828 training samples and 789 test samples). The random seed for all partitioning processes is set to 42 to ensure the reproducibility of the experimental results. For small-sample categories (e.g., the Fishing net category in NKSID with only 20 samples), this study implements a sample guarantee mechanism to ensure that at least 3 samples are included in the test set, avoiding the problem of ineffective evaluation of recognition performance for small-sample categories.

As shown in Table 1, CSSID has a balanced distribution of samples across conventional categories but a small overall sample size, making it suitable for verifying the basic effectiveness of the model; NKSID has a larger sample size and richer category types, including multiple small-sample categories (Fishing net accounts for only 0.76% of the total samples and Small propeller accounts for 3.55%), which can effectively test the model’s robustness in small-sample recognition. However, both datasets have certain limitations: CSSID is collected under controlled conditions and differs from real complex marine scenarios; NKSID’s targets are dominated by near-shore artificial facilities, lacking samples of deep-sea biological targets.

2.5. Experimental Setup

To systematically verify the effectiveness and superiority of the proposed MVDCNN framework and quantify the performance gain of multi-view feature-level fusion for sonar ATR tasks, this section specifies the experimental comparison baselines, software and hardware environments, training hyperparameters, and evaluation index system and designs a special comparison model (MV-YOLO) for practical application scenarios. The specific settings are as follows.

2.5.1. Baseline and MY-YOLO

The core verification objectives of this experiment are twofold: first, to verify the performance improvement of MVDCNN compared to single-view models; second, to highlight the adaptability advantage of classification-specialized backbones combined with multi-view fusion mechanisms by comparing with multi-view modified detection models. Based on this, two types of baseline models are selected for comparison.

First, mainstream classification architectures such as ResNet-34/50/101, Swin Transformer-tiny, and Vision Transformer-base are chosen to verify the universal performance gain of multi-view fusion across different backbone networks.

Second, representative YOLO series models in the sonar ATR field (YOLOv8m-cls and YOLO11n/11m/11x-cls) (Ultralytics Inc., Judicial Wy, Frederick, MD, USA) are selected. These models are optimized for dense detection tasks and can be used to compare the performance differences between classification-specialized architectures and detection architectures in sonar image classification tasks.

Considering that practical sonar ATR tasks mostly follow the “detection and recognition” workflow, to ensure the fairness and scenario adaptability of the comparison, this study modifies the YOLO series models for multi-view processing based on MVDCNN’s multi-view feature fusion mechanism, resulting in the MV-YOLO variant.

The evaluation of MV-YOLO is carried out from two dimensions: first, a quantitative comparison of classification performance, comparing the accuracy and small-sample recall rate differences between MV-YOLO and MVDCNN under the same view configuration; second, the convergence performance of training dynamics, analyzing the mitigation effect of multi-view input on the overfitting problem of YOLO models, providing a reference for the application of multi-view mechanisms in the full “detection and recognition” workflow.

2.5.2. Experimental Environment and Hyperparameters

All experiments are implemented based on the PyTorch (version 2.0.0, CUDA 12.0, Python 3.9) deep learning framework, with the hardware platform being an RTX 3060 GPU to ensure the reproducibility and computational efficiency of the experiments. Input images are uniformly standardized to

224 \times 224

pixels to eliminate the impact of image size differences on model performance.

To account for differences in model parameter quantities and task adaptability, hyperparameter tuning is performed to ensure that each model fully converges. The specific parameter configurations are shown in the Table 2.

Due to their focus on detection tasks and large parameter quantities, YOLO series models are set with higher learning rates (

2.5 \times 10^{- 3}

for YOLOv8m and

5.0 \times 10^{- 3}

for the YOLO11 series); classification backbones such as ResNet series and SwinT-tiny adopt a transfer learning strategy with a learning rate of

2.5 \times 10^{- 5}

; ViT-base has a reduced learning rate of

5.0 \times 10^{- 6}

due to its high model complexity to avoid training oscillations. The batch size of YOLO series models is set to 8, and training is conducted for 20 epochs to ensure full convergence of the detection head; the batch size of classification backbone networks is increased to 16, and stable convergence can be achieved with 10 training epochs, balancing training efficiency and model performance.

2.5.3. Evaluation Metrics

To comprehensively evaluate model performance, the experiment adopts a two-layer evaluation index method combining “overall performance and category-specific characteristics”. Accuracy is used to measure the model’s global classification capability, intuitively reflecting the improvement of multi-view fusion on overall recognition effectiveness. Precision and recall are introduced to conduct special analyses on small-sample categories (e.g., Fishing net and Small propeller in NKSID) and easily confused categories (e.g., Big propeller vs. Small propeller), quantifying the recognition reliability of the model for targets of different categories. In addition, to deeply explain the internal mechanism of model performance gain, the experiment supplements auxiliary verification methods such as convergence analysis, feature distribution visualization (t-SNE), and prediction confidence distribution analysis (KDE), achieving dual verification of “performance quantification and mechanism interpretation”.

2.5.4. Ablation Study of Weight Sharing Mechanism

To verify the impact of the shared-weight architecture in MVDCNN’s feature extraction module on model performance, parameter efficiency, and view-invariant feature capture capability, this study additionally designs an ablation experiment of shared-weight and non-shared-weight models to ensure single-variable control and result comparability in the experiment.

The ablation study is still conducted based on the backbone network, keeping the feature fusion (max-pooling) and classification modules of MVDCNNs completely consistent, with only the weight mechanism of the feature extraction module modified. The shared-weight model (MVDCNN-shared) serves as the baseline experimental group, where feature extraction across all views shares the same set of backbone network weights to achieve parameter reuse for cross-view feature extraction; the non-shared-weight model (MVDCNN-independent) serves as the control group, where each view branch is configured with independent backbone network weights and there is no parameter reuse between branches.

Both types of models adopt completely consistent training hyperparameters: the learning rate is set to

2.5 \times 10^{- 5}

, the batch size is 16, and the training epochs are 10, which are consistent with the training configuration of the backbone network in the MVDCNN, eliminating the interference of hyperparameter differences on experimental results.

In addition to conventional metrics such as accuracy, precision, and recall, the ablation experiment adds model efficiency metrics: the number of model parameters (M), training time per epoch (s/epoch), and parameter efficiency (accuracy/number of parameters) are counted to quantify the improvement in weight sharing on parameter compression and training efficiency.

3. Results

3.1. Convergence Behavior

Figure 8 and Figure 9, respectively, show the training and validation loss curves, as well as the training and validation accuracy curves of the MVDCNN (five backbones) and MV-YOLO (four backbones) under different view configurations. The MVDCNN with different backbone network configurations exhibits rapid decreases and stabilization in both training and validation losses within the set 10 training epochs across different views, and the validation accuracy curve also enters a plateau phase in the later stage without significant jitter or decline. However, due to its large number of parameters and high complexity, ViT requires more refined hyperparameter tuning for multi-view training.

The validation loss curve of single-view YOLO models (especially the YOLO11 series) decreases extremely slowly, almost flat, forming a huge “generalization error” gap with the continuously decreasing training loss (severe overfitting). After introducing multi-view, the validation loss curve shows a significant downward trend, and the gap with training loss narrows, but it is still weaker than the convergence stability of the MVDCNN.

3.2. Overall Classification Performance

Figure 10 shows the normalized confusion matrices of five backbone networks (ResNet34/50/101, SwinT-tiny, ViT-base) under three view configurations (one-view, two-view, three-view) on the CSSID and NKSID datasets. Rows 1 to 3 present the model results on the CSSID, while rows 4 to 6 show the results on the NKSID.

It can be intuitively observed that with the increase in the number of views, the main diagonal elements of the confusion matrix for all backbone networks are significantly enhanced (darker and brighter), while the off-diagonal elements (representing misclassification) decrease. On the CSSID, the average proportion of the main diagonal for two-view and three-view increases by 9.05% and 11.55%, respectively; on the NKSID, the increases are 15.53% and 19.58% (the proportion of the main diagonal for one-view is only 75.18%). For small-sample categories such as Fishing net, the average misclassification rate drops from 60.00% in one-view to 26.80% in two-view and 10.00% in three-view; for Small propeller, the average misclassification rate decreases from 84.40% in one-view to 26.00% in both two-view and three-view.

The confusion matrix of single-view baseline models shows a more scattered misclassification pattern (with numerous and miscellaneous off-diagonal elements). However, after introducing two-view or three-view information, the performance of the MVDCNN is significantly improved, and the confusion matrix tends to be ideal. This intuitively confirms that multi-view feature-level fusion is a highly effective performance improvement strategy.

To better quantify the overall performance of the MVDCNN on the datasets, accuracy is calculated using the confusion matrix as shown in Table 3. 2V-DCNN (two-view MVDCNN) achieves average accuracies of 94.72% and 97.24% on the CSSID and NKSID, representing increases of 7.93% and 5.05% compared to single-view baselines; 3V-DCNN achieves average accuracies of 96.60% and 98.28%, with average increases of 9.81% and 6.09% relative to single-view baselines.

To ensure comparison fairness, this study modified MV-YOLO based on the architecture and multi-view feature fusion mechanism of the MVDCNN. An analysis of MV-YOLO’s confusion matrix (Figure 11) shows that although the main diagonal of the confusion matrix is enhanced under multi-view configurations, small-sample categories still exhibit significant misclassification. For example, the misclassification rates of 3V-YOLO for Fishing net and Small propeller are 66.8% and 76.0%, respectively, and the dispersion of off-diagonal elements is higher than that of the MVDCNN. The proportions of main diagonal elements for 2V-YOLO and 3V-YOLO are only 73.2% and 78.3%, indicating weak consistency in intra-class discrimination.

Table 4 presents the accuracy of MV-YOLO on the two datasets. 2V-YOLO achieves average accuracies of 91.03% and 91.06% on CSSID and NKSID, with average increases of 8.49% and 3.64% compared to single-view baselines; 3V-YOLO achieves average accuracies of 92.92% and 94.30%, with average increases of 10.38% and 6.88% relative to single-view baselines. Under the same view configuration, the average accuracy of MVDCNN is 3.69% and 6.18% higher than that of MV-YOLO on CSSID and NKSID, respectively.

Figure 12 intuitively visualizes the performance comparison of sonar target recognition between the MVDCNN and MV-YOLO frameworks under one/two/three-view configurations, consisting of two subplots. The left subplot is a decagonal radar chart, with dimensions covering five backbones (ResNet34/50/101, SwinT-tiny, ViT-base) on the CSSID dataset and the same five backbones on the NKSID dataset; its scale starts at 65 to accommodate the low baseline accuracy of ViT-base, and blue dashed lines, green solid lines, and red dot-dashed lines represent single-view, two-view, and three-view configurations, respectively. The right subplot is an octagonal radar chart, with dimensions including four detection models (YOLOv8m and YOLO11n/m/x) on both CSSID and NKSID datasets; its scale starts at 70, and the curve markers for view configurations are consistent with the left subplot.

3.3. Feature Distribution Optimization

The previous sections have quantitatively demonstrated the performance advantages brought by multi-view fusion from the perspective of final classification decision outputs (confusion matrix and accuracy). However, to deeply understand the fundamental driver of its performance improvement, it is necessary to analyze the model’s internal feature representation space.

This section employs t-SNE dimensionality reduction visualization technology to visualize the feature space. As shown in Figure 13, the single-view feature space (the first subplot of each backbone network group) exhibits category mixing after t-SNE dimensionality reduction. On NKSID, which has numerous categories and large sample differences, the mixing is particularly evident, with point clouds of various categories interpenetrating and boundaries being blurred.

In addition, under multi-view configurations, the feature clustering of the Swin Transformer and Vision Transformer often exhibits better “intra-class compactness” and “inter-class separability”. This may be due to the global attention mechanism of the Transformer architecture, which can more effectively model long-range dependencies between feature blocks of different views, thereby achieving better cross-view feature alignment and complementarity.

After introducing two-view or three-view fusion, the feature spaces of the backbone networks (the second and third subplots of each backbone network group) show tightened feature clusters and clear category boundaries. Figure 14a–c clearly demonstrate the feature clustering effect of SwinT and 2/3V-SwinT on raw features. In the single-view feature space, the distribution of Floats feature points is somewhat discrete, and severe mixing occurs among Big propeller, Small propeller, and Cylinder, leading the model to misclassify Small propeller as Big propeller and Cylinder. As reflected in the confusion matrix, the correct recognition rate of Small propeller is only 10%, while 38% are misclassified as Big propeller and 21% as Cylinder. Multi-view fusion optimizes the feature space distribution: it not only compacts feature points of the same target but also increases the distance between feature clusters of different targets. This is the intrinsic mechanism by which multi-view fusion effectively reduces misclassification.

3.4. Prediction Uncertainty Reduction

This section uses KDE to plot confidence distribution curves and quantitatively calculates the mean confidence to compare the differences in prediction certainty under different view configurations. Figure 15 adopts a color-coding scheme for multi-dimensional data visualization: the green curve represents single-view, the blue curve corresponds to two-view, and the red curve represents three-view.

In terms of distribution shape, for most categories (e.g., Shipwreck in CSSID and Big propeller in NKSID), the single-view curve (green) tends to have a wide and scattered distribution, indicating high variance in the model’s confidence judgments for samples of the same class and insufficient prediction stability. With the increase in the number of views, the two-view and three-view curves (blue and red) become significantly narrower and higher in peak value, indicating that multi-view fusion reduces prediction variance and improves prediction consistency.

In terms of peak position, while the distribution converges, the peak of the curve generally shifts toward higher confidence (to the right). As shown in Figure 16, the mean of the Cylinder category increases from 0.862 in single-view to 0.935 in two-view and 0.972 in three-view. This indicates that the model not only makes more consistent judgments but also becomes more “confident” in the correctness of its judgments. Multi-view fusion effectively reduces the uncertainty of model predictions. In addition, for the Globe category, the mean confidence of incorrect predictions decreases from 0.274 in single-view to 0.270 in two-view and 0.185 in three-view. Although the multi-view model still has misclassifications, this possibility is suppressed, meaning the model becomes increasingly “unconfident” in misclassifications.

3.5. View-Invariant Feature Capture

This section verifies the shared-weight architecture of the feature extraction module proposed in Section 2.2.2. The performance, feature distribution, and view consistency of shared-weight (MVDCNN-shared) and non-shared-weight (MVDCNN-independent) architectures are compared to quantify the capture effect of view-invariant features.

Taking the performance index comparison of the representative MV-SwinT as an example, Figure 17 shows the confusion matrices of the two architectures on the NKSID, and the performance index comparison is summarized in Table 5.

The accuracy of the shared-weight architecture is 96.4%, representing an increase of 7.7% compared to the independent-weight architecture; the weighted F1-score is 0.962, with a relative increase of 7.9%. Its model parameter quantity is 27.5 M, a reduction of 66.7% compared to the independent-weight architecture, and parameter efficiency is improved by 328%. In addition, an analysis of the confusion matrix and confidence distribution (shown in Figure 18) shows that the shared-weight architecture has better recognition performance for the small-sample Small propeller (the misclassification rate drops from 48% to 0%) and exhibits stronger stability and consistency in category prediction (more convergent confidence distribution and higher mean confidence). Through the extraction and learning of the common features of targets, the shared-weight architecture makes the prediction of targets more reliable and the model output more credible.

Figure 19 shows the feature space distribution of the two architectures. The feature points of different views in the independent-weight model exhibit obvious separation; the blue (one-view), orange (two-view), and green (three-view) feature point clusters are relatively independently distributed in space, with distinct category clustering.

Compared with the independent-weight model, the feature distribution of the shared-weight model exhibits view consistency and class separability. Feature points of different views are more concentrated, with blue, orange, and green points interweaving, and features of different views of the same category tend to cluster together, indicating that the model has successfully extracted common features independent of views.

In addition, the clustering structure in the feature space of the shared-weight model is clearer, making feature points of the same category easier to distinguish, and its clustering boundaries are more distinct compared to the independent-weight model.

3.6. Small-Sample Robustness Enhancement

The previous sections have conducted multidimensional analyses of the MVDCNN’s performance. In addition to overall performance improvement, the improvement in recognition of small-sample and hard-to-classify cases by the MVDCNN is also noteworthy. In the series of confusion matrix subplots on the NKSID, the improvement in two typical small-sample categories, Small propeller and Fishing net (accounting for 3.55% and 0.76% of training samples, respectively), can be observed: in single-view, there are numerous dark off-diagonal blocks in the corresponding rows and columns of the matrix, indicating severe misclassification into other categories (e.g., Big propeller and Cylinder). With the increase in the number of views, misclassification is suppressed, and the recall rate of Small propeller is significantly improved.

As shown in the table below, the average recall rate increases from 15.86% in single-view to 73.79% in both two-view and three-view. The Fishing net category is almost unrecognizable in single-view ResNet, whereas the MVDCNN achieves nearly perfect recognition of this category across all backbones. For easily confused category pairs with similar shapes, such as Big propeller vs. Small propeller and Iron pipeline vs. Soft pipeline, the single-view confusion matrix shows strong mutual misclassification between them. After multi-view fusion, the confusion between these similar category pairs is significantly reduced.

To reveal the internal mechanism of MVDCNN’s improved recognition of small-sample categories, we use Grad-CAM to visualize the attention distribution of the model under different views, intuitively compare the differences in their discriminative regions, and further explain how multi-view fusion improves small-sample recognition performance through feature complementarity and view-invariant feature capture.

Figure 20 shows the imaging results of the same “Small propeller” target from three different views and their corresponding attention distributions. Under View 1 and View 3, the target is predicted as “Big propeller” by the model: the attention region in View 1 is concentrated at the center of the target, which is the most discriminative feature of “Big propeller”, leading to misclassification; the attention region in View 3 is mostly focused on the acoustic shadow at the lower left of the target, resulting in model misclassification. Under View 2, the target is correctly predicted as “Small propeller” by the model with a confidence of up to 0.921. In the Class Probability Distribution plot, the MVDCNN integrates information from the three views through max-pooling and finally outputs a prediction probability of 0.456 for “Small propeller”. Although this value is lower than the 0.921 of View 2, it is higher than the prediction results of View 1 and View 3. More importantly, the MVDCNN effectively suppresses the attention bias of a single view, avoiding overall misclassification caused by a single poor-quality view.

4. Discussion

4.1. Mechanism and Effectiveness of the Multi-View Feature Fusion

This study verifies through multidimensional experiments that multi-view feature-level fusion is an effective method to improve the performance and robustness of sonar ATR, and its internal mechanism can be systematically deconstructed from three dimensions: feature representation, decision-making behavior, and architecture adaptation.

The bottleneck of single-view sonar ATR lies in inherent defects such as geometric distortion and acoustic shadow in sonar imaging, leading the model to easily learn “apparent features” strongly related to views (e.g., local shadow shape and distorted contour), which is manifested as category mixing in the feature space and a high inter-class overlap rate. In contrast, multi-view feature-level fusion forces the model to break through the information limitation of a single view by supplementing cross-view complementary features (texture, overall shape, etc.) and shifts to learning “essential features” consistent across views of the target (e.g., target geometric contour and core structure).

The model’s learning focus shifts from apparent associations in a single view to essential associations across multiple views, thereby exhibiting higher consistency, certainty, and robustness at the output level. This is verified by t-SNE visualization and quantitative indicators. The feature space transforms from disordered mixing to an ideal distribution of “intra-class compactness and inter-class clarity”, laying a solid foundation for robust classification.

Single-view models have significant decision ambiguity due to information incompleteness, with wide and scattered prediction confidence distributions and are prone to outputting incorrect results due to local noise. The multi-view feature complementarity mechanism solves this problem from the perspective of decision-making basis: when feature ambiguity or uncertainty exists in one view, other views can provide supporting or supplementary information to construct more complete target representation.

KDE results show that multi-view fusion is not a simple superposition of information but achieves decision-making stability through feature purification, transforming the model’s output from uncertain fluctuations to high-confidence consistency and significantly reducing the randomness of classification results.

The shared-weight architecture is the key design for the MVDCNN to capture view-invariant features. Comparative experiments confirm that the shared-weight model forces all views to share the same feature extraction parameters, compelling the model to learn core features common across views (e.g., target structure and key texture) rather than view-specific noise. At the same time, the shared-weight architecture significantly reduces the number of parameters, improves parameter efficiency, and balances model light weighting and training efficiency while ensuring performance. This result reveals that the effectiveness of multi-view fusion depends not only on the fusion strategy but also on matching the feature extraction architecture adapted to the task.

In addition, the shared-weight model has advantages in representation learning: it maps features of different views to a more consistent space and learns view-invariant features; although the independent-weight model can capture view-specific features, the dispersion of its feature representations may lead to performance degradation and poor generalization.

4.2. Comparison Between MVDCNN and MV-YOLO

Under the same multi-view fusion mechanism, the MVDCNN exhibits certain performance advantages over MV-YOLO. This difference is essentially attributed to the adaptability discrepancy between backbone networks and classification tasks, which also confirms that multi-view feature-level fusion can only maximize its performance gains when combined with task-specific architectures.

Backbone networks of the YOLO series are deeply optimized for dense object detection tasks. They feature complex structures and rely on pre-training with densely labeled natural images, with their core capability lying in capturing local localization features of targets rather than the global essential features required for fine-grained classification. When transferred to sonar ATR classification tasks, even with the introduction of multi-view fusion, these networks still struggle to overcome the feature extraction inertia inherent in their “detection-oriented” design. They show poor adaptability to the unique characteristics of sonar images, such as low contrast and geometric distortion, and are particularly prone to misclassification for small-sample categories due to feature sparsity.

In contrast, the modular design of the MVDCNN enables deep adaptation between classification-specialized backbones (ResNet, ViT, SwinT) and multi-view fusion. The core strength of classification-specialized backbones is capturing global discriminative features, which allows for the efficient mining of cross-view essential attributes of sonar targets. Meanwhile, the modular architecture supports flexible switching of backbone networks, making full use of the advantages of different architectures (e.g., the global attention mechanism of Transformers for modeling long-range dependencies). This combination of “classification backbone and multi-view fusion” precisely matches the fine-grained classification requirements of sonar ATR. Especially for small-sample categories, it alleviates the feature sparsity issue by supplementing structural feature diversity, achieving a breakthrough from unrecognizable to stable recognition.

4.3. Mechanism for Robustness Enhancement of Small-Sample Categories

Small-sample categories are a typical challenge in sonar ATR, with the core pain point being model overfitting and insufficient discriminative ability caused by feature sparsity. Multi-view fusion forms a multidimensional gain mechanism for this problem, verifying the generalizability of the MVDCNN.

First is the supplementation of feature diversity.Multi-view provides structural features from different orientations for small-sample categories, making up for the defect of insufficient feature sample size under a single view. For example, the feature space of Fishing net (with only 14 training samples) transforms from “non-clustered” to “clearly clustered” with the supplement of multi-view information.

Second is the anchoring of core invariant features by the shared-weight architecture. This architecture guides the model to focus on the core invariant features of small-sample categories and avoid overfitting to local noise in a small number of samples. For instance, the blade structure of Small propeller is accurately anchored, reducing the misclassification rate as Big propeller from 38.6% (1-view) to 19.2% (two-view) and 15.8% (three-view).

Finally is the suppression of decision-making uncertainty. Complementary multi-view information reduces decision ambiguity for small-sample categories, with a significant decrease in confidence variance, transforming the model’s output for scarce samples from “random guessing” to “stable judgment”. Grad-CAM visualization further confirms that multi-view models can break free from attention bias toward interference regions such as acoustic shadows under single views and shift to the target core contour, realizing robust recognition of small-sample categories.

4.4. Limitations and Future Work

Despite the excellent performance of the MVDCNN, it still has certain limitations: the current max-pooling fusion strategy treats all views “equally” and cannot adaptively distinguish the imaging quality of different views (e.g., noisy views with severe occlusion or a low signal-to-noise ratio) and may even introduce interference due to redundant views (on CSSID, 3V-ResNet shows no improvement in accuracy compared with 2V-ResNet).

To address these limitations, we propose replacing max-pooling with view attention fusion (VAF) through the introduction of a View Attention Module (VAM). This module automatically evaluates the importance of different viewpoint features via learnable weights, assigning higher weights to views with less occlusion and higher signal-to-noise ratios, thereby achieving adaptive feature fusion.

As illustrated in Figure 21, the input to VAM is a feature tensor of shape

[B, V, D]

. The module computes attention scores for each viewpoint’s feature vector, which are then normalized via Softmax to obtain corresponding viewpoint weights. These weights are used to perform a weighted average of the features across all viewpoints, resulting in a fused feature representation of shape

[B, D]

.

Currently, the MVDCNN model only focuses on the multi-view sonar target recognition stage. However, practical underwater sonar ATR tasks follow a complete workflow of “detection and recognition”. Existing studies have not realized collaborative optimization and end-to-end deployment of detection and recognition modules, resulting in problems such as process fragmentation, low feature reusability, and insufficient real-time performance. To form a full-link sonar ATR technical closed loop from “target localization” to “fine-grained recognition”, future research will focus on the integration of the MVDCNN with multi-view detection models.

A cross-module feature reuse link of “detection features-recognition features” will be designed: the multi-view local features corresponding to target candidate boxes from the detection module are directly fed into the feature extraction module of the MVDCNN, avoiding feature loss in the traditional workflow of “cropping after detection and then recognition”. In addition, a bidirectional feedback mechanism will be built—the recognition results of the MVDCNN can reversely correct the confidence of candidate boxes in the detection module (e.g., increasing the weight of low-confidence detection boxes recognized as “Small propellers” and suppressing false detection boxes caused by acoustic shadows), realizing collaborative optimization of detection and recognition. Furthermore, an end-to-end joint training strategy will be adopted: after freezing the weights of the pre-trained backbone, the parameters of the detection head and the MVDCNN recognition head will be updated synchronously to improve the adaptability between modules.

5. Conclusions

To address the core challenges of single-view Sonar Automatic Target Recognition (ATR), including geometric distortion, acoustic shadow interference, training data scarcity, and poor recognition performance for small-sample categories, this paper proposes a Multi-View Deep Convolutional Neural Network (MVDCNN) based on feature-level fusion and designs a multi-view data augmentation method adapted to the characteristics of sonar imaging. The MVDCNN adopts a modular architecture: the input reshaping module achieves format compatibility between multi-view tensors and pre-trained backbones; the shared-weight feature extraction module integrates backbones such as ResNet, ViT, and SwinT to mine essential target features; the max-pooling feature fusion module selects salient cross-view features; and the final classification module completes target category mapping. The multi-view data augmentation method generates large-scale multi-view training sets from limited single-view samples based on combination formulas, effectively alleviating the data scarcity problem.

Experimental validation on the Custom Side-Scan Sonar Image Dataset (CSSID) and Nankai Sonar Image Dataset (NKSID) demonstrates that the MVDCNN achieves significant performance improvements over single-view baseline models. Under the two-view configuration, it reaches average classification accuracies of 94.72% on the CSSID and 97.24% on the NKSID, representing increases of 7.93% and 5.05% relative to single-view models; under the three-view configuration, the average accuracy is further improved to 96.60% and 98.28%. Meanwhile, the average misclassification rate of small-sample categories (e.g., Small propeller on the NKSID) drops from 84.4% in the one-view setting to 26.0% in both two-view and three-view settings, and the average recall rate rises from 15.86% in single-view to 73.79% in both two-view and three-view settings.

In addition, through t-SNE feature visualization, kernel density confidence analysis, and a comparison of shared-weight architectures, this paper reveals the core mechanism that multi-view fusion can guide the model to shift from learning apparent features to capturing view-invariant essential features and reducing prediction uncertainty, confirming the adaptability advantage of combining classification-specialized backbones with feature-level fusion. For the limitation that max-pooling cannot adapt weight views, this paper also proposes an optimization direction of introducing a view attention fusion module. This research not only provides a robust and efficient recognition scheme for sonar ATR but also offers theoretical support and technical paradigms for multi-view acoustic image understanding.

Author Contributions

Conceptualization, Y.F., C.P. and P.Z.; methodology, Y.F., P.Z., G.Z. and J.T.; software, C.P. and P.Z.; validation, Y.F., C.P., P.Z. and Z.Z.; formal analysis, Y.F.; investigation, Y.F. and C.P.; resources, P.Z.; data curation, C.P. and P.Z.; writing—original draft preparation, Y.F. and C.P.; writing—review and editing, Y.F., C.P., P.Z. and J.T.; visualization, C.P., P.Z. and Z.Z.; supervision, Y.F. and G.Z.; project administration, Y.F., G.Z. and J.T.; funding acquisition, Y.F. and J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This paper is supported by the National Natural Science Foundation of China (Grant Nos. 62401601, U2341227).

Data Availability Statement

The CSSID (Custom Sonar Image Dataset) developed in this work is available from the authors upon reasonable request. The NKSID (NK Sonar Image Dataset) used in this study is a third-party dataset publicly available at https://github.com/Jorwnpay/NK-Sonar-Image-Dataset (accessed on 25 April 2025) as described in the original publication.

Acknowledgments

The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ATR	Automatic Target Recognition
MVDCNN	Multi-View Deep Convolutional Neural Network
NKSID	Nankai Sonar Image Dataset
CSSID	Custom Side-Scan Sonar Image Dataset
UXO	Unexploded Ordinance
CNN	Convolutional Neural Network
ViT	Vision Transformer
SwinT	Swin Transformer
GANs	Generative Adversarial Networks
KDE	Kernel Density Estimation
t-SNE	t-Distributed Stochastic Neighbor Embedding
VAF	View Attention Fusion
VAM	View Attention Module

References

Williams, D.P. Transfer Learning with SAS-Image Convolutional Neural Networks for Improved Underwater Target Classification. In Proceedings of the IGARSS 2019—2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 78–81. [Google Scholar]
Cheng, Z.; Huo, G.; Li, H. A Multi-Domain Collaborative Transfer Learning Method with Multi-Scale Repeated Attention Mechanism for Underwater Side-Scan Sonar Image Classification. Remote Sens. 2022, 14, 355. [Google Scholar] [CrossRef]
Luo, X.; Qin, X.; Wu, Z.; Yang, F.; Wang, M.; Shang, J. Sediment Classification of Small-Size Seabed Acoustic Images Using Convolutional Neural Networks. IEEE Access 2019, 7, 98331–98339. [Google Scholar] [CrossRef]
Qin, X.; Luo, X.; Wu, Z.; Shang, J. Optimizing the Sediment Classification of Small Side-Scan Sonar Images Based on Deep Learning. IEEE Access 2021, 9, 29416–29428. [Google Scholar] [CrossRef]
Khan, A.; Rauf, Z.; Sohail, A.; Rehman, A.; Asif, H.; Asif, A.; Farooq, U. A Survey of the Vision Transformers and Their CNN-transformer Based Variants. Artif. Intell. Rev. 2023, 56, 2917–2970. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar] [CrossRef]
Lou, M.; Yu, Y. OverLoCK: An Overview-First-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. arXiv 2025, arXiv:2502.20087. [Google Scholar]
Dai, Q.; Zhang, G.; Xue, B.; Fang, Z. Capsule-Guided Multi-View Attention Network for SAR Target Recognition with Small Training Set. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar]
Wang, Z.; Wang, C.; Pei, J.; Huang, Y.; Zhang, Y.; Yang, H.; Xing, Z. Multi-View SAR Automatic Target Recognition Based on Deformable Convolutional Network. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 3585–3588. [Google Scholar]
Pei, J.; Wang, Z.; Sun, X.; Huo, W.; Zhang, Y.; Huang, Y.; Wu, J.; Yang, J. FEF-net: A Deep Learning Approach to Multiview SAR Image Target Recognition. Remote Sens. 2021, 13, 3493. [Google Scholar] [CrossRef]
Chen, Z.; Xie, G.; Deng, X.; Peng, J.; Qiu, H. DA-YOLOv7: A Deep Learning-Driven High-Performance Underwater Sonar Image Target Recognition Model. J. Mar. Sci. Eng. 2024, 12, 1606. [Google Scholar] [CrossRef]
Cao, L.; Ma, Z.; Hu, Q.; Xia, Z.; Zhao, M. DCE-net: An Improved Method for Sonar Small-Target Detection Based on YOLOv8. J. Mar. Sci. Eng. 2025, 13, 1478. [Google Scholar] [CrossRef]
Basha S, K.; Kiran B, A.; Nambiar, A.; Rajendran, S. A Novel Context-Adaptive Fusion of Shadow and Highlight Regions for Efficient Sonar Image Classification. arXiv 2025, arXiv:2506.01445. [Google Scholar] [CrossRef]
Li, S.; Li, T.; Wu, Y. Side-Scan Sonar Mine-like Target Detection Considering Acoustic Illumination and Shadow Characteristics. Ocean Eng. 2025, 336, 121711. [Google Scholar] [CrossRef]
Groen, J.; Coiras, E.; Williams, D.P. False-Alarm Reduction in Mine Classification Using Multiple Looks from a Synthetic Aperture Sonar. In Proceedings of the Oceans’10 IEEE Sydney, Sydney, Australia, 24–27 May 2010; pp. 1–8. [Google Scholar]
Williams, D.P. Underwater Target Classification in Synthetic Aperture Sonar Imagery Using Deep Convolutional Neural Networks. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 2497–2502. [Google Scholar]
Su, H.; Maji, S.; Kalogerakis, E.; Learned-Miller, E. Multi-View Convolutional Neural Networks for 3D Shape Recognition. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 945–953. [Google Scholar]
Su, J.-C.; Gadelha, M.; Wang, R.; Maji, S. A Deeper Look at 3D Shape Classifiers. In Computer Vision—ECCV 2018 Workshops; Leal-Taixé, L., Roth, S., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 645–661. [Google Scholar]
Neupane, D.; Seok, J. A Review on Deep Learning-Based Approaches for Automatic Sonar Target Recognition. Electronics 2020, 9, 1972. [Google Scholar] [CrossRef]
Jiang, W.; Wang, Y.; Li, Y.; Lin, Y.; Shen, W. Radar Target Characterization and Deep Learning in Radar Automatic Target Recognition: A Review. Remote Sens. 2023, 15, 3742. [Google Scholar] [CrossRef]
Jegorova, M.; Karjalainen, A.I.; Vazquez, J.; Hospedales, T. Full-Scale Continuous Synthetic Sonar Data Generation with Markov Conditional Generative Adversarial Networks. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3168–3174. [Google Scholar]
Peng, C.; Jin, S.; Bian, G.; Cui, Y.; Wang, M. Sample Augmentation Method for Side-Scan Sonar Underwater Target Images Based on CBL-sinGAN. J. Mar. Sci. Eng. 2024, 12, 467. [Google Scholar] [CrossRef]
Pei, J.; Huo, W.; Wang, C.; Huang, Y.; Zhang, Y.; Wu, J.; Yang, J. Multiview Deep Feature Learning Network for SAR Automatic Target Recognition. Remote Sens. 2021, 13, 1455. [Google Scholar] [CrossRef]
Zhang, P.; Tang, J.; Zhong, H.; Wu, H.; Li, H.; Fan, Y. Orientation Estimation of Rotated Sonar Image Targets via the Wavelet Subimage Energy Ratio. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9020–9032. [Google Scholar] [CrossRef]
Zhang, P.; Tang, J.; Zhong, H. Self-trained target detection of radar and sonar images using automatic deep learning. IEEE Trans. Geosci. Remote Sens. 2021. [Google Scholar]
Jiao, W.; Zhang, J.; Zhang, C. Open-Set Recognition with Long-Tail Sonar Images. Expert Syst. Appl. 2024, 249, 123495. [Google Scholar] [CrossRef]

Figure 1. The architecture of Multi-View Deep Convolutional Neural Network (MVDCNN): Modular framework integrating input reshaping, shared-weight feature extraction, view-dimension max-pooling fusion, and classification modules for multi-view sonar ATR.

Figure 2. Azimuth-constrained multi-view sonar imaging geometry: Principles of underwater target observation from varying viewing angles with defined angular intervals.

Figure 3. Multi-view sample augmentation example (

k = 3

): Transforming 6 single-view sonar samples into 9 valid multi-view combinations.

Figure 3. Multi-view sample augmentation example (

k = 3

): Transforming 6 single-view sonar samples into 9 valid multi-view combinations.

Figure 4. Input reshaping process for multi-view sonar images: From 5D tensor

[B, V, C, H, W]

to 4D tensor

[B, C, H, W]

. After input reshaping, the batch dimension

(B = 16)

and view dimension

(V = 3)

are combined into a single dimension of size “48”, as shown in the red text in the figure.

Figure 4. Input reshaping process for multi-view sonar images: From 5D tensor

[B, V, C, H, W]

to 4D tensor

[B, C, H, W]

. After input reshaping, the batch dimension

(B = 16)

and view dimension

(V = 3)

are combined into a single dimension of size “48”, as shown in the red text in the figure.

Figure 5. Channel replication strategy for single-channel sonar images to match three-channel pre-trained backbones. As shown in the red text, the original channel count of “1” is replicated to become “3”.

Figure 6. CSSID characteristics: Representative multi-view sonar images of four canonical underwater targets (Cone, Cylinder, Globe, Shipwreck) acquired in a controlled shallow-water environment.

Figure 7. NKSID characteristics: Real-world multi-view sonar imagery of eight marine targets captured in the South China Sea using Oculus M750d multi-beam forward-looking sonar.

Figure 8. Training dynamics of the MVDCNN: Loss and accuracy convergence curves of five backbone networks (ResNet34/50/101, SwinT-tiny, ViT-base) across 1/2/3-view configurations on CSSID and NKSID.

Figure 9. Training dynamics of MV-YOLO: Loss and accuracy convergence curves of four backbone networks (YOLOv8m and YOLO11n/11m/11x) across 1/2/3-view configurations on CSSID and NKSID.

Figure 10. Normalized confusion matrices of the MVDCNN (5 backbones) under 1/2/3-view configurations for sonar ATR on the CSSID and NKSID. The red saturation of matrix elements is positively correlated with the recognition probability of the corresponding category: the red depth on the main diagonal directly characterizes the correct recognition probability of targets, with deeper and more saturated red indicating a higher confidence in the accurate classification of targets in that category; the red intensity on the off-diagonal elements reflects the inter-class misclassification probability, where darker red (including gradient changes such as pink, light red, orange-red, bright red) denotes a higher probability of targets in the row category being incorrectly classified into the corresponding column category.

Figure 11. Normalized confusion matrices of MV-YOLO (4 backbones) under 1/2/3-view configurations for sonar ATR on CSSID and NKSID.

Figure 12. Radar chart of accuracy for MVDCNN and MV-YOLO with different backbones under 1/2/3-view configurations.

Figure 13. Feature space organization evolution: t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization showing transformation from disordered single-view feature distribution to structured multi-view clustering with enhanced intra-class compactness. For the CSSID dataset, which comprises four typical underwater targets, the color assignment follows the sequence: blue for Cone, orange for Cylinder, green for Globe, and red for Shipwreck. For the NKSID dataset, containing eight practical marine targets, the color mapping is extended to eight distinct hues from the same standardized palette: blue for Big propeller, orange for Cylinder, green for Fishing net, red for Floats, purple for Iron pipeline, brown for Small propeller, pink for Soft pipeline, and gray for Tire. This color scheme ensures high inter-class discriminability, and enables intuitive interpretation of intra-class compactness and inter-class separability of feature representations across different view configurations.

Figure 14. Progressive feature clustering enhancement: t-SNE visualization comparing SwinT feature distributions under 1/2/3-view configurations with emphasis on small sample classes (Small propeller and Floats). (a) SwinT. (b) 2V-SwinT. (c) 3V-SwinT.

Figure 15. Prediction uncertainty reduction: Kernel Density Estimation (KDE) visualization of confidence distributions across backbone networks showing convergence toward high-confidence predictions with increasing view count.

Figure 16. Prediction confidence distributions of ResNet50 on CSSID: the mean confidence of correct predictions increases and the mean confidence of incorrect predictions decreases. (a) ResNet50. (b) 2V-ResNet50. (c) 3V-ResNet50.

Figure 17. Weight sharing impact on classification: Confusion matrices comparing non-shared and shared-weight architectures. For the non-shared-weight matrix (left), the main diagonal exhibits heterogeneous red saturation: most categories (e.g., Big propeller, Cylinder) show moderately saturated red, while small-sample categories (e.g., Fishing net, Small propeller) present low-to-moderate saturation red. Concurrently, the off-diagonal regions contain multiple light-red/orange-red patches, indicating non-negligible inter-class confusion. In contrast, the shared-weight matrix (right) displays homogeneously high red saturation across most of the main diagonal: all categories except Fishing net (0.65, moderate red) achieve the maximum correct recognition rate (1.00, fully saturated deep red). Critically, the off-diagonal regions are almost entirely devoid of red (all values = 0.00), with only a faint red patch (0.35) observed for Fishing net misclassified to Tire, indicating a substantial reduction in inter-class misclassification. (a) MVDCNN-independent. (b) MVDCNN-shared.

Figure 18. Confidence reliability enhancement: KDE distributions comparing prediction confidence of non-shared and shared-weight architectures showing reduced variance and higher mean confidence. (a) MVDCNN-independent. (b) MVDCNN-shared.

Figure 19. View-invariant feature learning visualization: t-SNE embedding revealing how weight sharing (b) versus independent weights (a) affects cross-view feature consistency and class separability. (b) MVDCNN-shared.

Figure 20. Multi-view attention complementarity for small-sample recognition: Grad-CAM visualization of three views of the Small propeller target showing how multi-view fusion overcomes single-view attention bias and acoustic shadow interference.

Figure 21. Adaptive View Attention Module (VAM): Learnable attention mechanism for weighting multi-view features according to imaging quality and discriminative importance to overcome max-pooling limitations.

Table 1. Detailed Statistics of Custom Side-Scan Sonar Image Dataset (CSSID) and Nankai Side-Scan Sonar Imagery Dataset (NKSID) for multi-view sonar target recognition.

Dataset	Class	Training Set	Testing Set	Sum	Total
CSSID	Cone	35	15	50	174
	Cylinder	35	15	50
	Globe	16	8	24
	Shipwreck	35	15	50
NKSID	Big propeller	142	61	203	2617
	Cylinder	201	87	288
	Fishing net	14	6	20
	Floats	665	286	951
	Iron pipeline	78	34	112
	Small propeller	65	29	94
	Soft pipeline	80	35	115
	Tire	583	251	834

Table 2. Backbone-specific hyperparameter configuration: Learning rates, batch sizes, and epochs optimized for classification (ResNet, ViT, SwinT) versus detection (YOLO) architectures to ensure a fair comparison.

Model Category	Specific Model	Learning Rate	Batch Size	Epochs
YOLO Series	YOLOv8m	$2.5 \times 10^{- 3}$	8	20
YOLO Series	YOLO11 series	$5.0 \times 10^{- 3}$	8	20
ResNet	ResNet-34/50/101	$2.5 \times 10^{- 5}$	16	10
Transformer	SwinT-tiny	$2.5 \times 10^{- 5}$	16	10
Transformer	ViT-base	$5.0 \times 10^{- 6}$	16	10

Table 3. Classification accuracy of 5 classification backbones under 1/2/3-view configurations on the CSSID and NKSID (accuracy improvement vs. single-view baseline).

Dataset	Backbone	Single View	2V-DCNN	3V-DCNN
CSSID	ResNet34	90.57	94.34 (+3.77)	98.11 (+7.54)
	ResNet50	88.68	92.45 (+3.77)	94.34 (+5.66)
	ResNet101	92.45	94.34 (+1.89)	94.34 (+1.89)
	SwinT-tiny	92.45	98.11 (+5.66)	100.0 (+7.55)
	ViT-base	69.81	94.34 (+24.53)	96.23 (+26.42)
NKSID	ResNet34	91.51	97.47 (+5.96)	98.86 (+7.35)
	ResNet50	89.48	97.59 (+8.11)	97.85 (+8.37)
	ResNet101	92.65	97.21 (+4.56)	97.59 (+4.94)
	SwinT-tiny	94.42	98.99 (+4.57)	99.87 (+5.45)
	ViT-base	92.90	94.93 (+2.03)	97.21 (+4.31)

Values in bold parentheses indicate improvement over the single-view baseline.

Table 4. Classification Accuracy of MV-YOLO Under Different View Configurations on CSSID and NKSID.

Dataset	Model	Single-View	2-View	3-View
CSSID	YOLOv8m	73.58	86.79 (+13.21)	94.34 (+20.76)
	YOLO11n	84.91	92.45 (+7.54)	94.34 (+9.43)
	YOLO11m	86.79	92.45 (+5.66)	92.45 (+5.66)
	YOLO11x	84.91	92.45 (+7.54)	90.57 (+5.66)
NKSID	YOLOv8m	88.47	91.51 (+3.04)	95.18 (+6.71)
	YOLO11n	87.83	91.38 (+3.55)	94.17 (+6.34)
	YOLO11m	88.59	90.87 (+2.28)	94.04 (+5.45)
	YOLO11x	84.79	90.49 (+5.70)	93.79 (+9.00)

Values in bold parentheses indicate improvement over the single-view baseline.

Table 5. Performance and efficiency comparison: shared-weight vs. independent-weight MVDCNN (based on SwinT backbone on the NKSID).

Evaluation Metric	Shared Weights	Non-Shared Weights	Improvement
Accuracy (%)	96.4	88.7	+7.7%
Weighted F1 Score	0.962	0.883	+7.9%
Parameters (M)	27.5	82.6	−66.7%
Training Time (s/epoch)	45.2	52.3	−13.6%
Parameter Efficiency	3.51	1.07	+328%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, Y.; Peng, C.; Zhang, P.; Zhang, Z.; Zhang, G.; Tang, J. MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sens. 2026, 18, 76. https://doi.org/10.3390/rs18010076

AMA Style

Fan Y, Peng C, Zhang P, Zhang Z, Zhang G, Tang J. MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sensing. 2026; 18(1):76. https://doi.org/10.3390/rs18010076

Chicago/Turabian Style

Fan, Yue, Cheng Peng, Peng Zhang, Zhisheng Zhang, Guoping Zhang, and Jinsong Tang. 2026. "MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition" Remote Sensing 18, no. 1: 76. https://doi.org/10.3390/rs18010076

APA Style

Fan, Y., Peng, C., Zhang, P., Zhang, Z., Zhang, G., & Tang, J. (2026). MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sensing, 18(1), 76. https://doi.org/10.3390/rs18010076

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Multi-View Sample Augmentation Method

2.2. The Proposed MVDCNN Framework

2.2.1. Input Reshaping Module

2.2.2. Feature Extraction Module

2.2.3. Feature Fusion Module and Classification Module

2.3. Visualization Methods

2.3.1. t-Distributed Stochastic Neighbor Embedding

2.3.2. Kernel Density Estimation

2.4. Dataset

Partition Strategy

2.5. Experimental Setup

2.5.1. Baseline and MY-YOLO

2.5.2. Experimental Environment and Hyperparameters

2.5.3. Evaluation Metrics

2.5.4. Ablation Study of Weight Sharing Mechanism

3. Results

3.1. Convergence Behavior

3.2. Overall Classification Performance

3.3. Feature Distribution Optimization

3.4. Prediction Uncertainty Reduction

3.5. View-Invariant Feature Capture

3.6. Small-Sample Robustness Enhancement

4. Discussion

4.1. Mechanism and Effectiveness of the Multi-View Feature Fusion

4.2. Comparison Between MVDCNN and MV-YOLO

4.3. Mechanism for Robustness Enhancement of Small-Sample Categories

4.4. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI