Abstract
Facing the challenges of scarce annotations in forward-looking sonar image segmentation, this paper proposes a teacher–student network for unsupervised domain adaptation. The proposed model undergoes supervised learning with optical image data to endow the student model with basic segmentation capabilities, using the segment anything model (SAM) as the teacher to generate pseudo-labels in the sonar image domain, thus achieving knowledge transfer without relying on annotated sonar images. An adaptive weighting approach is proposed, which generates a consistency map using predictive consistency in the source domain and the target domain to assess the quality of pseudo-labels. This method dynamically adjusts supervisory strength, preventing incorrect fitting caused by noisy pseudo-labels. In addition, a multi-scale attention module is designed to refine bottleneck features of the U-Net. The effectiveness of the proposed method is validated on a self-built public forward-looking sonar image dataset, achieving a mean intersection over union (IoU) of 40.8% and a mean average precision (mAP) of 70.3%, demonstrating significant improvements over existing typical UDA methods.
1. Introduction
Forward-looking sonars (FLSs) are widely used in various investigations of underwater environments. Foreground–background segmentation serves as a fundamental role in underwater target detection and recognition systems. However, forward-looking sonar generates images through the reflection of acoustic waves, which are influenced by underwater acoustic propagation characteristics. This phenomenon may lead to scattering, reflection, and multi-path transmission, resulting in a significant amount of noise in sonar images []. Optical images contain rich information such as color and texture, and the recognition methods for them are relatively well-established. Traditional image processing and computer vision methods exhibit performance degradation when applied to sonar images. In early research, most researchers focused on traditional image processing methods for contour detection in image segmentation, such as Canny [], Sobel [], Otsu [], and so on. While these methods have achieved certain results in specific scenarios, they generally suffered from poor robustness and strong parameter dependence, making them sensitive to complex noise in sonar images. The development of deep networks has brought new solutions to image segmentation in the last 20 years []. Unlike traditional methods that rely on manually designed features, deep learning approaches can automatically extract features from labeled samples and establish high-dimensional mapping relationships []. These methods demonstrate stronger generalization capabilities and have been widely used in sonar image segmentation []. Additionally, several studies have integrated traditional edge-based segmentation methods with deep learning approaches. These hybrid methods typically employ deep neural networks to estimate specific parameters of conventional algorithms, enabling dynamic adaptation to input data. For instance, Ref. [] utilized a Deep Reinforcement Learning (DRL) model to enable adaptive threshold tuning in the Canny algorithm. Similarly, Ref. [] employed a YOLO network to detect potential obstacle regions in sonar images for threshold extraction, allowing the Otsu algorithm to dynamically and adaptively adjust its threshold based on obstacle morphology.
However, the aforementioned methods still have some shortcomings. Firstly, learning-based methods require a large number of data samples to ensure diversity to avoid overfitting, which means that model performs well on the training set but struggles to generalize to unseen test data. Due to the high cost of sonar data collection, it is difficult to obtain sufficient data samples. Moreover, the low resolution and insignificant features in sonar images demand specialized expertise from human annotators, making the annotation process extremely challenging []. The above facts have led to significant challenges for traditional supervised deep learning methods to sonar image segmentation. In light of these challenges, improving model performance with limited labeled samples has become a key research focus in recent studies. Unsupervised learning is a crucial branch of machine learning that aims to discover inherent structures, patterns, or regularities from unlabeled data. Unlike supervised learning, it does not rely on manually annotated labels, making it particularly well-suited for exploring unknown data.
Some researchers are focusing on transfer learning, a method that leverages knowledge from source domains to improve performance in target domains. The core rationale lies in the fact that even though the target domain data lacks labels, the general features and knowledge that models learn from large-scale, diverse source domains hold significant value for understanding the underlying structure of the target domain data. Unsupervised domain adaptation (UDA) is an important method of unsupervised learning. In the UDA framework, all the labels of the source domain are known, while the target domain remains unlabeled. By minimizing the feature distribution discrepancy between the source and target domains, UDA transfers knowledge from the fully labeled source domain to the unlabeled target domain. General solutions to UDA problem can be categorized into: feature-space domain discrepancy minimization, adversarial learning-based domain confusion methods, and teacher–student model frameworks []. The teacher–student framework utilizes a relatively stable and high-performance teacher model to generate pseudo-labels for target domain samples, which are then used to guide the training of the student model in the target domain. Through this process, the student model progressively adapts to the feature distribution of the target domain, thereby accomplishing cross-domain knowledge transfer.
In this study, we cast the adaptation from supervised optical image segmentation to sonar image segmentation as an UDA problem and propose a teacher–student framework to address the challenge of segmentation without labeled sonar image samples. The framework employs the pre-trained segment anything model (SAM) [] as the teacher model and U-Net as the student model. We use a publicly available optical image dataset called SWIMSEG [], which provides foreground and background annotations, as the source domain to transfer segmentation knowledge to our self-collected unlabeled sonar images, thereby achieving effective cross-modal image segmentation. In the absence of true labels for the target domain, blindly trusting any model for supervised training would lead to a rapid decline in accuracy. This is because both the bias of SAM on sonar data and the initially cognitively limited student model on the target domain can introduce noise. To address this, our method constructs a consensus map through an intersection-like operation, enabling adaptive supervision by leveraging cross-domain knowledge consistency between the source and target domains. This more cautious training strategy can adaptively regulate the supervisory weight of pseudo-labels, thereby mitigating the negative impact of their inherent noise. The main contributions of this work are as follows:
- A teacher–student framework is proposed, in which the student model conducts supervised training in the source domain (optical images) to learn basic segmentation capabilities. Simultaneously, a pre-trained SAM network is employed as the teacher model to produce pseudo-labels in the target domain (sonar images), guiding the transfer learning process.
- A multi-scale attention module is designed to improve the performance of U-Net by emphasizing features across different receptive fields. This module enhances the model’s multi-scale representation by generating element-wise attention weights from the encoder features to refine the bottleneck features.
- An adaptive supervision weight adjustment method is proposed based on the consistency between pseudo-labels and student predictions. The consistency between the teacher and student predictions to each same sample in target domain is computed and represented as a pixel-wise consistency map. The guidance intensity is dynamically adjusted based on the consistency maps.
The remainder of this paper is organized as follows: Section 2 presents representative works on U-Net-based sonar image segmentation, unsupervised domain adaptation techniques, and the teacher–student framework. Section 3 provides a brief overview of the proposed method, and elaborates on the multi-scale attention module, pipeline of pseudo-label generation, and the consistency map calculation. Section 4 firstly introduces the datasets and experimental settings, and then presents the experimental results to validate the effectiveness of this method. Section 5 summarizes the entire work and discusses future research directions.
2. Related Works
2.1. U-Net Sonar Image Segmentation
Sonar image segmentation is an important task in underwater environment perception. Compared with optical images, sonar images presents challenges such as low contrast, speckle artifacts and blurred contours. U-Net [], proposed in 2015, is a representative deep neural network for image segmentation. It adopts a symmetric encoder-decoder structure: encoders perform down-sampling to extract high-level semantic features while reducing spatial resolution, and decoders restore spatial resolution via transposed convolution. Skip connections enable feature fusion between low-level details and high-level semantics. Since the encoder loses some details and spatial information during the pooling process, and the upsampling in the decoder has limited precision, skip connections directly transfer the low-level features containing rich details and positional information from the encoder to the decoder, thereby improving the reconstruction accuracy of the decoder []. U-Net was originally developed for biomedical image segmentation tasks and has demonstrated strong performance on sonar image segmentation as well. Many researchers have improved U-Net to apply it for sonar image segmentation tasks [,,].
2.2. Unsupervised Domain Adaptation
The training of deep neural networks relies on accurate data labels to define the distribution characteristics of the data. However, in many cases, directly obtaining accurately labeled data samples is challenging. Using data with similar distributions to train the network can lead to a significant decline in model performance due to domain differences between the datasets. Unsupervised domain adaptation (UDA) is a transfer learning approach that aims to transfer knowledge from a labeled source domain to the unlabeled target domain. Typical UDA methods achieve adaptation by minimizing the distribution discrepancy between the source and target domains. Maximum mean discrepancy (MMD) [] achieves domain alignment by directly minimizing the mean discrepancy between source and target features in a reproducing kernel Hilbert space. The fundamental idea of MMD is that if two distributions are identical, then the means of all their mappings in the feature space should also be identical. Deep CORAL [] achieves domain adaptation by aligning the covariance matrices of source and target features, thereby reducing the distribution discrepancy between domains. The above methods achieve UDA by explicitly aligning statistical features between the source and target domains. Some researchers adopt adversarial training strategies to guide networks to produce domain-independent features. DANN [] applies a gradient reversal layer (GRL) between the domain discriminator and the feature extractor, which reverses the gradient during backpropagation. This encourages the feature extractor network to extract domain-invariant features. On this basis, DAAN [] introduces local class-wise discriminators to align the conditional distributions across domains for each semantic class. In addition, a dynamic weighting factor is incorporated to adaptively adjust the strength of adversarial learning.
Several researchers have also explored Unsupervised Domain Adaptation in sonar imagery. For instance, Ref. [] employed a teacher–student-based model to achieve the transfer of synthetic sonar shipwreck images to real environments, effectively overcoming the domain shift between simulated and real-world settings. Ref. [] formulates the domain shift between the training and test sets as an unsupervised domain adaptation problem and enhances the performance of side-scan sonar in real-world environments through an adversarial learning framework. It is evident that unsupervised domain adaptation holds significant research value for sonar images, where labeled data is scarce.
2.3. Teacher–Student Model
In addition to the representative methods mentioned above, the teacher–student framework has also proven effective for unsupervised domain adaptation []. In this framework, the teacher model is typically a well-trained model that generates pseudo-labels to guide the student model’s learning process. The student model learns feature representations in the target domain based on the supervision signals provided by the teacher model []. Studies such as [,,] employ teacher models to generate pseudo-labels and thereby guide the student model in learning target domain representations. The teacher model leverages its own predictive capabilities to learn distributions from large amounts of unlabeled data, thereby assisting in supervised learning tasks and improving the model’s performance.
The Segment Anything Model (SAM) is a pre-trained model trained on a large-scale dataset, demonstrating strong generalization ability across diverse image domains. By using input points or bounding boxes as prompts, it can generate accurate segmentation results. With its powerful zero-shot segmentation capability, SAM is well-suited to serve as a teacher model for generating pseudo-labels. Studies as [,] employ SAM as a teacher model to achieve unsupervised domain adaptation. The SAM is designed to operate in a category-agnostic manner. Trained in massive datasets, the SAM possesses exceptional generalization capabilities. While it may not achieve optimal performance in highly specialized domains, its cross-domain knowledge can be effectively leveraged for transfer learning. In pseudo-label-based UDA, the quality of pseudo-labels is crucial for model learning performance. However, as SAM’s training dataset contains no sonar or radar data, its recognition capability for sonar images is inherently limited. Consequently, directly using SAM as a pseudo-label generator for sonar image segmentation would introduce substantial erroneous segmentation noise []. To address this issue, Ref. [] proposes a framework that leverages the pseudo-labels generated by SAM to train a refinement network, which in turn can produce more reliable pseudo-labels in the target domain. Moreover, many studies have been conducted to improve pseudo-label quality [,,]. These methods mainly design confidence estimation strategies to select high-confidence pseudo-labels, thereby avoiding overfitting to noisy or unreliable labels.
3. Methodology
3.1. Overview
The problem of UDA refers to the process of transferring knowledge learned by a model from a source domain to a target domain in the absence of annotated data in the target domain. This enables the model to adapt to the data distribution of the target domain and achieve satisfactory prediction performance. This section provides an overview of the proposed method. We introduce a teacher–student framework for unsupervised domain adaptation in sonar image segmentation. The method enables accurate obstacle segmentation in unlabeled sonar images by transferring knowledge from fully annotated optical images. The overall framework is illustrated in Figure 1. The proposed method involves a source domain dataset consisting of optical images with pixel-wise annotations, and a target domain dataset consisting of unlabeled high-resolution forward-looking sonar images. A pre-trained SAM network is employed as the teacher model to generate coarse pseudo-labels for the images in the target domain. Meanwhile, a U-Net enhanced with multi-scale attention is adopted as the student model , which is jointly trained on both source and target domains.
Figure 1.
Overview of the proposed UDA framework.
In training flow, as illustrated by the blue arrows in Figure 1, the target domain images are first enhanced through a series of preprocessing steps to generate prompts for the SAM network, which then produces the corresponding pseudo-labels . The loss is computed between the source domain predictions and the labels , enable the model to have basic predictive capabilities. Simultaneously, the pseudo-labels generated by the SAM teacher model guide the student model via pseudo-supervised learning []. The loss is calculated between the student model’s predictions and the pseudo-labels on the target domain. The overall training objective is the weighted sum of these two losses.
On one hand, the strong supervision signal from the source domain helps correct noise in the pseudo-labels generated by SAM, preventing the student model from deviating due to inaccurate supervision. On the other hand, pseudo-labels allow the student model to directly learn from the target domain, mitigating domain shift caused by training solely on source data []. By jointly optimizing both losses, the model benefits from complementary learning signals: the student model acquires robust segmentation knowledge from the source domain while gradually transferring that knowledge to the target domain via pseudo-labels, thus achieving effective unsupervised cross-modal domain adaptation.
3.2. MSA U-Net
The U-Net simply fuses features between encoder and decoder of the same scale through skip connections, instead of integrating the features across multiple scales. The obstacles of diverse size in sonar images require receptive fields of different scales for effective feature extraction [], which motivates the incorporation of multi-scale attention mechanisms into the U-Net. The proposed network employs a novel multi-scale attention module, enhancing the network’s perception ability for obstacles of different scales. The structure of MSA U-Net is shown in Figure 2. The multi-scale attention module fuses the outputs from encoder layers at different scales to compute element-wise attention scores, which are then used to refine the features of the bottleneck layer.
Figure 2.
The structure of the MSA U-Net.
As shown in Figure 3, the multi-scale attention module consists of two stages: feature alignment and feature fusion. In the stage of feature alignment, the point-wise convolution layers are applied to unify the numbers of feature channels to a certain value , which is an tunable hyper-parameter. The point-wise convolution layer aligned channel dimension without altering spatial scale of input features. Given n is the number of input multi-scale features , the calculation of point-wise convolution layer over multi-scale features is expressed as follows:
where denotes the output of the i-th multi-scale feature. is the ReLU function. All feature channels are uniformly aligned to . To achieve spatial dimension alignment, bilinear interpolation is applied for feature downsampling. Each output position is obtained by a weighted average of its four neighboring points, as shown below:
where denotes the aligned features, whose spatial scale is identical to that of the bottleneck. is the weight used in the bilinear interpolation.
Figure 3.
The multi-scale attention module.
The aligned feature maps from different scales are concatenated along the channel dimension, as shown below:
where denotes the result of feature fusion along the channel dimension. A 3 × 3 convolution kernel is applied to compress the channel dimension of features to . A point-wise layer is used to align the number of channels with that of the bottleneck. Finally, a sigmoid function is applied to produce element-wise attention score, as shown below:
where has the same spatial dimensions as the bottleneck feature map, and are the weight and bias of the convolutional layer used for channel alignment, and and are the weight and bias of convolution layer used for dimension compression. is the ReLu function. The attention score is multiplied with the bottleneck feature to achieve rectification:
where ⊙ is the Hadamard product. The above method aligns the multi-scale features across the encoders in U-Net and then performs feature fusion, enabling the network focus more on important regions that exhibit cross-scale consistency, thereby effectively enhancing the model’s discriminative capability.
3.3. Candidate Region Selection for SAM
SAM is an interactive segmentation model that takes both an image and corresponding prompts, such as bounding boxes or points, as input. The prompts specify the location and scope of the targets. An preprocessing pipeline is designed to produce accurate candidate boxes as prompt for the SAM network. First, the input image I is enhanced using CLAHE(Contrast Limited Adaptive Histogram Equalization) [], and then binarized using a grayscale threshold filter. Pixels with intensities greater than a given threshold T are retained as the initial mask. Subsequently, the minimum bounding rectangles of all connected regions in the initial mask are computed. Those areas greater than are selected as prompts . These prompts guide SAM network to generate more focused and structurally complete pseudo-labels. Figure 4a is the original sonar image, Figure 4b is the enhancement result of CLAHE, Figure 4c shows the binarized sonar image filtered by the initial mask, in which the red boxes denote the generated prompts by the above method, Figure 4d displays the pseudo-labels generated by SAM network using the corresponding prompts.
Figure 4.
Images of different stages of the pipeline in candidate region generation (The red boxes indicate the regions extracted by the pre-processing method to serve as the prompt input for the SAM model.). (a) Original sonar image. (b) Enhanced with CLAHE. (c) Binarized sonar image and prompt boxes for SAM. (d) Pseudo label from SAM.
CLAHE has shown good performance in image enhancement, effectively reducing the impact of noise in sonar images, which often leads to blurred object contours. By dividing the input image into multiple local tiles and employ histogram equalization, CLAHE effectively enhances local contrast. Additionally, a contrast limiting mechanism is introduced to prevent over-amplification of noise in the sonar image. Assume that the input image is divided into i×j local tiles, the calculation process of normalized cumulative distribution function as follows:
where is the local gray-level histogram after clipped in the local tiles. is the gray-level is the number of pixels in each local block. The gray values of each pixels are calculated as follows:
Finally, bilinear interpolation is used to smoothly fuse the edge of adjacent tiles to avoid edge discontinuities.
3.4. Consistency-Aware Joint Loss Function
A joint loss function is designed to transfer knowledge from the source domain to the target domain. This loss consists of two components: a supervised loss of the source domain , and a pseudo-label loss on the target domain . The computation of these two types of losses is illustrated in Figure 1. The overall loss function is the weighted sum of these two terms, as shown below:
where is a scaling factor for the teacher model’s supervision. Dice loss is adopted for the source loss computation, as it effectively quantifies the overlap between the prediction and the ground truth. The expression of dice loss is as follows:
where is the prediction of the student model of the source domain, and is the corresponding ground-truth label of the images of the source domain. is a small constant added to prevent division by zero. Since the pseudo-labels generated by SAM network may contain wrong predictions, that could mislead the student model. An adaptive weighting mechanism based on prediction consistency is designed to address this challenge. This mechanism dynamically adjusts the magnitude of according to the consistency between the student model’s predictions and the pseudo-labels, thereby effectively suppressing the noise in the pseudo-labels. The loss function formulated under the guidance of the prediction consistency map is expressed as follows:
where is the prediction of the student model on the target domain, is the pseudo-label generated by the SAM network on the target domain, is the binary cross-entropy (BCE) loss function. , is the consistency map between the student model’ predictions and the pseudo-labels. The consistency map is constructed by aggregating the foreground and background prediction consistencies and , as illustrated in Figure 5. Under the independence assumption of teacher–student predictions, the consistency map is defined as follows:
Figure 5.
Calculation of consistency map. (The red area representing the highest values and blue representing the lowest).
Since the student model learns basic segmentation knowledge from the source domain, it already possesses a preliminary prediction capability on the target domain images. A higher consistency coefficient indicates that the pseudo-label at a given pixel is more reliable, whereas a lower value reduces its influence during training. The introduction of the consistency map leverages the generalization ability of the student model to correct errors in the pseudo-labels generated by the teacher model. Meanwhile, it reinforces regions where the predictions of the teacher and student models align, adaptively adjusting the strength of supervision from the teacher model. This mechanism guides the network to learn from reliable pseudo-labels and effectively suppresses the negative impact of low-quality pseudo-labels on the training process.
4. Experiments
4.1. Introduction of Datasets
Two datasets are required for the experiments: a source domain dataset and a target domain dataset. SWIMSEG [] is utilized as the source domain dataset, consisting of 1000 sky images captured by ground-based cameras. Each image is annotated with precise semantic segmentation labels, categorized into sky and cloud regions. Although SWIMSEG was originally developed for semantic segmentation of optical images, the spatial distribution and boundary characteristics of the “sky” and “cloud” regions are similar to those of the “background” and “obstacle” regions in forward-looking sonar images.The core task directly parallels the fundamental objective in sonar image analysis, which is to separate a often-homogeneous water background from salient obstacle targets. Consequently, the model’s ability learned from optical images—“defining foreground objects by detecting local contrast variations”—represents one of the core transferable features across modalities. In SWIMSEG, the distinction between clouds and sky primarily relies on local contrast (where clouds appear brighter than the sky). Similarly, in sonar images, the separation of targets from the background largely depends on local contrast derived from differences in acoustic reflection intensity. This results in comparable spatial distribution and boundary characteristics. Representative samples in SWIMSEG and their corresponding labels are shown in Figure 6.
Figure 6.
Samples in SWIMSEG dataset.
The target domain data [] were collected by an autonomous underwater vehicle (AUV) equipped with an Oculus MD750d forward-looking sonar, manufactured by Blueprint Subsea. To ensure data diversity, collection experiments were conducted both in a harbor and a natural lake. All data were collected at a speed of 1 knot while the AUV cruised steadily. The dataset comprises 381 high-resolution forward-looking sonar images, with a split of 342 for training and 39 for testing. Each image has a resolution of 1300 × 800 pixels, and the obstacle regions in every image were meticulously annotated by human experts. It includes images containing structured obstacles with regular shapes and distinct contours, such as docks and vessels, as well as unstructured obstacles with blurred boundaries and irregular shapes, such as reefs and seabed slopes. This provides complex and challenging scenarios for model evaluation. The experimental data acquisition scenarios and collection equipment are illustrated in Figure 7.
Figure 7.
Data collection scenarios and equipment.
4.2. Experimental Setup and Evaluation
In this experiment, mean intersection over union (mIoU) and mean average precision (mAP) are used to comprehensively evaluate the model’s performance. mIoU is a commonly used evaluation metric in semantic segmentation tasks. It is defined as the ratio of the intersection to the union of the predicted and ground truth regions for each class, averaged over all classes. It is expressed as follows:
where is the number of true positives for class i, is the number of false positives, is the number of false negatives, and C is total number of classes. The mAP focuses more on the accuracy and robustness of the prediction across different classes. It is particularly effective at highlighting performance differences under conditions of class imbalance or when object boundaries are ambiguous. The expression of mAP is as follows:
where is the precision of the class i, and is the recall of the class i. The mIoU emphasizes the spatial accuracy of overlap between predictions and ground truth labels, whereas mAP captures the balance between precision and recall, thereby evaluating the model’s classification performance across different classes. By combining mIoU and mAP, the segmentation model can be evaluated more comprehensively in terms of its reliability and stability under practical application scenarios.
To ensure a fair and comparable experimental environment, all models in this experiment are trained using a unified set of hyper-parameter configurations to eliminating performance deviations. The specific configurations are listed in the Table 1. In addition, all models adopt the same data augmentation strategies and random seeds. These experiments are conducted on Nvidia RTX3060 hardware platform, and all models are implemented by the PyTorch 2.3.0 framework.
Table 1.
Experimental hyper-parameter configuration.
4.3. Experiments on Pseudo-Label Generation
To verify the effectiveness of the proposed pseudo-label generation preprocessing strategy, a set of comparative experiments was conducted. The original sonar images are pre-processed by different methods and then fed into the SAM to generate pseudo-labels. Figure 8 shows the experimental results, the first column displays the original sonar image, the second column is the pseudo-labels generated by the SAM without any prompt input, the third column is the pseudo-labels produced by SAM using the original images without CLAHE enhancement, and the fourth column shows the results produced by the proposed pipeline. Table 2 measures the accuracy of SAM-generated pseudo-labels versus the ground truth labels in the target domain, evaluating different configurations regarding whether to input prompt boxes and whether to apply CLAHE preprocessing.
Figure 8.
Pseudo-labels generated by SAM with different preprocessing methods. (White regions indicate obstacles).
Table 2.
Evaluating SAM-generated pseudo-label quality with different preprocessing methods.
The results indicate that the SAM fails to produce accurate segmentation outputs without prompt boxes. In addition, applying CLAHE enhancement improves the local contrast of the images, enabling the SAM to generate more accurate pseudo-labels.
4.4. Model Training
The proposed network is trained using the configuration mentioned in the Section 4.2. We conduct an analysis of the weighting factor , which controls the supervision intensity of the teacher model. We conduct several sessions with different . The value of starts from 0 and is increased by 0.2 in each training session. Note that when , the model is solely trained on the source domain and directly validated on the target domain without any guidance from the teacher model. Figure 9 shows the model performance during the training process. Specifically, we trained our U-Net directly on the real annotations of the target domain. It shows that the fully supervised learning model achieved an mAP of 0.7454 and an mIoU of 0.4176 on the target domain. This result slightly outperforms our proposed method, indicating that our approach has reached performance levels close to those of supervised learning. The results of supervised training using the target domain’s ground-truth annotations are represented by the red dashed line in Figure 9.
Figure 9.
Model performance at each epoch in training process. (a) mAP. (b) mIoU.
The results indicate that as increases from 0 to 1.6, the model’s mAP gradually improves. Meanwhile, mIoU first increases and then decreases with the increase of , and reaching its peak at . In addition, when , both mAP and mIoU are relatively low, suggesting that supervision from SAM pseudo-labels in the target domain can significantly enhance the model’s performance.
4.5. Comparative Experiments
To fairly evaluate the performance of the proposed method, a series of comparative experiments are conducted. All these methods are built upon the U-Net of the same scale, and the supervised loss on the source domain is uniformly set to dice loss. In our experiments, we employed a U-Net architecture with a 4-layer encoder and a 4-layer decoder, where the channel sizes were 16, 32, 64, and 128, respectively. Furthermore, considering the symmetry structure of the U-Net architecture, the bottleneck layer features contain the highest-level semantic information. Therefore, for all comparison methods, the domain adaptation loss is computed based on features extracted from the U-Net bottleneck layer. For methods that directly align features between the source and target domains, such as BNM and Linear-MMD, we perform alignment using the features from the bottleneck layer of the U-Net. In contrast, for adversarial transfer methods like DANN and DAAN, although they introduce an additional adversarial sub-network, feature extraction is still conducted using the same U-Net architecture as in other comparative methods. To eliminate the influence of optimization strategies on the experimental results, all methods adopt the same hyper-parameter settings as described in Section 4.2 during training. Table 3 presents the experimental results in terms of mIoU and mAP. Figure 10 shows the best obstacle segmentation results achieved by each method under different values of .
Table 3.
Results of comparative experiment.
Figure 10.
Segmentation performance by different methods.
The experimental results indicate that adversarial-based unsupervised domain adaptation methods such as DANN and DAAN exhibit certain instability during training. This stems mainly from the adversarial training between generator and discriminator, requiring careful learning-rate tuning across modules to maintain their dynamic balance. In contrast, the Linear-MMD method does not depend on this. Its optimization objective is more straightforward and has better convergence properties, resulting in improved stability and transfer performance. Overall, the proposed method outperforms the other comparative methods in terms of both mIoU and mAP, demonstrating its effectiveness in sonar image segmentation tasks.
4.6. Ablation Experiments
A series of ablation studies are systematically conducted to validate the efficacy of both the multi-scale attention module and the consistency map-based weighting strategy. The experimental design comprises four distinct configurations: (1) Group A: Both the multi-scale attention module and consistency map weighting are applied; (2) Group B: Only the consistency map weighting is applied; (3) Group C: Only the multi-scale attention module is applied and (4) Group D: Neither the attention mechanism nor the consistency weighting is applied.
The results of the ablation experiments are shown in Figure 11, where the vertical axis represents different experimental groups under varying values of . On the left side of the horizontal axis is mIoU, and on the right side is mAP. The baseline model trained solely on the source domain with traditional U-Net is indicated by the red dashed lines. Comparisons between Group A and Group B, as well as between Group C and Group D, indicate that the multi-scale attention module effectively improves the model’s mAP. However, Group A shows a slight reduction compared to Group B in terms of mIoU. This may be attributed to the additional parameters introduced by the attention mechanism, which could potentially lead to mild over-fitting. Meanwhile, comparisons between Group A and Group C, as well as between Group B and Group D, demonstrate that the consistency map significantly enhances both mIoU and mAP. Moreover, as the increase, both Group C and Group D experience a notable decline in mIoU and mAP, indicating that the pseudo-labels from SAM introduce negative transfer effects. In contrast, Group A and Group B, which incorporate the consistency map weighting, maintain high accuracy across all values of , demonstrating the robustness and effectiveness of the consistency-guided weighting strategy. This shows that the consistency map enhances training efficiency and model robustness by selectively incorporating high-consistency predictions. It effectively suppresses noisy labels from the teacher model by leveraging the model’s inherent discriminative capability, thereby preventing misleading signals from propagating during training. Since both the pseudo-labels generated by SAM and the target domain predictions made by the student model using source domain information are inaccurate, relying solely on either prediction for a given set of target domain training data fails to provide reliable supervisory signals. To address this, an adaptive consistency-based weighting method employs an intersection-like operation to leverage only the consistent predictions between the two models. Specifically, when the pseudo-labels from SAM align closely with the predictions of the student model, a higher weight is assigned. Conversely, the weight of the pseudo-supervision is reduced whenever the predictions of the two models diverge.
Figure 11.
Result of ablation experiments.
5. Conclusions
This paper proposes an unsupervised domain adaptation method based on a teacher–student framework to address the challenges of limited samples and annotation difficulty in underwater sonar image segmentation. The method leverages the SAM to generate pseudo-labels for cross-domain transfer. First, CLAHE is applied to enhance image contrast, followed by threshold filtering and the generation of preliminary bounding boxes, which are used as prompts for the SAM to produce more accurate pseudo-labels. Subsequently, a multi-scale attention module is introduced to refine the bottleneck features of the U-Net, enhancing the model’s ability to extract semantic information across different scales.
Given the suboptimal accuracy of the pseudo-labels generated by the SAM, a consistency map between the source and target domain predictions is used to dynamically adjust the learning strength from the pseudo-labels. This mechanism effectively suppresses the negative impact of noisy labels by using the knowledge learned from supervised training on the source domain to guide the learning process in the target domain.
Although our proposed unsupervised domain adaptation solution achieves cross-modal transfer, it still relies on a high distributional similarity between the source and target domain data. While consistency-based adaptive supervision can effectively suppress prediction noise caused by domain shifts, the consensus map still struggles to filter reliable transfers when the domain shift is too substantial. Furthermore, the inherent inadequacy of the SAM in adapting to sonar images exacerbates the generation of noisy pseudo-labels, thereby misleading the student model. To address these issues, future work will focus on introducing an intermediate domain within a progressive framework to reduce the domain shift prior to training, enabling the model to smoothly transfer knowledge from the source to the target domain in a phased learning manner.
Author Contributions
Conceptualization, S.G., G.X. and W.G.; methodology, S.G. and B.L.; software, S.G.; validation, S.G., G.X. and B.L.; formal analysis, S.G. and B.L.; investigation, G.X.; resources, W.G. and G.X.; data curation, G.X. and S.G.; writing—original draft preparation, S.G.; writing—review and editing, W.G. and B.L.; visualization, S.G.; supervision, W.G.; project administration, W.G. and G.X.; funding acquisition, W.G. and G.X. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Youth Innovation Promotion Association of the Chinese Academy of Sciences (grant number: 2023386); the National Key Research and Development Program of China (grant number: 2023YFC2810100, 2020YFC1521704); the Hainan Province Science and Technology Special Fund (grant number: DSTIC-CYCJ-2025007).
Data Availability Statement
The data presented in this study are available on request from the corresponding author due to research reasons.
Acknowledgments
The authors thank other members of the research group for their participation in the experiment of underwater sonar dataset construction. The authors would like to thank the reviewers for their careful work.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Vishwakarma, A. Denoising and Inpainting of Sonar Images Using Convolutional Sparse Representation. IEEE Trans. Instrum. Meas. 2023, 72, 1–9. [Google Scholar] [CrossRef]
- Canny, J. A Computational Approach to Edge Detection. IEEE Trans. Pattern Anal. Mach. Intell. 1986, 6, 679–698. [Google Scholar] [CrossRef]
- Sobel, I.; Feldman, G. A 3×3 Isotropic Gradient Operator for Image Processing. Stanf. Artif. Proj. 1968, 271–272. [Google Scholar]
- Otsu, N. A Threshold Selection Method from Gray-Level Histograms. Automatica 1975, 11, 23–27. [Google Scholar] [CrossRef]
- Tian, Y.; Lan, L.; Guo, H. A Review on the Wavelet Methods for Sonar Image Segmentation. Int. J. Adv. Robot. Syst. 2020, 17, 1729881420936091. [Google Scholar] [CrossRef]
- Chai, Y.; Yu, H.; Xu, L.; Li, D.; Chen, Y. Deep Learning Algorithms for Sonar Imagery Analysis and Its Application in Aquaculture: A Review. IEEE Sens. J. 2023, 23, 28549–28563. [Google Scholar] [CrossRef]
- Yu, S. Sonar Image Target Detection Based on Deep Learning. Math. Probl. Eng. 2022, 2022, 5294151. [Google Scholar] [CrossRef]
- Choi, K.-H.; Ha, J.-E. An Adaptive Threshold for the Canny Algorithm with Deep Reinforcement Learning. IEEE Access 2021, 9, 156846–156856. [Google Scholar] [CrossRef]
- Cao, X.; Ren, L.; Sun, C. Research on Obstacle Detection and Avoidance of Autonomous Underwater Vehicle Based on Forward-Looking Sonar. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9198–9208. [Google Scholar] [CrossRef]
- Khan, R.; Mehmood, A.; Akbar, S.; Zheng, Z. Underwater Image Enhancement with an Adaptive Self Supervised Network. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1355–1360. [Google Scholar]
- Toldo, M.; Maracani, A.; Michieli, U.; Zanuttigh, P. Unsupervised Domain Adaptation in Semantic Segmentation: A Review. Technologies 2020, 8, 35. [Google Scholar] [CrossRef]
- Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.-Y.; et al. Segment Anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
- Dev, S.; Lee, Y.H.; Winkler, S. Color-Based Segmentation of Sky/Cloud Images From Ground-Based Cameras. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 10, 231–242. [Google Scholar] [CrossRef]
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
- Tran, S.-T.; Cheng, C.-H.; Nguyen, T.-T.; Le, M.-H.; Liu, D.-G. TMD-Unet: Triple-Unet with Multi-Scale Input Features and Dense Skip Connection for Medical Image Segmentation. Healthcare 2021, 9, 54. [Google Scholar] [CrossRef]
- Sun, Y.; Zheng, H.; Zhang, G.; Ren, J.; Shu, G. CGF-Unet: Semantic Segmentation of Sidescan Sonar Based on Unet Combined with Global Features. IEEE J. Ocean. Eng. 2024, 49, 963–975. [Google Scholar] [CrossRef]
- Sun, Y.-C.; Gerg, I.D.; Monga, V. Iterative, Deep Synthetic Aperture Sonar Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
- He, J.; Chen, J.; Xu, H.; Yu, Y. SonarNet: Hybrid CNN-Transformer-HOG Framework and Multifeature Fusion Mechanism for Forward-Looking Sonar Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–17. [Google Scholar] [CrossRef]
- Long, M.; Cao, Y.; Wang, J.; Jordan, M.I. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 97–105. [Google Scholar]
- Sun, B.; Saenko, K. Deep CORAL: Correlation Alignment for Deep Domain Adaptation. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; pp. 443–450. [Google Scholar]
- Ganin, Y.; Lempitsky, V. Unsupervised Domain Adaptation by Backpropagation. In Proceedings of the International Conference on Machine Learning, Lille, France, 7–9 July 2015; pp. 1180–1189. [Google Scholar]
- Yu, C.; Wang, J.; Chen, Y.; Huang, M. Transfer Learning with Dynamic Adversarial Adaptation Network. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 778–786. [Google Scholar]
- Sethuraman, A.V.; Skinner, K.A. STARS: Zero-Shot Sim-to-Real Transfer for Segmentation of Shipwrecks in Sonar Imagery. arXiv 2023, arXiv:2310.01667. [Google Scholar]
- Wang, Q.; Zhang, Y.; He, B. Automatic Seabed Target Segmentation of AUV via Multilevel Adversarial Network and Marginal Distribution Adaptation. IEEE Trans. Ind. Electron. 2023, 71, 749–759. [Google Scholar] [CrossRef]
- Li, J.; Seltzer, M.L.; Wang, X.; Zhao, R.; Gong, Y. Large-Scale Domain Adaptation via Teacher-Student Learning. arXiv 2017, arXiv:1708.05466. [Google Scholar]
- Li, W.; Fan, K.; Yang, H. Teacher–Student Mutual Learning for Efficient Source-Free Unsupervised Domain Adaptation. Knowl.-Based Syst. 2023, 261, 110204. [Google Scholar] [CrossRef]
- Zhang, H.; Tang, J.; Cao, Y.; Chen, Y.; Wang, Y.; Wu, Q.J. Cycle Consistency Based Pseudo Label and Fine Alignment for Unsupervised Domain Adaptation. IEEE Trans. Multimed. 2022, 25, 8051–8063. [Google Scholar] [CrossRef]
- Zhao, X.; Mithun, N.C.; Rajvanshi, A.; Chiu, H.-P.; Samarasekera, S. Unsupervised Domain Adaptation for Semantic Segmentation with Pseudo Label Self-Refinement. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 1–6 January 2024; pp. 2399–2409. [Google Scholar]
- Deng, Z.; Luo, Y.; Zhu, J. Cluster Alignment with a Teacher for Unsupervised Domain Adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9944–9953. [Google Scholar]
- Wang, Z.; Zhang, Y.; Zhang, Z.; Jiang, Z.; Yu, Y.; Li, L.; Li, L. Exploring Semantic Prompts in the Segment Anything Model for Domain Adaptation. Remote Sens. 2024, 16, 758. [Google Scholar] [CrossRef]
- Yan, W.; Qian, Y.; Zhuang, H.; Wang, C.; Yang, M. Sam4udass: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles. IEEE Trans. Intell. Veh. 2023, 9, 3396–3408. [Google Scholar] [CrossRef]
- Wang, F.; Zhao, L.; Hong, S.; Wang, Z.; Liu, C.; Gao, C.; Li, J.; Li, X.; Luo, D. Dual-Domain Teacher for Unsupervised Domain Adaptation Detection. IEEE Trans. Multimed. 2025, 27, 4217–4226. [Google Scholar] [CrossRef]
- Stenger, A.; Baudrier, É.; Naegel, B.; Passat, N. RESAMPL-UDA: Leveraging foundation models for unsupervised domain adaptation in biomedical images. Pattern Recognit. Lett. 2025, 196, 221–227. [Google Scholar] [CrossRef]
- Chen, H.; Li, L.; Chen, J.; Lin, K.-Y. Unsupervised Domain Adaptation via Double Classifiers Based on High Confidence Pseudo Label. arXiv 2021, arXiv:2105.04729. [Google Scholar] [CrossRef]
- Yang, R.; Tian, T.; Tian, J. Versatile Teacher: A Class-Aware Teacher–Student Framework for Cross-Domain Adaptation. Pattern Recognit. 2025, 158, 111024. [Google Scholar] [CrossRef]
- Wang, Q.; Breckon, T. Unsupervised Domain Adaptation via Structured Prediction Based Selective Pseudo-Labeling. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 6243–6250. [Google Scholar]
- Zhang, B.; Wang, Y.; Hou, W.; Wu, H.; Wang, J.; Okumura, M.; Shinozaki, T. FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling. Adv. Neural Inf. Process. Syst. 2021, 34, 18408–18419. [Google Scholar]
- Lee, D.-H. Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks. In Proceedings of the Workshop on Challenges in Representation Learning, ICML, Atlanta, GA, USA, 16–21 June 2013; Volume 3, p. 896. [Google Scholar]
- Yang, D.; Cheng, C.; Wang, C.; Pan, G.; Zhang, F. Side-Scan Sonar Image Segmentation Based on Multi-Channel CNN for AUV Navigation. Front. Neurorobotics 2022, 16, 928206. [Google Scholar] [CrossRef]
- Zuiderveld, K. Contrast Limited Adaptive Histogram Equalization. In Graphics Gems; Academic Press: San Diego, CA, USA, 1994; pp. 474–485. [Google Scholar]
- Gao, S.; Guo, W.; Xu, G.; Liu, B.; Sun, Y.; Yuan, B. A Lightweight YOLO Network Using Temporal Features for High-Resolution Sonar Segmentation. Front. Mar. Sci. 2025, 12, 1581794. [Google Scholar] [CrossRef]
- Ghifary, M.; Kleijn, W.B.; Zhang, M. Domain Adaptive Neural Networks for Object Recognition; Springer International Publishing: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Cui, S.; Wang, S.; Zhuo, J.; Li, L.; Huang, Q.; Tian, Q. Towards Discriminability and Diversity: Batch Nuclear-Norm Maximization under Label Insufficient Situations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).