SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition

Wan, Qian; Su, Hongbo; Liu, Xiyu; Yu, Yu; Lin, Zhongzhen

doi:10.3390/pr13030736

Open AccessArticle

SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition

by

Qian Wan

,

Hongbo Su

,

Xiyu Liu

,

Yu Yu

and

Zhongzhen Lin

^*

School of Art and Design, Guangdong University of Technology, Guangzhou 510080, China

^*

Author to whom correspondence should be addressed.

Processes 2025, 13(3), 736; https://doi.org/10.3390/pr13030736

Submission received: 3 December 2024 / Revised: 9 February 2025 / Accepted: 26 February 2025 / Published: 3 March 2025

(This article belongs to the Special Issue AI / Machine Learning Techniques as a Tool for Process Modeling and Product Design)

Download

Browse Figures

Versions Notes

Abstract

Unsupervised domain adaptation (UDA) enables training a model on labeled source data to perform well in a target domain without supervision, which is especially valuable in vision-based semantic segmentation. However, existing UDA methods often struggle with accurate semantic labeling at object boundaries and recognizing minor categories in the target domain. This paper introduces a novel UDA framework—SamDA—that incorporates the Segment Anything Model (SAM), a large-scale foundational vision model, as the mask generator to enhance edge segmentation performance. The framework comprises three core modules: a cross-domain image mixing module, a self-training module with a teacher–student network, and exponential moving average (EMA). It also includes a finetuning module that leverages SAM-generated masks for pseudo-label matching. Evaluations on the GTA5 and Cityscapes datasets demonstrate that SamDA achieves a mean IoU (mIoU) of 75.2, surpassing state-of-the-art methods such as MIC-DAFormer by 1.0 mIoU and outperforming all ResNet-based approaches by at least 15 mIoU. Moreover, SamDA significantly enhances the segmentation of small objects like bicycles, riders, and fences, with, respective, IoU improvements of 4.5, 5.2, and 3.8 compared to baseline models.

Keywords:

unsupervised domain adaptation; semantic segmentation; Segment Anything Model SAM

1. Introduction

In computer vision, unsupervised domain adaptation (UDA) aims to bridge the domain gap between the target and the source domains, enabling models trained on data with abundant annotations in the source domain to generalize effectively across the target domain, of which the data have no labels. This is particularly useful when the source domain can generate synthetic data with known labels, while the target domain is a real-world application that obtains human-annotated labels in a costly manner [1]. One inherent challenge in making UDA effective is to overcome the potential domain gaps [2] between the source and target datasets that can arise from different aspects of disparity, such as different class distributions, statistical properties, and object appearances. Various techniques, including style transfer (i.e., enhancing source samples to be more similar to the target samples) [3], data augmentation (i.e., adding noise to samples to make learning more robust) [4], and image mixing (i.e., superimposing source and target images to create new samples) [5] can be applied to bridge these domain gaps.

Existing deep learning-based methods for UDA semantic segmentation can be summarized in two different perspectives. From the perspective of learning frameworks, we have domain adversarial learning methods [3,6,7,8,9] and self-training methods [1,2,10,11]. From the perspective of backbone models, we have CNN-based methods [5,12,13] and transformer-based methods [1,14]. Despite these advancements, existing methods still struggle with two critical challenges in UDA-based semantic segmentation.

First, UDA models suffer from subpar performance in predicting semantic labels for edge pixels, primarily due to domain shifts leading to misclassification along object boundaries [2,15,16]. Prior research has shown that boundary regions are particularly vulnerable to label noise, and pseudo-labels in these areas tend to have low confidence scores, further exacerbating the problem [8,17]. Second, UDA models often fail to accurately infer semantic labels for minority classes—semantic objects that occupy only a small portion of the dataset—due to severe class imbalance issues. Recent studies have demonstrated that conventional self-training methods struggle to propagate reliable pseudo-labels for minority classes, resulting in an inherent bias toward majority classes [5,18].

To address these challenges, we integrated the Segment Anything Model (SAM) into our framework. SAM provides high-quality object segmentation using prompt-based learning, offering superior boundary delineation compared to conventional UDA pseudo-labeling methods [19,20,21]. Unlike standard UDA approaches that rely on heuristic-based pseudo-labeling, SAM is trained on a large-scale dataset and generalizes well across domains, making it a promising candidate for domain adaptation tasks. However, SAM-generated masks exhibit two key limitations when directly applied to UDA tasks: excessive granularity, leading to fragmented object masks; and the lack of semantic labels, necessitating an additional refinement step to align them with the semantic categories of the target domain [22,23].

Additionally, we propose an image mixing module to mitigate sample imbalance issues by generating new samples that combine an image from the source domain with minority semantic labels and an image from the target domain. Unlike prior work that primarily employs mix-up strategies to improve generalization, our image mixing approach explicitly targets minority class enhancement by ensuring that rare semantic categories appear more frequently in training batches [10,24,25]. This mitigates the bias introduced by pseudo-labeling and enables better learning for under-represented semantic classes.

We utilize SAM to generate high-quality pseudo-labels within an iterative self-training framework using a teacher–student architecture with image mixing. To effectively integrate SAM into our proposed UDA framework, we addressed the two challenges mentioned above: excessive granularity in generated masks and the lack of semantic labels.

Our contributions in this work can be summarized as follows:

We propose a novel UDA framework for image semantic segmentation by effectively integrating SAM as the mask generator to mitigate the common weakness in accurate identifications of semantic labels for the edge pixels of objects.
We designed an image mixing module to counter the minority class imbalance challenge by generating new samples that are synthesized using images from source and target domains.
We showed better semantic segmentation performance with our SamDA framework on two public benchmark datasets: GTA5 (source data) and CityScapes (target data).

The rest of this paper is organized as follows: Section 2 reviews related work, including UDA-based semantic segmentation, image mixing, and SAM applications. Section 3 presents the proposed SamDA framework, including self-training, the mixing module, and the finetuning module. Section 4 discusses experimental results and analysis, while Section 5 concludes the study.

2. Related Work

2.1. Semantic Segmentation Using Unsupervised Domain Adaptation

Deep learning methods on semantic segmentation using UDA can be broadly categorized into adversarial learning and self-training methods.

Adversarial Learning techniques for domain adaptive semantic segmentation were first introduced by Hoffman et al. [6]. Specifically, Generative Adversarial Networks (GANs) were introduced into domain adaptation tasks to reduce the domain gaps between the source and the target domains through adversarial training with a single feature extractor. Tsai et al. proposed the AdaptSegNet method, in which the output space was aligned through adversarial training, and multi-scale feature extractors were utilized to capture features at different granularities [8]. Hoffman et al. later introduced the cycle-consistent adversarial domain adaptation method CyCADA, in which cycle consistency loss was incorporated to transform the source domain images into the target domain style while maintaining semantic consistency [3]. In this framework, the feature extractor was required to not only extract deep features, but to also ensure feature consistency and integrity during image transformation.

Despite these efforts, adversarial training still exhibited limitations. For example, domain discriminators often struggle with edge pixels, leading to inaccurate segmentation boundaries as they treat these regions as noise rather than meaningful features [15,16]. Additionally, adversarial training methods tend to favor dominant classes in the source domain, leading to a poor recognition of minority classes [5,18]. Recent research has explored hybrid strategies, such as combining adversarial learning with self-training to improve class distribution alignment, but these approaches still require further refinement to handle boundary pixels and under-represented categories [1,26]. In recent years, researchers have introduced attention mechanisms (such as the DISE method [27]) and more complex network architectures to enhance the performance of feature extractors [1,14]. Currently, the mainstream approach involves the latter, leveraging multi-scale feature fusion and self-supervised learning to enhance the performance in UDA.

The Self-training paradigm presents significant improvements when compared to adversarial learning in domain adaptive semantic segmentation tasks. As a semi-supervised learning method, self-training uses pseudo-labels to augment training data for performance improvement. Current state-of-the-art models in semantic segmentation using UDA are primarily self-training methods. Zou et al. proposed the CBST self-training method to generate high-quality pseudo-labels to gradually improve model performance in the target domain [2]. The method involves training an initial model on the source domain, generating pseudo-labels on the target domain data, and retraining the model using these pseudo-labels. It also balances the number of pseudo-labels for each class to ensure a balanced sample representation during training, leading to enhancement in the model’s generalizability and performance in the target domain. Li et al. proposed the BDL method, which further enhances pseudo-label quality and feature extractor domain adaptability through a bidirectional learning strategy [26]. Xu et al. used self-training to generate pseudo-labels for iterative training to improve the feature extractor’s domain adaptability [28]. Xie and co-authors’ Noisy Student approach combines self-training with adversarial learning for image classification tasks, but it also has significant implications for domain adaptive semantic segmentation [29]. Pereira et al. introduced attention mechanisms into self-training, allowing the feature extractor to focus on important areas in images for enhanced model effectiveness [30].

The above methods significantly enhance model efficiency but may not fully address the quality issue in the generated pseudo-labels in UDA due to domain gaps. Specifically, pseudo-labeling approaches often struggle with **edge pixels**, as model uncertainty in these regions leads to inconsistent predictions [8,15]. Furthermore, pseudo-labels for **minority classes** are frequently incorrect due to class imbalance, which results in biased optimization toward majority classes [5,18]. To mitigate these issues, recent methods have introduced uncertainty-aware pseudo-labeling and entropy minimization [31,32], but their effectiveness remains limited when applied to fine-grained object boundaries.

2.2. Image Mixing

Many existing methods in UDA focus on improving pseudo-label quality. Hoyer et al. proposed the masked image consistency (MIC) module to enhance context relationships, thereby improving pseudo-label quality [14]. The core idea is to randomly mask parts of images during training, generate pseudo-labels using the unmasked parts, and use these pseudo-labels to guide the restoration of the masked parts. Tranheden et al. introduced the DACS algorithm, which mixes images from two different domains to create new, highly perturbed samples [5]. By mixing images, DACS enriched the training set to enhance model generalizability, mitigating pseudo-label error propagation, while improving target domain pseudo-label quality and model stability. DACS opened new directions for domain adaptive semantic segmentation, with cross-domain mixed sampling and label mixing methods applied in various approaches. For instance, DAFormer [1] was developed based on DACS using a Transformer-based architecture and multi-level context-aware modules to enhance domain adaptive semantic segmentation performance. DAFormer improved network architecture using Transformer and training strategies (e.g., progressive freezing, mixed labels, and pseudo-label filtering) to further enhance model performance. In the current work, we employed a novel mixing strategy by combining images from two domains to generate new samples for minority classes.

2.3. Segment Anything Model

The Segment Anything Model (SAM) [19] is a new large vision foundation model created by Facebook in 2023, and it has achieved remarkable results in segmentation tasks, especially in zero-shot segmentation. Its generalizability as a mask generator across different domains can greatly boost semantic segmentation using UDA. It can be leveraged to address the inaccurate identification of object borders. However, effective integration of SAM in semantic segmentation UDA is not without challenges, including lack of specialization, high training costs, and, more critically, excessive granularity in the generated masks and absence of semantic information.

Unlike conventional UDA methods that rely on self-training with heuristic-based pseudo-labeling, SAM provides high-quality mask generation through prompt-based segmentation, allowing for improved edge delineation and object boundary refinement [20,21]. Recent studies have explored how foundation models like SAM can be leveraged in domain adaptation. For example, SAM4DUASS [10] employs semantic-guided mask labeling to improve pseudo-label quality for rare classes, demonstrating significant improvements in recognizing minority categories. Similarly, SAM-EDA [33] integrates SAM into a mean-teacher framework, where the teacher assistant generates pseudo-labels based on SAM’s outputs to guide the student model’s learning process. Despite these advancements, integrating SAM into UDA remains an open problem, particularly in handling mask granularity and semantic consistency.

To address SAM’s lack of specialization, Chen et al. proposed a lightweight adapter module that adaptively adjusts the feature maps obtained from SAM’s encoder, enhancing its capabilities in underwater, shadow, and camouflage object segmentations [34]. Xu et al. proposed a training-free automatic prompt generation method, integrating SAM for medical image segmentation tasks [35]. To address the high training time of SAM, Ke et al. proposed a higher-quality segmentation model based on SAM, improving the segmentation of detailed structures [36]. To retain SAM’s zero-shot transfer capability while reducing training complexity, the original image encoder, prompt encoder, and mask decoder of SAM were retained and the parameters frozen. Zhao et al. proposed a fast SAM model for real-time image segmentation [37].

Despite these optimizations, SAM’s excessive mask granularity remained a critical challenge for UDA applications. Since SAM was designed to segment any object, it frequently produces over-fragmented masks, making it difficult to assign consistent semantic labels to each region [22,23]. Additionally, SAM lacks inherent semantic label information, requiring external classifiers or refinement modules to align its outputs with UDA-specific segmentation tasks. Existing studies have attempted to mitigate this issue by integrating SAM with semantic segmentation models [38], but these solutions often introduce additional computational overhead.

In summary, while SAM presents significant advantages for UDA, its direct application remains underexplored. Our work aims to bridge this gap by combining SAM’s high-quality mask generation capabilities with an improved pseudo-labeling framework that enhances edge segmentation and minority class recognition.

3. Methodology

3.1. Problem Formulation

Let us denote labeled and unlabeled images from the source and target domain as

X_{s} = {(x_{s}^{i}, y_{s}^{i})}_{i = 1}^{n}

and

X_{t} = {x_{t}^{j}}_{j = 1}^{m}

, respectively, with

y_{s}^{i} \in C_{s} = {1, 2, \dots, r}

. In the context of UDA, images

X_{s}

and

X_{t}

share identical categories but follow distinct underlying distributions. In the task of semantic segmentation, given the source dataset, our goal is to obtain a pixel-wise assignment of semantic categories to each unlabeled pixel in the target images.

3.2. Self-Training

We can train a Student model

M^{S}

using the available source domain images and labels by supervised learning with the cross-entropy loss

L_{c e}^{S}

on the source domain images. For the i-th sample,

L_{c e}^{S}

can be written as follows:

L_{c e}^{(i)} = - \sum_{j = 1}^{H \times W} \sum_{k = 1}^{K} y_{s}^{(i, j, k)} log M^{S} (x_{s}^{(i, j, k)}) .

(1)

However, due to the domain gap, the model was trained only using source domain images, and

L_{c e}^{S}

loss is unlikely to generalize well to the target domain. To address the domain gap, we adopted a self-training based UDA technique as our baseline [1,2,5,11,39]. In self-training, a Teacher model

M^{T}

is used to generate pseudo-labels

{\hat{y}}_{t}^{i}

for the target domain images. The target domain data with pseudo-labels and the labeled source domain data are then used together to iteratively train the student model

M^{S}

to adapt to the target domain.

The teacher model weight update at time step t depends on all the previous updates of the student network (where

M_{t}^{S}

is the student network parameters up to time step t):

M_{t}^{T} = k \sum_{i = 1}^{t} {(1 - k)}^{i - 1} M_{t - i + 1}^{S} .

(2)

Here, k is a hyperparameter (0.99) that is used to control the degree of model weight change. The pseudo-labels

{\hat{y}}_{t}^{i}

for the target domain images are generated by the teacher model

M^{T}

:

{\hat{y}}_{t}^{i} = arg max_{j, k} M^{T} {(x_{t}^{i})}^{(j, k)} .

(3)

3.3. Mixing Module

To improve the model’s ability to learn rare class information from the target domain, we propose a cross-domain mixing strategy inspired by DACS [5]. The goal is to construct “mix image across domains” that allow both source and target domains to participate in pixel-level data and label generation, prioritizing rare classes and target domain pseudo-label accuracy (especially at boundaries).

Figure 1 illustrates our mixing process, where rare classes are prioritized in sampling. This technique effectively mitigates class imbalance issues, ensuring that minority classes are adequately represented during model training.

In our mixing strategy, classes are first divided into rare classes

R

and common classes

C

. Let

n_{1}

and

n_{2}

be two variables representing the number of rare and common classes selected from the source domain images to be cropped into the mixed image, respectively. We define

n_{1} \sim B (| R |, p_{1})

and

n_{2} \sim B (| C |, p_{2})

, where

| R |

and

| C |

represent the total number of rare and common classes, respectively, and

B

is the binomial distribution. We set

0 \leq n_{1} \leq α_{1}

and

0 \leq n_{2} \leq α_{2}

, where

α_{1}

and

α_{2}

are hyperparameters.

To prioritize rare classes, we set

p_{1} > p_{2}

, allowing a higher probability of selecting rare classes. By adjusting the values of

α_{1}

and

α_{2}

, we ensured that the mixed image contains sufficient information about rare classes without introducing excessive noise that may hinder the model’s learning. The final number of classes selected for the mixed image is given by

n_{1} + n_{2}

, with, at most, seven rare classes and three common classes.

The mixed image and its corresponding labels were sourced from both the source and target domains. This cross-domain fusion introduces new data and prevents the model from learning to distinguish between the two domains based on dataset differences (e.g., quality and style) when performing the semantic segmentation task [40,41]. The implementation of the mixing module is shown in Algorithm 1.

Algorithm 1: Implementation of the mixing module to generate mixed images from both the source and target domains for training student models iteratively.

3.4. Finetune Module

The finetune module is responsible for matching the predicted target labels by the teacher model with the SAM generated masks using a maximum threshold matching strategy, which is described below.

3.4.1. Mask Generation

The edge segmentation capability in semantic segmentation models cannot be directly measured by mIoU and is often overlooked during model evaluations. Enhancing the edge segmentation capability means enhancing its ability to distinguish border pixels belonging to different labels in the training domain. Our method uses SAM, a large foundation segmentation model [19], as a guide to enhance the edge segmentation capability of the semantic segmentation model.

Since SAM can accept prompts of various styles to generate masks, we adopted the approach used in the SAM demo [19]: points are uniformly placed on the image as prompts, and a mIoU threshold is selected for the SAM generated masks to guarantee higher-quality spatial prompts (i.e., generated masks). The mask generation process is given by

masks = SAM (x_{t}, point prompts)

.

3.4.2. Pseudo-Label Matching

Mask generation using SAM will generate masks of different sizes. In large masks, the probability of including several different classes in the mask will be greater. Therefore, in order to adapt to the mask and pseudo-label matching strategy described below, we sorted the generated masks in descending order of area. Let

{mask}_{i}

denote the i-th generated mask, and the sorted mask sequence

{{mask}_{1}, {mask}_{2}, \dots, {mask}_{k}}

satisfies

Area ({mask}_{i}) \geq Area ({mask}_{i + 1}), \forall i \in {1, 2, \dots, k - 1}

. Sorting the masks in descending order of area can allow for the optimization of the pseudo-labels to transition from a coarse to fine-grained level, while preserving the spatial information of the small masks on the pseudo-labels without being covered.

We will treat each mask as a spatial prompt and use it to reduce or fill the area of the selected class in the pseudo-label by first assigning the correct semantic information to this mask, and a maximum threshold matching strategy is then used to determine the semantic information of the mask. To assign reasonable semantic information to

{mask}_{i}

, we calculate the area combination of each class in the pseudo-label region covered by

{mask}_{i}

, find the class ℓ with the largest area, and ensure that its area exceeds the chosen threshold

σ

:

ℓ = arg {max}_{c \in C} (Area ({mask}_{i} \cap L_{c})),

and

Area ({mask}_{i} \cap L_{ℓ}) > σ .

As shown in Figure 2, the finetune module significantly improves the prediction accuracy. After fine tuning, the images more closely resemble the ground truth images in the fourth column, with enhanced edge definition. This refinement enables the model to learn precise geometric details more effectively. Furthermore, Algorithm 2 provides a detailed implementation of the finetune module for generating pseudo-labels for the target domain.

Algorithm 2: The implementation of the finetune module to generate pseudo-labels for the target image.

3.5. Overall Loss Function

The overall loss function, as shown in Figure 3 through three stages of backpropagations in red dotted arrows, is constructed based on the cross entropy loss function (see Equation (1)). It includes three components using different data sources generated in the training process, and it can be expressed as shown below:

L = C_{1} \cdot L_{C E}^{S} (y_{s}^{p r e d}, y_{s}) + C_{2} \cdot L_{C E}^{M} (y_{m}^{p r e d}, y_{m}) + C_{3} \cdot L_{C E}^{T} ({\hat{y}}_{t}, y_{v}),

(4)

where

L_{C E}^{S} (y_{s}^{p r e d}, y_{s})

denotes the cross entropy from training the student model on source domain data,

L_{C E}^{M} (y_{m}^{p r e d}, y_{m})

denotes the cross entropy from training the student model on mix domain data, and

L_{C E}^{T} ({\hat{y}}_{t}, y_{v})

denotes the cross entropy from training the student model on target domain data.

C_{1}, C_{2}, C_{3}

are the hyperparameters that control the relative weights of the three loss terms.

4. Experiments and Results

4.1. Datasets

We used the GTA5 dataset [42], which includes 24,966 synthetic images with pixel-level semantic annotations, as our source dataset; and the Cityscapes dataset [43], a collection of street scene images from 50 different cities aimed at semantic understanding, as our target dataset. Figure 4 provides examples from both datasets. We applied Gaussian denoising to the datasets to reduce image noise. Additionally, following standard UDA practices, we resized each image in both datasets to 512 × 512 pixels, and we then applied horizontal flipping to augment the training samples.

CityScapes. The Cityscapes dataset contains 5000 images with a resolution of 1024 × 2048 pixels. Each image comes with pixel-level annotations for 19 object classes, which supports semantic segmentation research and was thus adopted as the target domain in our current work.

GTA5. The GTA5 dataset, generated from the Grand Theft Auto V game, contains a large number of synthetic images on urban scenes with a resolution of 1914 × 1052 pixels. Each image also comes with pixel-level annotations for 19 object classes, and it was adopted as the source domain in this work. GTA5 is widely used in UDA segmentation due to its extensive labeled synthetic urban scene dataset, enabling diverse and challenging domain shifts. While real-world smart city datasets are valuable, they often lack large-scale annotation, making them impractical for domain adaptation research.

Domain Adaptation from GTA5 to Cityscapes. We present the results of the GTA5 to Cityscapes domain adaptation, along with the performance of several existing models on the same task. After thoroughly reviewing prior work, we categorized the existing methods into two groups based on the type of networks: ResNet-based and SegFormer-based. We ensured that each model was optimized for its best performance.

4.2. Experimental Setup

We adopted DeepLabV3+ [12] and SegFormer [44] as our backbone networks, the latter of which was used in DAFormer [1]. These architectures enable effective feature extraction for UDA-based semantic segmentation. We set the hyperparameters as follows:

p_{1} = 0.7, p_{2} = 0.3, α_{1} = 7, α_{2} = 3, k = 0.99, σ = 0.7, C_{1} = 0.7, C_{2} = 1.0

, and

C_{3} = 0.9

. These values were determined through sensitivity analysis, ensuring a balanced trade-off between model generalization and training stability. All of the experiments were conducted using NVIDIA RTX 4090 GPUs, manufactured by NVIDIA Corporation, Santa Clara, CA, USA. Identical hyperparameters were used across all of the ablation analyses to ensure fair comparisons.

4.3. Evaluation Metrics

The mean intersection over union (mIoU) is a standard metric for evaluating semantic segmentation models as it offers a comparable single value across models and datasets and penalizes both false positives and negatives. It is the average of the intersection over union (IoU) values for each class, measuring the overlap between predicted and ground truth segmentation masks. The formulas for IoU and mIoU are

I o U = \frac{T P}{T P + F P + F N}

and

m I o U = \frac{1}{C} \sum_{c = 1}^{C} I o U_{c}

, respectively, where

T P, F P, and F N

are true positive, false positive, and false negative, respectively, and C is the number of classes.

5. Results

We summarize the performance of SamDA and the state-of-the-art UDA methods in Table 1. Our proposed method SamDA outperformed all ResNet-based methods by a large margin (by an at least 15 mIoU improvement) and surpassed MIC-DAFormer by 1 mIoU. Furthermore, SamDA demonstrated significant advantages in segmenting small objects, such as fences, riders, and bicycles, where prior methods have struggled to achieve accurate predictions.

5.1. Qualitative Comparison

To provide qualitative insights into the effectiveness of our method, we present selected segmentation results in Figure 5. This visualization compares segmentation outputs without SAM, with SAM, and the ground truth (GT).

Edge Detection Improvement. As shown in Figure 5, the SAM-enhanced segmentation results exhibited notably sharper object boundaries, especially in the rider and bicycle regions. Compared to models without SAM, the segmentation masks generated by SamDA more closely resembled the GT, effectively mitigating boundary ambiguity.

Minor Class Recognition. Our method significantly improves small object segmentation, as is evident in the bike and pedestrian classes in Figure 5. By incorporating our mixing module, the model was exposed to additional rare-class samples, which led to higher-fidelity predictions for these under-represented categories.

5.2. Ablation Study

Table 2 demonstrates the efficacy of our cross-domain mixing strategy and finetune module in improving mIoU. -The mixing module enhanced model generalization, yielding a 4.5 mIoU improvement over traditional mixing methods when using a ResNet-based backbone (Table 2, Rows 3 and 4). -When using SegFormer, the mixing module still provided a 1.1 mIoU boost (Rows 6 and 7), demonstrating its effectiveness across different architectures. -The finetune module further refined segmentation quality, providing an additional 3.8 mIoU improvement (Rows 4 → 5 and Rows 7 → 8). -The full SamDA model (Row 8) achieved 75.2 mIoU, surpassing all of the baseline methods.

5.3. Computational Cost Analysis

Unlike methods such as DACS [5] and DAFormer [1], which primarily focus on feature alignment and image-level style transfer, SamDA introduces additional computational overhead during the training phase due to the integration of the SAM model for mask generation. However, this additional cost is manageable as SAM operates in a one-time preprocessing manner rather than an iterative refinement scheme.

We conducted all of the experiments using four NVIDIA RTX 4090 GPUs with a batch size of 16. This setup ensures that our method remains computationally feasible without significantly increasing training costs compared to existing approaches. However, it is important to note that existing UDA methods do not provide explicit computational cost details in their papers, making a direct quantitative comparison difficult.

To provide a broader context, we summarize the architectural characteristics and potential computational costs of several state-of-the-art UDA methods below:

DACS [5]: Uses DeepLabV3+ as the backbone and does not rely on additional external models like SAM, making it computationally lightweight.
DAFormer [1]: Adopts a Transformer-based architecture with SegFormer-B5 as the backbone, leading to higher computational requirements than CNN-based approaches.
MIC-DAFormer [14]: Extends DAFormer with masked image consistency learning, requiring additional memory and compute overhead.
SAM4UDA [10]: Integrates SAM for pseudo-label refinement but requires iterative optimization, leading to higher training costs.

Since these papers do not explicitly report GPU resource consumption, batch size, or training iterations, we do not speculate on precise computational costs. Instead, we focused on ensuring that SamDA maintains inference-time efficiency comparable to existing UDA methods, as it does not require additional mask matching operations during inference. The additional computational cost of SamDA is confined to the training phase, where SAM-generated masks are used for supervision, ensuring that its real-time deployment feasibility remains unchanged.

6. Discussion

6.1. Failure Analysis and Limitations

Despite SamDA’s superior overall mIoU compared to the baselines (Table 1), performance remained suboptimal for certain target domain categories, particularly Traffic Light and Motorcycle. These challenges primarily stemmed from domain shifts and the inherent characteristics of SAM’s zero-shot segmentation.

One notable issue arose with the fragmentation of slender objects, such as traffic lights. The IoU of Traffic Light (44.2) was significantly lower than that of similar categories like Pole (53.5). This was mainly due to geometric domain shifts and semantic confusion. In GTA5, traffic lights are rendered as simplified 3D models, whereas in Cityscapes, they exhibit complex structures and reflective surfaces. SAM, which is trained with generic prompts, often struggles to delineate these boundaries accurately, leading to fragmented or incomplete pseudo-labels. Additionally, due to spatial proximity, SAM frequently merges Traffic Light and Pole into a single mask, and our area-based matching strategy does not always resolve such ambiguities, introducing further label noise.

Another limitation is the degradation in segmenting small objects like motorcycles. The Motorcycle category attained an IoU of only 33.3, significantly lower than Car (66.7). The primary reasons for this degradation are resolution constraints and pose variability. SAM struggles to identify small objects when they appear in oblique perspectives or low-resolution settings, resulting in a high false negative rate (FN: 48.9%), as observed in Table 1. Additionally, although the mixing module augments minority classes, motorcycles in Cityscapes exhibit extreme pose variations, such as side views and occlusions by riders, which are under-represented in synthetic GTA5 images. As a result, pseudo-label confidence for these cases remains low.

Apart from these category-specific issues, domain-specific performance bottlenecks further hinder adaptation. For example, the Rider class (IoU: 58.1) demonstrates lower segmentation accuracy in low-light Cityscapes scenarios, such as night-time cyclists. SAM’s zero-shot segmentation lacks robustness in illumination variations, often misclassifying shadowed regions as the background. Additionally, long-tail distribution challenges persist in categories like Traffic Sign, which, despite augmentation via image mixing, achieves an IoU of only 39.7. This limitation is primarily caused by over-segmentation, as SAM fragments densely clustered traffic signs—such as those at intersections—into multiple disjoint masks. Our area-based selection method discards smaller fragments, exacerbating information loss. Furthermore, traffic signs in GTA5 differ stylistically from those in Cityscapes, leading to misalignment in feature representation during cross-domain adaptation.

6.2. Methodological Limitations and Practical Considerations

While SamDA demonstrates significant improvements, several limitations impact its practical applicability, particularly in real-world deployment and computational feasibility. The first consideration is the practical impact of improving segmentation accuracy for minority classes. Many real-world applications prioritize critical object categories, such as pedestrians and vehicles, over less prominent classes, like Traffic Light and Traffic Sign. Although SamDA reduces the domain gaps for these categories, further optimization may not be essential when balancing computational efficiency with segmentation quality.

Another key limitation is the computational cost and scalability of integrating SAM. As discussed in Section 5, SAM introduces additional computational overhead, which is manageable with high-end GPUs (e.g., four NVIDIA RTX 4090 GPUs), but its feasibility in resource-constrained settings remains unverified. The increased training complexity may also hinder adoption in real-time applications such as autonomous driving, where computational efficiency is crucial. Future research should explore ways to reduce training costs while preserving segmentation accuracy.

Finally, generalization to more complex datasets remains a challenge. Our evaluation primarily focuses on GTA5 → Cityscapes, a synthetic-to-real adaptation benchmark. While this setup captures fundamental domain adaptation issues, real-world datasets often introduce additional complexities such as extreme weather conditions, night-time scenes, and occlusions (e.g., ACDC [48] and Dark Zurich [49]). SAM’s robustness under such conditions is not well documented, making its effectiveness in these scenarios uncertain. Future work should assess SamDA’s performance on more diverse datasets to determine its broader applicability.

7. Conclusions

In this work, we propose SamDA, a novel framework that integrates SAM, a vision foundation model, as the mask generator to enhance the accuracy of semantic label assignment, particularly at object boundaries, within an unsupervised domain adaptation setting. Additionally, our framework incorporates a mixing module that generates synthetic samples to augment minority classes, improving segmentation performance for under-represented objects. Extensive experiments validate the effectiveness of SamDA, demonstrating superior segmentation performance compared to baseline methods on the GTA5 (source domain) and CityScapes (target domain) datasets.

While SamDA significantly improves domain adaptation performance, challenges remain in terms of computational efficiency and generalizability. The integration of SAM introduces additional training overhead, which may limit scalability in large-scale applications. Future work could explore lightweight adaptations of SAM, such as model distillation, to mitigate computational costs while maintaining segmentation accuracy. Another promising research direction is extending SamDA beyond 2D image segmentation to tasks such as 3D point cloud processing and volumetric data analysis, which are crucial for applications in autonomous driving and medical imaging. Additionally, further evaluations under extreme conditions, including night-time and adverse weather scenarios, would be valuable to assess the robustness of SAM in such environments. Investigating domain-specific adaptation strategies, such as tailored prompt engineering and adversarial augmentation, may enhance its performance in challenging settings.

Overall, SamDA highlights the potential of integrating vision foundation models into UDA, but further research is required to optimize its efficiency and applicability across diverse real-world scenarios.

Author Contributions

Methodology, X.L. and Y.Y.; Software, H.S. and X.L.; Validation, H.S.; Investigation, Q.W.; Resources, Y.Y.; Data curation, H.S. and Y.Y.; Writing – original draft, X.L.; Writing – review & editing, Z.L.; Supervision, Q.W. and Z.L.; Project administration, Z.L.; Funding acquisition, Q.W. All authors have read and agreed to the published version of the manuscript. All authors have read and agreed to the published version of themanuscript.

Funding

This work was funded by the Science and Technology Projects in Guangzhou (No. 2023A04J1607) and the Smart Medical Innovation Technology Center, GDUT (No. ZYZX24-036).

Data Availability Statement

The datasets used in this study are publicly available: GTA5 dataset: This dataset is provided by Richter et al. and is publicly available at https://download.visinf.tu-darmstadt.de/data/from_games/. Cityscapes dataset: The dataset is provided by Cordts et al. and can be accessed at https://www.cityscapes-dataset.com/. Both datasets were used in accordance with their respective licenses and terms of use.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Hoyer, L.; Dai, D.; Van Gool, L. Daformer: Improving network architectures and training strategies for domain-adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–22 June 2022; pp. 9924–9935. [Google Scholar]
Zou, Y.; Yu, Z.; Kumar, B.; Wang, J. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 289–305. [Google Scholar]
Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.Y.; Isola, P.; Saenko, K.; Efros, A.; Darrell, T. Cycada: Cycle-consistent adversarial domain adaptation. In Proceedings of the International Conference on Machine Learning, Pmlr, Stockholm, Sweden, 10–15 July 2018; pp. 1989–1998. [Google Scholar]
French, G.; Mackiewicz, M.; Fisher, M. Self-ensembling for visual domain adaptation. arXiv 2017, arXiv:1706.05208. [Google Scholar]
Tranheden, W.; Olsson, V.; Pinto, J.; Svensson, L. Dacs: Domain adaptation via cross-domain mixed sampling. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 5–9 January 2021; pp. 1379–1389. [Google Scholar]
Hoffman, J.; Wang, D.; Yu, F.; Darrell, T. Fcns in the wild: Pixel-level adversarial and constraint-based adaptation. arXiv 2016, arXiv:1612.02649. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Tsai, Y.H.; Hung, W.C.; Schulter, S.; Sohn, K.; Yang, M.H.; Chandraker, M. Learning to adapt structured output space for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7472–7481. [Google Scholar]
Chen, Y.C.; Lin, Y.Y.; Yang, M.H.; Huang, J.B. Crdoco: Pixel-level domain transfer with cross-domain consistency. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Beach, CA, USA, 16–20 June 2019; pp. 1791–1800. [Google Scholar]
Yan, W.; Qian, Y.; Zhuang, H.; Wang, C.; Yang, M. SAM4UDASS: When SAM Meets Unsupervised Domain Adaptive Semantic Segmentation in Intelligent Vehicles. IEEE Trans. Intell. Veh. 2023, 9, 33963408. [Google Scholar] [CrossRef]
Brüggemann, D.; Sakaridis, C.; Truong, P.; Van Gool, L. Refign: Align and refine for adaptation of semantic segmentation to adverse conditions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–7 January 2023; pp. 3174–3184. [Google Scholar]
Jung, Y.J.; Kim, M.J. Deeplab v3+ Based Automatic Diagnosis Model for Dental X-ray: Preliminary Study. J. Magn. 2020, 25, 632–638. [Google Scholar] [CrossRef]
Chang, W.G.; You, T.; Seo, S.; Kwak, S.; Han, B. Domain-specific batch normalization for unsupervised domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7354–7362. [Google Scholar]
Hoyer, L.; Dai, D.; Wang, H.; Van Gool, L. MIC: Masked image consistency for context-enhanced domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 11721–11732. [Google Scholar]
Choi, W.; Kim, D.; Kim, C. Self-ensembling with GAN-based data augmentation for domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1234–1243. [Google Scholar]
Lee, Y.; Kim, Y.; Kim, S.; Kim, C. Sliced Wasserstein discrepancy for unsupervised domain adaptation. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 2019, 43, 3199–3213. [Google Scholar]
Chen, Y.; Zhang, W.; Wang, Z.; Wang, S. Source-free domain adaptation for semantic segmentation. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; pp. 6572–6583. [Google Scholar]
Zhang, Y.; Qiu, Z.; Dai, D.; Van Gool, L. Prototypical pseudo-label denoising and target structure learning for domain adaptive semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 12455–12465. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 4015–4026. [Google Scholar]
Lu, Y.; Zhang, J.; Liu, K. Semantic-guided prompt learning for few-shot segmentation. In Proceedings of the International Conference on Machine Learning (ICML), Honolulu, HI, USA, 23–29 July 2023; pp. 6789–6798. [Google Scholar]
Zhang, H.; Li, R.; Wang, T. Personalizing foundation segmentation models with reinforcement learning. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 2023, 34, 8765–8779. [Google Scholar]
Shen, J.; Xu, W.; Huang, J. SAM: Anything model for segmentation. arXiv 2023, arXiv:2303.14285. [Google Scholar]
Yu, X.; Wang, R.; Zhao, Q. Pointly supervised SAM for domain adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4563–4572. [Google Scholar]
Wang, B.; Sun, X.; Zhou, L. TransMix: Enhancing minority class recognition with adaptive image mixing. In Proceedings of the European Conference on Computer Vision (ECCV), Paris, France, 2–3 October 2023; pp. 3456–3465. [Google Scholar]
Zhang, T.; Liu, Y.; Chen, S. Balanced mix-up training for long-tailed recognition. Int. J. Comput. Vis. (IJCV) 2022, 130, 5678–5692. [Google Scholar]
Li, Y.; Yuan, L.; Vasconcelos, N. Bidirectional learning for domain adaptation of semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 6936–6945. [Google Scholar]
Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3723–3732. [Google Scholar]
Xu, R.; Li, G.; Yang, J.; Lin, L. Larger norm more transferable: An adaptive feature norm approach for unsupervised domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 1426–1435. [Google Scholar]
Xie, Q.; Luong, M.T.; Hovy, E.; Le, Q.V. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10687–10698. [Google Scholar]
Pereira, M.C.; Sastre-Gomez, S. Nonlocal and nonlinear evolution equations in perforated domains. J. Math. Anal. Appl. 2021, 495, 124729. [Google Scholar] [CrossRef]
Zou, Y.; Yu, Z.; Liu, X.; Kumar, B.; Wang, J. Confidence regularized self-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 5982–5991. [Google Scholar]
Li, G.; Kang, G.; Liu, W.; Wei, Y.; Yang, Y. Content-consistent matching for domain adaptive semantic segmentation. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 440–456. [Google Scholar]
Wang, Z.; Zhang, Y.; Zhang, Z.; Jiang, Z.; Yu, Y.; Li, L.; Li, L. Exploring Semantic Prompts in the Segment Anything Model for Domain Adaptation. Remote Sens. 2024, 16, 758. [Google Scholar] [CrossRef]
Chen, T.; Zhu, L.; Ding, C.; Cao, R.; Zhang, S.; Wang, Y.; Li, Z.; Sun, L.; Mao, P.; Zang, Y. Sam fails to segment anything?—Sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more. arXiv 2023, arXiv:2304.09148. [Google Scholar]
Xu, Y.; Tang, J.; Men, A.; Chen, Q. Eviprompt: A training-free evidential prompt generation method for segment anything model in medical images. arXiv 2023, arXiv:2311.06400. [Google Scholar] [CrossRef] [PubMed]
Ke, L.; Ye, M.; Danelljan, M.; Tai, Y.W.; Tang, C.K.; Yu, F. Segment anything in high quality. Adv. Neural Inf. Process. Syst. 2024, 36, 29914–29934. [Google Scholar]
Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast segment anything. arXiv 2023, arXiv:2306.12156. [Google Scholar]
Jinlei, W.; Chen, C.; Dai, C.; Hong, J. A Domain-Adaptive segmentation method based on segment Anything model for mechanical assembly. Measurement 2024, 235, 114901. [Google Scholar]
Zhou, Q.; Feng, Z.; Gu, Q.; Pang, J.; Cheng, G.; Lu, X.; Shi, J.; Ma, L. Context-aware mixup for domain adaptive semantic segmentation. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 804–817. [Google Scholar] [CrossRef]
Wu, X.; Wu, Z.; Guo, H.; Ju, L.; Wang, S. DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Yao, D.; Li, B.; Wang, R.; Wang, L. Dual-level Interaction for Domain Adaptive Semantic Segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Paris, France, 2–6 October 2023. [Google Scholar]
Richter, S.R.; Vineet, V.; Roth, S.; Koltun, V. Playing for data: Ground truth from computer games. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part II 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 102–118. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Guo, X.; Yang, C.; Li, B.; Yuan, Y. Metacorrection: Domain-aware meta loss correction for unsupervised domain adaptation in semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3927–3936. [Google Scholar]
Araslanov, N.; Roth, S. Self-supervised augmentation consistency for adapting semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15384–15394. [Google Scholar]
Hoyer, L.; Dai, D.; Gool, L. HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation. In Computer Vision—ECCV 2022; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. ACDC: The adverse conditions dataset with correspondences for semantic driving scene understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 2723–2739. [Google Scholar]
Sakaridis, C.; Dai, D.; Van Gool, L. Guided curriculum model adaptation and uncertainty-aware evaluation for semantic nighttime image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 7374–7383. [Google Scholar]

Figure 1. Illustration of the mixing module process. The figure depicts the steps from selecting rare and common classes from the source domain, cropping and fusing them with the target image, to generating the final mixed image used for training. This technique enhances rare class representation and reduces domain shift.

Figure 2. Demonstration of the effectiveness of the proposed mixing module. From left to right, each column corresponds to the target images, the predicted labels without the mixing module, the predicted labels with the mixing module, and the ground truth labels for the target images. The improvement in minority class segmentation and edge details is evident, demonstrating the utility of our method.

Figure 3. The SamDA unsupervised domain adaptation framework with three major components: (a) The mixing module with cross-domain mixing strategy; (b) Self-training with the student and teacher network using an exponential moving average (EMA); (c). The finetune module with SAM as a mask generator and pseudo-label matching using a maximum threshold matching strategy.

Figure 4. Data samples from the GTA5 and CityScapes datasets. The first row contains the original images, while the second row contains the semantic labels of the images above them.

Figure 5. Selected semantic segmentation samples from the experiments to demonstrate the effectiveness of our proposed mixing strategy with and without the finetune module. GT stands for ground truth.

Table 1. The performances of SamDA and selected baselines on the unsupervised domain adaptation from GTAV to Cityscapes.

	Method	Road	S.Walk	Build.	Wall	Fence	Pole	T.Light	Sign	Veget.	Terrain	Sky	Person	Rider	Car	Truck	Bus	Train	M.Bike	Bike	mIoU
ResNet-Based	CBST [2]	91.8	53.5	80.5	32.7	21.0	34.0	28.9	20.4	83.9	34.2	80.9	53.1	24.0	82.7	30.3	35.9	16.0	25.9	42.8	45.9
	CCM [32]	93.5	57.6	84.6	39.3	24.1	25.2	35.0	17.3	85.0	40.6	86.5	58.7	28.7	85.8	49.0	56.4	5.4	31.9	43.2	49.9
	MetaCor [45]	92.8	58.1	86.2	39.7	33.1	36.3	42.0	38.6	85.5	37.8	87.6	62.8	31.7	84.8	35.7	50.3	2.0	36.8	48.0	52.1
	DACS [5]	89.9	39.7	87.9	30.7	39.5	38.5	46.4	52.8	88.6	37.0	88.8	67.2	35.9	84.5	45.7	50.0	0.8	27.3	34.0	52.2
	SAC [46]	90.4	53.9	86.6	42.4	27.3	45.1	48.5	42.7	87.4	40.1	86.1	67.5	29.7	88.5	49.1	54.6	9.8	26.6	45.3	53.8
	DACS [5] (w/Mixing)	94.7	63.1	87.6	30.7	40.6	40.2	47.8	51.6	87.6	47.0	88.9	66.7	35.9	90.2	50.8	57.5	0.2	39.8	56.4	56.7
	DACS [5] (w/SAM)	87.8	56.0	79.7	45.3	44.8	45.6	53.5	53.5	88.6	45.2	82.1	70.7	39.4	90.0	49.5	59.4	1.0	48.9	56.4	57.8
	DACS [5] (w/SAM+Mixing)	92.7	54.1	88.9	44.2	33.3	43.8	49.8	38.0	88.4	45.0	86.5	70.1	45.0	90.0	41.4	50.6	42.0	46.3	58.7	60.5
SegFormer- Based	DAFormer [1]	95.7	70.2	89.4	53.5	48.1	49.6	55.8	59.4	89.9	48.9	92.5	72.2	44.7	92.3	74.5	78.2	65.1	55.9	61.8	68.3
	HRDA [47]	96.7	75.0	90.0	58.2	50.4	51.1	56.7	62.1	90.2	53.3	92.9	72.4	47.1	92.6	78.9	83.4	75.6	54.2	62.6	70.7
	MIC-DAFormer [14]	95.8	73.3	92.8	56.2	51.9	51.6	59.6	62.8	93.1	51.9	96.3	77.7	47.0	96.0	81.7	81.7	68.2	59.9	64.3	71.6
	MIC-DAFormer [14] (w/SAM)	96.4	74.4	91.0	61.6	51.5	58.1	63.9	69.3	91.3	50.4	94.2	81.6	52.9	93.7	84.1	85.7	79.5	63.9	67.5	74.2
	SamDA (Ours)	96.4	76.2	89.9	66.6	53.6	58.9	63.3	68.9	92.3	52.4	95.2	82.3	54.8	95.8	84.8	87.4	74.7	65.3	70.8	75.2

Note: -Bold numbers indicate the highest performance in each column. -Italic text in the first column represents the category of methods. -Shaded cells highlight the mean IoU (mIoU) values.

Table 2. Ablation analyses on the network components. ST indicates self-training module; Traditional-Mix indicates a classic mix, such as DACS; Mix indicates our proposed mixing module; Finetune indicates our proposed finetune module; and Former-based indicates using SegFormer as the backbone.

Model	ST	Traditional-Mix	Mix (Ours)	Finetune	Former-Based	mIoU
1	✕	✕	✕	✕	✕	40.3
2	✓	✕	✕	✕	✕	48.8
3	✓	✓	✕	✕	✕	52.2
4	✓	✕	✓	✕	✕	56.7
5	✓	✕	✓	✓	✕	60.5
6	✓	✓	✕	✕	✓	68.3
7	✓	✕	✓	✕	✓	69.4
8	✓	✕	✓	✓	✓	75.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wan, Q.; Su, H.; Liu, X.; Yu, Y.; Lin, Z. SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition. Processes 2025, 13, 736. https://doi.org/10.3390/pr13030736

AMA Style

Wan Q, Su H, Liu X, Yu Y, Lin Z. SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition. Processes. 2025; 13(3):736. https://doi.org/10.3390/pr13030736

Chicago/Turabian Style

Wan, Qian, Hongbo Su, Xiyu Liu, Yu Yu, and Zhongzhen Lin. 2025. "SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition" Processes 13, no. 3: 736. https://doi.org/10.3390/pr13030736

APA Style

Wan, Q., Su, H., Liu, X., Yu, Y., & Lin, Z. (2025). SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition. Processes, 13(3), 736. https://doi.org/10.3390/pr13030736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SAM-Enhanced Cross-Domain Framework for Semantic Segmentation: Addressing Edge Detection and Minor Class Recognition

Abstract

1. Introduction

2. Related Work

2.1. Semantic Segmentation Using Unsupervised Domain Adaptation

2.2. Image Mixing

2.3. Segment Anything Model

3. Methodology

3.1. Problem Formulation

3.2. Self-Training

3.3. Mixing Module

3.4. Finetune Module

3.4.1. Mask Generation

3.4.2. Pseudo-Label Matching

3.5. Overall Loss Function

4. Experiments and Results

4.1. Datasets

4.2. Experimental Setup

4.3. Evaluation Metrics

5. Results

5.1. Qualitative Comparison

5.2. Ablation Study

5.3. Computational Cost Analysis

6. Discussion

6.1. Failure Analysis and Limitations

6.2. Methodological Limitations and Practical Considerations

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI