1. Introduction
Skin diseases, including basal-cell carcinoma (BCC), melanoma, squamous-cell carcinoma (SCC), and epithelial carcinoma, affect millions globally. Melanoma, the deadliest, accounts for about 75% of skin cancer-related deaths due to its aggressive metastatic potential [
1]. Early symptoms appear as small moles or spots that may darken or grow. Early-stage melanoma can be treated with minor surgery [
2]. Early detection is crucial for successful treatment, and dermatologists play a key role in identifying subtle changes in skin lesions [
3]. Accurate segmentation of the lesion from the surrounding healthy tissue is vital for effective diagnosis and treatment planning.
Recent advances in artificial intelligence (AI) have revolutionized dermatological imaging, offering significant potential to assist physicians in diagnosis. AI integration improves diagnostic efficiency, reduces misdiagnosis rates, and enhances patient outcomes. Deep learning, particularly convolutional neural networks (CNNs), has become the leading method for image segmentation, automating the identification of cancerous lesions and reducing the burden on medical professionals [
4,
5,
6,
7]. However, these methods typically rely on large volumes of labeled training data, which can be resource-intensive.
The dependency on high-quality annotated data for fully supervised learning is a major limitation, as it can be costly and time-consuming to acquire, especially in the medical field [
8]. Medical image annotations require a high level of expertise, as incorrect annotations can directly impact the quality and reliability of segmentation models. Alternative learning paradigms such as weakly supervised learning [
9,
10,
11], unsupervised learning [
12,
13,
14], and semi-supervised learning (SSL) have been explored by researchers to overcome the challenges associated with obtaining labeled data [
15,
16,
17]. The aim of these approaches is to decrease dependence on large amounts of labeled data by incorporating different levels of supervision or using unlabeled data more effectively.
Recent advancements in SSL have demonstrated its potential to bridge the gap between fully supervised and unsupervised learning paradigms. SSL presents an attractive solution by enabling the model to learn from a small amount of labeled data along with a large amount of unlabeled data. This approach has significant implications for real-world clinical applications, as it reduces the burden of data annotation while maintaining high performance. Pseudo-labeling is a common strategy to exploit unlabeled data, in which pseudo-labels are assigned to unlabeled images, and both labeled and pseudo-labeled data are utilized to train the segmentation model [
18,
19,
20]. The learning process and segmentation quality can suffer due to the presence of noise in pseudo-labels, which is a major drawback of this approach. Medical imaging requires precision, and this issue is particularly crucial. The difficulty of segmenting skin disease images is shown in
Figure 1.
To address the challenges associated with noisy pseudo-labels, researchers have proposed various methods aimed at improving the quality of pseudo-labels and enhancing the learning process. Recent advancements in semi-supervised medical image segmentation have been made possible by the incorporation of consistency regularization and unsupervised loss functions. The Mean Teacher (MT) model, in particular, has gained considerable attention due to its ability to enhance model stability and performance by enforcing consistency between the outputs of a teacher model and a student model when subjected to different perturbations [
21]. The parameters of the teacher model are updated using the Exponential Moving Average (EMA) of the student model’s weights, which helps stabilize the learning process and reduce the effects of noisy labels. Building on this foundation, subsequent research has focused on developing SSL algorithms that leverage consistency learning to further improve segmentation performance [
22].
Despite these advancements, existing state-of-the-art (SOTA) methods, such as Uncertainty-Aware Mean Teacher (UAMT) [
23], Dual Fixmatch Cross Pseudo Supervision (DFCPS) [
24], Multi-Task Mean Teacher (MTMT) [
25], and Attention U-Net [
26], have yet to fully overcome the challenges of noisy pseudo-labels and the limited utilization of complementary learning during training. The need for innovative approaches to improve segmentation accuracy and robustness, particularly in low-label settings, is underscored by these limitations.
This paper proposes a novel end-to-end semi-supervised segmentation framework called UDAMT. UDAMT utilizes the Mean Teacher model to develop key innovations that improve segmentation performance in SSL scenarios. Our approach is to design a dual-headed segmentation network that includes an auxiliary segmentation head in the student model. By implementing this dual-headed architecture, the model can extract complementary information during training, resulting in an enhanced ability to learn from both limited-labeled and pseudo-labeled data.
To address the challenges of noisy pseudo-labels, UDAMT incorporates an uncertainty-guided mechanism that identifies high-uncertainty regions within the pseudo-labeled data and excludes them from the training process. The mechanism uses uncertainty maps to guide the model towards the most reliable regions of the pseudo-labeled images. By minimizing noisy labels and increasing precise boundary information, the proposed approach enhances the model’s overall segmentation performance.
In summary, our work specifically focuses on the segmentation of skin lesions targeting the precise delineation of the lesion region from the surrounding healthy skin thereby addressing a critical need in the early diagnosis and treatment of skin cancer.
This work has the following contributions:
We propose a novel dual-headed teacher–student framework for semi-supervised skin lesion segmentation, enhancing the student model’s feature extraction and resilience to noisy pseudo-labels through an auxiliary segmentation head.
UDAMT introduces a pseudo-labeling mechanism based on uncertainty, efficiently identifying and excluding unreliable regions in pseudo-labeled data, improving the learning process’s quality and stability.
Extensive experiments on the ISIC 2016, 2017, and 2018 datasets demonstrate that UDAMT outperforms state-of-the-art methods, achieving up to a 1.17%p improvement in Dice coefficient under low-label settings, showing its potential for real-world clinical applications.
The paper is organized as follows:
Section 2 reviews related work in medical image segmentation and SSL.
Section 3 details the UDAMT framework and key components.
Section 4 presents the experimental setup and evaluation metrics.
Section 5 compares experimental results with current methods.
Section 6 concludes the paper and outlines future research directions.
2. Related Works
2.1. Medical Image Segmentation
Medical image segmentation plays a crucial role in clinical diagnosis by identifying distinct tissues or organs within images. Traditional methods, such as threshold-based segmentation and edge detection [
27,
28], initially provided simple but limited solutions. The threshold-based method [
29,
30] classifies pixels based on intensity but struggles with complex images and noise. Edge detection-based approaches [
31,
32,
33] enhance boundaries but face challenges with blurred edges and substantial noise.
The advent of deep learning has revolutionized segmentation, with CNNs becoming the dominant approach. U-Net [
34] introduced a U-shaped architecture with skip connections, significantly improving segmentation accuracy. Variants such as Attention U-Net, MultiResUNet [
35], UNet++ [
36], and TransUNet [
37] further refined feature extraction through attention mechanisms and transformer-based architectures [
38].
Recent advances have focused on enhancing segmentation accuracy, robustness, and computational efficiency. Methods such as UAMT and Cross Pseudo Supervision (CPS) [
39] introduced uncertainty-guided learning and dual-headed architectures, improving resilience against noisy labels. Multi-task models like MTMT incorporate auxiliary tasks to enhance segmentation performance. Self-supervised approaches, including PseudoSeg [
40], utilize unlabeled data more effectively, reducing annotation dependency.
Additionally, attention mechanisms, including Attention U-Net and TransUNet, enable refined feature selection. Emerging paradigms like uncertainty-aware learning [
41,
42,
43] and self-supervised pretraining [
44] further improve segmentation accuracy by focusing on high-confidence regions.
These advancements highlight the transformative role of deep learning in medical image segmentation. By integrating teacher–student architectures, uncertainty-guided learning, and self-supervised techniques, SOTA methods continue to push the boundaries of automation and diagnostic precision in clinical practice.
2.2. Semi-Supervised Medical Image Segmentation
SSL has emerged as a promising solution to mitigate the need for large-scale labeled datasets in medical image segmentation. Unlike fully supervised methods, SSL leverages a small amount of labeled data along with abundant unlabeled data, making it more scalable and cost-effective [
45,
46].
A key SSL technique is pseudo-labeling, where a model assigns labels to unlabeled data based on its predictions [
47]. However, noisy pseudo-labels can degrade performance. To address this, consistency regularization enforces stability in model predictions under various perturbations, improving segmentation robustness. FixMatch, for example, refines pseudo-labeling by applying strong augmentations and confidence thresholding.
Recent works incorporate uncertainty estimation to enhance pseudo-label reliability. By quantifying uncertainty, models can prioritize high-confidence regions while filtering out noisy labels. Approaches such as Monte Carlo Dropout [
48] and Uncertainty-Guided Collaborative Mean Teacher (UCMT) [
49] have successfully integrated uncertainty awareness, leading to improved segmentation performance.
SSL has been widely adopted in various medical imaging tasks, including brain Magnetic Resonance Imaging (MRI) and cardiac MRI segmentation, demonstrating performance comparable to fully supervised models despite using fewer labeled samples [
50]. The Mean Teacher framework, introduced by Tarvainen et al., remains foundational in SSL, with models such as UAMT extending its capabilities through uncertainty estimation and pseudo-label refinement.
These advancements demonstrate the effectiveness of SSL in bridging the gap between fully supervised and unsupervised learning. By integrating pseudo-labeling, consistency regularization, and uncertainty-guided mechanisms, modern SSL techniques offer scalable and robust solutions for real-world medical imaging applications.
2.3. Teacher–Student Framework in SSL
The teacher–student framework has become a cornerstone in SSL, particularly in medical image segmentation, due to its ability to effectively leverage unlabeled data while maintaining robust learning performance. In this paradigm, the teacher model generates pseudo-labels for unlabeled data, and the student model learns from both labeled data and these pseudo-labeled samples. This framework promotes the utilization of a large amount of easily obtainable unlabeled data, thereby reducing the dependency on expensive, expert-annotated medical images.
The original concept of the teacher–student model was popularized by the MT model proposed by Tarvainen and Valpola. In this model, the teacher model generates pseudo-labels for unlabeled data, and the student model is trained using a combination of these pseudo-labels and labeled data. A critical aspect of the Mean Teacher approach is that the teacher model’s parameters are updated through an EMA of the student model’s weights, rather than being learned directly from labeled data. This ensures that the teacher model provides stable and reliable pseudo-labels, improving the consistency of the learning process.
One major advantage of the teacher–student framework is its ability to enforce consistency between the teacher and student models. By applying different augmentations or perturbations to the input data and ensuring that both the teacher and student produce consistent predictions, the model’s robustness is improved. This consistency regularization serves as an unsupervised loss term that guides the student model toward more reliable segmentation, even in the presence of noisy labels. Consistency regularization has been widely adopted in SSL due to its effectiveness in reducing the adverse impact of noisy pseudo-labels.
Recent advancements within the teacher–student framework include the integration of uncertainty estimation. Uncertainty-aware teacher–student models aim to identify regions of high uncertainty in pseudo-labels and exclude these regions from the training process, thereby reducing the effect of noisy or incorrect labels. For instance, the UAMT model proposed by Yu et al. employs Monte Carlo Dropout to estimate uncertainty in the network’s predictions and uses these estimates to refine the pseudo-labels generated by the teacher model. This approach ensures that the student model focuses on more reliable information, leading to improved segmentation performance.
Another significant development is the CPS method introduced by Chen et al., which utilizes two student models with different initializations. Each student model generates pseudo-labels for the other, and both models learn from each other’s pseudo-labels through a cross-supervision strategy. This method enhances the robustness of the models by leveraging complementary information and reducing the reliance on a single-teacher model.
In recent developments, the Collaborative Mean Teacher (CMT) [
15] framework extends the traditional teacher–student approach by introducing multiple student models that collaboratively generate pseudo-labels. This approach encourages diversity in the generated pseudo-labels, leading to more comprehensive learning for each student model. Furthermore, the introduction of Uncertainty-Guided Mixup (UMIX) [
51] allows the model to manipulate input images based on uncertainty, enhancing the overall quality of the generated pseudo-labels.
Our proposed dual-headed teacher–student framework builds upon these advancements. We introduce an auxiliary segmentation head to the student model, enabling it to learn complementary features during training. This dual-headed setup allows the student model to cross-reference its segmentation outputs, providing an additional layer of consistency that enhances learning stability. Furthermore, by incorporating an uncertainty-guided mechanism, we ensure that only reliable pseudo-labels are used during training, thereby mitigating the negative effects of noisy labels. This combined approach leverages the strengths of the teacher–student’s paradigm-effective use of unlabeled data, consistency regularization, and uncertainty estimation-to achieve superior segmentation performance in skin lesion analysis.
The teacher–student framework, particularly with these recent innovations, remains a powerful tool for advancing medical image segmentation. By harnessing the synergy between labeled and unlabeled data, and by leveraging techniques such as consistency regularization and uncertainty awareness, the framework not only reduces the need for costly labeled data but also enhances the overall quality and reliability of medical image analysis.
3. Method
3.1. Overall Architecture
The proposed method adopts a dual-headed segmentation network within a teacher–student framework to leverage both labeled and unlabeled data for enhanced segmentation performance. The architecture comprises a teacher and student model working collaboratively to improve generalization and robustness. The student model benefits from an auxiliary segmentation head, which introduces multiple learning pathways during training and ensures better adaptability. The framework diagram is shown in
Figure 2.
At the core of the design, the dual-headed segmentation network strengthens the student model’s ability to extract meaningful features from diverse data sources. Two segmentation heads are employed: the main segmentation head, which generates segmentation maps, and the auxiliary segmentation head, which guides the learning process. By combining outputs from these two heads, the model achieves richer feature representation and improved resilience to noise in pseudo-labels.
An auxiliary perspective introduced by the secondary head ensures consistent predictions across network depths. The auxiliary loss, calculated as the cross-entropy between the outputs of the main and auxiliary heads, drives the model towards improved accuracy while addressing challenges posed by noisy or uncertain labels. This dual-headed approach significantly reduces the risk of overfitting to erroneous pseudo-labels and enhances the overall generalization of the student model.
A key component of the framework, the teacher–student interaction facilitates semi-supervised segmentation. The teacher model, implemented as an EMA of the student model, evolves steadily over time. It generates pseudo-labels for unlabeled data, which the student model utilizes to refine its learning. This iterative process enables the student to effectively combine knowledge from labeled and pseudo-labeled data, resulting in improved segmentation performance.
Stability in pseudo-label generation is ensured by the EMA mechanism in the teacher model. Meanwhile, the student network’s dual-headed structure introduces diversified learning pathways, mitigating the effects of noisy labels. Together, these elements contribute to a robust and generalizable segmentation framework.
Parameter sharing between the teacher and student models is facilitated by using the EMA strategy, where the teacher model’s parameters are updated as a moving average of the student model’s weights. This approach stabilizes the training process, ensuring that the teacher model provides high-quality pseudo-labels throughout the learning process. The EMA update is given by
where
represents the teacher model parameters at iteration
,
represents the student model parameters at iteration
, and
is the decay rate, typically close to 1 (e.g., 0.99), ensuring a gradual change in the teacher model’s parameters. This strategy ensures that the teacher model retains the beneficial knowledge from previous training steps while also incorporating new information learned by the student model.
The training mechanism involves two main phases: supervised training with labeled data and semi-supervised training with both labeled and pseudo-labeled data. During supervised training, the student model learns directly from the ground truth annotations. During semi-supervised training, the teacher model generates pseudo-labels for the unlabeled data, which are used alongside labeled data to train the student. Specifically, the input image first passes through the teacher model to generate pseudo-labels, while Monte Carlo Dropout is applied to produce uncertainty maps. These uncertainty maps are then used to filter reliable pseudo-label regions, guiding the loss calculation for the student model. Thus, the core operations between them are uncertainty estimation and pseudo-label region filtering. Cross-entropy loss is computed for both the main and auxiliary heads, with an uncertainty-guided mechanism excluding highly uncertain regions from contributing to the loss calculation.
By leveraging both labeled and unlabeled data in a unified framework, and by incorporating dual-headed segmentation, our approach effectively addresses the challenges associated with limited labeled data and noisy pseudo-labels, resulting in a more accurate and robust segmentation model.
The flowchart illustrates (
Figure 3) the pseudo-labeling process in the UDAMT model for semi-supervised skin lesion segmentation. The model follows a teacher–student framework, where both the teacher and student models share a ResNet50 backbone. The teacher model, updated via Exponential Moving Average (EMA), generates pseudo-labels for unlabeled data. To improve reliability, Monte Carlo Dropout is applied for uncertainty estimation, filtering out high-uncertainty regions based on a threshold (0.2). The student model incorporates a dual-head structure—comprising a primary Fully Convolutional Network (FCN) and an auxiliary DeepLab segmentation head—ensuring consistency learning. Input data undergoes strong and weak augmentations before training. The total loss function integrates supervised loss on labeled data, pseudo-label loss on filtered regions, and auxiliary consistency loss. The iterative process continues with the student model being optimized via backpropagation and the teacher model being updated through EMA, enhancing segmentation quality progressively.
3.2. Uncertainty Map Generation
In the SSL framework, uncertainty estimation plays a key role in identifying which pseudo-labels are reliable and which may contain errors. In our approach, uncertainty maps are generated to identify high-uncertainty regions, which are subsequently used to enhance the training process by focusing only on reliable regions.
To generate uncertainty maps, we employ Monte Carlo Dropout during inference. Specifically, the student model is run multiple times with dropout enabled, resulting in a set of predictions for each pixel. The uncertainty of each pixel is then quantified based on the variance of the predictions across multiple runs. The variance
of a pixel
can be computed as
where
is the number of stochastic forward passes,
is the predicted probability of pixel
in the
-th forward pass, and
is the average predicted probability over all passes. High variance in
indicates high uncertainty, allowing us to identify regions that are potentially mislabeled.
Once the uncertainty map is generated, we employ a patching strategy to reduce the impact of noisy pseudo-labels. High-uncertainty regions are excluded from the training process by masking them out, allowing the model to focus on low-uncertainty, reliable areas. Specifically, the pseudo-labels for uncertain regions are not included in the loss calculation during training. The objective function for the segmentation task is modified as follows:
where
represents the set of pixels with low uncertainty,
is the ground truth or pseudo-label, and
is the model’s prediction. By excluding high-uncertainty regions from the loss calculation, we effectively reduce the noise in the training process and ensure that the student model learns from more reliable data.
This strategy is particularly effective in reducing the negative impact of erroneous pseudo-labels, leading to a more stable training process and better overall performance. Furthermore, we introduce a patching approach, wherein regions of high uncertainty are replaced with patches from regions of low uncertainty, thereby providing additional training examples that are reliable. This augmentation technique helps the model generalize better and enhances its robustness to noisy labels.
The combination of uncertainty map generation, noise reduction through masking, and patch-based augmentation ensures that our semi-supervised model learns effectively from the available data, even when the amount of labeled data is limited.
3.3. Loss Function
In our proposed semi-supervised segmentation model, the loss function plays a crucial role in guiding the model to learn from both labeled data and pseudo-labeled data effectively. The dual-headed architecture in the student model utilizes a cross-entropy loss function to enforce consistency between the segmentation heads and to enhance feature representation through complementary learning.
The dual-headed student model consists of a main segmentation head and an auxiliary segmentation head. To promote consistent learning across the two segmentation heads, we define a cross-entropy loss between them, encouraging the heads to produce similar segmentation outputs. This consistency regularization reduces divergence in feature extraction and prevents the network from overfitting to noisy pseudo-labels.
The cross-entropy loss between the main head and the auxiliary head is formulated as follows:
where
represents the set of all pixels in the image and
and
are the predicted probabilities for pixel
from the auxiliary and main heads, respectively. This loss enforces the two heads to agree on their predictions for each pixel, thereby improving consistency.
The overall loss function for the student model combines multiple components to ensure effective learning. The total loss
is defined as
where
is the supervised loss on labeled data, calculated as
with
representing the set of labeled pixels,
being the ground truth label for pixel
, and
being the prediction.
is the consistency loss between the main and auxiliary heads.
is the unsupervised loss computed on pseudo-labeled data, using uncertainty-guided masking.
,
, and
are weighting factors that control the contribution of each component in the total loss.
By integrating the consistency loss between the two segmentation heads, our model ensures robust feature extraction and reduces the risk of overfitting to noisy labels. This multi-component loss function allows the student model to effectively learn from both labeled and pseudo-labeled data, resulting in improved segmentation accuracy and generalization capabilities.
To dynamically optimize the loss weight under low-label proportions, we introduce an adaptive weighting mechanism that adjusts the contribution of the unsupervised loss component based on the uncertainty map. Specifically, the unsupervised loss weight is computed as
where
represents the uncertainty measure for a given sample. This approach ensures that pseudo-labels with lower uncertainty have a greater impact on training while reducing the influence of noisy pseudo-labels. By dynamically adjusting
based on the uncertainty map, the model can focus on more reliable regions during training, further enhancing the robustness and accuracy of the segmentation process.
To explicitly incorporate this into the total loss function, we redefine
as
where
is the upper bound for the unsupervised loss weight. Substituting this into Equation (5), the updated total loss function becomes
This dynamic adjustment allows the model to focus on more reliable regions during training, further enhancing robustness and accuracy.
4. Experimental Setup
4.1. Dataset and Evaluation Metrics
To validate the generalization and effectiveness of the proposed semi-supervised segmentation model, experiments were conducted on three publicly available datasets: ISIC 2016, ISIC 2017, and ISIC 2018. These datasets, maintained by the International Skin Imaging Collaboration (ISIC), contain dermoscopic images with pixel-level segmentation masks, distinguishing lesion from non-lesion regions. This allows for a precise evaluation of segmentation models.
The ISIC 2016, ISIC 2017, and ISIC 2018 datasets include various skin lesion types, primarily melanoma and non-melanoma lesions (such as nevus and keratosis). The distribution of lesion types varies across the datasets and is not consistently documented. Demographic details such as age, gender, and ethnicity are also not consistently provided. Specific details regarding imaging tools and acquisition procedures are not included in the dataset documentation.
The ISIC 2016 dataset contains 900 images, serving as a benchmark for testing models in data-limited scenarios. The ISIC 2017 dataset expands to 2000 images, providing a more diverse set for evaluating model generalization. The ISIC 2018 dataset, the largest of the three, contains 2594 images, making it ideal for evaluating segmentation methods in more complex scenarios. A summary of the dataset composition and splits is provided in
Table 1.
The original image resolutions of the ISIC 2016, 2017, and 2018 datasets are 768 × 512, 1024 × 1024, and 1024 × 1024 pixels, respectively. In the experiments, all images were uniformly scaled to 224 × 224 to balance computational efficiency and detail preservation.
The proposed model uses an input resolution of 224 × 224 pixels for all datasets. This resolution balances computational efficiency and segmentation performance, enabling effective lesion segmentation without excessive computational cost. Larger resolutions, such as 512 × 512 pixels, provide more detail but double the computation time. Smaller resolutions, like 128 × 128 pixels, reduce computational requirements but lose critical lesion details, reducing segmentation accuracy.
Experiments show that the 224 × 224 pixel resolution yields strong performance in metrics like Dice and Sensitivity. A resolution of 512 × 512 pixels provided a marginal improvement (+0.5% in Dice) but at the cost of a 2× increase in computation time. Reducing the resolution to 128 × 128 pixels resulted in a 2.3% drop in Dice due to loss of boundary detail.
For each dataset, data were split into 80% training and 20% testing. The training set included 5% and 10% labeled data, with the remaining as unlabeled. This setup simulates real-world scenarios with limited labeled data. The testing set, consisting of images with corresponding ground truth masks, was used exclusively for model evaluation, ensuring unbiased performance assessment.
Preprocessing steps were standardized across datasets as follows:
Image Resizing: All images were resized to 224 × 224 pixels using bilinear interpolation.
Normalization: Pixel intensity values were scaled to [0, 1] to stabilize model training.
Data Augmentation: Random augmentations included ±15° rotations, horizontal/vertical flips, and brightness/contrast adjustments.
Mask Binarization: All segmentation masks were standardized into binary format, distinguishing lesion from non-lesion areas.
This study focuses on binary segmentation of lesion and non-lesion regions. The primary evaluation metric is the Dice Similarity Coefficient (DSC), calculated as
where
X and Y represent the predicted and ground truth lesion regions, respectively. A Dice value closer to 1 indicates high segmentation accuracy.
Other metrics include the following:
mIoU: Measures overlap between predicted and ground-truth regions. Higher mIoU values indicate better segmentation quality.
Sensitivity: Reflects the true positive rate, with higher values showing better detection of lesions.
Specificity: Measures the true-negative rate, indicating the model’s ability to correctly identify non-lesion areas.
Overall Accuracy: Provides a holistic measure of segmentation performance.
These metrics collectively evaluate segmentation accuracy, boundary delineation, and the balance between detecting lesions and avoiding false positives. This comprehensive evaluation framework highlights both strengths and areas for improvement in the model’s performance.
By combining these metrics, the evaluation framework not only highlights the model’s strengths in accurately segmenting skin lesions but also identifies its potential weaknesses, offering deeper insights into its segmentation performance across various scenarios and datasets.
4.2. Setup Details
The experiments were conducted on a high-performance computing setup equipped with an NVIDIA Tesla V100 GPU (32 GB memory), running Ubuntu 20.04, CUDA 11.1, PyTorch 1.8.1, and Python 3.8. This setup ensured efficient handling of extensive training iterations while providing real-time feedback for fine-tuning hyperparameters. The training process consisted of two phases—supervised training and semi-supervised training—designed to progressively enhance model performance.
Key hyperparameters were selected based on preliminary experiments and best practices in medical image segmentation. The batch size was set to 8, optimized for GPU memory constraints. A learning rate of 0.001 was used, following a cosine annealing scheduler for gradual decay. To improve generalization, a weight decay of 0.0005 was applied. The EMA decay rate (α) was set to 0.99, ensuring stable updates in the teacher model for reliable pseudo-labeling. An uncertainty threshold of 0.2 was introduced to filter out high-uncertainty regions, focusing training on reliable pseudo-labels. Additionally, the loss weighting factors were defined as : 1.0 (supervised loss), : 0.5 (auxiliary consistency loss), and : 0.7 (unsupervised loss), ensuring a balanced optimization between supervised and semi-supervised learning objectives.
In the initial phase, we trained the student model using only labeled data to establish a robust baseline. During this phase, the cross-entropy loss was computed solely on labeled samples, allowing the model to learn basic lesion segmentation patterns without interference from potentially noisy pseudo-labels. The supervised training phase typically lasted for the first 20 epochs to ensure that the model achieved stable performance on labeled data before introducing pseudo-labeled data.
After the initial supervised phase, we transitioned to the semi-supervised training phase, where the teacher–student framework was applied. During this phase, the following things occurred:
The teacher model generated pseudo-labels for the unlabeled data, refining them using an uncertainty-guided mechanism. High-uncertainty regions, identified via Monte Carlo Dropout, were excluded from the training loss calculation.
The student model learned from both the labeled and pseudo-labeled data. The dual-headed segmentation structure in the student model facilitated the learning of complementary information through the main and auxiliary heads, which were aligned using the auxiliary consistency loss.
The teacher model parameters were updated using the exponential moving average (EMA) of the student model’s weights, ensuring stable pseudo-label generation over time.
To further enhance the model’s generalization ability, data augmentation techniques were applied to both labeled and unlabeled images. These augmentations included random rotations, flips, scaling, and color jittering, which helped the model become more robust to variations in lesion appearance and imaging conditions.
Throughout training, we evaluated the model’s performance on a validation set after each epoch, monitoring metrics such as the Dice coefficient to track segmentation accuracy. The model with the highest Dice coefficient on the validation set was saved as the final checkpoint for testing.
This experimental setup and training protocol enabled our model to efficiently leverage both labeled and unlabeled data, achieving high segmentation accuracy while maintaining robustness to noisy pseudo-labels. The proposed framework, with its uncertainty-guided mechanism and dual-headed architecture, was particularly effective in addressing the challenges of limited labeled data and noisy pseudo-labeling in medical image segmentation tasks.
5. Results and Discussion
In this chapter, we provide a comprehensive evaluation of the proposed UDAMT framework for semi-supervised skin lesion segmentation. The experimental results demonstrate that UDAMT, by integrating a dual-headed architecture and an uncertainty-driven mechanism, significantly enhances segmentation performance under limited annotation conditions. Specifically, ablation experiments reveal that removing the auxiliary segmentation head results in an approximate 2.1 percentage point decrease in the Dice coefficient and a 1.7 percentage point drop in mIoU, while disabling the uncertainty mechanism causes about a 2.8 percentage point decline in Dice. When both modules are employed, the model achieves improvements of 1.17%p and 1.31%p in Dice under 5% and 10% labeled data settings, respectively, underscoring their complementary roles in suppressing noisy pseudo-labels and enhancing feature representation.
In the comparative analysis, UDAMT attains Dice coefficients of 87.84% and 88.73% under 5% and 10% labeled data, respectively, outperforming traditional supervised models (such as U-Net and TransUNet) as well as other semi-supervised approaches (including MT, UAMT, CMT, and FixMatch). Furthermore, paired t-tests on the Dice and mIoU metrics (p < 0.05) confirm that these improvements are statistically significant.
Regarding computational resources, UDAMT requires only 12.9 M parameters and achieves an inference time of 25.7 ms per image, resulting in an overall computational cost that is considerably lower than that of MT and UAMT. Additionally, experiments on the PH2 dataset demonstrate excellent generalization to small, irregular lesions. Overall, UDAMT strikes a favorable balance between segmentation accuracy, resource consumption, and inference speed, supporting its feasibility and advantages for real-world clinical deployment.
The remainder of this chapter presents detailed results from the ablation studies, comparative analyses, and computational efficiency evaluations, further discussing the UDAMT framework’s superiority across multiple performance metrics and its potential for practical applications.
5.1. Ablation Studies
To evaluate the contributions of each component in our framework, we conducted ablation studies by adding or removing modules and analyzing their impact on segmentation performance. We used the ISIC 2018 dataset with 5% and 10% labeled data, and the performance was measured using the Dice coefficient.
Table 2 provides the results of these ablation experiments.
Experiment 1 (baseline) uses a single segmentation head and is trained only on labeled data. This serves as the performance benchmark.
Experiment 2 adds the auxiliary segmentation head, improving the Dice coefficient by 0.99%p (5% labeled) and 0.91%p (10% labeled).
Experiment 3 incorporates the uncertainty map, which filters high-uncertainty regions, resulting in Dice improvements of 0.65%p (5% labeled) and 0.93%p (10% labeled).
Experiment 4 combines both the auxiliary head and uncertainty map, achieving the best performance with improvements of 1.17%p (5% labeled) and 1.31%p (10% labeled).
The ablation results clearly show that each module contributes positively to the model’s performance, with the auxiliary head and uncertainty map yielding significant improvements individually. However, the highest performance is achieved when both modules are incorporated, underscoring the synergy between the dual-headed architecture and uncertainty-guided training. This comprehensive ablation analysis demonstrates the effectiveness of our proposed framework, which leverages both labeled and unlabeled data to achieve high segmentation accuracy in a semi-supervised manner.
These findings confirm that our proposed framework, with its auxiliary segmentation head and uncertainty-guided pseudo-labeling, is highly effective in semi-supervised skin lesion segmentation, particularly under limited labeled data conditions.
To validate the independent contributions of the dual-headed architecture and the uncertainty-driven mechanism, ablation studies were conducted on the ISIC 2018 dataset. The results are summarized in
Table 3.
The ablation results show that both modules independently contribute to performance improvement, with the full model achieving the best results.
5.2. Comparative Analysis
We evaluate the performance of our proposed UDAMT model against the baseline model and several state-of-the-art semi-supervised skin lesion segmentation methods on the ISIC 2018 dataset with 5% and 10% labeled data. The evaluation is based on segmentation accuracy, measured by the Dice coefficient, and robustness to noisy labels, which is crucial in semi-supervised settings where pseudo-labels can introduce noise.
The baseline model is a single-headed student model trained on labeled data only, providing a performance benchmark. UDAMT is compared to the following models:
MT: Teacher–student framework with EMA updates for the teacher model.
UAMT: This includes an uncertainty-aware mechanism to exclude uncertain regions.
CMT: This uses multiple teacher models and a region mixing strategy based on uncertainty.
FixMatch: This combines strong and weak data augmentations to generate pseudo-labels.
Uncertainty-Guided: Similar to UAMT, but with enhanced uncertainty estimation.
The Dice coefficients for each model under 5% and 10% labeled data are shown in
Table 4. UDAMT outperforms the baseline by 1.17%p (5% labeled) and 1.31%p (10% labeled) and consistently performs better than other methods, including MT, UAMT, and FixMatch.
Table 4 compares the Dice coefficients of various methods on the ISIC2018 dataset under 5% and 10% labeled data conditions. The baseline model, a single-headed student trained solely on labeled data, achieves 86.67% and 87.42% Dice, respectively. In contrast, UDAMT attains 87.84% and 88.73%, reflecting improvements of 1.17%p and 1.31%p over the baseline.
Compared to other semi-supervised approaches—MT, UAMT, CMT, UCMT, FixMatch, and an Uncertainty-Guided Framework—UDAMT consistently shows higher or competitive Dice scores. For example, UDAMT outperforms MT by approximately 0.74–0.75%p and UAMT by around 0.39%p, while offering a simpler, single-teacher design that reduces computational overhead compared to more complex methods like UCMT. Overall, the results indicate that the dual-headed architecture and uncertainty-guided mechanism in UDAMT effectively enhance segmentation accuracy and robustness under limited labeled data conditions.
The UDAMT framework demonstrates notable robustness to noisy labels, a common challenge in SSL. By incorporating the auxiliary head and using an uncertainty map, UDAMT effectively mitigates the impact of noisy pseudo-labels. The auxiliary head enforces consistency in feature representation, helping the model learn robust segmentation boundaries even in the presence of label noise. Meanwhile, the uncertainty-guided masking strategy selectively excludes unreliable regions from contributing to the training loss, further enhancing stability and accuracy. This dual mechanism enables UDAMT to maintain high segmentation performance across different levels of labeled data. The visualization effect of image segmentation is shown in
Figure 4.
Figure 4 provides a visual comparison of segmentation results, demonstrating the superior performance of UDAMT in accurately segmenting lesion regions. The analysis highlights UDAMT’s effectiveness in handling noisy pseudo-labels through its uncertainty-guided masking strategy and auxiliary segmentation head. These mechanisms help reduce segmentation errors by excluding unreliable regions and ensuring consistent feature representation, which improves the model’s ability to handle label noise.
The dual-headed architecture of UDAMT enhances its ability to delineate lesion boundaries, even in complex cases like blurry edges, severe occlusion, and color inconsistencies.
Figure 4 shows that UDAMT’s segmentation closely matches the ground truth, outperforming baseline models in accurately detecting boundaries, which is crucial for clinical use.
UDAMT also demonstrates strong generalization across various lesion types and challenging scenarios, even with limited labeled data. The model’s robustness, as shown in
Figure 4, makes it well-suited for real-world applications where annotated medical data are scarce.
The comparison in
Figure 4 further emphasizes UDAMT’s ability to minimize false positives and negatives, leading to more reliable and clinically relevant segmentation. Its enhanced boundary detection aids dermatologists in accurate diagnosis and treatment planning, while its performance in low-label data settings makes it scalable for resource-constrained environments like telemedicine and low-resource healthcare systems.
In conclusion, UDAMT’s dual-headed architecture and uncertainty-guided training provide state-of-the-art performance in semi-supervised segmentation, particularly in challenging scenarios with noisy labels and limited annotations. This robust and efficient design positions UDAMT as a valuable tool for reliable segmentation in medical imaging.
5.3. Performance Analysis Across Evaluation Metrics
To give a comprehensive assessment of the proposed UDAMT framework, we carried out a thorough analysis of the most important segmentation metrics: mIoU, DSC, SE, SP, and ACC. The metrics selected were meant to evaluate various aspects of the model’s performance, such as segmentation accuracy, detection of lesion regions, minimizing false positives, and overall prediction reliability. The ISIC 2016, ISIC 2017, and ISIC 2018 datasets were evaluated using the 5% and 10% labeled data settings. Comparative analyses were conducted to compare traditional supervised models like U-Net and TransUNet with state-of-the-art semi-supervised methods like MT, UAMT, and CMT.
Note that all experiments were performed by training and testing each model separately on the ISIC 2016, ISIC 2017, and ISIC 2018 datasets to ensure that the evaluation metrics accurately reflect performance on each individual dataset.
The experimental results, presented in
Table 5, demonstrate the consistent superiority of the UDAMT framework across all metrics and datasets. The model exhibited robust performance, particularly in handling limited labeled data scenarios, effectively leveraging pseudo-labeled data to enhance segmentation quality.
Table 5 presents key insights into the UDAMT framework’s performance compared to traditional supervised models and state-of-the-art semi-supervised approaches. UDAMT outperforms traditional models like U-Net and TransUNet, with improvements of 1.30%p and 1.72%p in Dice, respectively, highlighting the advantage of leveraging unlabeled data. It also surpasses the Mean Teacher model by 1.86%p in Dice, thanks to its uncertainty-guided masking and dual-headed design. When compared to UAMT and CMT, UDAMT achieves further gains (0.39%p over UAMT and 0.22%p over CMT, with a 0.54%p increase in accuracy).
UDAMT achieves the highest Dice coefficient (87.84% with 5% labeled data and 88.73% with 10% labeled data), as well as the highest sensitivity (86.67%) and accuracy (97.89%) across the ISIC 2016, 2017, and 2018 datasets, demonstrating its robustness in semi-supervised segmentation tasks.
To further validate the robustness of our results, we conducted statistical significance testing using paired
t-tests. The
t-tests were applied to the Dice coefficient and mIoU metrics across different methods. As shown in
Table 6 the performance improvements of UDAMT over baseline methods are statistically significant, with
p-values less than 0.05 in all cases, indicating a meaningful difference rather than random variation.
To mitigate feature coupling between the two heads and encourage complementary learning, we incorporate an auxiliary consistency loss, defined as
where
and
represent the predictions of the main and auxiliary heads, respectively. Additionally,
Figure 5 presents a visual comparison of the attention maps generated by each segmentation head, highlighting their distinct feature focus and verifying that the dual-headed architecture improves feature diversity. The heatmaps generated illustrate how each head emphasizes different lesion regions, contributing to improved segmentation performance. The main head predominantly captures larger structural details, while the auxiliary head focuses on fine-grained features, enhancing robustness and diversity in feature representation.
The proposed UDAMT framework outperformed both traditional supervised models and advanced semi-supervised approaches across all evaluated metrics. Its superior segmentation accuracy and robustness to noisy pseudo-labels demonstrate the effectiveness of its architectural innovations. These results underscore the potential of UDAMT for real-world medical imaging applications, where labeled data are scarce and robustness to label noise is critical. Future work will focus on extending the UDAMT framework to other medical imaging domains, optimizing its computational efficiency, and exploring its deployment in clinical settings.
5.4. Knowledge Distillation and Parameter Minimization Comparison
To evaluate the practicality and computational efficiency of the proposed UDAMT framework, a detailed comparison of parameter scale and inference speed was conducted against existing frameworks, particularly MT and other knowledge distillation-based models.
Table 7 provides an overview of the parameter count and inference time per image for each model, tested on the ISIC 2018 dataset using a single NVIDIA Tesla V100 GPU.
The UDAMT framework requires 12.9 M parameters, which is slightly higher than the baseline model (12.5 M) but significantly lower than MT (14.8 M) and UAMT (15.2 M). This reduction in parameter count is attributed to the use of the auxiliary segmentation head and the single-teacher architecture, which avoids the computational overhead associated with multiple teacher models.
The inference time of UDAMT averages 25.7 ms/image, closely matching the baseline model (25.4 ms) and outperforming MT (27.6 ms) and UAMT (28.1 ms). This demonstrates that UDAMT achieves high computational efficiency while maintaining superior segmentation performance.
UDAMT achieves the highest Dice coefficients across both 5% and 10% labeled data settings, showcasing a balance between performance and resource efficiency. This makes UDAMT suitable for real-world applications where computational resources are often limited.
From
Table 7, it can be observed that although the baseline method exhibits a slightly lower computational cost (defined as the product of parameter count and inference time) compared to UDAMT (317.5 M-ms vs. 331.5 M-ms, respectively), UDAMT achieves a significant improvement in segmentation performance (for example, the Dice coefficient under the 5% labeled data setting is 86.67% compared to 87.84%). Moreover, when compared to other advanced methods such as MT, UAMT, and CMT, UDAMT maintains a lower parameter count and faster inference while delivering superior overall performance. In summary, UDAMT demonstrates an excellent balance between performance and resource consumption, enabling efficient and accurate medical image segmentation even in environments with limited computational resources, which strongly supports its practical deployment in clinical settings.
To assess the generalization of UDAMT beyond the ISIC dataset, we conducted additional experiments on the PH2 dataset, which contains small and irregularly shaped lesions. Our model achieved an 85.4% Dice coefficient for small lesion segmentation, demonstrating its robustness. Additionally,
Table 8 presents inference speed comparisons across different hardware platforms, confirming UDAMT’s computational efficiency.
Practical Applications and Resource-Constrained Environments.
The results highlight the practical advantages of UDAMT in resource-constrained environments, such as medical imaging systems deployed on portable devices or edge computing platforms. The following aspects underline its applicability:
Low Parameter Overhead: With fewer parameters compared to other semi-supervised frameworks, UDAMT is particularly well suited for memory-constrained environments, such as mobile healthcare devices and embedded systems.
Efficient Inference: UDAMT’s fast inference speed enables its use in time-sensitive medical applications, including real-time lesion segmentation during dermatological examinations.
Scalability to Other Tasks: UDAMT’s computational efficiency makes it scalable to other medical imaging tasks, such as organ segmentation in CT scans or tumor detection in MRIs, especially in scenarios requiring SSL due to limited labeled data availability.
The computational efficiency of UDAMT, combined with its robust segmentation accuracy, positions it as a highly practical solution for real-world medical imaging tasks. Unlike models such as UAMT and FixMatch, which incur additional overhead due to their reliance on complex architectures or augmentation strategies, UDAMT achieves comparable or better performance with a streamlined design.
Future research will explore optimizing UDAMT further by integrating lightweight architectures, such as MobileNet-based backbones, and hardware-specific accelerations, such as TensorRT or FPGA implementations, to enhance its applicability in low-power environments.
This analysis reaffirms UDAMT’s value in achieving state-of-the-art performance while addressing the practical constraints of deploying machine learning models in healthcare scenarios.
6. Conclusions and Future Work
In this study, we proposed UDAMT, a novel SSL framework for skin lesion segmentation, designed to address the critical challenges posed by limited labeled data and noisy annotations in medical image analysis. Our approach effectively combines a dual-headed teacher–student architecture with an uncertainty-guided mechanism, significantly enhancing feature learning and boundary precision. Through extensive experiments on the ISIC 2016, ISIC 2017, and ISIC 2018 datasets, UDAMT achieved notable improvements in segmentation performance, including up to 1.17%p increase in Dice coefficient and 1.31%p in mIoU under low-label settings (5% labeled data). These results validate the robustness and effectiveness of UDAMT, particularly in scenarios where labeled data are scarce.
Furthermore, UDAMT demonstrates superior computational efficiency. The model requires only 12.9 M parameters, which is slightly higher than the baseline model (12.5 M) but significantly lower than methods like MT (14.8 M) and UAMT (15.2 M). Additionally, UDAMT achieves an inference time of 25.7 ms per image, making it computationally feasible for real-time clinical applications. These advantages underscore UDAMT’s potential for deployment in clinical environments with limited computational resources.
Ablation studies provided further insights into the contributions of key components. The dual-headed segmentation network and the uncertainty-guided mechanism were both essential in enhancing segmentation quality and robustness. The dual-headed architecture allowed for complementary learning, providing richer feature representations, while the uncertainty-guided mechanism effectively filtered out unreliable pseudo-labels, ensuring more accurate model predictions. These findings highlight the importance of combining these elements for optimal performance in semi-supervised medical image segmentation.
While the current implementation of UDAMT employs Monte Carlo Dropout for uncertainty estimation—introducing some additional computational cost—the overall overhead remains manageable, and the model’s performance is not compromised. In future work, we aim to explore more efficient uncertainty estimation techniques, such as Bayesian neural networks or learned uncertainty modules, which could further reduce computational costs while maintaining the model’s effectiveness in real-time applications.
Although UDAMT is currently optimized for binary skin lesion segmentation, we believe that the framework can be easily extended to multi-class and multi-organ segmentation tasks. By modifying the network to accommodate diverse anatomical structures and varying noise characteristics across different imaging modalities (e.g., MRI, CT), UDAMT’s applicability in clinical practice can be significantly broadened.
UDAMT’s potential for clinical deployment is substantial. By reducing the reliance on large annotated datasets and delivering high segmentation accuracy in data-constrained environments, UDAMT can alleviate the annotation burden on dermatologists and other medical professionals. Its efficiency and effectiveness make it well suited for deployment in resource-limited settings, such as mobile health devices and telemedicine platforms, where rapid, reliable skin lesion diagnosis is essential.
In conclusion, the UDAMT framework presents a meaningful advance in semi-supervised medical image segmentation, addressing both the technical and practical challenges in clinical settings. The framework’s ability to balance performance, computational efficiency, and scalability provides a solid foundation for future work. We plan to continue optimizing UDAMT to further enhance its robustness and expand its clinical applications, ultimately improving diagnostic accuracy and patient outcomes in real-world healthcare environments.