1. Introduction
Traditional machine learning algorithms, such as SVMs, Random Forests, and KNNs, perform well on small-scale datasets. These methods established the foundation for early medical image analysis [
1,
2,
3]. Deep learning models, such as CNNs and Transformers, leverage large annotated datasets and high-performance computing. They overcome the limitations of traditional machine learning in feature extraction and pattern recognition. These models achieve automated and high-precision classification of large-scale medical imaging data [
4,
5,
6].
The widespread adoption of deep learning across various domains has significantly accelerated industrial advancements. Medical image classification is a core component of medical image analysis. Deep learning models significantly reduce the workload of radiologists in both training and clinical practice while providing reliable auxiliary data for downstream diagnosis [
7]. However, training deep learning models requires extensive annotated data support. Medical data annotation requires expert knowledge and careful review. As a result, the annotation costs are much higher than in other image domains [
8]. In recent years, semi-supervised learning, with its significantly reduced reliance on labeled data compared to traditional supervised learning, has gained widespread application across various deep learning tasks [
9,
10]. Semi-supervised learning splits datasets into a small labeled subset and a large unlabeled subset. It uses strategies such as pseudo-labeling [
11,
12], consistency regularization [
13,
14], and label propagation [
15] to train deep models with limited labeled data. This approach dramatically reduces annotation costs, and numerous studies [
16,
17,
18] have successfully introduced semi-supervised learning into medical image classification tasks.
In the current semi-supervised image classification methodologies, two predominant approaches have emerged [
19,
20]: consistency-based methods, which enforce prediction stability under input perturbations, and pseudo-labeling-based methods, which assign pseudo-labels to unlabeled data before retraining with the augmented set.
Among them, consistency-based methods enforce prediction stability under input perturbations, as exemplified by the mean teacher (MT) framework [
21], which aligns student and teacher model predictions. Extensions of this line of work have further incorporated relation modeling [
17] and contrastive self-supervised pretraining [
22] to better leverage unlabeled data. Pseudo-labeling approaches, on the other hand, generate supervisory signals from model predictions to expand the training set [
23,
24]. Pseudo-labeling approaches have drawn increasing attention due to their simplicity and scalability. Early works such as FixMatch [
25] employed fixed-threshold pseudo-labeling but discarded a large portion of unlabeled data, particularly those from minority classes. FlexMatch [
26] introduced dynamic per-class thresholds but remained ineffective under class imbalance. Recent studies use adaptive thresholds [
27], label smoothing [
28], curriculum learning [
29], or distribution modeling [
30]. These techniques aim to reduce imbalance and improve pseudo-label reliability. Other works leveraged prototype alignment [
31] or multi-level feature fusion [
23]. Despite these advances, two major limitations remain regarding the existing approaches: (i) they exhibit simplistic utilization patterns of labeled data, neglecting the modeling of intermediate feature layers and multi-scale structural information; (ii) they employ pseudo-label threshold adjustment mechanisms with inherent limitations that accumulate bias during the initial training phases due to noisy labels.
To address these challenges, this paper proposes a novel semi-supervised medical image classification framework integrating contrastive learning with category-adaptive pseudo-labeling. We design a semantic discrimination enhancement module that strengthens the utilization of labeled data through supervised contrastive loss, thereby improving feature representation by reducing intra-class distances while increasing inter-class separation. This ensures reliable model convergence during the early training stages when labeled data is scarce and learnable information limited. Considering the inherent class imbalance in medical imaging, where models tend to favor majority classes, we develop a category-adaptive pseudo-label regulation module that dynamically adjusts thresholds based on per-class learning progress, effectively alleviating head-class dominance while improving tail-class recognition. To minimize early-stage noise interference from erroneous pseudo-labels when model capability is weak, our method strategically implements category-adaptive pseudo-labeling in the later training phases. Furthermore, we exploit deep semantics in unlabeled data by enforcing consistency across different views of the same sample. Dual augmented views and additional regularization improve robustness to semantic features. This leads to better classification performance. Extensive experiments on the ISIC2018 and Chest X-ray14 datasets demonstrate significant improvements in classification accuracy. The main contributions of this work are summarized as follows:
We propose a novel semi-supervised medical image classification model, CLCP-MT, which addresses the dual challenges of annotation scarcity and class imbalance by integrating supervised contrastive learning with a category-adaptive pseudo-labeling mechanism, thereby significantly enhancing overall classification performance.
We design a semantic discrimination enhancement (SDE) module that leverages supervised contrastive learning to cluster intra-class samples while separating inter-class samples, effectively extracting discriminative information from limited labeled data in latent structural space and substantially amplifying the value of sparse annotations.
We introduce a category-adaptive pseudo-label regulation (CAPR) module, which dynamically adjusts pseudo-label confidence thresholds based on real-time learning progress across different categories, mitigating dominance by head classes while improving recognition performance for tail classes, thereby achieving effective modeling of long-tailed distributions.
The experimental results on the ISIC2018 and Chest X-ray14 datasets demonstrate that our method consistently outperforms existing semi-supervised approaches under varying annotation ratios and exhibits remarkable efficacy and robustness in class-imbalanced scenarios.
3. Methods
This section delineates the semi-supervised medical image classification model CLCP-MT proposed in our study, as illustrated in
Figure 2. The overall architecture builds upon the mean teacher framework [
21], incorporating the consistency regularization principle: applying diverse perturbations to identical inputs while constraining output-level consistency to fully exploit latent information from unlabeled data. Within the mean teacher paradigm, the teacher model parameters
are updated via exponential moving average (EMA) of student model parameters
, with gradient optimization exclusively applied to the student network during training. This approach leverages temporally ensembled student weights to construct a more robust teacher network for generating reliable consistency targets to guide student model training.
To effectively extract discriminative information under extreme label scarcity, we augment this framework with a semantic discriminative enhancement (SDE) module, introducing supervised contrastive loss to optimize utilization of limited labeled samples. Furthermore, addressing the challenges of error amplification from noisy pseudo-labels during the early training phases and head-class dominance, we implement a class-adaptive pseudo-label refinement (CAPR) module in the later training stages. This component employs dynamic thresholding to suppress head-class bias while enhancing tail-class recognition, thereby improving overall model performance on class-imbalanced medical datasets.
3.1. Semantic Discrimination Enhancement (SDE) Module
To fully exploit the informational value of labeled data and deeply investigate the relationships among data in low-dimensional space, we designed a semantic discrimination enhancement (SDE) module, as illustrated in
Figure 3. By employing contrastive loss on labeled data, our approach minimizes intra-class distances while maximizing inter-class distances within the labeled dataset, thereby encouraging the network to extract additional semantic information from the scarce labeled samples.
The labeled data is subjected to two distinct perturbations,
and
, before being fed into the student network and teacher network, respectively, yielding embeddings
and
. These embeddings can be expressed as
where
and
represent the embeddings obtained from the student network and the teacher network, respectively, after inputting the i-th image with different perturbations. We concatenate
and
along dimension 0 to obtain the feature matrix
Z, which can be expressed as
For clarity in notation, we denote
Z as
Since both
and
are derived from the same labeled dataset, the resulting label matrix
Y obtained by concatenating their corresponding ground truth labels
y along dimension 0 can be expressed as
The similarity matrix
S serves to identify samples that are proximate within the feature space, which can be expressed as
where
denotes the similarity between sample i and sample j. The objective is to minimize the distance between samples i and j in the feature space and enhance their similarity when they belong to the same category while maximizing their separation distance and suppressing their similarity when they pertain to distinct categories.
To identify homogeneous samples, we define a mask matrix
, as expressed in Equation (
7). When
, it indicates that the ground truth labels of sample i and sample j have no overlapping components, meaning the samples belong to distinct classes. Conversely, when
, it signifies that the ground truth labels of sample i and sample j share overlapping components, indicating the samples are from the same class. To exclude self-comparisons, the mask matrix
must be subtracted by an identity matrix
E, with the positive sample pair mask matrix
M represented in Equation (
8).
To minimize intra-class sample distances while maximizing inter-class sample distances, the supervised contrastive loss function is defined as
where
and
denote the temperature coefficients.
3.2. Class-Adaptive Pseudo-Label Refinement (CAPR) Module
Medical imaging data often exhibits class imbalance and sample difficulty imbalance. During training, neural networks tend to prioritize learning easily classifiable samples while struggling with the minority of challenging cases. To address this issue, we propose a category-adaptive pseudo-label regulation module that dynamically reduces the weight of easy samples while increasing the weight of hard samples, thereby preventing model training from being dominated by easily classifiable instances, as illustrated in
Figure 4.
For each category, a specific threshold is dynamically adjusted for pseudo-label generation. The unlabeled data
is fed into both the student network and the teacher network after undergoing different perturbations, yielding prediction outputs
and
, respectively. Based on the category-specific threshold
, the pseudo-label
corresponding to the teacher network’s prediction
for the current unlabeled data can be derived. Here,
denotes the threshold for the
c-th category. The computation of the pseudo-label threshold and the pseudo-label for the
c-th category at the
t-th epoch is as follows:
Herein, denotes the classification network, where and represent the parameters of the student network and teacher network, respectively, and correspond to the perturbations applied to the student and teacher networks, signifies the minimum threshold, indicates the maximum threshold, stands for the ramp-up period, refers to the class weight, and denotes the number of samples belonging to class c in the training set.
To mitigate the interference caused by low-confidence pseudo-labels during training, we employ a category-specific threshold
to derive a confidence mask for filtering high-confidence pseudo-label samples, expressed as
In addition to the mask, we quantified the uncertainty of pseudo-labels and weighted the pseudo-label loss based on uncertainty, thereby enabling the model to focus more on high-confidence samples while reducing reliance on low-confidence ones. We employed entropy to measure prediction uncertainty and generated confidence weights based on entropy values. The entropy
and confidence weight
for unlabeled data are calculated as follows:
By applying the confidence mask and weights
, we can compute the pseudo-label loss
between the student network’s prediction
and the teacher network’s generated pseudo-label
for the same sample. The pseudo-label loss is formulated as follows:
where
denotes the focal loss function.
3.3. Overall Loss
During the initial training phase, the entire batch size (containing both labeled and unlabeled data) is subjected to different perturbations before being fed into the student and teacher models. For the labeled data
, the predicted results
are compared with the ground truth labels
to compute the supervised loss
, as shown in Equation (
19). The consistency loss
is derived by comparing the student network’s predictions
with the teacher network’s predictions
across the entire batch, as specified in Equation (
20). The labeled contrastive loss
is obtained by concatenating the feature representations of differently perturbed labeled data after passing through the student and teacher networks, then computing the loss between these data-level features and their corresponding ground truth labels, as illustrated in Equation (
9).
During the later stages of training, category-adaptive pseudo-labels are incorporated. These pseudo-labels and their corresponding confidence scores are derived from the predictions of the teacher network on unlabeled data
. The pseudo-label loss
is then computed by combining these results with the predictions from the student network, as illustrated in Equation (
17).
Consequently, the overall optimization objective of the entire framework can be formulated as
where
represents an incrementally weighted factor,
denotes the hyperparameter for the labeled contrastive loss, and
signifies the hyperparameter for the pseudo-label loss. The ultimate objective is to minimize the loss function
L by updating the student network’s parameters through gradient descent.
where
represents the Gaussian ramp-up curve that controls the weight,
t denotes the current training iteration, and
T is the ramp-up value. During the initial
T training iterations, the function value gradually increases from 0 to 1. Subsequently, the value of
is fixed at 1 for the remaining training process. This design ensures that, during the early stages of network training, when the consistency target of unlabeled data remains unreliable, the training loss will not be dominated by unsupervised loss.
The training algorithm of the semi-supervised medical image classification model based on supervised contrastive learning and class-adaptive pseudo-labels is shown in Algorithm 1.
Algorithm 1: Contrastive Learning and Class-Adaptive Pseudo-Labeling for Semi-Supervised Medical Image Classification |
![Entropy 27 01015 i001 Entropy 27 01015 i001]() |
4. Experiments and Results
We evaluated our proposed semi-supervised learning approach for its application in dermoscopic image-based skin lesion classification (single-label) and chest X-ray image-based thoracic disease diagnosis (multi-label).
4.1. Datasets
4.1.1. ISIC2018 Dataset
We conducted skin lesion classification on the ISIC2018 dataset [
36,
37]. The ISIC2018 Task 3 dataset for skin lesion analysis, released by the International Skin Imaging Collaboration (ISIC) in 2018, comprises 12,500 dermoscopic images of skin lesions. This dataset includes 10,015 training images, 1512 test images, and 193 validation images. The training set contains seven disease categories, with each image having a resolution of 600 × 450 pixels. Specifically, the 10015 training images consist of 1113 melanoma (MEL) cases, 6705 melanocytic nevus (NV) cases, 514 basal cell carcinoma (BCC) cases, 327 actinic keratosis (AKIEC) cases, 1099 benign keratosis (BKL) cases, 115 dermatofibroma (DF) cases, and 142 vascular lesions (VASC). This constitutes a single-label imbalanced dataset, with the distribution of different lesion types illustrated in
Figure 5.
All the images were resized to 224 × 224 pixels. To leverage pretrained models, we normalized each image from both datasets using statistical parameters derived from the ImageNet dataset [
38]. For fair comparison and in accordance with prior work [
17], we randomly partitioned the entire dataset into 70% for training, 10% for validation, and 20% for testing. Our network architecture employed DenseNet121 [
39] pretrained on ImageNet [
40] as the backbone.
4.1.2. Chest X-Ray14 Dataset
We conducted thoracic disease diagnosis using the Chest X-ray14 dataset [
41]. The Chest X-ray14 dataset, collected between 1992 and 2015, comprises 112,120 frontal-view X-ray images from 30,805 unique patients, along with image labels for 14 disease categories (each image may have multiple labels) extracted from corresponding radiology reports through natural language processing techniques. The images have a resolution of 1024 × 1024 pixels. The 14 diagnostic labels include Atelectasis, Consolidation, Infiltration, Pneumothorax, Edema, Emphysema, Fibrosis, Effusion, Pneumonia, Pleural Thickening, Cardiomegaly, Nodule, Mass, and Hernia. The distribution of different disease types is illustrated in
Figure 6.
All the images were resized to a resolution of 224 × 224 pixels. To leverage the pretrained model, we normalized each image from both datasets using statistical parameters derived from the ImageNet dataset [
38]. The official data partitioning protocol was adopted, allocating 70% of the samples for training, 10% for validation, and 20% for testing, with strict patient-wise separation ensuring no data leakage across the splits. Given the substantially larger scale of this dataset compared to ISIC2018, we implemented a deeper network architecture, specifically employing DenseNet169 [
39] pretrained on ImageNet [
40] as our backbone network.
4.1.3. Evaluation Metric
We selected six evaluation metrics for assessment on the ISIC2018 dataset, including AUC (Area Under the Curve), sensitivity, specificity, accuracy, F1-score, and precision. For the Chest X-ray14 dataset, following previous work by [
42], we adopted AUC as the evaluation metric.
4.2. Implementation Details
The experiments were conducted using Python 3.8 as the programming language and PyTorch 1.7.0 as the framework for the CLCP-MT model. To ensure a fair comparison, all the trials were performed on an NVIDIA RTX 3090 GPU with 24 GB of video memory (Santa Clara, CA, USA).
The experimental setup on the ISIC2018 dataset and studies conducted on the Chest X-ray14 dataset involved the following: All the configurations used DenseNet121 as the backbone. Each batch contained 48 samples: 12 labeled and 36 unlabeled. The models were trained for 100 epochs with a ramp-up period set to 30.
For the Chest X-ray14 dataset, we used DenseNet169 as the backbone. Each batch contained 48 samples: 12 labeled and 36 unlabeled. Training lasted for 20 epochs with a ramp-up period of 10.
The learning rate was set to 1 × 10
−4 and decayed exponentially by 0.9 per epoch. We trained with the Adam optimizer and used an EMA decay of 0.99. Consistent with numerous SSL algorithms [
17,
22], we applied various perturbations to the input unlabeled images, including random cropping, flipping, color jittering, and blurring.
4.3. Comparison with the State-of-the-Art Methods
To validate the advancement of the CLCP-MT framework, we conducted comparative experiments with current mainstream semi-supervised learning approaches under identical experimental conditions. During the training process, all the methods used the same training configurations. These included data partitioning, preprocessing, input perturbations, learning rate schedulers, and optimizers. On the ISIC2018 dataset, all the methods used a pretrained DenseNet121 as the backbone. On the Chest X-ray14 dataset, all the methods used a pretrained DenseNet169 as the backbone.
4.3.1. Results on ISIC2018 Dataset
On the ISIC2018 dataset, we conducted a comprehensive comparison between our proposed method and several state-of-the-art approaches, including consistency-based methods (mean teacher and SRC-MT), a pseudo-labeling-based method (NM), a hybrid approach combining consistency and pseudo-labeling (FixMatch), and a curriculum learning-based method (FlexMatch). The mean teacher model enforces prediction consistency between the student network and teacher network through consistency loss. The teacher’s parameters are updated as the exponential moving average of the student’s parameters. Building upon mean teacher, SRC-MT incorporates SRC loss to model relational information among different samples for effective utilization of unlabeled data. The NM method functions as a pseudo-label estimator that propagates labels based on neighboring samples of unlabeled data. FixMatch generates pseudo-labels using a fixed threshold. It then computes cross-entropy loss between the pseudo-labels and strongly augmented predictions of the same sample. FlexMatch enhances FixMatch by replacing the fixed threshold with a dynamic threshold mechanism.
As shown in
Table 1, we conducted experiments using 20% labeled data and 80% unlabeled data. The upper bound represents the fully supervised model trained with 100% (7000) labeled data, serving as the performance ceiling. The baseline denotes the fully supervised model trained with only 20% (1400) labeled data. Our proposed method outperforms other state-of-the-art approaches across all the metrics except sensitivity. Compared to the mean teacher model, which also enforces prediction consistency between two models, the CLCP-MT model achieves improvements of 1.44% in AUC, 0.28% in specificity, 0.81% in accuracy, 4.8% in F1-score, and 8.17% in precision. Relative to the FixMatch method that similarly constrains pseudo-labels with predictions from another model, CLCP-MT demonstrates enhancements of 0.87% in AUC, 0.77% in specificity, 0.04% in accuracy, 1.91% in F1-score, and 3.95% in precision. When compared to FlexMatch, which also employs dynamic thresholds, our CLCP-MT model shows superior performance with gains of 1.03% in AUC, 0.89% in specificity, 0.01% in accuracy, 2.01% in F1-score, and 3.86% in precision.
The experimental results demonstrate that all the methodologies outperform the baseline, indicating that unlabeled data can benefit the model. The mean teacher model enforces consistency between the predictions of the student and teacher networks, while the SRC-MT model incorporates a sample relation consistency paradigm on top of the mean teacher framework, leading to improvements in AUC, sensitivity, accuracy, F1-score, and precision, albeit with a 0.11% decline in specificity. This suggests that SRC-MT effectively leverages unlabeled sample information, albeit at the cost of increased false positives among negative samples with similar features. Compared to the mean teacher model, the FixMatch model exhibits superior performance across all the evaluation metrics except specificity, indicating its enhanced capability in detecting positive samples. The FlexMatch model employs a dynamic thresholding mechanism for pseudo-label generation but underperforms relative to FixMatch, suggesting its limited applicability to imbalanced medical imaging datasets. The NM model propagates labels based on neighboring unlabeled samples, achieving higher sensitivity than the other methods, which implies fewer missed diagnoses of diseases. Our proposed method surpasses all the comparative approaches in five out of six evaluation metrics, with only sensitivity being marginally lower than that of the NM model, demonstrating its superior effectiveness in utilizing unlabeled data.
4.3.2. Results on Chest X-Ray14 Dataset
On the Chest X-ray14 dataset, our approach was benchmarked against the consistency-based mean teacher model and the SRC-MT model, both of which were previously described in
Section 4.3.1.
As illustrated in
Table 2, our experimental setup comprised 20% labeled data and 80% unlabeled data. The upper bound represents the fully supervised model utilizing 100% (78,468) labeled data, serving as the performance ceiling for this experiment. The baseline constitutes the fully supervised model trained solely on 20% (15,694) labeled data, establishing our experimental starting point. Compared to the baseline, the CLCP-MT model demonstrated a 2.64% improvement in average AUC values. Relative to the upper bound, the CLCP-MT model exhibited a 7.73% reduction in average AUC performance. When benchmarked against the MT model, the CLCP-MT architecture achieved AUC improvements across eight categories, with the Hernia class (representing only 0.2% of the total dataset as an extreme minority) showing a 2.09% higher average AUC. Similarly, the CLCP-MT model outperformed the SRC-MT model in eight categories, delivering a 1.36% enhancement in mean AUC scores.
Compared to the baseline, the CLCP-MT model demonstrates improved average AUC values, indicating its effective utilization of additional discriminative information derived from unlabeled data. However, when benchmarked against the upper bound, our approach still exhibits potential for further enhancement in leveraging unlabeled data. The CLCP-MT model achieves the highest average AUC values among the comparative methods, outperforming both mean teacher and SCR-MT. Notably, it exhibits superior AUC performance in classifying the Hernia, Pneumonia, and Fibrosis categories—three underrepresented classes in the Chest X-ray14 dataset, comprising merely 0.2%, 1.28%, and 1.5% of the total dataset, respectively. This result underscores our model’s enhanced capability in recognizing tail-class data.
4.4. Ablation Study
4.4.1. Different Percentages of Labeled Data
We conducted ablation experiments on the ISIC2018 dataset under varying percentages of labeled training data. The upper bound represents the fully supervised model trained with 100% labeled data, while the baseline constitutes the fully supervised model trained exclusively on labeled data. Four label proportions (5%, 10%, 20%, and 30%) were selected for comparative analysis against both the baseline under different labeled data ratios and the upper bound.
As demonstrated in
Table 3, the CLCP-MT model consistently outperforms the baseline across all four labeled data proportions (5%, 10%, 20%, and 30%). When utilizing 5% labeled data, the CLCP-MT model achieves improvements of 2.08% in AUC, 0.32% in sensitivity, 2.88% in specificity, 6.05% in accuracy, 12.12% in F1-score, and 15.06% in precision compared to the baseline using only 5% labeled data. With 20% labeled data, the model exhibits enhancements of 3.88% in AUC, 3.2% in sensitivity, 1.18% in specificity, 3.58% in accuracy, 10.38% in F1-score, and 14.46% in precision relative to the corresponding baseline. Furthermore, when comparing the performance between 20% and 30% labeled data, the CLCP-MT model shows additional gains of 0.46% in AUC, 6.05% in sensitivity, 0.86% in specificity, 0.08% in accuracy, and 0.85% in F1-score. At the 30% labeled data level, the model’s AUC of 92.71% approaches the upper bound of 94.77% with a marginal gap of 2.06%, while its accuracy of 93.76% nears the upper bound of 95.29% with only a 1.53% difference. However, the F1-score still lags behind by 8.4%, indicating potential for further improvement in positive class identification.
Under four distinct label proportions (5%, 10%, 20%, and 30%), the CLCP-MT model leveraging unlabeled data demonstrated statistically significant superiority over the baseline that solely utilized labeled data across all the evaluation metrics. This performance advantage remained consistent with increasing labeled data quantities. Specifically, when employing only 5% labeled data, our method achieved a 12.12% higher F1-score than the baseline through the utilization of unlabeled data, substantiating the model’s capability to extract valuable information from unlabeled datasets. While the model performance exhibited progressive improvement with additional labeled data, the enhancement became statistically insignificant when increasing from 20% to 30% labeled data. This observation indicates that augmenting labeled data yields diminishing returns: it substantially benefits model performance under data-scarce conditions but approaches a performance plateau as labeled data quantities increase.
Table 4 presents a comparative analysis of the mean AUC values between the CLCP-MT model and the MT model across varying proportions of labeled data in the Chest X-ray14 dataset. The CLCP-MT model consistently demonstrates superior performance under all the labeled data ratios. Specifically, with 2% labeled data, the CLCP-MT model achieves a 1.04% higher mean AUC value compared to the MT approach. When utilizing 20% labeled data, our method exhibits a more substantial improvement of 2.09% in mean AUC over the MT baseline. The progressive enhancement in the CLCP-MT model’s mean AUC with increasing labeled data quantities substantiates the positive impact of labeled data on model performance.
4.4.2. Effect of Weight Coefficient Results of Two Unlabeled Losses
We investigated the impact of varying hyperparameters
and
in Equation (
21). Here,
denotes the weight of the labeled data-contrastive loss, while
represents the weight of the pseudo-label loss. Both
and
were selected from the range [0, 1]. Specifically, we examined the values
. The ablation study of these weight hyperparameters is presented in
Table 5. The configuration
and
indicates that the model exclusively employs supervised loss and consistency loss.
In comparison to the baseline scenario with and , the incorporation of labeled data-contrastive loss and pseudo-label loss yields statistically significant improvements across all the evaluation metrics, including AUC, accuracy, and F1-score. The AUC metric demonstrates a maximum enhancement from 90.81% to 92.25%, while the F1-score shows a more substantial increase from 55.87% to 61.82%, indicating a positive impact of these loss functions on the model’s classification performance. Within the parameter range of , the model exhibits greater sensitivity to variations in . The optimal AUC value of 92.25% is achieved at and , with the accuracy reaching a suboptimal value of 93.68%. Beyond this parameter configuration, further increases in either weight parameter result in diminishing returns or marginal performance degradation.
The incorporation of contrastive learning and pseudo-labeling methods can effectively enhance the overall model performance. Appropriately increasing the value improves the model’s capability to capture positive samples, whereas excessive augmentation of may lead to reduced sensitivity. We systematically selected parameters that optimize the model’s discriminative ability between positive and negative samples, ultimately employing and for subsequent experimental validation.
4.4.3. Effect of Different Components
We conducted ablation studies using 20% (1400) labeled data and 80% (5600) unlabeled data from the training set. Our proposed CLCP-MT model consists of three components: a consistency regularization method, a semantic discrimination enhancement module, and a category-adaptive pseudo-label regulation module. To further investigate the contribution of each component to the overall model performance, ablation experiments were carried out, as shown in
Table 6.
The experimental results demonstrate that incorporating the consistency regularization method into the baseline model led to improvements of 2.44%, 3.74%, 0.90%, 2.77%, 5.58%, and 6.28% in AUC, sensitivity, specificity, accuracy, F1-score, and precision, respectively. Building upon the consistency regularization method, the addition of the SDE module further enhanced AUC, accuracy, F1-score, and precision, although sensitivity decreased by 0.90%. When the CAPR module was integrated on top of the consistency regularization method, AUC, sensitivity, F1-score, and precision increased by 1.83%, 0.06%, 0.05%, and 0.05%, respectively, while specificity and accuracy decreased by 0.79% and 0.22%. Finally, with both the SDE and CAPR modules added alongside the consistency regularization method, all the evaluation metrics showed improvements.
The aforementioned results demonstrate that the consistency regularization method significantly enhances the model’s overall classification performance by enforcing output-level consistency between the student and teacher networks. While the SDE module facilitates deeper exploitation of discriminative information from labeled data, it may inadvertently push scarce positive samples toward negative regions in the highly class-imbalanced ISIC2018 dataset, consequently reducing sensitivity. The CAPR module introduces additional supervisory signals for unlabeled data, thereby strengthening the model’s discriminative capability overall. However, the inherent noise in pseudo-labels can adversely affect precision, leading to a degradation in the F1-score.