1. Introduction
Deep learning (DL) [
1,
2] is revolutionizing the way decisions are made across industries and is used in a wide range of fields, including natural language processing, medical and healthcare applications, and object recognition. Some models, such as Convolutional Neural Networks (CNNs) [
3], Generative Adversarial Networks (GANs) [
4], Recurrent Neural Networks (RNNs) [
5], and Auto-Encoders [
6], have been used frequently.
Convolutional Neural Networks are a rapidly growing and popular model structure for deep learning. It allows multi-layer computational models to extract higher-level features from raw inputs progressively. It has excellent performance in large-scale image processing and has a broad base of applications in areas such as autonomous driving systems [
7], agriculture [
8], and medical imaging [
9,
10]. For instance, autonomous driving systems leverage CNNs to interpret complex visual data in real time, ensuring safe and efficient navigation. Moreover, intelligent medicine applications utilize CNNs for medical image analysis, disease diagnosis, and personalized treatment plans. The continuous evolution and refinement of CNNs will undoubtedly lead to more sophisticated and innovative applications in the future, further integrating and enriching our daily lives.
However, its success heavily depends on having access to large-scale labeled data. When there is not enough labeled data available, the performance of deep networks decreases significantly [
11]. In real-world scenarios, obtaining labeled data is expensive and sometimes impossible for nonspecialists, as with medical images.
Semi-supervised learning (SSL) [
12] addresses this issue by employing a limited quantity of labeled training data in conjunction with a substantial amount of unlabeled data for training models. Adopting this approach can significantly lessen the labor of labeling data and swiftly enhance the network’s efficiency. In the recent past, deep semi-supervised learning has entered into a very interesting research problem, namely open-world [
13]. Traditional learning methods were designed for a closed-world setting. However, the real world is inherently open and dynamic, so previously unseen classes may appear in test data or during model deployment [
14]. As deep learning rapidly advances, applying deep semi-supervised learning (DSSL) [
15] is suggested for fields challenging and expensive with data labeling, particularly in medical images. DSSL is a method that combines semi-supervised learning with deep learning models as the backbone network. It can be classified into five categories (i.e., pseudo-labeling, consistency regularization, generative methods [
16,
17], graph-based methods [
18,
19], and hybrid methods [
20]).
The combination of consistency regularization and pseudo-labeling belongs to holistic/hybrid methods [
21,
22,
23], a trend in SSL. Consistency regularization is founded on the principle that perturbed data output should remain consistent with its original output. Pseudo-labeling [
24,
25] is a prevalent technique in semi-supervised learning. The basic idea is that models trained using labeled data should produce similar predictions or identical pseudo-labels for the same unlabeled data. These pseudo-labels are used as additional instances to train the model. Different pseudo-label selection methods will lead to model performance variations, negatively affecting model performance if not implemented properly [
26,
27].
FixMatch [
22], a holistic method, generates labels using consistency regularization and pseudo-labeling. Inspired by UDA [
28] and ReMixMatch [
21], FixMatch uses RandAugment [
29] for strong augmentation to produce severely distorted versions of a given unlabeled image. Pseudo-labels are generated based on weakly augmented unlabeled images (e.g., enhanced using only flipped and shifted data). They will calculate the loss with the corresponding strong augmentation image prediction results. It is vital that FixMatch typically uses a fixed threshold
to divide the classes obtained from model training of unlabeled data into two clusters, credible and noncredible. As shown in
Figure 1, if the maximum prediction of unlabeled data exceeds
, the corresponding class can be used as a pseudo-label to optimize the model. Otherwise, the unlabeled data has no pseudo-label.
FlexMatch [
30] argues that this high and fixed threshold can cause the model to under-consider the training difficulty of different categories, resulting in poorer results. Specific metrics can subtly indicate learning difficulty, but this requires partitioning the validation set, which is costly for medical data with limited labeling. Therefore, FlexMatch proposes a curriculum learning method, Curriculum Pseudo Labeling. They dynamically adjusted the threshold of each category during the training process without introducing additional parameters or computation. However, FlexMatch uses dynamic thresholding, adjusting the threshold for each class based on the number of unlabeled samples entering the model training. This approach only works when the number of images in each category is balanced. If the number of images varies greatly between classes, the threshold for the class with the smaller number is usually lower. The model is then prone to learning knowledge with lower confidence, reducing its classification accuracy.
Real datasets in many domains suffer from category imbalance. This affects the robustness of the model. Faced with such problems, researchers have proposed methods that use resampling (oversampling or undersampling) at the data level. For example, Lin, T.Y. et al. [
31] proposes a novel undersampling method that offers significant advantages over other state-of-the-art methods on multiple publicly available datasets. However, resampling can be counterproductive [
32]. In medical datasets, it is normal to have fewer cases at risk than in other data. Forced balancing can lead to models that may not be suitable for real-world applications. At the algorithmic level, some approaches modify the predictions of the underlying classifiers based on the frequency of each category, and others propose a new loss function that allows neural networks to handle unbalanced data streams online [
33,
34]. However, this requires a priori knowledge and does not consider avoiding the imbalance problem of deep semi-supervised learning.
In this paper, we design Cumulative Effective Labeling (CEL) to evaluate learning difficulty using ground truth and effective pseudo-labels. CEL contends that the greater the number of learning instances for a category, the easier it becomes for the model to grasp that category. It counts the ground truth and highly reliable pseudo-labels of each iteration to determine each class’s learning difficulty, which is used to calculate the threshold for each class in the next iteration. Additionally, we propose a way to calculate the threshold of each class with CEL, namely Self-adaptive Dynamic Threshold (SDT). SDT utilizes a hyperparameter and a clever mapping function to allow thresholds to adapt during learning. It follows the idea of having a high threshold when the learning difficulty is low and a low threshold when it is high. This approach helps prevent the model from acquiring incorrect knowledge and encourages it to learn efficiently from more data.
Our method, FldtMatch, combines SDT with a hybrid method and has demonstrated remarkable performance. It achieved superior results on the DFUC2021 and ISIC2018 datasets compared with other semi-supervised methods, regardless of the backbone network used. When using EfficientNet [
35] as the backbone network on the DFUC2021 dataset, our method obtains a macro F1-Score of 60.25%, which exceeds FlexMatch, a well-known competitor in the field by about 5.6%. On the ISIC2018 dataset, our method outperforms other dynamic threshold adjustment strategies by a maximum of 2.2%.
In summary, this paper makes the following three contributions:
We designed the Cumulative Effective Labeling (CEL) to better assess each category’s learning difficulty. CEL will serve in ensuing dynamic threshold calculation and learning model knowledge.
We propose a Self-adaptive Dynamic Threshold (SDT) that can adapt to different situations by introducing only one hyperparameter. An innovative mapping function efficiently computes thresholds for a category under unbalanced datasets, significantly enhancing the model performance.
SDT combined with a hybrid method for deep semi-supervised learning, named FldtMatch. Compared with other threshold adjustment methods, FldtMatch achieves the best results on multiple datasets, and even some of the training results are significantly higher than others.
3. Method
FlexMatch draws inspiration from Curriculum Labeling (CL) [
44] and introduces Curriculum Pseudo Labeling (CPL). CPL collects and keeps records of highly reliable predictions of model-generated unlabeled data. It then carefully calculated the frequency of each of these highly reliable predictions using the following formula:
where
is the number of samples judged to be in category
c with a probability capable of exceeding a threshold
at time step
t.
is the prediction of the weak augmentation
.
is a pseudo-label for weak augmentation
.
This counting mechanism helps us understand the distribution of different categories in the unlabeled dataset, which can guide subsequent learning strategies or refine the model’s predictions. The advantage of this statistical pseudo-labeling is that it does not require much additional computational work and maintains data processing efficiency. At the same time, the maximum predictive values of the counted samples must exceed predefined thresholds to ensure that frequent changes in categories during model instability do not adversely affect the calculation of dynamic thresholds. Determine the learning difficulty
for each category, i.e., Equation (
7).
where
is from Equation (
6).
Instead of using
directly as in Equation (
5), FlexMatch introduces the mapping function
to compute the dynamic threshold
as follows:
The mapping function effectively regulates the threshold change rate, setting a minimum threshold to prevent erroneous knowledge from being learned. FlexMatch offers three methods for calculating the threshold for each class: (1) concave:
, (2) linear:
, (3) convex:
, and
x is
[
30].
However, these mapping functions do not accommodate the case of unbalanced datasets. The reason is that some
t-moment class
c’ pseudo-labeled counts,
, are too low due to the small number of samples belonging to category
c, and the dynamic threshold
is also low after Equations (
7) and (
8). Even after the model stabilizes after a training period,
is consistently in a very low state. A low threshold indicates that the model learns low-confidence predictions as pseudo-labels. In other words, the model will likely learn the wrong knowledge, leading to slower convergence. In
Section 4.4.1, we show the problems with Flexmatch more visually.
3.1. Cumulative Effective Labeling
CPL only accounts for the number of high-confidence prediction results, as indicated in Equation (
6). The learning difficulty in Equation (
8) is determined using Equation (
7), which ultimately affects both the loss function and the model’s training process. However, the ground truth can have a more direct effect on learning difficulty. Since model training depends on the data, a large volume of data in a specific category will naturally influence the model’s learning process.
When the distribution of labeled data is unbalanced, meaning that some classes have significantly more labeled examples than others, the model’s training process can become biased. This bias can negatively impact its ability to make accurate predictions on unlabeled data. In other words, this bias can affect the selection of pseudo-labels generated from the model’s predictions on unlabeled data. Therefore, when using pseudo-labels for further training or analysis, we must consider potential imbalances in the labeled data distribution.
We propose Cumulative Effective Labeling, which takes the highly credible pseudo-labels and ground truth for each category as our effective labels as Equation (
9).
where
is from Equation (
6),
M represents a labeled number, and
is the ground truth of the m-th label. It is not one-hot coding since no cross-entropy loss calculation is required. We then calculated the learning difficulty
through Equation (
10).
where
is from Equation (
9).
In practice, we can use an array to store the high-confidence prediction results for all samples. It is important to note that, as indicated in Equation (
9), only prediction results that exceed a fixed threshold
are recorded. This approach ensures the validity of the learning difficulty and the subsequent pseudo-labels’ reliability.
3.2. Self-Adaptive Dynamic Threshold
Figure 2a illustrates the linear, convex, and concave mapping functions. Our approach to dynamic threshold adjustment considers both the number of samples and changes in quantity. Categories with a large amount of training data have higher thresholds, and fluctuations in quantity have little impact on these thresholds, which remain stable. In contrast, when there is limited data, the thresholds start low. However, as the amount of data increases, the thresholds are raised rapidly, ensuring the model converges more quickly. As a result, the linear mapping function
is unsuitable for our purposes.
When comparing the suitability of convex and concave functions for modifying thresholds, especially in unbalanced datasets, the concave function emerges as the more appropriate choice. Concave functions inherently possess characteristics that allow for more nuanced and gradual adjustments to thresholds, which is particularly beneficial when dealing with datasets where the distribution of classes is uneven.
To this end, we propose a simple yet effective dynamic threshold method, Self-adaptive Dynamic Threshold. The focus is on designing the mapping function as Equation (
11).
where
x is
from CEL and
is a hyperparameter.
Setting up specialized mapping functions for all datasets is not possible. SDT application spans numerous situations, necessitating only the incorporation of a hyperparameter
. In
Figure 2b, we present several typical examples of adjusting thresholds dynamically. The parameter
influences these curves’ minimum value and steepness. Specifically, as the value of
increases, the curve becomes smoother, leading to a gentler change in the threshold. At the same time, the SDT ensures a lower threshold limit, preventing the threshold from being too low in categories with a limited number of samples. This threshold adjustment method is sensitive and stable, mainly when working with unbalanced datasets.
Researchers can flexibly modify the parameter
according to the specific dataset and experimental situation. When the number of images in each category is relatively balanced,
can be set to a smaller value. In this way, the model can accelerate convergence. Conversely, if there is a significant imbalance in the number of instances per category,
should be set to a relatively large value. In
Section 4.4.3, we discuss
’s taking of values in more detail.
This approach prevents the threshold from becoming too low, as illustrated in
Figure 2c, which could lead to the model incorporating erroneous information. Doing so helps mitigate the model’s bias towards categories with a more significant number of instances.
Such a strategic modification of
based on the observed class distribution goes a considerable distance toward alleviating the problems caused by quantitative imbalances in the dataset. It fosters a more equitable treatment of all classes, enhancing the model’s overall performance and robustness in making accurate predictions across the entire spectrum of classes present. Based on Equation (
11), our SDT is calculated as shown below:
The loss function for unlabeled images at time step
t is as Equation (
13).
where
denotes the number of unlabeled images entering the model training at moment
t, and
is from Equation (
12). For any unlabeled sample
, the maximum value of
must exceed the corresponding category dynamic threshold
. Only then can we consider the pseudo-label
reliable enough to compute the loss of
.
3.3. FldtMatch: DSSL with Self-Adaptive Dynamic Threshold
To fully realize the potential of SDT, it needs to be integrated with a specific semi-supervised approach. We introduce FldtMatch, a method that combines SSL with a holistic approach to develop a new deep semi-supervised learning technique. This holistic approach incorporates the concepts of consistency regularization and pseudo-labeling, which are briefly outlined in
Section 1. This combination dynamically adjusts thresholds to promote more accurate and efficient learning from limited labeling data. To illustrate the model training process employing FldtMatch, we provide a graphical representation in
Figure 3 and a detailed algorithmic description in Algorithm 1.
Suppose we have a training set , where X represents the labeled dataset and U represents the unlabeled dataset. Each tuple in X comprises image and image label , which is the ground truth. In U, there are only unlabeled images, such as , and no ground truth.
During any iteration
t, we will use Equations (
9)–(
12) to compute the thresholds for each category, e.g.,
of
Figure 3 represents the third class of dynamic thresholds, and Algorithm 1 shows the computation of how it is obtained. Subsequently, weak augmentation
and strong augmentation
are applied to each unlabeled image of a batch to obtain weak augmentation images and strong augmentation images, respectively. Strong augmentation transforms the data more drastically, while weak enhancement applies milder changes. They aim to preserve the basic features of the image while introducing some variability. Weak and strong augmentation images, along with labeled images of the same batch, are used in model training to obtain their corresponding results.
The predictions for the labeled images are directly compared with their ground truth annotations to compute the supervised loss
as following Equation (
14). This loss quantifies the discrepancy between the model’s predictions and the actual labels, providing a direct measure of the model’s performance on the labeled portion of the dataset.
where
denotes labeled image number at iteration
t.
For the unlabeled data, pseudo-labels are first generated based on dynamic threshold
and the model’s predictions on the weak augmentation images. These pseudo-labels serve as temporary labels for the unlabeled data, reflecting the model’s current understanding of the data distribution. Subsequently, these pseudo-labels are compared with the model’s predictions on the corresponding strong augmentation images to calculate the unsupervised loss
as Equation (
13). This loss measures the consistency between the model’s predictions across different augmentations of the same image, encouraging the model to produce robust and consistent predictions even when faced with variations in the input data. Finally, we combine
and
for the optimization model in the same way as in Equation (
1).
is a hyperparameter, which we set to 1 in this paper.
The initialization phase has a time complexity of
, where C is the class number, and N is the unlabeled image number. Next, as the model begins training, it is necessary to compute CEL for each category. This can be accomplished in a single traversal, resulting in a time complexity of
, in which M is the labeled image number. Finally, calculating the learning difficulty
and dynamic threshold
has a time complexity of
. Since both CEL and SDT need to be computed in each iteration, the method’s time complexity is
, in which T is the iteration number. Given that
C is much less than
, this simplifies to
.
Algorithm 1 FldtMatch: DSSL with Self-adaptive Dynamic Threshold |
- Require:
, . Labeled image number M, unlabeled image number N. Model iterations number T. Number of classes C. is a fixed threshold. - 1:
} {Initialize the learning difficulty of all classes as 1, indicating easy learning.} - 2:
} {Initialize predictions of all unlabeled data as −1 indicating no pseudo-labeling.} - 3:
for to T do - 4:
for to C do - 5:
{Counting effective labeling} - 6:
end for - 7:
Calculate for all classes using Equation ( 10) - 8:
Calculate for all classes using Equation ( 12) {Suppose a batch has labeled images and unlabeled image.} - 9:
for any and in this batch do - 10:
, are generated by Randaugment [ 29] or otherwise [ 45]. - 11:
, and are the predictions provided by the model. - 12:
if then - 13:
- 14:
end if - 15:
end for - 16:
Calculate loss using Equations ( 13) and ( 14) - 17:
Optimize the model using . - 18:
end for
|