DYMatch: Semi ‐ Supervised Learning with Dynamic Pseudo Labeling and Feature Consistency

: A considerable number of approaches based on consistency regularization and pseudo ‐ labeling have been proposed in Semi ‐ supervised Learning (SSL) so far. These approaches signifi ‐ cantly enhance the SSL methods by effectively utilizing large amount of unlabeled data to improve the model’s performance. However, existing methods may fail to utilize the unlabeled data more efficiently, mainly due to the challenges faced in pseudo ‐ label estimation. In this paper, we begin by analyzing the impact of the pseudo ‐ label estimation method on training the SSL model. Further ‐ more, we emphasize that an effective pseudo ‐ label estimation method should reflect the difference of recognition performance among samples from different categories, and also ensure the mainte ‐ nance of both high ‐ quantity and high ‐ quality pseudo ‐ labels during each training iteration. Based on above analysis, we propose DYMatch, an innovative SSL method employing a dynamic estima ‐ tion process. Firstly, a dynamic pseudo ‐ label estimation method based on Gaussian mixture model is proposed to dynamically estimate the confidence threshold on different categories. Secondly, a feature ‐ correlation consistency regularization method is introduced to further enhance the learning on unlabeled data. The experimental results show that DYMatch is a simple and effective SSL method, especially when the labeled data is limited.


Introduction
In the field of computer vision, deep learning methods have consistently been at the forefront, and their superior performance heavily relies on the availability of enough accurately annotated labeled data [1,2].However, the acquisition of enough labeled data is a labor-intensive and expensive task.Now, there are various methods available to address such issues, such as weakly supervised learning [3,4], zero-shot learning [5], and semisupervised learning [6].Among them, semi-supervised learning offers an effective approach by employing a few labeled data and a large amount of unlabeled data for training.It shows great potential in practical applications and can significantly reduce the dependence of training on laborious data collection and data annotation [7][8][9][10][11][12].
The main challenge in SSL lies in effectively utilizing unlabeled data to improve the model's generalization performance [6].And consistency regularization [13][14][15] and pseudo-labeling [16,17] represent two most widely utilized paradigms designed for SSL, with their combined approaches also demonstrating good performance [9,10,18,19].The key idea underlying these methods is grounded in the assumption that the training process of the SSL model should follow the low-entropy assumption, that is, the model should maintain the consistency in predictions when different perturbations are applied to the same unlabeled data [6].However, a potential limitation in employing the pseudolabeling methods to address this problem is the requirement to set a fixed confidence threshold [7][8][9]18] or implement a stage-wise threshold adjustment method [20] for the selection of suitable pseudo labels.
In pseudo-labeling methods designed with a fixed confidence threshold, the model selects unlabeled data with prediction scores surpassing the threshold for training and directly discard those falling below the threshold.For example, UDA [18] and FixMatch [9] employ a fixed high-confidence threshold to maintain the quality of the chosen pseudo labels.However, during the early training stages, a fixed high-confidence threshold (e.g., 0.99) may exclude too much unlabeled data from training, resulting in a large amount of unlabeled data being left unused.This approach overlooks the fact that, in the early training stages, the model has a limited ability to distinguish unlabeled samples, making it challenging to generate high-confidence predictions.An enhanced approach for the fixed threshold is to employ a stage-wise threshold adjustment strategy.For example, Dash [20] and AdaMatch [21] opt to gradually increase the threshold as the training progresses.The utilization of unlabeled data in such methods demonstrates an improvement compared to methods using a fixed confidence threshold.However, the performance of the stagewise threshold adjustment strategy is directly influenced by the pre-defined hyper-parameters, and the setting of hyper-parameters is largely decoupled from the model's training process in most cases.This is especially notable in disregarding the varying difficulties associated with learning data from different categories.Furthermore, optimal hyper-parameters have to be researched on different datasets.FlexMatch [10] attempts to employ different local (class-based) thresholds for samples from different categories.Although the setting of local threshold takes into account the different learning status of the model for different categories, it is still derived from a pre-defined fixed global (dataset-based) threshold.In essence, these methods fundamentally ignore the crucial effect of the model's training process on the pseudo-label estimation, whether for global or local thresholds.
Essentially, the core requirement for a pseudo-label estimation method is to maintain a balance between the quantity and quality of pseudo labels during the training process.The methods utilizing a high fixed threshold explicitly abandon the requirement for the quantity, concentrating solely on generating high-quality pseudo labels.However, when the labeled data are extremely rare, the challenge with the high fixed threshold lies in its difficulty in ensuring the quality of pseudo labels.Conversely, the stage-wise threshold adjustment method is equivalent to compromising the quality of pseudo labels in favor of increasing their quantity.Such a strategy is susceptible to introducing incorrect pseudo labels, leading to the cognitive bias during the training process.Therefore, a novel dynamic pseudo-labeling estimation method based on Gaussian mixture model is proposed in this paper.We assume that the prediction scores for a specific category in a training iteration are sampled from a Gaussian mixture distribution comprising two distributions, namely the positive and negative distributions.This assumption can be extended to the whole unlabeled data.By solving for this Gaussian mixture model, the model can assign each category a local threshold that aligns with the current training status.Specifically, for a given category, we consider the prediction scores of this category as the weighted sum of the positive and negative distribution, and employ the maximum likelihood estimation to predict the parameters of each Gaussian distribution.Consequently, the dynamic pseudo-label estimation method can adaptively generate the optimal local threshold at different training iterations so as to stabilize the quantity and quality of pseudo labels.Intuitively, during the early training stages, the dynamic pseudo-label estimation method employs a relatively low threshold to encourage the model to train on more unlabeled data and accelerate the convergence.As the classification performance improves and the prediction confidence increases, a higher threshold is generated to filter out incorrect pseudo labels, alleviating the cognitive bias of the model.
Inspired by previous researches [22,23], we recognize the significance of constraining the training process of feature representation in SSL tasks.Therefore, in DYMatch, we also propose a feature-correlation consistency regularization method, and its integration with the dynamic pseudo-label estimation method can effectively enhance the model's classification performance.Unlike other feature-constraint methods, the feature-correlation consistency regularization method is selectively applied to the unlabeled data with prediction scores exceeding the confidence threshold generated by the dynamic pseudo-label estimation method.Moreover, we implement feature-correlation consistency regularization and dynamic pseudo-label estimation at different layers in the model, thereby effectively avoiding the influence of the strong coupling between these two methods on the model's performance.
The paper is structured as follows.Section 2 reviews the common pseudo-labelling and consistency regularization methods in SSL, and also introduces the relevant background on Gaussian mixture model and its applications in SSL.Section 3 details the motivation and implementation of the DYMatch.In Section 4, we present the results of the comparative experiments and ablation studies.The paper is ultimately summarized in Section 5.

Consistency Regularization and Pseudo-Labeling Methods in SSL
The SSL task addressed in this paper can be defined as a classification task with  categories.We define  X ,  ∶  ∈ 1, ⋯ ,  as the batches of labeled training samples, where  and  represent the -th batch of labeled samples and the labels corresponding to these samples, respectively. represents the number of these batches, and  denotes the index of an unlabeled sample in a batch.In addition, let  U ∶  ∈ 1, ⋯ ,  be the batches of unlabeled data.We use  and  to denote the number of labeled and unlabeled samples in each batch of training data, typically with   . | ;  is denoted as the prediction score by the model  for input  with parameters , where  represents the -th sample in a training batch.
In SSL, the motivation of the consistency regularization often follows the smoothness assumption or the low-density assumption [6], which means that the model should generate similar predictions for a specific unlabeled sample when subjected to various perturbations.Therefore, the consistency regularization method can be viewed to find a smooth manifold on which the whole dataset lies by utilizing the unlabeled data [24].A common strategy for introducing perturbations is employing data augmentation methods [25,26], which apply image transformations while preserving the semantic information of the data.Specifically, consistency regularization demands that an unlabeled sample  should exhibit similarity to its corresponding augmented sample, i.e.,   .Therefore, most SSL models employ the consistency regularization loss term in Equation (1) to a batch of unlabeled data: where  • and  • represent two different data augmentation operations,  denotes the -th unlabeled sample in a batch.Mean Teacher approach [27] contains a teacher model and a student model.The student model closely resembles the regular mode of Π-Model [13], and the teacher model has the same network structure as the student model but employs an exponentially moving average of the student's weights for parameter update.In Mean Teacher, the predictions of the student model and the teacher model are used to calculate the consistency regularization loss.Specifically,  • in Equation ( 1) is defined as the predictions of the model updated with the Exponential Moving Average (EMA) method.This operation yields a relatively stable feature representation, significantly improving the classification performance.Dual Student [28] 1) represents an adversarial transformation.Consistency regularization method is performed between the predictions of the original unlabeled data and the perturbed one.UDA [18] can be viewed as an unsupervised data augmentation method that analyzes the contribution of noise perturbations methods on consistent regularization and replaces simple perturbations techniques with advanced data augmentation methods, such as RandAugment [25] and AutoAugment [26].Within the consistency regularization framework, UDA extends the data augmentation method from full-supervised learning to SSL.In this paper, DYMatch also utilizes a strongly-weakly data augmentation method, similar to FixMatch [9], to introduce perturbations to the unlabeled data.However, compared to the feature-correlation consistency regularization applied by DYMatch in the feature space, the existing consistency regularization method tends to impose consistency regularization on model's predictions.The strong coupling with the pseudo-labeling in these approaches may restrict the model's classification performance.
The difference between pseudo-labeling and consistency regularization methods is that consistency regularization methods typically rely on constraints imposed by various data transformations.In contrast, the pseudo-labeling method relies more on obtaining good pseudo labels with a high confidence, and adding them to the training set as labeled data.By defining   | , the loss function for unlabeled data can be derived as follows: where     ,  is a hyper-parameter known as the confidence threshold,  • is the standard cross-entropy loss function, and  is a mask function based on the .Additionally, the   operation is used to generate a "one-hot" distributions based on the model's predictions.In Pseudo-Label [16], the model is firstly trained in the fullsupervised approach on labeled data.Subsequently, the identical model is employed to predict the unlabeled data, and the predictions with the maximum confidence are viewed as the pseudo labels.In FlexMatch [10] the Course Pseudo-Labeling method (CPL), a curriculum learning method, is employed to learn the unlabeled data based on the model's training status.Specifically, the significant contribution of CPL lies in its flexibly adjustment of the thresholds for each category during the training process, allowing more unlabeled data used for training.The dynamic pseudo-label estimation method in DYMatch is based on the model's training status.It dynamically selects local thresholds for different categories, and simultaneously estimating global thresholds.This strategy effectively ensures that the model can achieve the optimal combination of pseudo labels in terms of both quantity and quality at any training iteration.

Gaussian Mixture Model and Its Application in SSL
The Gaussian Mixture Models (GMM) are commonly employed in machine learning problems to address clustering problems.This method belongs to mixture models, which assumes that the observed samples with known labels are sampled from a specific multivariate gaussian distribution.Specifically, the GMM assumes that the observed samples in the dataset belong to  different categories.However, due to the lack of label information of the samples, it is necessary to employ the prior probability  to denote the likelihood that the sample belonging to a category and the joint distribution corresponding to this category is modeled as a multivariate normal distribution   , ∑ .Once obtaining the parameter estimates of the model, the posterior distribution of an observed sample can be calculated, ultimately determining its category.In theory, the GMM is capable of fitting dataset with arbitrary distributions, and it shows better performance when handling dataset modeled as the multivariate Gaussian distribution clusters [30].In this paper, the model's prediction scores for the samples are continuous variables.Therefore, the GMM model is selected to model and analyze them.
Within the SSL tasks, there is a crucial task known as noise label learning, also referred to as robust training [31].In this task, GMM is an important and effective method for estimating noise rates [32,33].The estimated noise rate is widely used to reweight samples in robust classifiers [34,35] or determine the quantity of samples considered as clean training samples [36][37][38].
A novel SSL framework, DivideMix [39], has been proposed to deal with the challenge of training with noisy labels.Firstly, DivideMix proposes the concept of co-division, a process of training two networks simultaneously.The GMM is employed to fit the loss distribution of the training data, which aims to partition the training data into labeled and unlabeled sets.The labeled dataset can be considered as the samples with correct labels, while the unlabeled data can be regarded as the samples with noisy labels.The separated datasets are then used to train an SSL model.By iteratively repeating the previous process, the model gradually overcome the influence of noisy labels.Inspired by DivideMix, DYMatch integrates this dynamic estimation method into semi-supervised classification learning.Specifically, in DYMatch, the dynamic estimation method is employed to yield the local threshold for a specific category, which effectively improves utilization of unlabeled data.

DYMatch
The structure diagram of unlabeled data in DYMatch is shown in Figure 1.In this section, we provide a detailed description of how the DYMatch model deals with semisupervised classification task.Additionally, we analyze DYMatch's two core components, namely, the dynamic pseudo-label estimation method based on the Gaussian Mixture Model and the feature-correlation consistency regularization method.

Motivation of Dynamic Pseudo-Label Estimation Method
The core idea of the pseudo-labeling method is to integrate more unlabeled data into the training set by employing appropriate confidence threshold, thereby enriching the distribution of the training set and effectively improving the classification and generalization performance of the model.In this section, inspired by Wang et al. [12], we consider the model's prediction scores for a specific category as a binary classification problem.The analysis of this binary classification problem provides motivation for the design of dynamic pseudo-label estimation method.
Assuming a binary classification problem, where the true distribution is a uniform mixture of two Gaussian distributions, that is, the sample set  contains positive samples ( 1) and negative samples ( 1).The input data  follows the following conditional distribution: Furthermore, we assume that   .Given that binary classification models often employ a  • function as the output activation function, the prediction score of the model can be defined as: where  is used to denote the current training status of the model.Typically, during the training process, the model should gradually become more confident for its classification performance, so  is a gradually increasing positive parameter.  /2 is the Bayes optimal linear decision boundary.In the early training stages,  is relatively small.The decision boundary is around the input data , and   may be closer to 0.5, indicating that the model fails to generate a confident prediction for the input data .As the model becomes more confident,  is expected to gradually grow during training, and the decision boundary gradually moves away from the input data .When    /2 is positive and large enough,   tends towards 1.In this case, the input data  is viewed as a positive sample.And when    /2 is negative and its absolute value is large enough,   will approach 0. At this time, the model assigns the input data x to the negative class.
Based on the above analysis, we can employ a fixed threshold  ∈ 0.5,1 to partition the prediction scores.The input data  will be assigned as pseudo label 1, when  satisfies the following conditions: And the Equation ( 5) can be rewritten as: Likewise,  will be assigned as pseudo label -1 if: and the Equation ( 7) is equivalent to: Finally, when 1    , the model cannot generate a high-confidence prediction for the input .Generally, the utilization of unlabeled data is directly affected by the confidence threshold .Specifically, as the confidence threshold  increases, the utilization of unlabeled data gradually gets lower.Moreover, in the early training stages, when  is small, indicating that the model cannot generate confident predictions, using a higher threshold may result in a lower utilization rate of the unlabeled data and a slower convergence.
If we integrate over , the following conditional probability can be obtained: where Φ is the cumulative distribution function of a standard normal distribution.
If   1   1 0.5, we can obtain the following formula: When   , we can obtain that   1   1 .Specifically, when the standard deviations ( and  ) of the two normal distributions are unequal, it indicates that the samples of these two categories have different variability.In a binary classification task, this gap may affect the model's decision boundary.The model may pay more attention to the category with the smaller standard deviation, as this means that the samples in that category are relatively more concentrated.As a result,   1 and   1 are not equal.The model adjusts the decision boundaries according to the distribution properties of the different classes, making it easier to make more confident predictions for the category with the smaller standard deviation while being more cautious for the category with the larger standard deviation.This leads to unequal prediction performance of the model for these two classes.In fact, when using a larger confidence threshold , the imbalance in the pseudo-label estimation method becomes more obvious.Imbalanced pseudo-label estimation method may distort decision boundaries and lead to cognitive bias in pseudo-labeling.An easy resolution for such situation is to use different local thresholds for different categories to estimation pseudo labels.
According to Equations ( 13) and ( 14), we can define that  1/2 ,  1/2 ,   / 1  / ,   / 1  / ,    , and   1   1 can be written as: According to Equation ( 15), we can obtain that: By taking the derivative of  in Equation ( 16), we can obtain: where  is the probability density function of the standard normal distribution.According to its symmetry, Equation ( 17) can be rewritten as: the following conclusion can be derived:         .According to the above derivation, we can conclude that   0, that is,   is monotonically increasing in the interval 0, ∞ .Extending this conclusion to Equation ( 15), we can obtain that   1   1 is also monotonically increasing.Therefore, the utilization rate of pseudo labels,   1   1 , decreases as   becomes smaller.Specifically, when the distributions of two categories are more similar, the model may face challenges in accurately distinguishing the samples in these two categories.As the differentiation of two categories diminishes, more samples may be confused in the feature space, and the model cannot generate confident predictions about these samples.Therefore, a suitable pseudolabel threshold is needed to balance the utilization rate between these categories.Otherwise, there may not be enough samples for training the model to distinguish categories that are already challenging to differentiate.In summary, an effective pseudo-labeling method should take into account the change in the model's classification performance between different categories during the training process.Specifically, the confidence threshold  should gradually increase as the model's performance parameter  during the training process.This ensures enough unlabeled data for the model to train in the early training iterations, while, in the later iterations, the threshold  can filter out incorrect predictions to mitigate the cognitive bias of the model.Moreover, given the model's different classification performance in different categories, with some being easier to classify than others, the pseudo-labeling method need to adjust the threshold  for each class to encourage equitable setting of threshold to different classes.The main contribution of this paper lies in dynamically adjusting classbased local thresholds and dataset-based global thresholds based on the model's training status.Additionally, the combination of feature-correlated consistency regularization and dynamic pseudo-label estimation further enhances the performance of SSL model.And then, we will provide a detailed description of these two methods used in DYMatch.

Dynamic Pseudo-Label Estimation Method Based on Gaussian Mixture Model
During the training process, it is crucial to dynamically assign corresponding confidence thresholds to each class, and the model's prediction scores for the training data can accurately reflect the current training status for the corresponding category.Therefore, in DYMatch, we propose a dynamic pseudo-label estimation method based on GMM.Firstly, utilizing the model prediction scores during training allows us to obtain classbased local thresholds, dynamically partitioning unlabeled data into positive and negative samples.The final dynamic local threshold is then updated using the Exponential Moving Average (EMA) method based on the local thresholds at each training iteration.Additionally, the global dynamic threshold is estimated by using the EMA of the prediction scores from the model, and the dynamic local threshold is also used to adjust the global dynamic threshold.
Therefore, at the beginning of training, the confidence threshold may be relatively smaller, enabling utilization of more unlabeled data with potentially correct predictions for training.As the training of model, the confidence thresholds are dynamically adjusted and generally tend to increase.This adjustment filters out more unlabeled data with potentially incorrect model predictions, reducing the training bias of the model.

Dynamic Local Confidence Threshold Estimation
Dynamic local confidence threshold estimation aims to estimate class-specific local thresholds to account for the inter-class diversity.Furthermore, the dynamic local confidence threshold is steadily increased during training to ensure the discarding of unlabeled data with incorrect prediction.
Therefore, the Gaussian Mixture model is employed to distinguish the positive from negative samples in model's prediction scores for specific categories.Here, positive samples correspond to unlabeled data that the model predicts correctly, while negative samples correspond to unlabeled data with incorrect prediction.The model's prediction scores used to distinguish positive and negative samples can be regarded as the local thresholds for the corresponding categories.Specifically, we assume that the prediction scores  for category  are sampled from a Gaussian mixture distribution   with two distributions, positive and negative.Local thresholds are dynamically generated by fitting a Gaussian Mixture Model to the prediction scores.
where  ,  denotes a Gaussian distribution.The parameters  ,  ,  and  ,  ,  represent the weight, mean and variance of the negative and positive sample distributions, respectively.Then, the EM algorithm is employed to infer the posterior probability  | ,  ,  .This posterior probability can be directly used to generate the pseudo label, and the dynamic local threshold   corresponding to category  can be defined as: By using the EMA method, we obtain the final dynamic local threshold, which reflects the learning status of the model for a specific category: where  ∈ 0,1 denotes the momentum decay in EMA,  represents the -th iteration during the training process.Meanwhile,       , where  • represents the dynamic local threshold estimation method based on GMM, and   denote the prediction scores of unlabeled data with class . represents the distribution alignment strategy from ReMixMatch, which balance the distributions of prediction scores.The local thresholds are initialized to 1/, where  represents the number of categories.Finally,   1 ,  2 , ⋯ ,   contains confidence thresholds for all categories.
In the implementation of the dynamic local threshold estimation method, we maintain a queue   ∈ ℝ of prediction scores with dimension  for each category.The queues of all categories form a prediction memory bank   1 ,  2 , ⋯ ,   ,  ∈ ℝ , where  represents the number of categories.Specifically, the queue   stores the prediction scores  of the unlabeled data predicted by the model as category , and employs these prediction scores to fit a Gaussian mixture model.Subsequently, the dynamic threshold estimation method can dynamically adjust the threshold   corresponding to category  based on the fitted Gaussian mixture model.As a result,   can align with the model's classification ability at different training stages.Moreover, the EM algorithm has a negligible effect on training time and does not impose an additional burden to the training.

Dynamic Global Confidence Threshold Estimation
The global threshold estimation shares similar characteristics as for the local one, that is, they both reflect the training status of the model and maintain a steady increase during training.However, unlike the local threshold estimation, the global threshold estimation should capture the model's training status for the entire dataset.Therefore, we denote the global threshold as the model's average prediction score for the unlabeled data.Specifically, the global threshold is estimated as the EMA of the prediction scores in each training iteration.We initialize the global threshold to 1/, where  represents the number of categories.The global threshold  is defined as:

Dynamic Pseudo-Label Estimation
Having obtained the class-specific dynamic local thresholds and dataset-specific dynamic global thresholds, we employ dynamic local thresholds to adjust the dynamic global threshold, and obtain the final dynamic confidence threshold: where  • denotes the maximum normalization function.Finally, the loss function for dynamic pseudo-label estimation of unlabeled data at the -th iteration can be defined as: where    represent the model's predictions for an unlabeled sample, and     is the confidence threshold corresponding to each category. • is a mask function based on the confidence threshold.In addition, we employ  and  to represent  |  ;  and  |  ;  , where  • and  • represent the weak augmentation and the strong augmentation method, respectively.

Consistency Regularization Method Based on Feature-Correlation
In methods like FixMatch, MixMatch and ReMixMatch, consistency regularization is typically performed on the prediction scores.The strong coupling between consistency regularization and the pseudo-label methods, which also rely on model's prediction scores, may constrain the performance of the model.Therefore, in this paper, we perform the consistency regularization method on the feature maps fed to the classification module  • .And this strategy focuses on unlabeled data with prediction scores exceeding the confidence threshold.Specifically, for a selected unlabeled sample, we employ the negative cosine similarity to measure the correlation between the two different augmented versions of this unlabeled sample.The different data augmentations correspond to strong and weak data augmentations.
However, it should be emphasized that there exist differences between the strongly augmented and the weakly augmented versions of the same unlabeled data.Consequently, they cannot be considered to be identical for the feature extraction.Therefore, a learnable linear projection module, denoted as ℎ • , is performed on the strongly augmented feature representation of the unlabeled data to weakening the constraint in the consistency regularization method.The module ℎ • consists of two consecutive linear layers, with a ReLU layer following the first linear layer: Finally, the loss function of the consistency regularization method based on featurecorrelation can be defined as: where  • denotes the cosine similarity function, the    ∈ ℝ represents the feature representation of the weakly augmented unlabeled data before the classification module  • , and    ∈ ℝ represents the strongly augmented representation.In our consistency regularization method,    is fed to a learnable linear projection module ℎ • , where ℎ: ℝ → ℝ . represents a mask vector where all unlabeled data with prediction scores exceeding the dynamic threshold are marked as 1, while others are marked as 0.  is the number of unlabeled data marked as 1 in the .

Loss Function of DYMatch
In DYMatch, the final loss function for training the network is: where ℒ represents the standard cross-entropy loss used for training with labeled data, ℒ and ℒ are the loss for dynamic pseudo-label estimation and feature correlation consistency regularization on unlabeled data, respectively. and  are a fixed scalar hyper-parameter, representing the relative weight of ℒ and ℒ , respectively.ℒ is defined as: In the loss function for unlabeled data, we use the loss term ℒ to obtain more discriminative feature representations, minimizing the difference between the feature representations of the weakly augmented and strongly augmented unlabeled data.This method ensures that the feature representations of the two different augmented samples remain consistency during the training process, thereby improving the model's classification performance.Meanwhile, ℒ ensures a balance in the quantity and quality of pseudo labels.
The number of the training set and testing set in CIFAR-10 and CIFAR-100 datasets are 50,000 and 10,000, respectively.However, they consist of 10 and 100 categories, respectively.SVHN dataset includes 10 categories, with 73,257 samples used for training and 26,032 samples used for testing.STL-10 dataset is designed for evaluating SSL methods.It includes 5000 labeled samples and 100,000 unlabeled samples.Compared to other standard SSL datasets, STL-10 only has fewer labeled data, and the dataset compensates by offering a considerable amount of unlabeled data.Additionally, each category contains different proportions of labeled data and unlabeled data.These settings make the STL-10 dataset more challenging for SSL training and closer to real-world tasks.
Three domain adaptation datasets are also employed for the performance evaluation, including Office31 [44], Office-Home [41] and DomainNet [45].The Office31 dataset comprises 4110 images distributed across three domains, which are Amazon, Webcam and Dslr.And these three domains share 31 object categories commonly found in offices.The Office-Home dataset consists of 64 object categories sampled from four domains, which are denoted as Artistic, Clip Art, Product, and Real-World, respectively.And Office-Home comprises around 15,500 images.The DomainNet dataset, containing 345 categories of common objects from six domains, is the largest domain adaptation dataset.
According to the standard settings used in SSL datasets, we randomly select a specific number of labeled samples to constitute the small part of the training set.The remaining data, discarding the labels, forms another part of the training set.In DYMatch, both weak augmentation and strong augmentation methods are employed to the unlabeled data.The weak augmentation involves flip-and-shift transformations, while the strong augmentation combines the RandAugment [25] method with the Cutout [46] method.

Experimental Settings
The baseline models employed for the comprehensive performance comparison encompassed Mean Teacher [27], Pseudo-Label [16], MixMatch [7], ReMixMatch [8], UDA [18], FixMatch [9], FlexMatch [10], DoubleMatch [23], FeatMatch [22], Meta pseudo-labelling [19], SimMatch [47], Semi-Clustering [48], SoftMatch [11] and FreeMatch [12].The Mean Teacher and Pseudo-Label, respectively, employed the consistency regularization method and the pseudo-labeling method, which have become widely used in the SSL approach.MixMatch and UDA both explore the impact of different data augmentation methods on SSL.Based on MixMatch, ReMixMatch suggests that achieving a balance in the distribution of prediction scores can significantly enhance the model's classification performance.FixMatch stands out as an attempt to combine consistency regularization and pseudo-labeling methods, resulting in good classification performance.DoubleMatch and FeatMatch innovatively refined the consistency regularization method, taking inspiration from FixMatch.Meanwhile, FlexMatch also enhanced the pseudo-labeling method upon the foundation laid by FixMatch.SoftMatch and FreeMatch take into account the impact of the model's training status for the confidence threshold.Meta pseudo-labeling introduced an innovative training diagram.SimMatch and Semi-Clustering strive to integrate methods from other fields, including self-supervised and deep clustering strategies, into FixMatch.The hyper-parameters of all baseline methods are consistent with those in their corresponding papers.
To ensure a fair comparison, we followed the guidelines recommended in [49] and shared the same training pipelines for the same dataset, which include optimizer, learning rate decay schedule and the backbone module.In our experiments on all benchmark datasets, the model was trained employing the SGD optimizer.The momentum and weight decay in SGD were set to 0.9 and 1 × 10 −3 , respectively.The Nesterov method was not utilized in SGD.The learning rate was initialized at 0.02 and gradually decreased employing the cosine annealing scheduler.The Wider ResNet network [50] was employed as the backbone module.In detail, we employed Wider ResNet-28-2 for CIFAR-10, SVHN and the domain adaptation datasets.For the CIFAR-100 and STL-10 datasets, the Wider Res-Net-28-8 and the Wider ResNet-37-2 networks are employed as the backbone, respectively.Moreover, we show the average and the standard deviation of the results for each model by trained three times on each number of labeled data on each dataset.
DYMatch involves five hyper-parameters: the dimension of the queue in the dynamic local threshold estimation (), the momentum decay in the EMA method (), the dimension of the output feature dimension in linear projection module ℎ • () and the relative weight hyper-parameters (ℒ and ℒ ).In the process of training,  and  were set as 0.999 and 100, respectively.In most cases,  and  were set to 2.0.However, when only a minimal amount of labeled data is available (i.e., 40 labeled data for CIFAR-10, SVHN and STL-10 datasets, 400 labeled data for CIFAR-100),  and  were adjusted to 3.0.This means that in these situations with limited labeled data, we should pay more attention to the constraints for the training on the unlabeled data.Moreover, the different backbones were employed for the different datasets.Thus, for CIFAR-10 and SVHN datasets, the  in ℎ • was set to 128.In CIFAR-100, the parameter  was set to 512, and for STL-10, the parameter  was set to 256.The labeled batch size  is set to 64.We set the  as 7, which means that the unlabeled batch size  is set to 7 times of  for all benchmark datasets.

Experimental Results for Standard SSL Datasets
In this subsection, we presented the classification performance of DYMatch with other baseline methods on the standard SSL datasets from Tables 1-4.Results for Mean Teacher and Pseudo-Label with 40 labels per class are not included due to their poor performance on CIFAR-100.MPL, SoftMatch and FreeMatch obtained the best classification performance with few specific settings of the number of labeled data.However, DYMatch demonstrates good classification performance, outperforming other baseline methods on most labeled data settings on each standard SSL dataset.
As shown from Tables 1-3, with the results on the CIFAR-10, CIFAR-100, and SVHN datasets, and DYMatch consistently obtains the good classification performance on most different settings of the numbers of labeled data.In experiments with 40 labeled data per class on CIFAR-10, the error rate of DYMatch was slightly higher than MPL by 0.07%, but it was lower than other baseline methods.DYMatch also performed worse than MPL, when employing 2500 labeled samples on CIFAR-100.Furthermore, as the quantity of labeled data decreases, the DYMatch gradually demonstrates its robust classification performance.When utilizing a very small number of labeled samples, for example, only 4 labels per class, DYMatch exhibited the lowest error rates, which were 1.73% and 0.46% lower than MPL on CIFAR-10 and CIFAR-100 datasets, respectively.The classification accuracy of DYMatch was marginally lower by 0.03% than that of SoftMatch when 250 labeled samples were used on SVHN.However, with utilizing other settings of the number of the labeled data, DYMatch can achieve better performance compared to SoftMatch.In general, most models obtained better classification performance on the SVHN dataset, resulting in a relatively small performance gap among some outstanding models.
As shown in Table 4, for the STL-10 dataset, DYMatch still achieved very competitive classification performance.DYMatch can obtain the lowest error rate when using 1000 labeled data.And when only 40 labeled data are available, DYMatch's error rate is only 2.02% higher than FreeMatch.Domain adaptation datasets pose a challenge for training due to the presence of multiple domains within each category and significant difference existing among these domains.This intrinsic gap complicates the process of obtaining an optimal performance even with fully-supervised training.Typically, the domain adaptation datasets tend to be more complex and challenging in the field of SSL.
Initially, we create the training and testing sets from each of these datasets following the strategy employed in [51].The proposed DYMatch was evaluated, along with FlexMatch, ReMixMatch, FixMatch, DoubleMatch, SimMatch and SoftMatch on these datasets and compared the classification performance.For a fair comparison, the training pipelines applied to these domain adaptation datasets was the same as that employed in SSL benchmark datasets.We set the weight of loss term for the unlabeled data followed the settings used in the corresponding original paper.According to Table 5, an accuracy of 56.63% can be obtained through the fully-supervised training on the most complex domain adaptation dataset, DomainNet.All semi-supervised learning methods included in the comparison struggle to achieve comparable classification performance, showing the distinct performance gaps.When facing with only 500 labeled data for each category on DomainNet dataset, DYMatch exhibits an accuracy slightly lower than SoftMatch by 0.8%.However, in all other settings, DYMatch consistently demonstrates the best classification performance.

Ablation Study
Since DYMatch combines two effective methods, i.e., the dynamic pseudo-label estimation method based on the Gaussian mixture model and the consistency regularization method based on feature correlation, the ablation studies were conducted to provide a deeper understanding of the factors contributing to DYMatch's good performance.In this subsection, the results are presented only for the experiments involving 40 and 4000 labeled data on CIFAR-10.
As shown in Table 6, the error rates of DYMatch_PL (i.e., DYMatch with only dynamic pseudo-label estimation method based on the Gaussian mixture model) were lower than those of DYMatch_CR (i.e., DYMatch with only consistency regularization method based on feature correlation), but the combination of these two methods led to the better performance of DYMatch than DYMatch_PL and DYMatch_CR.The comparison between DYMatch_CR and DYMatch_PL highlights their differences in utilizing the unlabeled data.In DYMatch_CR, the feature maps were used after the backbone  • for feature-correlation consistency regularization.However, the lack of selection on unlabeled data, a large amount of data that cannot be accurately identified by the model may greatly hinder the convergence of the model.On the contrary, DYMatch_PL employed the dynamic pseudo-label estimation method, which effectively filter the unlabeled data during the training process.Consequently, compared to DYMatch_CR, DYMatch_PL proved to be more effective in utilizing the unlabeled data.Moreover, the superior performance of DYMatch indicated that the combination of these two methods resulted in a more efficient utilization of unlabeled data.
We also compared the performance of DYMatch when applying feature-correlation consistency regularization method at different layers, i.e., after the backbone  • or after the classification module  • .DYMatch_F denotes DYMatch using the feature representation after the backbone  • , and DYMatch_G denotes DYMatch employing the feature representation after the classification module  • .Table 7 obviously shows that the error rates of DYMatch_G is significantly higher than that of DYMatch_F.This result also demonstrated that performing consistency regularization after the classification module  • limited the performance of the model due to the coupling with the dynamic pseudolabel estimation method.According to the results shown in Table 7, the dynamic pseudo-label estimation method based on Gaussian mixture model in DYMatch effectively utilizes unlabeled data.Simultaneously, the feature-correlation consistency regularization method focuses effectively on the better pseudo labels, which be selected by the dynamic pseudo-label estimation method.Therefore, the consistency regularization method based on feature correlation can be used as an enhancement to the dynamic pseudo-label estimation method.The effective combination of these two methods can enable the model to obtain excellent classification performance.
Moreover, we also compared DYMatch with other SSL methods in different metrics within the pseudo-labeling method, such as the variations in confidence threshold, the quantity ang accuracy of the selected pseudo-labels.From Figure 2a

Results for Different Optimizers and Learning Rate Decay Methods
The classification performance of the model was influenced by the choice of optimizers and their hyper-parameters.Table 8 reveals that SGD with a momentum of 0.9 can produced the best results, and the Adam optimizer yield the worst results.Additionally, the Nesterov method did not lead to a noticeable enhancement in performance.Furthermore, despite experimenting with different initial learning rates, we were not able to obtain a significant improvement in classification performance from this setting.In DYMatch, we employ the learning rate decay method designed based on the cosine annealing strategy.To verify the performance of the model, we compared this method with two other learning rate decay methods, namely the Exp-warmup method and the method without learning rate decay.The results in Table 9 proved that the cosine annealing scheduler can achievedthe best classification performance.

Conclusions and Future Work
Although semi-supervised learning methods have made rapid progress in recent years, most of the more advanced methods are designed based on increasingly complicated learning algorithms.These methods often introduced the complex data perturbation methods or integrated other complex algorithms into the framework of SSL methods.In this paper, we proposed DYMatch, a new SSL algorithm that obtained better performance on a variety of standard SSL benchmark datasets.The main contribution of this paper lies in the introduction of a dynamic pseudo-label estimation method based on Gaussian mixture models.This strategy effectively improves the utilization of unlabeled data by dynamically estimating class-based confidence thresholds, thereby enhancing the model's generalization performance.Moreover, the combination of this strategy with a featurecorrelation consistency regularization method also significantly enhances the efficiency of utilizing unlabeled data.The good experimental results obtained illustrate the effectiveness of DYMatch on SSL classification tasks.We hope that this direction can serve as inspiration for further research, such as how to design better feature regularization methods that can more effectively utilize multi-domain dataset.If you are interested in our work and want to apply our methods to other fields, please feel free to contact the author.

Figure 1 .
Figure 1.A diagram of the training process of unlabeled data in DYMatch.Firstly, the strongly and weakly augmentations are used to generate two different visions of the unlabeled data.The feature maps ℎ   and   are used to compute the loss of the feature-correlation consistency regularization method.And the  denotes the pseudo labels selected by the dynamic pseudo-label estimation method based on the Gaussian Mixture Model.Please see more details in our method section below.
,b, the confidence threshold of DYMatch gradually increases during the training process.It allows for the utilization of more unlabeled data in the early stages of training, while becoming more cautious in the later stages of training.This situation aligns with the analysis in Section 4.1.Correspondingly, As shown in Figure 2c, DYMatch can obtain the better accuracy of pseudo labels and classification performance compared to FixMatch, AdaMatch and FlexMatch.

Figure 2 .
Figure 2. We presented the variations in the thresholds, the number of pseudo labels, and the accuracy of pseudo labels for DYMatch and other SSL methods at each iteration on CIFAR-10 with 40 labels.(a) Class-average thresholds.(b) Class-average number of pseudo labels.(c) The accuracy of pseudo labels.

Table 1 .
The Error Rates for CIFAR-10 with Five Different Numbers of Labelled Images.The best result is in bold.

Table 2 .
The Error Rates for SVHN with Five Different Numbers of Labelled Images.The best result is in bold.

Table 3 .
The Error Rates for CIFAR-100 with Four Different Numbers of Labelled Images.The best result is in bold.

Table 4 .
The Error Rates for STL10 on Two Different Numbers of Labelled Images.The best result is in bold.

Table 5 .
The Error Rates for Other Object Recognition Datasets with Different Numbers of Labelled Images.(The numbers of data for each category contained in each domain adaptation dataset are given).The best result is in bold.

Table 6 .
The Results of the Ablation Study on the Influence of the dynamic pseudo-label estimation method and the feature-correlation-based consistency regularization method on DYMatch (PL represents the dynamic pseudo-label estimation method based on Gaussian mixture model, and CR represents the consistency regularization method based on feature correlation).The best result is in bold.

Table 7 .
The Results of the Ablation Study for Features from Different module.The best result is in bold.

Table 8 .
The Results of the Ablation Study on Different Optimizers with Different Learning Rates (lr) and Momenta (mom).The best result is in bold.

Table 9 .
The Results for Different Learning Rate Decay Schedulers.The best result is in bold.