Graph-Based Self-Training for Semi-Supervised Deep Similarity Learning

Semi-supervised learning is a learning pattern that can utilize labeled data and unlabeled data to train deep neural networks. In semi-supervised learning methods, self-training-based methods do not depend on a data augmentation strategy and have better generalization ability. However, their performance is limited by the accuracy of predicted pseudo-labels. In this paper, we propose to reduce the noise in the pseudo-labels from two aspects: the accuracy of predictions and the confidence of the predictions. For the first aspect, we propose a similarity graph structure learning (SGSL) model that considers the correlation between unlabeled and labeled samples, which facilitates the learning of more discriminative features and, thus, obtains more accurate predictions. For the second aspect, we propose an uncertainty-based graph convolutional network (UGCN), which can aggregate similar features based on the learned graph structure in the training phase, making the features more discriminative. It can also output the uncertainty of predictions in the pseudo-label generation phase, generating pseudo-labels only for unlabeled samples with low uncertainty; thus, reducing the noise in the pseudo-labels. Further, a positive and negative self-training framework is proposed, which combines the proposed SGSL model and UGCN into the self-training framework for end-to-end training. In addition, in order to introduce more supervised signals in the self-training process, negative pseudo-labels are generated for unlabeled samples with low prediction confidence, and then the positive and negative pseudo-labeled samples are trained together with a small number of labeled samples to improve the performance of semi-supervised learning. The code is available upon request.


Introduction
Semi-supervised learning is a schema for network training using a small amount of labeled data and a large amount of unlabeled data. The current semi-supervised learning methods are mainly categorized into consistency regularization methods [1,2] and pseudolabeling methods [3,4]. Consistent regularization methods aim to keep the outputs of the model constant under perturbations. For example, Sajjadi et al. [5] proposed the π model. It conducts two separate data augmentations for inputs and predicts the augmented inputs separately using a deep network, then minimizes the distance between the two predictions by a consistency loss function. However, the consistency regularization methods mostly rely on data augmentation strategies, thus their generalization ability is limited.
In contrast, pseudo-labeling methods are independent from data augmentations. They aim to generate pseudo-labels for unlabeled data and then train the network along with a small amount of labeled data. In pseudo-labeling methods, self-training methods [6,7] are the most widely studied methods, and such methods have three steps. Firstly, the network The contributions of this paper are as follows: (1) A SGSL model is proposed to consider the potential correlation between labeled data and unlabeled data. It calculates the similarity between unlabeled and labeled sample features in a batch to initialize their correlation. Moreover, end-to-end training makes this correlation optimized, which facilitates the network to learn more discriminative features and, thus, makes the confidence of predictions more accurate and credible.
(2) In order to improve the accuracy and reliability of pseudo-labels, the UGCN is proposed. It uses the graph convolutional network to aggregate features based on the learned graph structures so that the unlabeled sample features are close to the similar labeled sample features. When features are passed through the network, the predictions will be consistent and, thus, improve the prediction accuracy of unlabeled samples. In addition, we also use dropout to obtain the uncertainty of predictions. If the uncertainty of predictions is high, it means that the confidence is not credible and does not generate pseudo-labels for the corresponding samples to improve the reliability of pseudo-labels.
(3) A positive and negative self-training framework based on graph-based deep uncertainty is proposed, which fuses the proposed SGSL and UGCN in the self-training framework. It can make features more discriminative in data space and improve the accuracy of pseudo-labels when the framework is trained end-to-end.

Related Work
Semi-supervised learning methods can be broadly divided into two categories: consistency regularization methods [1,2] and pseudo-labeling methods [3,4]. There are three kinds of perturbations in consistency regularization methods, i.e., perturbations to inputs [1,9], perturbations to the network [10], and perturbations to the training process [2,11]. Applying perturbations to inputs is the most used strategy. For example, Guyon et al. [12] propose the mean teacher model, which consists of two parts: the student model and teacher model. Images are augmented twice and then inputted into the student model and the teacher model to predict the corresponding label distributions, respectively, after which a consistency loss is utilized for both predictions. Ke et al. [13] propose the dual student method, which replaces the teacher model in the mean teacher method. When applying perturbations to the network, Zhang et al. [10] propose the worst-case perturbation method, in which additive and DropConnect perturbation are used to the network. Methods using perturbations to inputs are widely studied. However, these methods rely on data augmentation strategies. Their performances will be limited if consistent regularization methods are utilized in areas where the effectiveness of data augmentation is low (e.g., video, medical images).
Pseudo-labeling methods generate pseudo-labels for unlabeled data and then train the network. Pseudo-labeling methods can be divided into two categories, i.e., multi-view training methods [3,4,14,15] and self-training methods [6,7].
Multi-view training methods focus on training two or more different networks and providing pseudo-labels to each other. For instance, the co-training method [3] contains two networks that take images from two views as inputs. If one of the networks has higher confidence, then pseudo-labels will be generated for inputs and served as the training set for the other network in the next iteration. Chen et al. [14] propose a method with three networks. If the predictions of the two models are consistent, then pseudo-labels are further generated, which are then used as the training data for the third model in the next iteration. In multi-view training methods, it is inevitable that multiple networks are involved and, thus, the volume of network parameters to be trained increases, making it difficult to apply to scenarios with limited resources.
In contrast, self-training methods use a single network to predict and generate pseudolabels. For example, Lee et al. [6] propose pseudo-labeling methods in which the network was trained using a supervised learning approach via a small amount of labeled data; the trained network model is used to predict unlabeled data. The predictions are filtered to generate pseudo-labels and are added to the training set to train the network iteratively.
Xie et al. [16] propose a noisy student model, which consists of a teacher model and a student model. The teacher model is firstly trained on a small amount of labeled data, and then the teacher model is used to predict the unlabeled data and generate pseudo-labels. The pseudo-labeled data and the labeled data are then combined and trained with the student model, which becomes a new teacher model after training and is trained again iteratively by re-predicting the unlabeled data. Self-training methods do not rely on data augmentation strategies and their network parameters are greatly reduced compared to multi-view training methods. The main drawback of these methods is that the generated pseudo-labels are not always accurate. To reduce the noise in pseudo-labels, Rizve et al. [8] propose using uncertainty to determine whether predictions are reliable. Moreover, the higher the uncertainty, the less reliable the predictions. In addition, Rizve et al. [8] argue that the predictions with low confidence can also be used to generate pseudo-labels to perform negative learning. However, this method does not take into account the correlation between labeled data and unlabeled data during network training.
There are also many graph-based semi-supervised learning methods [17,18], in which all data are represented as nodes in a graph, and labels of unlabeled data are obtained by label propagation. These methods generally carry out research in terms of both graph construction [19] and label inference [20]. Unlike them, graphs are used in our approach to model the deep similarity between samples, which can be used for graph convolution to optimize the feature distribution and, thus, improve the quality of the generated pseudo-labels.

Overview
To learn the correlation between labeled and unlabeled data, we propose a positive and negative self-training framework based on graph-based deep uncertainty, as shown in Figure 1.
Given an image training set I = {I 1 , I 2 , . . . , I O }, where O represents the number of images in I. In semi-supervised settings, the training set is divided into two sets, i.e., labeled images I L = {I 1 , I 2 , . . . , I L } and unlabeled images I U = {I 1 , I 2 , . . . , I U }, where L is the number of labeled images and U is the number of unlabeled images, L U, I = I L ∪ I U . The proposed positive and negative self-training framework based on graph-based deep uncertainty has three stages, which can be described as follows.
In the first stage, I L is passed through ResNet-50 in batches to obtain the batch features X L b and X L b ∈ R b×dim , where b denotes the number of images in a batch and dim is the dimension of features. The batch features are directly used to generate predictions. After that, the predictions of batch features logits and labels are inputted into the classification loss, where logits ∈ R b×M and M is the number of classes. Batch features X L b are also inputted to the proposed SGSL model, which outputs the correlation A L b ∈ R b×b between samples in the current batch. Then A L b and the true correlation between samples in the current batch, denoted as A tar , are inputted to binary classification loss. Moreover, A L b and X L b are inputted to the proposed dropout-based GCN, which outputs predictions logits g nn ∈ R b×M . Then logits g nn and labels are inputted to classification loss.
In summary, the losses in this phase consist of three items: (a) the loss between the generated similarity graph of SGSL and the true relationship graph between samples, which supervise the training of ResNet-50 and SGSL; (b) the loss between the predictions of UGCN and the ground truth labels of samples, which supervise the training of ResNet-50, SGSL, and UGCN; (c) the loss generated directly from the classification from the batch features; this loss supervises the training of ResNet-50.  The proposed framework is divided into three stages. In the first stage, a small number of labeled samples are fed into the network and trained using a supervised learning approach. In the second stage, the parameters are fixed and the network is tuned to the test mode. Unlabeled samples are inputted to predict and generate pseudo-labels by uncertainty filtering. Then original unlabeled data are updated by adding positive and negative pseudo-labels. In the third stage, the data with pseudo-labels are trained together with a small amount of labeled data.
In the second stage, the trained network is used to extract features of I U in batches, i.e., X U b and X U b ∈ R b×dim . Then X U b is input to the SGSL model to obtain the correlation A U b ∈ R b×b of features in that batch of data. After that, X U b and A U b are input to the proposed UGCN to generate positive and negative pseudo-labels for unlabeled data. In this stage, the weights of the model are fixed.
In the third stage, the network is trained based on the pseudo-labeled samples obtained in the second stage together with the original samples with labels. The positive and negative self-training is performed in this phase. The training process of positive learning is the same as the first stage. Moreover, for negative learning, I U is fed into ResNet-50, then the predictions logits neg ∈ R b×M are output. logits neg and negative pseudo-labels are inputted to negative cross-entropy loss. More specifically, after obtaining pseudo-labels for the unlabeled data, where positive pseudo-labels represent the categories to which the samples belong and negative pseudo-labels indicate the categories to which the samples do not belong, both positive and negative labels are used as inputs to the cross-entropy loss function to supervise the model to learn features with discriminative properties. The difference is that for positive pseudo-labels, the model predicts the category the sample belongs to, while for negative pseudo-labels, the model predicts the category the sample does not belong to. In addition, the usage of the original ground truth labels is the same as the positive pseudo-labels.
In the self-training process, the second and third stages are iterated until the number of iterations reaches the preset number NU M iter .

The Similarity Graph Structural Learning Model
In order to take into account the correlation between labeled and unlabeled samples in semi-supervised learning, so that the unlabeled sample features can be close to their corresponding labeled sample features, and to make the predictions of unlabeled samples more credible, we propose a SGSL model to learn the correlation between labeled and unlabeled samples, as shown in Figure 2. Given batch features X b ∈ R b×dim , the purpose of the proposed SGSL model is to learn the similarity graph structureÂ ∈ R b×b . At first, the dimension of batch features X b is transformed by adding a dimension, i.e., X b ∈ R 1×b×dim . Then, we swap the first and second dimensions of X b to obtain X b and X b ∈ R b×1×dim . Next, X b and X b are subtracted to obtain the initialized representations A f ea of the similarity graph structure, i.e., A f ea = X b − X b and A f ea ∈ R b×b×dim . The entry of i-th row and j-th column of A f ea denote the correlation representation of the i-th sample and j-th sample in the batch and it has a dimension dim. Then, A f ea is fed into the proposed SGSL model, which consists of convolutional layers, batch normalization, and activation functions. Each convolutional layer has a kernel size of 1 × 1 and a stride of 1 × 1. The input dimension of the first convolutional layer is dim and the output dimension is dim out conv1 , the input dimension of the second convolutional layer is dim out conv1 , and the output dimension is dim out conv2 . After the second convolutional layer, the input dimension of the third convolutional layer is dim out conv2 while the output dimension is 1 because the similarity graph structure of the batch samples needs to be obtained. After the sigmoid function, the structure A sim between the batch samples is obtained, the values in A sim are all between 0 and 1, and A sim ∈ R b×b . Then, where D is the diagonalized degree matrix and J represents the identity matrix. During the training process, the graph structure A tar of the current batch of samples is obtained based on their true labels or pseudo-labels, as specified by the following rules A tar i,j = 1 , x i and x j have the same label or pseudo-label 0 , otherwise Then, A tar and A sim are input to a binary cross-entropy loss, i.e., Moreover, the data input into SGSL to model similarity differ in the three phases. In stage 1, SGSL is in training mode, and all the input data are labeled data with real labels; in stage 2, the weights of SGSL are fixed, and the similarity between the input data (including labeled data and unlabeled data) is evaluated; in stage 3, SGSL is in a training mode, the input data consist of labeled data and unlabeled data with positive pseudo-label, and the labels consist of real labels and positive pseudo-labels.

Uncertainty-Based Graph Convolutional Network
In order to make the features of unlabeled data close to the features of corresponding labeled data, so that similar features are consistent in prediction, and to use uncertainty to determine whether the prediction confidence is reliable, UGCN is proposed, as shown in Figure 3. Then, F (3) and X b are concatenated and input to the classifier. "Conv" represents the convolutional layer, "BN" is batch normalization, and "DP" denotes dropout.
Given batch features X b and the output of SGSL modelÂ, UGCN firstly uses the graph convolution network to aggregate features based on the similarity graph structureÂ, i.e., where F (l) is the input of l-th GCN and F (1) = X b . denotes the inner product. θ (l) gcn represents the learnable parameter of l-th GCN, and θ (l) out . σ is the activation function. After GCNs, the aggregated features F (3) ∈ R b×dim (2) out can be obtained, i.e., Then, F (3) and X b are concatenated, i.e., where concat represents concatenation along the feature dimension, X agg b ∈ R b×(dim+dim (2) out ) . Then X agg b is input to the convolutional layer. After batch normalization, activation function, and dropout, a convolutional layer and batch normalization are attached to obtain the predictions of X agg b , i.e.,Ŷ b ,Ŷ b ∈ R b×M . The output dimension of the second convolutional layer is M.
The above process is the training process in the first and third stages. While in the second stage, UGCN is able to output the uncertainty of predictions for generating pseudolabels. The uncertainty is obtained by dropout. Specifically, the model is in the test mode in the second stage, but the dropout layer is in the training mode. Therefore, the predictions are different when inputting the same samples twice. The standard deviation can be used to measure whether the predictions are credible. he proposed method repeatedly inputs each sample in a batch T times to obtain T predictions. Then a sigmoid function is used to restrict the values between 0 and 1. After that, the average of the results obtained from the T predictions is calculated, i.e., represents the output of t-th inputs, T denotes the number of times that data are repeatedly fed into the network, T = 10 in the proposed method.Ŷ b denotes the predictions in the second stage andŶ b ∈ R b×M . Then the maximum value inŶ b can be obtained, whereP b represents the confidence of samples belonging to the corresponding class,P b ∈ R b×1 . For uncertainty, the standard deviation is calculated, is T times the outputs of the same batch samples, andŶ ∈ R T×b×M , std calculates the standard deviation across the first dimension, and U b ∈ R b×M . Next, the standard deviationÛ b corresponding to the maximum predicted value inP b is obtained and U b ∈ R b×1 . Finally, the prediction confidenceP b of a batch sample and its corresponding uncertaintyÛ b are obtained.
In summary, the role and training of UGCN in three phases are as follows: (a) In the first stage, UGCN is set as the training mode, and the sample features extracted by ResNet-50 are aggregated in the neighborhood according to the similarity graph built by SGSL, the predicted categories of the samples are output after graph convolution. In this process, because the inputs are labeled data, ground truth labels supervise the training of UGCN. (b) In the second stage, UGCN is set as the eval mode. The inputs to the network are unlabeled data, and UGCN predicts these samples to obtain their pseudo-labels. Moreover, the UGCN generates confidence for the prediction of each sample as an assist to the pseudolabel generation. In this process, the weights of UGCN a fixed. (c) In the third stage, the UGCN is set to the training mode. The input of the network consists of labeled data and unlabeled data with pseudo-labels, and the UGCN performs graph convolution on the similarity graph of these data to output predictions, ground truth labels, and pseudo-labels, generating losses to supervise its training.

Pseudo-Label Generation Based on Uncertainty
We utilize a pseudo-label generation method based on uncertainty. Given prediction confidenceP b and corresponding uncertaintyÛ b , the i-th sample in the batch has a positive pseudo-label only if the following condition is satisfied, whereÛ b (i) is the prediction confidence of the i-th sample in the batch andP b (i) is the corresponding uncertainty. κ p and τ p are predefined values used to filter the uncertainty and prediction confidence, respectively. If the prediction confidence of sample x i is greater than or equal to τ p , and its uncertainty is less than κ p , then the prediction confidence is considered reliable, and a positive pseudo-label can be generated. Such a strategy leaves many unlabeled samples unlabeled, but in fact, although these samples do not obtain positive pseudo-labels, they can obtain negative pseudo-labels, i.e., to determine the categories to which these samples explicitly do not belong to, the specific rule is, where κ n and τ n are pre-defined values used to filter the uncertainty and prediction confidence for negative pseudo-labels. If the sample x j fails to be assigned to a positive pseudo-label, a prediction confidence less than τ n , and an uncertainty value less than κ n , then it can be considered that x j does not belong to the class corresponding to that prediction confidence and the corresponding position. After this process, the generated positive and negative pseudo-labels are used to update the original unlabeled data, and then in the third stage, the positive and negative pseudo-labeled data are used to train the network together with the original labeled data.

Datasets and Settings
Our approach is suitable for tasks that are sensitive to inter-sample connections, such as clustering and retrieval tasks. The proposed method is evaluated on image clustering and person re-identification (re-ID) tasks. In these two tasks, the data can be naturally modeled as graph structures, which allows learning the similarity between samples. Since the inputs are image data, the general and powerful CNN model ResNet-50 [21] is used as the feature extractor. Our semi-supervised approach improves the performance of the model by increasing the accuracy of pseudo-labels. To evaluate the proposed method, we adopt the metrics used in previous works.
For image clustering tasks, IJB-B [22] and IJB-C [23] datasets are utilized. In the IJB-B dataset, there are seven subsets for clustering. In this paper, the top 3 subsets with the most images are selected for clustering, i.e., the subsets including 512, 1024, and 1845 identities. Moreover, in these subsets, there are 18,251, 36,575, and 68,195 images, respectively. The IJB-C dataset is an upgraded version of the IJB-B dataset, which has 4 subsets with 32, 1021, 1839, and 3531 identities, respectively. The top 3 subsets with the largest image numbers are also selected for clustering, and these subsets include 41,074, 71,392, and 140,623 images, respectively. The widely used normalized mutual information (NMI) is our evaluation metric for image clustering. In semi-supervised settings, only one-third of images of each subset are labeled, the rest of the labels are not involved in semi-supervised training.

Implementation Details
The proposed method was implemented using the PyTorch deep learning framework, including torch 1.10.0, cudnn 8.2.0, and CUDA 11.3. The Python version used was 3.8.5. The server hardware consisted of an NVIDIA Geforce RTX 3090 and an Intel(R) Core(TM) i9-10900K CPU @ 3.70 GHz. The operating system used was Ubuntu 20.04.3 LTS.
The original images were all resized to 256 × 128 and randomly horizontally flipped for data augmentation. The stochastic gradient descent (SGD) algorithm was utilized to optimize the proposed model with an initial learning rate of 0.03; the momentum is 0.9. Here, NU M iter = 20, and in each iteration, the proposed model was trained for 60 epochs. In addition, τ p = 0.8, τ n = 0.05, κ p = 0.05, and κ n = 0.005.

Ablation Study
To explore the impact of the proposed SGSL model and UGCN, ablation experiments were conducted on the Market-1501 dataset, as shown in Table 1. In Table 1, "w/o UGCN" indicates that the SGSL model and UGCN are removed from the proposed method, "w/o Uncertainty" indicates that uncertainty is not utilized in generating pseudo-labels, and "Proposed" indicates the proposed method.
As shown in Table 1, compared to variant 1, variant 3 improves mAP by 3.8%, Rank-1 by 2.8%, Rank-5 by 1.5%, Rank-10 by 1.2%, and Rank-20 by 0.8%. The difference between variant 3 and variant 1 is that variant 3 utilizes the proposed SGSL model and UGCN, and the experimental results are improved because the SGSL model considers the correlation between unlabeled and labeled samples. This correlation is then input to the graph convolutional network. With the feature aggregation capability of UGCN, it can make the features of the unlabeled samples approach its similarly labeled samples gradually, and then drive the unlabeled samples to obtain more reliable classification predictions.
In addition, variant 3 improved mAP by 4.2%, Rank-1 by 3.2%, Rank-5 by 2.0%, Rank-10 by 1.4%, and Rank-20 by 0.9% compared to variant 2. The main difference between the two sets of experiments is that in variant 3, the proposed method utilizes uncertainty to assist in generating pseudo-labels for unlabeled samples. The main reason for the improved results is that the pseudo-label generation for unlabeled samples in variant 2 relies entirely on predictions of the network. However, if there are incorrect predictions, the generated pseudo-labels are more likely to be noisy and lead the network to be trained in the wrong direction. In variant 3, the same batch of samples is repeatedly fed into the network 10 times and the standard deviation of the prediction results is calculated. This standard deviation is used as the uncertainty of predictions. Then, pseudo-labels are generated by filtering the predictions with low uncertainty, which effectively reduces the noise in pseudo-labels and leads to an improvement in the network's performance. Therefore, the performance of variant 3 is better than that of variant 2.

Parameters Analysis
In the following experiments, the influence of threshold τ p and GCN layer l on the performance is explored.
To explore the impact of the threshold value τ p , we varied it from 0.4 to 0.9 in increments of 0.1, with the number of GCN layers set to 2. The experimental results are presented in Figure 4, and specific numerical results are provided in Table 2. Figure 4 shows that the performance of the model is relatively stable on mAP, Rank-1, Rank-5, Rank-10, and Rank-20 with varying values of τ p , and most of the evaluation metrics achieve their best results when τ p is set to 0.8. This part of the results shows that τ p brings less influence to the proposed method. The possible reason is that the features of the model tend to be distinguishable and stable after several iterations of the self-training process when the pseudo-labels predicted by UGCN tend to be correct and have a high confidence level. Therefore adjusting the confidence threshold does not affect the model to select the true positive samples.  In the experiments exploring the effect of the number of graph convolution layers l, we set l to 2, 3, and 4, respectively, with τ p set to 0.8. The experimental results are shown in Figure 5, and the specific numerical results are shown in Table 3. It can be observed from Figure 5 that the model's performance remains relatively stable as the number of graph convolution layers changes, with most of the tested metrics reaching their best performance when the model has two layers of graph convolution. This part of the experimental results shows that the model is less sensitive to the number of graph convolution layers. This is likely because increasing the depth of the graph convolution introduces an additional number of parameters. In semi-supervised training, most of the data are unlabeled data. Increasing the depth of the network does not effectively increase the knowledge gained by the model from the data, so changing the number of layers of the graph convolution has little effect on the performance of the network and may even bring about a decrease in performance.

Runtime Analysis
The running time of the model in each phase of the proposed method is shown in Table 4. The results in the table are measured with a batch size of 64. From Table 4, it can be seen that stage 2 of generating pseudo-labels and performing uncertainty filtering takes the longest time in training. That is probably because it has to traverse and filter the confidence of samples to obtain positive and negative pseudo-labels. In the testing phase, the model is able to process about 1800 images per second, thus providing a certain level of the real-time performance. The proposed method is compared to the classical clustering methods. For a fair comparison, the features extracted by the proposed method are used for the rest of the clustering methods. The experimental results are shown in Figures 6 and 7. The specific numerical results are shown in Tables 5 and 6.

IJB-C-1021 IJB-C-1839 IJB-C-3531
K-Means [26] 0.8690 0.8674 0.8676 DBSCAN [27] 0.7010 0.6625 0.6297 ARO [28] 0.8955 0.9101 0.9111 L-GCN [29] 0.8008 0.8042 0.8111 Proposed 0.9063 0.9271 0.9548 As shown in Figures 6 and 7, the proposed method outperforms the remaining clustering methods in terms of experimental results. For example, on the IJB-B-512 subset, the proposed method improves by 2.89% compared to k-means, 21.07% compared to the DBSCAN method, 4.4% compared to the ARO method, and 1.14% compared to the L-GCN, and achieves similar results to the rest of the IJB-B subsets. The experimental results show that the predictions of the proposed method have high accuracy. This is mainly because, the proposed method improves the accuracy of predictions from two perspectives, i.e., the discrimination of features and the accuracy of pseudo-labeling. Specifically, the proposed method learns the similarity graph structure between labeled and unlabeled samples using the SGSL model, and then makes the features more discriminative by UGCN. Moreover, when generating pseudo-labels for unlabeled samples, the proposed method not only uses uncertainty to check the reliability of prediction confidence, but also makes full use of samples with low confidence and generates negative pseudo-labels for them to enrich supervised information of the network.

Comparison of Person Re-ID Task
The proposed method is being compared to semi-supervised person re-identification methods. For a fair comparison, the proposed method is only compared to those methods with the same semi-supervised setup. These methods can be briefly described as follows: MVC [30], which is a semi-supervised method based on self-training, SPC [31], which is a semi-supervised method based on self-paced learning, and TSSML [32], which is a person re-identification method based on transductive learning. The experimental results on the Market-1501 and DukeMTMC-reID datasets are shown in Figures 8 and 9, and the specific numerical results are shown in Tables 7 and 8, respectively.   From the comparison results on the Market-1501 dataset, it can be seen that the proposed method achieves the best results on mAP, Rank-1, Rank-5, and Rank-10. Compared to the suboptimal TSSML method, the proposed method improves by 0.8% on mAP, 0.6% on Rank-1, and 0.8% on Rank-5. Moreover, from the comparison results on the DukeMTMC-reID dataset, it shows that the proposed method improves by 0.6% in mAP compared to the TSSML method, while it is still competitive in Rank-1 and Rank-5, although it is not the best. Compared to the SPC-Combine method, the proposed method improves by 3.6% on mAP, 0.7% on Rank-1, 2.7% on Rank-5, and 2.9% on Rank-10.
There are two main reasons for the strong competitiveness of the proposed method. Firstly, we fully consider the potential correlation between labeled and unlabeled samples during training. Then, we exploit the neighborhood aggregation capability of the graph convolutional network to gradually drive the features of unlabeled samples to approach those of similar labeled samples during training. This, in turn, drives the backbone network to learn more discriminative features through backpropagation. Secondly, to reduce the noise in pseudo-labels, uncertainty is utilized to measure the reliability of predictions by repeatedly feeding batch samples into the network 10 times and calculating the standard deviation of 10 results. Only those with a standard deviation less than a threshold are considered reliable classification predictions. Therefore, the experimental results of the proposed method on both image clustering and person re-identification tasks are highly competitive, demonstrating that the proposed method can learn more discriminative features and generate more accurate pseudo-labels.

Conclusions
We propose a positive and negative self-training framework based on graph-based deep uncertainty, which can utilize the potential correlation between labeled and unlabeled samples in semi-supervised learning. The network includes two key models, i.e., SGSL and UGCN. The SGSL model builds a kind of similarity graph structure for labeled and unlabeled samples. The UGCN can aggregate features in the training phase based on the learned graph structure, making the features more discriminative. In addition, it can output uncertainty for predictions in the pseudo-label generation phase and generate pseudo-labels only for the unlabeled samples with low uncertainty, which in turn reduces the noise in pseudo-labels. The proposed method is evaluated on image clustering and person re-identification tasks, and both experimental results show the effectiveness of the proposed method.