A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm

: To provide more external knowledge for training self-supervised learning (SSL) algorithms, this paper proposes a maximum mean discrepancy-based SSL (MMD-SSL) algorithm, which trains a well-performing classiﬁer by iteratively reﬁning the classiﬁer using highly conﬁdent unlabeled samples. The MMD-SSL algorithm performs three main steps. First, a multilayer perceptron (MLP) is trained based on the labeled samples and is then used to assign labels to unlabeled samples. Second, the unlabeled samples are divided into multiple groups with the k -means clustering algorithm. Third, the maximum mean discrepancy (MMD) criterion is used to measure the distribution consistency between k -means-clustered samples and MLP-classiﬁed samples. The samples having a consistent distribution are labeled as highly conﬁdent samples and used to retrain the MLP. The MMD-SSL algorithm performs an iterative training until all unlabeled samples are consistently labeled. We conducted extensive experiments on 29 benchmark data sets to validate the rationality and effectiveness of the MMD-SSL algorithm. Experimental results show that the generalization capability of the MLP algorithm can gradually improve with the increase of labeled samples and the statistical analysis demonstrates that the MMD-SSL algorithm can provide better testing accuracy and kappa values than 10 other self-training and co-training SSL algorithms.


Introduction
Semi-supervised learning (SSL) is an important branch of data mining and machine learning [1], which uses a large number of unlabeled samples to improve the generalization capability of classifiers trained on a small number of labeled samples. Different from active learning [2], SSL focuses on the selection of easily classified samples rather than the selection of easily misclassified samples. The goal of active learning is to minimize the number of samples labeled by domain experts, while the goal of SSL is to maximize the usage of information from unlabeled samples without the intervention of domain experts. The lower labor and time costs achieved using SSL make it more suitable than active learning for a wide range of real-world applications such as automatic query classification [3], image recognition [4], fraudulent cash-out detection [5], and biological sequence analysis [6].
Up to now, researchers have proposed many useful methods to deal with SSL problems [7]. These methods can be categorized as self-training models, co-training models, generative models, semi-supervised SVM, and graph models. Self-training [8] and cotraining [9] methods have attracted much attention as they are simple-to-implement and easy-to-interpret SSL paradigms. The former is a single-view SSL paradigm, which iteratively updates a classifier based on the combination of labeled samples and pseudo-labeled samples, while the latter is a multiple-view SSL paradigm that iteratively updates multiple classifiers based on the combination of labeled samples and pseudo-labeled samples. The objective of self-training and co-training SSL is to create one or more classifiers that are as good as possible by efficiently using a large number of unlabeled samples. Some key studies about each paradigm are summarized as follows.
• Self-training methods. Li and Zhou [10] devised a self-training algorithm named SETRED (self-training with editing), which introduced a data editing technique into the self-training process to filter out the noise in self-labeled examples. Wang et al. [11] proposed a self-training nearest neighbor rule using cut edges (SNNRCE) method, which is based on a nearest neighbor rule for classification and cuts edges in the relative neighborhood graph. Halder et al. [12] presented an advanced aggregation pheromone density based semi-supervised classification (APSSC) algorithm which makes no assumption on the data distribution and has no user-defined parameters. Wu et al. [13] designed a self-training semi-supervised classification (self-training SSC) framework based on density peaks of data, where the structure of the data space is integrated into the self-training process of SSC to help train a better classifier. • Co-training methods. Zhou and Goldman [14] proposed a democratic co-learning (DemoCoL) method, which employs a set of different learning algorithms to train a set of classifiers separately on the labeled data and then combines the outputs using weighted voting to predict the labels of unlabeled examples. Zhou and Li [15] designed an extended co-training semi-supervised learning algorithm named Tri-Training, which generates three classifiers from the original labeled samples and then refines them using the unlabeled samples in the tri-training process. Wang et al. [16] proposed a random subspace co-training (RASCO) method which trains many classifiers based on feature subspaces of the original feature space. Yaslan and Cataltepe [17] improved the classical RASCO algorithm and gave a relevant RASCO named Rel-RASCO, which produces relevant random subspaces by considering the mutual information between features and class labels. Huang et al. [18] presented a classification algorithm based on local cluster centers (CLCC) for SSL, which was able to reduce the interference of mislabeled data.
Although the aforementioned SSL methods have shown good performance in experiments, they still have some important drawbacks that can be further improved for the self-training and co-training SSL paradigms. In particular, the selection of the most confident pseudo-labeled samples for self-training SSL mainly depends on internal judgment rather than external judgment, i.e., a classifier teaches itself using its own cognition until it is satisfied with its own learning. Moreover, for co-training SSL, the assumption that multiple views are conditionally independent always results in a high computational complexity.
To address these issues, this paper presents a novel SSL algorithm, named maximum mean discrepancy-based semi-supervised learning (MMD-SSL) that performs three main steps. First, unlabeled samples are divided into different groups using the k-means clustering algorithm. Then, the k-means-labeled samples are used as external knowledge to train a multilayer perceptron (MLP), which is then used to assign labels to unlabeled samples. The MLP-classified samples are used as internal information for the classifier training. Third, the maximum mean discrepancy (MMD) criterion measures the distribution consistency between the k-means-clustered samples and MLP-classified samples. Then, samples having a consistent distribution are labeled and used to retrain the MLP. We conduced extensive experiments on 29 benchmark data sets to validate the rationality and effectiveness of the MMD-SSL algorithm. Results show that the generalization capability of the MLP algorithm can gradually improve with the increased number of labeled samples. Moreover, a statistical analysis demonstrates that the MMD-SSL algorithm provides better testing accuracy and kappa values than 10 other self-training and co-training SSL algorithms, i.e., SETRED, SNNRCE, APSSC, Self-Training-NN, DemoCoL, Tri-Training, RASCO, Rel-RASCO, CLCC, and Co-Training-NN, where Self-Training-NN and Co-Training-NN are the classical self-training [8] and co-training [9] paradigms using neural networks as classifiers.
The remainder of this paper is organized as follows. In Section 2, we introduce the preliminaries of SSL. In Section 3, we propose the MMD-SSL method. In Section 4, we describe the experimental evaluation method and analyze the results. Finally, in Section 5, we conclude this paper and discuss future works.

Preliminaries
Assume there is a labeled data set containing N samples, described using D condition attributes and one class attribute as D = (x n ,ȳ n ) x n = (x n1 ,x n2 , · · · ,x nD ),ȳ n ∈ {c 1 , c 2 , · · · , c K }, n = 1, 2, · · · , N and an unlabeled data set having M samples with D condition attributes as where c 1 , c 2 , · · · , c K are K discrete labels of the data setD. The initial classifier L (0) is trained with the small number of samples from the data setD. The generalization capability of L (0) is restricted due to the insufficient sample size. The data set D is easier to obtain thanD because the class labels of samples are ignored in D. It is very expensive to label unlabeled samples with the help of experts. How to use the unlabeled samples to improve the generalization capability of a classifier L (0) trained with labeled samples is the primary focus of semi-supervised learning (SSL). Self-training and co-learning are two classical SSL paradigms. A brief description of these paradigms is given next.

SSL with Self-Training Paradigm
The origin of the self-training SSL paradigm can be traced back to Scudder [8]. After that, several extended self-training SSL methods have been developed [10][11][12][13]19]. The main algorithmic steps of the self-training SSL paradigm are listed below.
Step 1: Train a classifier L on the labeled data setD; Step 2: Label the unlabeled samples in D with L; Step 3: Evaluate the confidence scores of these newly labeled samples and obtain the data set D including the samples with high confidence scores; Step 4: Update the labeled data asD ←D ∪ D; Step 5: Update the unlabeled data as D ← D − D.
Step 6: Repeat Step 1-5 until the stopping criteria are met.
To design an effective self-training SSL method, the key aspect is how to calculate confidence scores for the labels given to unlabeled samples. Here, we only introduce the simplest way of selecting samples with high confidence scores for reference. Assume that the probability output of an unlabeled sample x m is p The confidence score of x m is calculated as where γ ∈ (0, 1) is a threshold used to produce a hard label for x m . The samples having a confidence score of 1 are selected to update the classifier. This method usually leads to many incorrectly labeled samples and can yield relatively poor training performance.

SSL with Co-Training Paradigm
The co-training SSL paradigm [9,20,21] requires two different views of a data set, i.e., two different feature subsets to label the unlabeled samples. Ideally, these two feature subsets are conditionally independent given the class and the class of samples can be correctly predicted using each view. The main algorithmic steps of the co-training SSL paradigm are provided as follows.
Step 1: Partitioning the labeled data setD into two labeled data setsD 1 andD 2 according to two different views A (1) and A (2) ; Step 2: Train two classifiers L 1 and L 2 on the labeled data setsD 1 andD 2 , respectively; Step 3: Label the unlabeled samples in D with L 1 ; Step 4: Evaluate the confidence scores of these newly labeled samples with L 1 and obtain the A (2) -view data set D 2 including the samples having high confidence scores; Step 5: Label the unlabeled samples in D with L 2 ; Step 6: Evaluate the confidence scores of these newly labeled samples with L 2 and obtain the A (1) -view data set D 1 including the samples having high confidence scores; Step 7: Update the labeled data as D 1 ← D 1 ∪ D 2 ; Step 8: Update the labeled data as D 2 ← D 2 ∪ D 1 ; Step 9: Update the unlabeled data asD ←D − D, where D is composed of the samples in D 1 and D 2 with full views.
Step 10: Repeat Step 1-9 until the stopping criteria are met.
Developing an effective co-training SSL method requires selecting two conditionally independent and sufficient views. Prior studies [22] have shown that the generalization capability of a classifier can be improved when the dependence between the two views is weak.

The Proposed MMD-SSL Algorithm
This section presents the novel maximum mean discrepancy-based semi-supervised learning (MMD-SSL) algorithm. Its main steps are listed in Algorithm 1. MMD-SSL belongs to the self-training SSL paradigm and perform three main operations, i.e., training a multilayer perceptron (MLP) classifier on the labeled data set, clustering the unlabeled samples using the k-means algorithm, measuring the distribution consistency between the classification, and clustering results using the maximum mean discrepancy (MMD) criterion [23,24].

Algorithm 1 MMD-SSL Algorithm.
Input: A labeled data setD and an unlabeled data set D. Output: The predicted labels y 1 , y 2 , · · · , y M and a multilayer perceptron (MLP) Learner.
1: repeat 2: Train a MLP Learner with two hidden layers on the data setD; 3: Predict the labels of samples from the unlabeled data set D with Learner and partition D intoK K ≤ K disjoint data subsets D 1 , D 2 , · · · , DK according to the predicted labels; 4: Apply the k-means clustering algorithm to partition D intoK disjointed data subsets Q 1 , Q 2 , · · · , QK; 5: for i = 1; i ≤K; i + + do 6: for k = 1; k ≤K; k + + do 7: Calculate the maximum mean discrepancy (MMD) between Q i and D k as MMD(Q i , D k ); until M = 1, 2, · · · ,K or N = 1, 2, · · · ,K . 19: Update the labeled data asD ←D ∪ D; 20: Update the unlabeled data as D ← D − D. 21: until The number of labels predicted with Learner for unlabeled data set D is 1 or the number of samples in the unlabeled data set D is less than the given threshold ζ > 0.
In the MMD-SSL algorithm, it is feasible to match the k-means-clustered data set Q m with the MLP-classified data set D n . Assume that Q m has a consistent probability distribution with D n , which indicates that holds for any k ∈ 1, 2, · · · ,K and k = n. This observation can be demonstrated by the following illustration in Figure 1. We can see that the unlabeled data sets are, respectively, partitioned into 5 parts by the k-means algorithm (Q 1 , Q 2 , Q 3 , Q 4 , Q 5 ) and MLP classifier (D 1 , D 2 , D 3 , D 4 , D 5 ). MMD values corresponding to different data pairs are calculated as for the data sets shown in Figure 1. It indicates that the k-means-clustered data sets Q 1 , Q 2 , Q 3 , Q 4 , Q 5 have the same class labels as the MLP-classified data sets D 3 , D 1 , D 5 , D 4 , D 2 , respectively. Taking the data pair (Q 4 , D 3 ) as example, the samples in both Q 4 and D 3 are labeled as class 3 and further added into the labeled data set to update the training of the MLP classifier in the next iteration.  The stopping criteria of the MMD-SSL algorithm is that there is only one class predicted by the MLP classifier in the unlabeled data set or that the number of unlabeled samples in the unlabeled data set is less than the threshold ζ > 0. The rationale of the first stopping criterion is that applying the k-means clustering algorithm is unnecessary for a data set having one cluster, that is, where the MLP classifier predicts the same label for all unlabeled samples. In this situation, the training of the MMD-SSL algorithm stops and all samples from the unlabeled data set obtain the predicted label. For the second stopping criterion, we adopt an adaptive threshold determination strategy, i.e, let ζ =K. It indicates that the training of MMD-SSL is stopped when the number of remaining unlabeled samples is equal to the number of labels predicted by the MLP classifier.

Experimental Results and Analysis
We conducted two experiments to validate the rationality and effectiveness of the proposed MMD-SSL algorithm. The MMD-SSL algorithm was implemented using the Python programming language and other SSL algorithms were downloaded from sci2s (https://sci2s.ugr.es/SelfLabeled (accessed on 16 December 2021)), which is a soft computing and intelligent information system developed by the University of Granada research group. All the experiments were carried out on a personal computer equipped with an Intel(R) Quad-core 3.00 GHz i5-7400 CPU and 16 GB of main memory.
The comparative results corresponding to 10 times the independent training of the MMD-SSL algorithms are summarized in Table 1. We can see that the clustering algorithms combined with the MLP classifier obtain better average testing accuracies than the clustering algorithms combined with other classifiers. The average testing accuracies of the MMD-SSL algorithm with MLP classification and k-means, agglomerative, spectral, and the BIRCH clustering algorithm are 0.975, 0.958, 0.996, and 0.954, respectively. Due to its better generalization capability, the MLP classifier was selected to carry out the classification in the MMD-SSL algorithm. For MMD-SSL with an MLP classifier, we used k-means to conduct the clustering task for the unlabeled data set due to its simple model structure and acceptable semi-supervised learning performance.
Secondly, the convergence of the proposed MMD-SSL algorithm on another synthetic data set #2 as shown in Figure 2 was validated. The synthetic data set #2 had the same data distribution as the synthetic data set #1. Figure 3 shows the convergence process of the MMD-SSL algorithm. We can see that the MMD-SSL algorithm reaches convergence with only six iterations, i.e., all unlabeled samples are labeled with the gradually updated MLP classifier. The testing accuracies of the MLP classifier corresponding to these six iterations in Figure 2d are 0.711, 0.819, 0.836, 0.859, 0.869, and 0.869. The increase rate of the testing accuracy reaches more than 15%. It indicates that the designed MMD-SSL algorithm is reasonable and able to improve the generalization capability of a classifier by properly utilizing the unlabeled training samples.

Effectiveness Validation
In this experiment, we compared the testing accuracy and kappa coefficient [25] of the MMD-SSL algorithm with 10 other self-training and co-training SSL algorithms, namely SETRED [10], SNNRCE [11], APSSC [12], Self-Training-NN [8], DemoCoL [14], Tri-Training [15], RASCO [16], Rel-RASCO [17], CLCC [18], and Co-Training-NN [9]. The parameters of these SSL algorithms were set as follows. CLCC: the number of random forests was 6, the manipulative beta parameter was 0.4, the number of initial clusters was 4, the running frequency was 10, and the number of best center sets was 6; • Co-Training-NN: the maximum number of iterations was 40, the number of nearest neighbors was 3, and the size of the initial unlabeled sample pool was 75.
We selected 29 data sets from the KEEL data repository (https://sci2s.ugr.es/kee l/category.php?cat=clas&order=name#sub2 (accessed on 16 December 2021)) to test the performance of these SSL algorithms. The detailed descriptions of these data sets are summarized in Table 2. Each data set was randomly partitioned into three parts: labeled training data, unlabeled training data, and labeled testing data. The ratios of labeled training data were set as 10% and 30%. For each labeled data ratio, the percentages of unlabeled training data and labeled testing data were 70% and 30% of the rest of the data set. Each algorithm was trained using the labeled and unlabeled training data and tested based on the labeled testing data. The testing accuracy and kappa value were calculated as the average value of 10 different testing accuracies and kappa values corresponding to 10 different data partitions. Tables 3-6 present the detailed comparative results of these 11 SSL algorithms for the two labeled data ratios. We can see that (1) the MMD-SSL algorithm obtains higher testing accuracy and kappa values than the other 10 SSL algorithms for each labeled data ratio; (2) the testing accuracy and kappa value of the SSL algorithm increase gradually as the label training data increase. We also use the critical difference diagram [26] as shown in Figures 4 and 5 to present the statistical analysis results for this comparison of 11 SSL algorithms. There are 11 algorithms which are compared based on 29 data sets. For the given significance level of 0.01, the critical difference (CD) value is calculated as q 0.05 11×(11+1) 6×29 = 2.516, where q 0.05 is the CD value of Tukey's distribution corresponding to a significance level of 0.01. The statistical analysis results indicate that the MMD-SSL algorithm obtains (1) significantly better testing accuracy and kappa value than APSSC, Self-Training-NN, Tri-Training, RASCO, Rel-RASCO, CLCC, Co-Training-NN for SSL based on 10% labeled training data and (2) significantly better testing accuracy and kappa value than APSSC, SNNRCE, Tri-Training, RASCO, Rel-RASCO, CLCC, Co-Training-NN for SSL based on 30% labeled training data. Above all, the MMD-SSL algorithm obtains the highest average testing accuracy and kappa value among the compared SSL algorithms and thus demonstrates its effectiveness when dealing with the SSL problems. In addition, the comparative results indicate that MMD-SSL is more suitable for dealing with SSL for imbalanced data sets, because the kappa coefficient is one of the most popular indices to measure the ability to handle imbalanced classification of a learning algorithm and the average testing kappa values of MMD-SSL are higher than those of other SSL algorithms.
(a) CD diagram on 10% labeled training data (b) CD diagram on 30% labeled training data

Conclusions and Future Work
In this paper, we proposed a maximum mean discrepancy-based semi-supervised learning (MMD-SSL) algorithm which is a data distribution-oriented SSL algorithm. The unlabeled samples were gradually labeled by considering the consistencies between the clustering and classification results. The MMD-SSL algorithm belongs to the category of self-training SSL and performs three main steps: the training of a multilayer perceptron (MLP) classifier, the clustering of unlabeled samples, and the consistent labeling of unlabeled samples. The experimental results demonstrated the rationality and effectiveness of the designed MMD-SSL algorithm by comparing it with 10 other SSL algorithms on 29 benchmark data sets. Here, we briefly summarized the technical advantages of the MMD-SSL algorithm.

•
Highly confident pseudo labeling. Because the MMD criterion is used to measure the distribution consistency between the k-means-clustered samples and MLP-classified samples, the pseudo labeling considers both the inherent features (k-means clustering results) and extrinsic characteristics (MLP classification results) of unlabeled samples. This kind of pseudo labeling provides more confidence than the pseudo labeling done using only the internal or external information. • Good generalization capability of the classifier. The MLP classifier is trained based on the samples with highly confident pseudo labels and thus its testing performance is gradually improved with the increase of training samples. The experimental results have demonstrated this conclusion. The highly confident pseudo labeling leads to the good generalization capability of the MLP classifier. • Easy implementation. The MMD-SSL algorithm is easy to understand and implement in any programming language. Moreover, training the MMD-SSL algorithm converges with the decrease of unlabeled samples.
Future work will focus on the following three directions. First, the MMD-SSL algorithm will be implemented on a distributed computation environment and used to deal with large-scale SSL problems. Second, an ensemble version of the MMD-SSL algorithm will be developed to further enhance labeling confidence of unlabeled samples. Third, the MMD-SSL algorithm will be applied in real-world applications, e.g., the identification of harassing phone calls and the detection of abnormal power consumption behavior.

Data Availability Statement:
The data presented in this study are available in BaiduPan https://pa n.baidu.com/s/1aDm8n7AA2ETtSumM5LXVBQ (accessed on 16 December 2021) with extraction code vn6j.

Acknowledgments:
The authors would like to thank the editor and two anonymous reviewers who carefully read the paper and provided valuable suggestions that considerably improved the paper. They thank Philippe Fournier-Viger for helping them improve the linguistic quality of the manuscript so that it can be read more smoothly.

Conflicts of Interest:
The authors declare no conflict of interest.

MMD
Maximum mean discrepancy. SSL Semi-supervised learning. MMD-SSL Maximum mean discrepancy-based semi-supervised learning. MLP Multilayer perceptron. SVM Support vector machine.