Next Article in Journal
Analysis of Eigenfrequencies of a Circular Interface Delamination in Elastic Media Based on the Boundary Integral Equation Method
Next Article in Special Issue
An Algebraic Approach to Clustering and Classification with Support Vector Machines
Previous Article in Journal
An SIRS Epidemic Model Supervised by a Control System for Vaccination and Treatment Actions Which Involve First-Order Dynamics and Vaccination of Newborns
Previous Article in Special Issue
Bayesian Framework for Multi-Wave COVID-19 Epidemic Analysis Using Empirical Vaccination Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm

1
Big Data Institute, College of Computer Science & Software Engineering, Shenzhen University, Shenzhen 518060, China
2
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen 518060, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(1), 39; https://doi.org/10.3390/math10010039
Submission received: 8 November 2021 / Revised: 10 December 2021 / Accepted: 17 December 2021 / Published: 23 December 2021

Abstract

:
To provide more external knowledge for training self-supervised learning (SSL) algorithms, this paper proposes a maximum mean discrepancy-based SSL (MMD-SSL) algorithm, which trains a well-performing classifier by iteratively refining the classifier using highly confident unlabeled samples. The MMD-SSL algorithm performs three main steps. First, a multilayer perceptron (MLP) is trained based on the labeled samples and is then used to assign labels to unlabeled samples. Second, the unlabeled samples are divided into multiple groups with the k-means clustering algorithm. Third, the maximum mean discrepancy (MMD) criterion is used to measure the distribution consistency between k-means-clustered samples and MLP-classified samples. The samples having a consistent distribution are labeled as highly confident samples and used to retrain the MLP. The MMD-SSL algorithm performs an iterative training until all unlabeled samples are consistently labeled. We conducted extensive experiments on 29 benchmark data sets to validate the rationality and effectiveness of the MMD-SSL algorithm. Experimental results show that the generalization capability of the MLP algorithm can gradually improve with the increase of labeled samples and the statistical analysis demonstrates that the MMD-SSL algorithm can provide better testing accuracy and kappa values than 10 other self-training and co-training SSL algorithms.

1. Introduction

Semi-supervised learning (SSL) is an important branch of data mining and machine learning [1], which uses a large number of unlabeled samples to improve the generalization capability of classifiers trained on a small number of labeled samples. Different from active learning [2], SSL focuses on the selection of easily classified samples rather than the selection of easily misclassified samples. The goal of active learning is to minimize the number of samples labeled by domain experts, while the goal of SSL is to maximize the usage of information from unlabeled samples without the intervention of domain experts. The lower labor and time costs achieved using SSL make it more suitable than active learning for a wide range of real-world applications such as automatic query classification [3], image recognition [4], fraudulent cash-out detection [5], and biological sequence analysis [6].
Up to now, researchers have proposed many useful methods to deal with SSL problems [7]. These methods can be categorized as self-training models, co-training models, generative models, semi-supervised SVM, and graph models. Self-training [8] and co-training [9] methods have attracted much attention as they are simple-to-implement and easy-to-interpret SSL paradigms. The former is a single-view SSL paradigm, which iteratively updates a classifier based on the combination of labeled samples and pseudo-labeled samples, while the latter is a multiple-view SSL paradigm that iteratively updates multiple classifiers based on the combination of labeled samples and pseudo-labeled samples. The objective of self-training and co-training SSL is to create one or more classifiers that are as good as possible by efficiently using a large number of unlabeled samples. Some key studies about each paradigm are summarized as follows.
  • Self-training methods. Li and Zhou [10] devised a self-training algorithm named SETRED (self-training with editing), which introduced a data editing technique into the self-training process to filter out the noise in self-labeled examples. Wang et al. [11] proposed a self-training nearest neighbor rule using cut edges (SNNRCE) method, which is based on a nearest neighbor rule for classification and cuts edges in the relative neighborhood graph. Halder et al. [12] presented an advanced aggregation pheromone density based semi-supervised classification (APSSC) algorithm which makes no assumption on the data distribution and has no user-defined parameters. Wu et al. [13] designed a self-training semi-supervised classification (self-training SSC) framework based on density peaks of data, where the structure of the data space is integrated into the self-training process of SSC to help train a better classifier.
  • Co-training methods. Zhou and Goldman [14] proposed a democratic co-learning (DemoCoL) method, which employs a set of different learning algorithms to train a set of classifiers separately on the labeled data and then combines the outputs using weighted voting to predict the labels of unlabeled examples. Zhou and Li [15] designed an extended co-training semi-supervised learning algorithm named Tri-Training, which generates three classifiers from the original labeled samples and then refines them using the unlabeled samples in the tri-training process. Wang et al. [16] proposed a random subspace co-training (RASCO) method which trains many classifiers based on feature subspaces of the original feature space. Yaslan and Cataltepe [17] improved the classical RASCO algorithm and gave a relevant RASCO named Rel-RASCO, which produces relevant random subspaces by considering the mutual information between features and class labels. Huang et al. [18] presented a classification algorithm based on local cluster centers (CLCC) for SSL, which was able to reduce the interference of mislabeled data.
Although the aforementioned SSL methods have shown good performance in experiments, they still have some important drawbacks that can be further improved for the self-training and co-training SSL paradigms. In particular, the selection of the most confident pseudo-labeled samples for self-training SSL mainly depends on internal judgment rather than external judgment, i.e., a classifier teaches itself using its own cognition until it is satisfied with its own learning. Moreover, for co-training SSL, the assumption that multiple views are conditionally independent always results in a high computational complexity.
To address these issues, this paper presents a novel SSL algorithm, named maximum mean discrepancy-based semi-supervised learning (MMD-SSL) that performs three main steps. First, unlabeled samples are divided into different groups using the k-means clustering algorithm. Then, the k-means-labeled samples are used as external knowledge to train a multilayer perceptron (MLP), which is then used to assign labels to unlabeled samples. The MLP-classified samples are used as internal information for the classifier training. Third, the maximum mean discrepancy (MMD) criterion measures the distribution consistency between the k-means-clustered samples and MLP-classified samples. Then, samples having a consistent distribution are labeled and used to retrain the MLP. We conduced extensive experiments on 29 benchmark data sets to validate the rationality and effectiveness of the MMD-SSL algorithm. Results show that the generalization capability of the MLP algorithm can gradually improve with the increased number of labeled samples. Moreover, a statistical analysis demonstrates that the MMD-SSL algorithm provides better testing accuracy and kappa values than 10 other self-training and co-training SSL algorithms, i.e., SETRED, SNNRCE, APSSC, Self-Training-NN, DemoCoL, Tri-Training, RASCO, Rel-RASCO, CLCC, and Co-Training-NN, where Self-Training-NN and Co-Training-NN are the classical self-training [8] and co-training [9] paradigms using neural networks as classifiers.
The remainder of this paper is organized as follows. In Section 2, we introduce the preliminaries of SSL. In Section 3, we propose the MMD-SSL method. In Section 4, we describe the experimental evaluation method and analyze the results. Finally, in Section 5, we conclude this paper and discuss future works.

2. Preliminaries

Assume there is a labeled data set containing N samples, described using D condition attributes and one class attribute as
D ¯ = x ¯ n , y ¯ n x ¯ n = x ¯ n 1 , x ¯ n 2 , , x ¯ n D , y ¯ n c 1 , c 2 , , c K , n = 1 , 2 , , N
and an unlabeled data set having M samples with D condition attributes as
D = x m , y m x m = x m 1 , x m 2 , , x m D , y m = null , m = 1 , 2 , , M ,
where c 1 , c 2 , , c K are K discrete labels of the data set D ¯ . The initial classifier L 0 is trained with the small number of samples from the data set D ¯ . The generalization capability of L 0 is restricted due to the insufficient sample size. The data set D is easier to obtain than D ¯ because the class labels of samples are ignored in D . It is very expensive to label unlabeled samples with the help of experts. How to use the unlabeled samples to improve the generalization capability of a classifier L 0 trained with labeled samples is the primary focus of semi-supervised learning (SSL). Self-training and co-learning are two classical SSL paradigms. A brief description of these paradigms is given next.

2.1. SSL with Self-Training Paradigm

The origin of the self-training SSL paradigm can be traced back to Scudder [8]. After that, several extended self-training SSL methods have been developed [10,11,12,13,19]. The main algorithmic steps of the self-training SSL paradigm are listed below.
Step 1: 
Train a classifier L on the labeled data set D ¯ ;
Step 2: 
Label the unlabeled samples in D with L ;
Step 3: 
Evaluate the confidence scores of these newly labeled samples and obtain the data set D ̲ including the samples with high confidence scores;
Step 4: 
Update the labeled data as D ¯ D ¯ D ̲ ;
Step 5: 
Update the unlabeled data as D D D ̲ .
Step 6: 
Repeat Step 1–5 until the stopping criteria are met.
To design an effective self-training SSL method, the key aspect is how to calculate confidence scores for the labels given to unlabeled samples. Here, we only introduce the simplest way of selecting samples with high confidence scores for reference. Assume that the probability output of an unlabeled sample x m is p 1 m , p 2 m , , p K m , where k = 1 K p k m = 1 . The confidence score of x m is calculated as
c s m = 1 , if there exists k 1 , 2 , , K such that p k m > γ holds 0 , otherwise ,
where γ 0 , 1 is a threshold used to produce a hard label for x m . The samples having a confidence score of 1 are selected to update the classifier. This method usually leads to many incorrectly labeled samples and can yield relatively poor training performance.

2.2. SSL with Co-Training Paradigm

The co-training SSL paradigm [9,20,21] requires two different views of a data set, i.e., two different feature subsets to label the unlabeled samples. Ideally, these two feature subsets are conditionally independent given the class and the class of samples can be correctly predicted using each view. The main algorithmic steps of the co-training SSL paradigm are provided as follows.
Step 1: 
Partitioning the labeled data set D ¯ into two labeled data sets D ¯ 1 and D ¯ 2 according to two different views A ( 1 ) and A ( 2 ) ;
Step 2: 
Train two classifiers L 1 and L 2 on the labeled data sets D ¯ 1 and D ¯ 2 , respectively;
Step 3: 
Label the unlabeled samples in D with L 1 ;
Step 4: 
Evaluate the confidence scores of these newly labeled samples with L 1 and obtain the A ( 2 ) -view data set D ̲ 2 including the samples having high confidence scores;
Step 5: 
Label the unlabeled samples in D with L 2 ;
Step 6: 
Evaluate the confidence scores of these newly labeled samples with L 2 and obtain the A ( 1 ) -view data set D ̲ 1 including the samples having high confidence scores;
Step 7: 
Update the labeled data as D 1 D 1 D ̲ 2 ;
Step 8: 
Update the labeled data as D 2 D 2 D ̲ 1 ;
Step 9: 
Update the unlabeled data as D ¯ D ¯ D ̲ , where D ̲ is composed of the samples in D ̲ 1 and D ̲ 2 with full views.
Step 10: 
Repeat Step 1–9 until the stopping criteria are met.
Developing an effective co-training SSL method requires selecting two conditionally independent and sufficient views. Prior studies [22] have shown that the generalization capability of a classifier can be improved when the dependence between the two views is weak.

3. The Proposed MMD-SSL Algorithm

This section presents the novel maximum mean discrepancy-based semi-supervised learning (MMD-SSL) algorithm. Its main steps are listed in Algorithm 1. MMD-SSL belongs to the self-training SSL paradigm and perform three main operations, i.e., training a multilayer perceptron (MLP) classifier on the labeled data set, clustering the unlabeled samples using the k-means algorithm, measuring the distribution consistency between the classification, and clustering results using the maximum mean discrepancy (MMD) criterion [23,24].
To train the MLP classifier and apply the k-means algorithm, we use the standard sklearn packages, i.e., MLPClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html, accessed on 21 December 2021), and KMeans (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html (accessed on 21 December 2021)). For the data sets
Q i = x m i , y m i x m i = x m 1 i , x m 2 i , , x m D i , y m i w 1 , w 2 , , w K ¯ , m = 1 , 2 , , M i
and
D k = x ^ n k , y ^ n k x ^ n k = x ^ n 1 k , x ^ n 2 k , , x ^ n D k , y ^ n k v 1 , v 2 , , v K ¯ , n = 1 , 2 , , N k ,
the MMD value between them is calculated as
MMD Q i , D k = 1 M i m = 1 M i n = 1 M i κ x m i , x m i + 1 N k m = 1 N k n = 1 N k κ x ^ m k , x ^ n k 2 M i N k m = 1 M i n = 1 N k κ x m i , x ^ n k 1 2 ,
where x m i , x ^ n k x 1 , x 2 , , x M , v 1 , v 2 , , v K ¯ , w 1 , w 2 , , w K ¯ c 1 , c 2 , , c K , M i and N k are the numbers of samples in Q i and D k ,
κ a , b = exp d = 1 D a d b d 2 2 σ 2
is the Gaussian kernel function to measure the distance between two vectors a   =   a 1 , a 2 , , a D and b   =   b 1 , b 2 , , b D in reproducing the kernel Hilbert space, and σ 2 is the kernel radius.
Algorithm 1: MMD-SSL Algorithm.
Input: A labeled data set D ¯ and an unlabeled data set D .
Output: The predicted labels y 1 , y 2 , , y M and a multilayer perceptron (MLP) L e a r n e r .
1:
repeat
2:
 Train a MLP L e a r n e r with two hidden layers on the data set D ¯ ;
3:
 Predict the labels of samples from the unlabeled data set D with L e a r n e r and partition D into K ¯ K ¯ K disjoint data subsets D 1 , D 2 , , D K ¯ according to the predicted labels;
4:
 Apply the k-means clustering algorithm to partition D into K ¯ disjointed data subsets Q 1 , Q 2 , , Q K ¯ ;
5:
for  i = 1 ; i K ¯ ; i + + do
6:
  for  k = 1 ; k K ¯ ; k + + do
7:
   Calculate the maximum mean discrepancy (MMD) between Q i and D k as MMD Q i , D k ;
8:
  end for
9:
end for
10:
D ̲ = ;
11:
M = N = ;
12:
repeat
13:
  Determine the data pair Q m , D n with the consistent probability distribution, where
m , n = arg min i , k x , y x 1 , 2 , , K ¯ M , y 1 , 2 , , K ¯ N MMD Q i , D k ;
14:
  Label the samples in Q m with the label of D n ;
15:
   M M m ;
16:
   N N n ;
17:
   D ̲ D ̲ Q m D n ;
18:
until  M = 1 , 2 , , K ¯ or N = 1 , 2 , , K ¯ .
19:
 Update the labeled data as D ¯ D ¯ D ̲ ;
20:
 Update the unlabeled data as D D D ̲ .
21:
until The number of labels predicted with L e a r n e r for unlabeled data set D is 1 or the number of samples in the unlabeled data set D is less than the given threshold ζ > 0 .
In the MMD-SSL algorithm, it is feasible to match the k-means-clustered data set Q m with the MLP-classified data set D n . Assume that Q m has a consistent probability distribution with D n , which indicates that
MMD Q m , D n < MMD Q m , D k
holds for any k 1 , 2 , , K ¯ and k n . This observation can be demonstrated by the following illustration in Figure 1. We can see that the unlabeled data sets are, respectively, partitioned into 5 parts by the k-means algorithm ( Q 1 , Q 2 , Q 3 , Q 4 , Q 5 ) and MLP classifier ( D 1 , D 2 , D 3 , D 4 , D 5 ). MMD values corresponding to different data pairs are calculated as
MMD Q i , D k D 1 D 2 D 3 D 4 D 5 Q 1 0.08 0.52 0.48 0.58 0.49 Q 2 0.45 0.30 0.43 0.26 0.44 Q 3 0.45 0.51 0.46 0.57 0.33 Q 4 0.44 0.48 0.00 0.54 0.44 Q 5 0.54 0.57 0.53 0.63 0.25 .
Then, we can get the distribution consistency measure results as
MMD Q 4 , D 3 = min i = 1 , 2 , 3 , 4 , 5 k = 1 , 2 , 3 , 4 , 5 MMD Q i , D k = 0.00 MMD Q 1 , D 1 = min i = 1 , 2 , 3 , 5 k = 1 , 2 , 4 , 5 MMD Q i , D k = 0.08 MMD Q 5 , D 5 = min i = 2 , 3 , 5 k = 2 , 4 , 5 MMD Q i , D k = 0.25 MMD Q 2 , D 4 = min i = 2 , 3 k = 2 , 4 MMD Q i , D k = 0.26 MMD Q 3 , D 2 = 0.51
for the data sets shown in Figure 1. It indicates that the k-means-clustered data sets Q 1 , Q 2 , Q 3 , Q 4 , Q 5 have the same class labels as the MLP-classified data sets D 3 , D 1 , D 5 , D 4 , D 2 , respectively. Taking the data pair Q 4 , D 3 as example, the samples in both Q 4 and D 3 are labeled as class 3 and further added into the labeled data set to update the training of the MLP classifier in the next iteration.
The stopping criteria of the MMD-SSL algorithm is that there is only one class predicted by the MLP classifier in the unlabeled data set or that the number of unlabeled samples in the unlabeled data set is less than the threshold ζ > 0 . The rationale of the first stopping criterion is that applying the k-means clustering algorithm is unnecessary for a data set having one cluster, that is, where the MLP classifier predicts the same label for all unlabeled samples. In this situation, the training of the MMD-SSL algorithm stops and all samples from the unlabeled data set obtain the predicted label. For the second stopping criterion, we adopt an adaptive threshold determination strategy, i.e, let ζ = K ¯ . It indicates that the training of MMD-SSL is stopped when the number of remaining unlabeled samples is equal to the number of labels predicted by the MLP classifier.

4. Experimental Results and Analysis

We conducted two experiments to validate the rationality and effectiveness of the proposed MMD-SSL algorithm. The MMD-SSL algorithm was implemented using the Python programming language and other SSL algorithms were downloaded from sci2s (https://sci2s.ugr.es/SelfLabeled (accessed on 21 December 2021)), which is a soft computing and intelligent information system developed by the University of Granada research group. All the experiments were carried out on a personal computer equipped with an Intel(R) Quad-core 3.00 GHz i5-7400 CPU and 16 GB of main memory.

4.1. Rationality Validation

The first experiment was done to evaluate the suitability of using the MLP classification algorithm and the k-means clustering algorithm in the MMD-SSL algorithm. We checked the testing performance of 28 different combinations of classification and clustering algorithms on a synthetic data set #1, which can be downloaded in any country from our BaiduPan online storage (https://pan.baidu.com/s/1aDm8n7AA2ETtSumM5LXVBQ (accessed on 21 December 2021)) with extraction code nc19. The classification algorithms included MLP, Bernoulli naive Bayes (BNB) BernoulliNB (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html (accessed on 21 December 2021)), Gaussian naive Bayes (GNB) GaussianNB (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html (accessed on 21 December 2021)), support vector machines (SVM) SVC (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html (accessed on 21 December 2021)), k-nearest neighbors (k-NN) KNeighborsClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (accessed on 21 December 2021)), decision tree DecisionTreeClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html (accessed on 21 December 2021)), and random forest RandomForestClassifier (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (accessed on 21 December 2021)). The clustering algorithms included k-means, agglomerative clustering AgglomerativeClustering (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html (accessed on 21 December 2021)), spectral clustering SpectralClustering (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html (accessed on 21 December 2021)), and BIRCH Birch (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html (accessed on 21 December 2021)). There were five classes in the synthetic data set #1, which was divided into three parts, i.e., labeled training data set (356 samples), unlabeled training data set (692 samples), and labeled testing data set (298 samples). The labeled training data set was randomly generated by using the standard sklearn package make_blobs (https://scikit-learn.org/dev/modules/generated/sklearn.datasets.make_blobs.html (accessed on 21 December 2021)) with n_features = 2, centers = 5, and cluster_std = 0.3. The unlabeled training and labeled testing samples were generated by randomly adding small values in the interval [ 0.08 , 0.08 ] to the feature values of labeled samples. The MLP classifier with a learning rate of 0.1 and a maximum iteration number of 300 used in this experiment had two hidden layers and each hidden layer included 400 ReLU activation nodes.
The comparative results corresponding to 10 times the independent training of the MMD-SSL algorithms are summarized in Table 1. We can see that the clustering algorithms combined with the MLP classifier obtain better average testing accuracies than the clustering algorithms combined with other classifiers. The average testing accuracies of the MMD-SSL algorithm with MLP classification and k-means, agglomerative, spectral, and the BIRCH clustering algorithm are 0.975, 0.958, 0.996, and 0.954, respectively. Due to its better generalization capability, the MLP classifier was selected to carry out the classification in the MMD-SSL algorithm. For MMD-SSL with an MLP classifier, we used k-means to conduct the clustering task for the unlabeled data set due to its simple model structure and acceptable semi-supervised learning performance.
Secondly, the convergence of the proposed MMD-SSL algorithm on another synthetic data set #2 as shown in Figure 2 was validated. The synthetic data set #2 had the same data distribution as the synthetic data set #1. Figure 3 shows the convergence process of the MMD-SSL algorithm. We can see that the MMD-SSL algorithm reaches convergence with only six iterations, i.e., all unlabeled samples are labeled with the gradually updated MLP classifier. The testing accuracies of the MLP classifier corresponding to these six iterations in Figure 2d are 0.711, 0.819, 0.836, 0.859, 0.869, and 0.869. The increase rate of the testing accuracy reaches more than 15%. It indicates that the designed MMD-SSL algorithm is reasonable and able to improve the generalization capability of a classifier by properly utilizing the unlabeled training samples.

4.2. Effectiveness Validation

In this experiment, we compared the testing accuracy and kappa coefficient [25] of the MMD-SSL algorithm with 10 other self-training and co-training SSL algorithms, namely SETRED [10], SNNRCE [11], APSSC [12], Self-Training-NN [8], DemoCoL [14], Tri-Training [15], RASCO [16], Rel-RASCO [17], CLCC [18], and Co-Training-NN [9]. The parameters of these SSL algorithms were set as follows.
  • SETRED: the maximum number of iterations was 40 and the size of the initial unlabeled sample pool was 75;
  • SNNRCE: the rejection threshold to test the critical region was 0.5;
  • APSSC: the spread of Gaussian was 0.3, evaporation coefficient was 0.7, and MT was 0.7;
  • Self-Training-NN: the maximum number of iterations was 40 and the number of nearest neighbors was 3;
  • DemoCoL: the number of nearest neighbors was 3 and the confidence of pruned tree was 0.25;
  • RASCO: the maximum number of iterations was 40 and the number of views was 30;
  • Rel-RASCO: the maximum number of iterations was 40 and the number of views was 30;
  • CLCC: the number of random forests was 6, the manipulative beta parameter was 0.4, the number of initial clusters was 4, the running frequency was 10, and the number of best center sets was 6;
  • Co-Training-NN: the maximum number of iterations was 40, the number of nearest neighbors was 3, and the size of the initial unlabeled sample pool was 75.
We selected 29 data sets from the KEEL data repository (https://sci2s.ugr.es/keel/category.php?cat=clas&order=name#sub2 (accessed on 21 December 2021)) to test the performance of these SSL algorithms. The detailed descriptions of these data sets are summarized in Table 2. Each data set was randomly partitioned into three parts: labeled training data, unlabeled training data, and labeled testing data. The ratios of labeled training data were set as 10% and 30%. For each labeled data ratio, the percentages of unlabeled training data and labeled testing data were 70% and 30% of the rest of the data set. Each algorithm was trained using the labeled and unlabeled training data and tested based on the labeled testing data.
The testing accuracy and kappa value were calculated as the average value of 10 different testing accuracies and kappa values corresponding to 10 different data partitions. Table 3, Table 4, Table 5 and Table 6 present the detailed comparative results of these 11 SSL algorithms for the two labeled data ratios. We can see that (1) the MMD-SSL algorithm obtains higher testing accuracy and kappa values than the other 10 SSL algorithms for each labeled data ratio; (2) the testing accuracy and kappa value of the SSL algorithm increase gradually as the label training data increase. We also use the critical difference diagram [26] as shown in Figure 4 and Figure 5 to present the statistical analysis results for this comparison of 11 SSL algorithms. There are 11 algorithms which are compared based on 29 data sets. For the given significance level of 0.01, the critical difference (CD) value is calculated as q 0.05 11 × 11 + 1 6 × 29 = 2.516 , where q 0.05 is the CD value of Tukey’s distribution corresponding to a significance level of 0.01. The statistical analysis results indicate that the MMD-SSL algorithm obtains (1) significantly better testing accuracy and kappa value than APSSC, Self-Training-NN, Tri-Training, RASCO, Rel-RASCO, CLCC, Co-Training-NN for SSL based on 10% labeled training data and (2) significantly better testing accuracy and kappa value than APSSC, SNNRCE, Tri-Training, RASCO, Rel-RASCO, CLCC, Co-Training-NN for SSL based on 30% labeled training data. Above all, the MMD-SSL algorithm obtains the highest average testing accuracy and kappa value among the compared SSL algorithms and thus demonstrates its effectiveness when dealing with the SSL problems. In addition, the comparative results indicate that MMD-SSL is more suitable for dealing with SSL for imbalanced data sets, because the kappa coefficient is one of the most popular indices to measure the ability to handle imbalanced classification of a learning algorithm and the average testing kappa values of MMD-SSL are higher than those of other SSL algorithms.

5. Conclusions and Future Work

In this paper, we proposed a maximum mean discrepancy-based semi-supervised learning (MMD-SSL) algorithm which is a data distribution-oriented SSL algorithm. The unlabeled samples were gradually labeled by considering the consistencies between the clustering and classification results. The MMD-SSL algorithm belongs to the category of self-training SSL and performs three main steps: the training of a multilayer perceptron (MLP) classifier, the clustering of unlabeled samples, and the consistent labeling of unlabeled samples. The experimental results demonstrated the rationality and effectiveness of the designed MMD-SSL algorithm by comparing it with 10 other SSL algorithms on 29 benchmark data sets. Here, we briefly summarized the technical advantages of the MMD-SSL algorithm.
  • Highly confident pseudo labeling. Because the MMD criterion is used to measure the distribution consistency between the k-means-clustered samples and MLP-classified samples, the pseudo labeling considers both the inherent features (k-means clustering results) and extrinsic characteristics (MLP classification results) of unlabeled samples. This kind of pseudo labeling provides more confidence than the pseudo labeling done using only the internal or external information.
  • Good generalization capability of the classifier. The MLP classifier is trained based on the samples with highly confident pseudo labels and thus its testing performance is gradually improved with the increase of training samples. The experimental results have demonstrated this conclusion. The highly confident pseudo labeling leads to the good generalization capability of the MLP classifier.
  • Easy implementation. The MMD-SSL algorithm is easy to understand and implement in any programming language. Moreover, training the MMD-SSL algorithm converges with the decrease of unlabeled samples.
Future work will focus on the following three directions. First, the MMD-SSL algorithm will be implemented on a distributed computation environment and used to deal with large-scale SSL problems. Second, an ensemble version of the MMD-SSL algorithm will be developed to further enhance labeling confidence of unlabeled samples. Third, the MMD-SSL algorithm will be applied in real-world applications, e.g., the identification of harassing phone calls and the detection of abnormal power consumption behavior.

Author Contributions

Methodology, Writing-Original Draft Preparation, Formal Analysis, Q.H.; Writing-Original Draft Preparation, Writing-Review and Editing, Y.H.; Supervision, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was supported by the Scientific Research Foundation of National Natural Science Foundation of China (61972261), Basic Research Foundation of Shenzhen (JCYJ20210324093609026), and Shenzhen University for Newly introduced Teachers (2018060).

Data Availability Statement

The data presented in this study are available in BaiduPan https://pan.baidu.com/s/1aDm8n7AA2ETtSumM5LXVBQ (accessed on 21 December 2021) with extraction code vn6j.

Acknowledgments

The authors would like to thank the editor and two anonymous reviewers who carefully read the paper and provided valuable suggestions that considerably improved the paper. They thank Philippe Fournier-Viger for helping them improve the linguistic quality of the manuscript so that it can be read more smoothly.

Conflicts of Interest

The authors declare no conflict of interest.

Acronyms

MMDMaximum mean discrepancy.
SSLSemi-supervised learning.
MMD-SSLMaximum mean discrepancy-based semi-supervised learning.
MLPMultilayer perceptron.
SVMSupport vector machine.
SETREDSelf-training with editing.
SNNRCESelf-training nearest neighbor rule using cut edges.
SSCSemi-supervised classification.
APSSCAggregation pheromone density based semi-supervised classification.
DemoCoLDemocratic co-learning.
RASCORandom subspace co-training.
Rel-RASCORelevant RASCO.
CLCCClassification algorithm based on local cluster centers.
BNBBernoulli naive Bayes.
GNBGaussian naive Bayes.
SVMSupport vector machines.
k-NNk-nearest neighbors.
sci2sSoft Computing and Intelligent Information Systems.
KEELKnowledge extraction based on evolutionary learning.
CDCritical difference.

References

  1. Zhu, X.J.; Goldberg, A.B. Introduction to semi-supervised learning. Synth. Lect. Artif. Intelli. Mach. Learn. 2009, 3, 1–130. [Google Scholar] [CrossRef] [Green Version]
  2. Cohn, D.A.; Ghahramani, Z.; Jordan, M.I. Active learning with statistical models. J. Artif. Intell. Res. 1996, 4, 129–145. [Google Scholar] [CrossRef]
  3. Beitzel, S.M.; Jensen, E.C.; Frieder, O.; Lewis, D.D.; Chowdhury, A.; Kolcz, A. Improving automatic query classification via semi-supervised learning. In Proceedings of the Fifth IEEE International Conference on Data Mining, Houston, TX, USA, 7–30 November 2005; pp. 8–15. [Google Scholar]
  4. Guillaumin, M.; Verbeek, J.; Schmid, C. Multimodal semi-supervised learning for image classification. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 902–909. [Google Scholar]
  5. Li, Y.; Sun, Y.; Contractor, N. Graph mining assisted semi-supervised learning for fraudulent cash-out detection. In Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Sydney, Australia, 31 July–3 August 2017; pp. 546–553. [Google Scholar]
  6. Tamposis, I.A.; Tsirigos, K.D.; Theodoropoulou, M.C.; Kontou, P.I.; Bagos, P.G. Semi-supervised learning of Hidden Markov Models for biological sequence analysis. Bioinformatics 2019, 35, 2208–2215. [Google Scholar] [CrossRef] [PubMed]
  7. Van Engelen, J.E.; Hoos, H.H. A survey on semi-supervised learning. Mach. Learn. 2020, 109, 373–440. [Google Scholar] [CrossRef] [Green Version]
  8. Scudder, H. Probability of error of some adaptive pattern-recognition machines. IEEE Trans. Inf. Theory 1965, 11, 363–371. [Google Scholar] [CrossRef]
  9. Blum, A.; Mitchell, T. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998; pp. 92–100. [Google Scholar]
  10. Li, M.; Zhou, Z.H. SETRED: Self-training with editing. In Proceedings of the 2005 Pacific-Asia Conference on Knowledge Discovery and Data Mining, Hanoi, Vietnam, 18–20 May 2005; pp. 611–621. [Google Scholar]
  11. Wang, Y.; Xu, X.; Zhao, H.; Hua, Z. Semi-supervised learning based on nearest neighbor rule and cut edges. Knowl.-Based Syst. 2010, 23, 547–554. [Google Scholar] [CrossRef]
  12. Halder, A.; Ghosh, S.; Ghosh, A. Aggregation pheromone metaphor for semi-supervised classification. Pattern Recognit. 2013, 46, 2239–2248. [Google Scholar] [CrossRef]
  13. Wu, D.; Shang, M.; Luo, X.; Xu, J.; Yan, H.; Deng, W.; Wang, G. Self-training semi-supervised classification based on density peaks of data. Neurocomputing 2018, 275, 180–191. [Google Scholar] [CrossRef]
  14. Zhou, Y.; Goldman, S. Democratic co-learning. In Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, FL, USA, 15–17 November 2004; pp. 594–602. [Google Scholar]
  15. Zhou, Z.H.; Li, M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 2005, 17, 1529–1541. [Google Scholar] [CrossRef] [Green Version]
  16. Wang, J.; Luo, S.W.; Zeng, X.H. A random subspace method for co-training. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 1–8 June 2008; pp. 195–200. [Google Scholar]
  17. Yaslan, Y.; Cataltepe, Z. Co-training with relevant random subspaces. Neurocomputing 2010, 73, 1652–1661. [Google Scholar] [CrossRef]
  18. Huang, T.; Yu, Y.; Guo, G.; Li, K. A classification algorithm based on local cluster centers with a few labeled training examples. Knowl.-Based Syst. 2010, 23, 563–571. [Google Scholar] [CrossRef]
  19. Piroonsup, N.; Sinthupinyo, S. Analysis of training data using clustering to improve semi-supervised self-training. Knowl.-Based Syst. 2018, 143, 65–80. [Google Scholar] [CrossRef]
  20. Wang, W.; Zhou, Z.H. A new analysis of co-training. In Proceedings of the 27th International Conference on Machine Learning, Haifa, Israel, 21–24 June 2010; pp. 1–8. [Google Scholar]
  21. Zhan, W.; Zhang, M.L. Inductive semi-supervised multi-label learning with co-training. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; pp. 1305–1314. [Google Scholar]
  22. Zhou, Z.H. Disagreement-based Semi-supervised Learning. Acta Autom. Sin. 2013, 39, 1871–1878. [Google Scholar] [CrossRef]
  23. Gretton, A.; Borgwardt, K.M.; Rasch, M.J.; Schölkopf, B.; Smola, A. A kernel two-sample test. J. Mach. Learn. Res. 2012, 13, 723–773. [Google Scholar]
  24. He, Y.L.; Huang, D.F.; Dai, D.X.; Huang, J.Z. General bounds for maximum mean discrepancy statistics. Math. Appl. 2021, 2, 284–288. [Google Scholar]
  25. Vieira, S.M.; Kaymak, U.; Sousa, J.M. Cohen’s kappa coefficient as a performance measure for feature selection. In Proceedings of the 2010 International Conference on Fuzzy Systems, Barcelona, Spain, 18–23 July 2010; pp. 1–8. [Google Scholar]
  26. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
Figure 1. Distribution consistency measures between k-means-clustered and MLP-classified data sets (σ2 = 0.01).
Figure 1. Distribution consistency measures between k-means-clustered and MLP-classified data sets (σ2 = 0.01).
Mathematics 10 00039 g001
Figure 2. The synthetic data set generated with the make_blobs package.
Figure 2. The synthetic data set generated with the make_blobs package.
Mathematics 10 00039 g002
Figure 3. The convergence of the MMD-SSL algorithm.
Figure 3. The convergence of the MMD-SSL algorithm.
Mathematics 10 00039 g003aMathematics 10 00039 g003b
Figure 4. Critical difference diagrams corresponding to accuracy comparisons in Table 3 and Table 4.
Figure 4. Critical difference diagrams corresponding to accuracy comparisons in Table 3 and Table 4.
Mathematics 10 00039 g004
Figure 5. Critical difference diagrams corresponding to kappa comparisons in Table 5 and Table 6.
Figure 5. Critical difference diagrams corresponding to kappa comparisons in Table 5 and Table 6.
Mathematics 10 00039 g005
Table 1. Testing accuracies of different combinations of classification and clustering algorithms in MMD-SSL algorithm.
Table 1. Testing accuracies of different combinations of classification and clustering algorithms in MMD-SSL algorithm.
Classification AlgorithmClustering AlgorithmMaximumMinimumMean
BNBk-means0.6730.6730.673
Agglomerative0.6730.6730.673
Spectral0.6730.6730.673
BIRCH0.6730.6730.673
GNBk-means0.9570.9470.952
Agglomerative0.9070.9070.907
Spectral0.9470.9300.937
BIRCH0.9370.9370.937
SVMk-means0.9530.9530.953
Agglomerative0.9200.9200.920
Spectral0.9970.9970.997
BIRCH0.9200.9200.920
k-NNk-means0.9630.9630.963
Agglomerative0.9300.9300.930
Spectral1.0000.9930.997
BIRCH0.9230.9230.923
Decision treek-means0.9870.8670.888
Agglomerative1.0000.8730.919
Spectral0.9200.8300.863
BIRCH0.9470.8300.878
Random forestk-means0.9930.8900.930
Agglomerative0.9230.8670.902
Spectral0.9700.8070.892
BIRCH0.9300.8130.867
MLPk-means0.9900.9670.975
Agglomerative0.9870.9370.958
Spectral0.9970.9900.996
BIRCH0.9670.9300.954
Table 2. Descriptions of the 29 benchmark data sets.
Table 2. Descriptions of the 29 benchmark data sets.
Data SetsSamplesFeaturesClassesClass Distribution
appendicitis1067285/21
australian690142383/307
banana5300432924/2376
chess31963621527/1669
coil200098228529236/586
magic19,02010212332/6688
mammographic83052427/403
monk-243262204/228
nursery12,960854320/4266/2/4044/328
page-blocks54721054913/329/28/87/115
penbased10,99216101143/1143/1144/1055/1144/1055/1056/1142/1055/1055
phoneme5404523818/1586
pima76882500/268
ring74002023664/3736
saheart46292302/160
satimage64353671533/703/1358/626/707/1508
segment2310197330/330/330/330/330/330/330
sonar208602111/97
spambase45975722785/1812
spectfheart26744255/212
texture55004011500/500/500/500/500/500/500/500/500/500/500
thyroid7200213166/368/6666
tic-tac-toe95892332/626
titanic2201321490/711
twonorm74002023703/3697
vowel990131190/90/90/90/90/90/90/90/90/90/90
wine17813359/71/48
wisconsin68392444/239
zoo10116741/20/5/13/4/8/10
Table 3. Accuracy comparison of MMD-SSL and 10 other SSL algorithms on 10% labeled data.
Table 3. Accuracy comparison of MMD-SSL and 10 other SSL algorithms on 10% labeled data.
MMD-SSLSelf-Training MethodsCo-Training Methods
Data SetsSETRED (2005)SNNRCE (2010)APSSC (2013)Self-Training (NN)DemoCoL (2004)Tri-Training (2005)Rasco (2008)Rel-Rasco (2010)CLCC (2010)Co-Training (NN)
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
appendicitis0.75000.03950.73730.13040.79270.09000.67730.21260.75730.12290.82180.04710.73820.07280.79360.06820.75550.08370.85000.09390.78270.0721
australian0.79810.03230.80430.03620.80870.03290.83770.04040.80430.03620.84490.02680.80290.04210.70870.06990.74350.03890.85360.02930.80580.0319
banana0.87250.00760.86380.01190.86620.01250.82400.02080.86380.01190.84170.02260.86810.01010.85130.01290.84720.01210.58040.04190.84600.0166
chess0.93410.01260.81040.02830.82200.01840.83260.02180.80980.02920.91990.01750.83090.02940.79910.02780.80100.02670.66400.03740.79940.0272
coil20000.91470.00470.89260.00520.91280.00760.68380.03640.89040.00670.93220.00780.87950.01090.89320.00710.89460.00450.94030.00050.90090.0058
magic0.84800.00350.78400.00740.79180.00600.73790.01140.78400.00740.78420.01640.76780.00720.78000.00700.78120.00780.75940.01870.78120.0061
mammographic0.65280.08830.75800.05940.77730.05340.80220.04170.75910.06060.79630.05510.76990.06290.72770.05950.71510.05850.79850.04050.71280.0629
monk-20.74920.06180.64590.04090.69230.08310.65630.07840.64590.04520.90750.04520.64600.07510.65300.05650.68520.08940.70820.07660.63730.0823
nursery0.96540.00750.81010.00810.74990.01110.66830.01580.71430.01450.89510.00410.86980.00610.45870.01620.45730.01170.36030.01590.76980.0094
page-blocks0.94360.01250.93590.00820.93730.00870.80120.12020.92560.01050.90770.09560.93640.00800.84670.01290.85380.01110.89930.00330.93290.0104
penbased0.97790.00440.97780.00540.97300.00470.85520.00790.97780.00540.94740.01170.98010.00490.90540.01010.91040.01120.72810.04830.97520.0062
phoneme0.81060.00680.80460.02040.80590.01120.68320.02130.80440.02060.78740.01770.80460.01920.79980.02820.79510.02410.75850.02710.80590.0212
pima0.74110.01300.65650.05420.63830.06600.73320.03390.65650.05420.69670.05860.62650.05620.64070.04780.63410.05600.69540.03500.64860.0497
ring0.90240.01080.66910.02030.55730.00810.50490.00070.66910.02030.87410.01200.60410.00980.66680.02180.66920.02200.63880.04900.67760.0183
saheart0.64460.06410.63000.08350.62150.07400.65590.07720.64080.07840.68190.04720.62770.06900.60390.06270.60800.08350.65380.02730.63640.0557
satimage0.85420.00610.85700.01310.85470.01600.80110.01530.84660.01510.84620.01420.85210.01300.78180.02070.78550.02230.79440.02160.84910.0180
segment0.89540.01650.90650.01720.90220.02000.85190.02480.90610.01700.90260.01690.90740.01260.70520.04740.72600.05390.73590.03330.88270.0339
sonar0.66350.05970.66330.09640.64900.09610.70170.12730.66330.09640.60050.11120.63450.12760.62050.07970.62500.12760.56330.09400.68670.1160
spambase0.90490.01030.82810.01890.83270.01760.63240.13400.82810.01890.87770.01880.81100.02040.81640.01750.81380.02130.79660.02340.81400.0206
spectfheart0.78520.00600.72010.11890.74260.07290.37420.07440.68650.11700.73790.08280.69050.09110.72640.08600.70090.06210.79420.01660.51100.1227
texture0.97080.00410.95130.00800.95150.00780.87330.01350.95130.00760.89440.01560.95240.00560.81240.02160.82110.02430.71820.03380.94800.0070
thyroid0.95970.00310.90900.00930.92040.00530.65540.15720.89630.01110.93930.01750.90670.01070.89560.00960.89620.00810.92580.00250.90720.0077
tic-tac-toe0.67220.02480.72550.04060.73600.02980.67010.04940.71500.04570.69000.03090.70670.02680.67540.05360.70050.05320.64610.03920.71930.0443
titanic0.76280.00980.64020.14190.64160.14230.77560.02930.64020.14190.77560.02820.74150.03530.64020.14150.64020.14190.69790.01930.64020.1419
twonorm0.97400.00170.93580.00740.94590.00710.97590.00750.93580.00740.96450.00820.91090.00650.92420.01020.92650.00570.95890.01040.93640.0081
vowel0.41480.02260.48080.05330.48380.05910.43430.05170.48790.05210.41620.06390.49800.04910.31520.07780.31720.08420.22730.03620.48590.0680
wine0.98150.01170.94380.02490.92710.06640.96050.03590.94380.02490.94930.03900.92650.04460.66860.10340.61800.13990.94900.04000.87160.0963
wisconsin0.96000.00370.94780.04280.96220.02940.95930.02190.94780.04280.96500.02570.94620.03660.86420.06010.86290.05950.95220.03510.94340.0481
zoo0.88000.03400.93470.05480.92280.09370.93470.05480.92360.06830.93140.06500.93470.05480.60190.19020.66190.08790.83640.13690.83670.1171
Average0.83390.02010.80080.12630.80070.13150.74320.14740.79570.12560.83200.12470.79900.12750.73020.13700.73270.13580.74090.16750.78430.1295
Testing accuracies of the 10 other SSL algorithms: https://sci2s.ugr.es/sites/default/files/files/ComplementaryMaterial/SelfLabeled/SelfLabeled10.ods (accessed on 21 December 2021).
Table 4. Accuracy comparison of MMD-SSL and 10 other SSL algorithms on 30% labeled data.
Table 4. Accuracy comparison of MMD-SSL and 10 other SSL algorithms on 30% labeled data.
MMD-SSLSelf-Training MethodsCo-Training Methods
Data SetsSETRED (2005)SNNRCE (2010)APSSC (2013)Self-Training (NN)DemoCoL (2004)Tri-Training (2005)Rasco (2008)Rel-Rasco (2010)CLCC (2010)Co-Training (NN)
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
appendicitis0.85630.04240.81090.13890.82090.13050.82910.09470.83000.12400.86910.07340.80270.10390.72180.14100.78730.20230.84820.09830.83910.1194
australian0.85100.01990.81010.02290.81300.03070.85940.02750.81010.02290.85360.03000.80000.03230.77100.04140.77970.03770.85220.04620.81010.0188
banana0.89140.00350.87000.01240.87250.01100.83260.02200.87000.01240.87280.01330.86870.01140.86450.01180.86510.01270.58130.03040.86090.0131
chess0.95640.00620.86480.02390.86730.01200.90830.01880.86510.02400.95960.01740.79910.02470.86040.02370.86080.02130.67550.03770.85420.0257
coil20000.92090.00780.89920.00750.91580.00730.73170.03580.89770.00810.93200.00360.88200.01120.90010.00790.89650.01010.94030.00050.90180.0083
magic0.85260.00570.79500.00980.79520.00600.74500.00960.79500.00980.80160.00810.77760.01250.79400.00980.79400.01030.75310.01660.79320.0092
mammographic0.76000.01880.76200.07520.78760.07820.80330.04050.76200.07520.83000.05560.75570.07840.74010.07140.73530.07360.79020.04620.73290.0662
monk-20.96770.01850.75130.05880.74210.03960.78070.07070.75810.05790.94520.04340.67950.06630.73500.06400.74670.05130.73930.09750.70170.0688
nursery0.99680.00160.83570.01150.76730.01710.71430.00940.76870.01140.92120.01070.74150.01160.67370.01150.68330.01100.36330.01270.83550.0118
page-blocks0.94960.00140.94610.00600.94500.00580.85200.01770.94280.00620.92890.05230.94680.00750.88180.01430.87650.01330.89780.00050.94480.0062
penbased0.98570.00110.99010.00280.97400.00470.88770.00690.99010.00280.97290.00470.98880.00510.96640.00350.96400.00440.73250.03260.99020.0028
phoneme0.85600.00190.84700.01920.84360.01550.71500.02600.84700.01920.80290.02240.84640.01470.84200.02060.84340.01900.77160.01530.84640.0188
pima0.71170.00890.66940.04940.69030.05370.72530.02850.67330.04750.73050.04740.65900.06100.65110.04340.65630.02780.72670.03660.65500.0582
ring0.95660.00260.71040.01310.60070.01970.50490.00070.71040.01310.90890.00930.64530.00990.70970.01260.70990.01340.61700.02150.71320.0130
saheart0.69350.02640.66440.04410.67330.04810.64720.09690.66440.04080.70800.05200.66870.04510.67530.02640.62750.05650.67530.05650.65130.0590
satimage0.86940.00380.88620.01180.87230.02000.79720.01720.88220.01050.86930.01040.87380.01100.85530.01720.85440.01310.78630.02580.88500.0109
segment0.94320.00600.94110.01730.93680.01400.88870.01930.94110.01730.94160.01440.94590.01560.87970.02440.87750.02320.75890.03890.93770.0149
sonar0.84440.01190.76450.08400.74500.12320.78290.10250.76450.08400.73100.07390.76900.08070.68640.10720.71190.10640.62000.07580.76880.0849
spambase0.92640.00420.86950.00910.86580.01390.80950.02520.86950.00910.90520.01680.84880.01590.86620.01040.86530.00960.78810.02070.86690.0087
spectfheart0.77040.03460.71270.09880.75340.08740.43090.05910.70100.09250.71210.07990.73460.07570.70440.12600.67440.09080.79420.01660.57690.1289
texture0.99190.00110.98050.00490.96050.00470.88670.01230.98000.00550.93310.01080.97330.00550.92560.00990.92250.01410.72400.04810.97800.0055
thyroid0.96310.00720.91830.00800.92850.00490.49180.09030.91000.00680.95210.00590.91530.00460.90640.00630.90890.00520.92580.00250.91440.0049
tic-tac-toe0.79170.04960.79230.03480.79550.03850.72640.04610.79750.02740.76300.04820.71920.01970.77980.03590.78820.02670.66080.01680.78610.0307
titanic0.80000.00490.64070.14140.65520.14640.77740.03380.64070.14140.77920.02720.70830.03090.64070.14140.64020.14190.71970.01900.64070.1414
twonorm0.97260.00080.94390.00930.94930.00950.97580.00670.94390.00930.97010.00720.91310.01340.94050.00940.94110.00980.95730.01050.94350.0098
vowel0.67650.01450.77370.02400.75660.02320.69900.03930.78890.02730.59600.05090.78890.02450.61010.05070.61620.03440.22630.02680.76060.0239
wine0.97040.01890.92750.05570.88690.06270.95490.03360.93820.04610.96600.03730.93860.06310.71210.11820.68560.13160.93270.05970.85260.0757
wisconsin0.97370.00240.95350.04350.96370.02140.95780.02340.95350.04350.96660.02800.96350.02250.91700.05400.90830.05630.95930.02360.93720.0480
zoo0.90670.01330.93310.07140.92640.06270.93970.06790.93310.07140.91330.08480.93470.05480.80360.07820.82220.09280.88060.08810.91640.0830
Average0.88300.01170.83670.10010.83120.10220.78120.13610.83550.10040.86330.09960.82380.10450.79360.10270.79460.10340.74820.16490.82400.1096
Testing accuracies of the 10 other SSL algorithms: https://sci2s.ugr.es/sites/default/files/files/ComplementaryMaterial/SelfLabeled/SelfLabeled30.ods (accessed on 21 December 2021).
Table 5. Kappa comparison of MMD-SSL algorithm and 10 other SSL algorithms on 10% labeled data.
Table 5. Kappa comparison of MMD-SSL algorithm and 10 other SSL algorithms on 10% labeled data.
MMD-SSLSelf-Training MethodsCo-Training Methods
Data SetsSETRED (2005)SNNRCE (2010)APSSC (2013)Self-Training (NN)Democratic-Co (2004)Tri-Training (2005)Rasco (2008)Rel-Rasco (2010)CLCC (2010)Co-Training (NN)
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
appendicitis0.38580.21560.09820.40710.13430.32570.33600.29990.16070.39670.12310.24620.05020.23470.19290.2756−0.00890.26130.34870.41450.07870.2266
australian0.61220.05810.59940.07620.60710.07070.67270.08040.59940.07620.68260.05670.59560.09110.40730.14590.47560.08270.70250.06290.60060.0695
banana0.77140.01380.72380.02430.72870.02530.64510.04220.72380.02430.67890.04600.73280.02030.69860.02670.69050.02410.07550.10450.68620.0343
chess0.85040.02760.61960.05680.64310.03660.66380.04380.61830.05880.83920.03530.51420.06120.59680.05590.60080.05380.31140.07940.59670.0551
coil20000.08090.02420.05290.03500.03850.03630.07770.03310.05150.03290.04040.04440.04910.03280.04120.04190.04500.05110.00000.00000.04900.0362
magic0.65770.01070.51630.01570.52520.01320.40420.02020.51630.01570.47220.04720.47950.01600.50740.01410.51000.01680.40400.05870.50660.0127
mammographic0.56070.07740.51600.12010.55430.10770.60640.08260.51830.12240.59370.10880.53930.12700.45480.12080.43030.11900.59810.08060.42230.1273
monk-20.53290.06880.29120.07900.37510.17100.30980.15650.29140.08770.81500.08850.28560.15570.29620.10940.36410.17860.40940.14520.28110.1653
nursery0.93800.02110.72010.01190.63330.01620.57510.02290.58280.02130.84480.00580.56190.00960.20590.02400.20390.01620.05560.02530.66270.0141
page-blocks0.73810.03840.64090.03970.64190.05280.37010.11190.60730.05250.64150.20040.64450.03690.19280.04990.23390.05900.06400.12880.62820.0513
penbased0.97610.00470.97530.00600.97000.00520.83910.00880.97530.00600.94160.01300.97790.00540.89490.01120.90040.01240.69780.05360.97240.0069
phoneme0.58400.01480.51920.05480.52200.02870.39940.03060.51860.05560.50240.04180.51930.05060.50740.07320.49750.06220.36050.14120.51770.0595
pima0.40980.02600.26520.10260.19940.11390.40250.08090.26520.10260.31220.14450.19630.08050.23760.09440.21350.11660.26490.13050.19130.1035
ring0.81700.02740.33380.04120.10680.01640.00000.00000.33380.04120.74750.02410.20190.02020.32930.04420.33410.04470.27250.09910.35110.0370
saheart0.20370.11290.18960.18800.13250.16330.28350.14960.21620.17960.25080.09790.10780.17980.09680.15860.12170.18460.06940.10520.12860.1405
satimage0.83760.00680.82350.01590.82060.01960.75610.01900.81080.01830.81020.01740.81770.01590.73040.02560.73500.02740.74240.02780.81350.0223
segment0.88740.02550.89090.02010.88590.02340.82730.02890.89040.01980.88640.01970.89190.01470.65610.05530.68030.06280.69190.03880.86310.0396
sonar0.21930.16940.30720.20970.28130.20170.39270.26390.30780.21000.16880.24290.25200.26650.21330.18340.22290.27880.09400.20490.37420.2352
spambase0.79200.02050.63930.03890.64650.03500.34200.20420.63930.03890.74390.03820.60340.03960.61360.03450.60930.04340.56460.05050.61380.0406
spectfheart0.03520.04310.22290.30570.18830.23300.10200.05400.14660.27060.36290.15270.16860.22380.20780.21880.09240.18300.00000.00000.11520.1792
texture0.97220.00560.94640.00880.94660.00850.86060.01490.94640.00840.88380.01720.94760.00620.79360.02370.80320.02680.69000.03710.94280.0077
thyroid0.67010.01640.25640.06850.21640.05080.07640.02580.22380.06210.26200.23770.21530.05910.16460.03100.16520.06190.00000.00000.24650.0571
tic-tac-toe0.36180.02130.36920.09160.41500.05630.30770.11020.36480.10220.25990.13400.20870.07850.28050.11350.32240.11650.07210.12560.33170.1058
titanic0.45330.04150.27000.19390.27170.19460.43760.07860.27000.19390.43780.08000.25840.13520.27010.19240.27000.19390.15950.11130.27000.1939
twonorm0.94540.00330.87160.01470.89190.01410.95190.01500.87160.01470.92890.01630.82190.01300.84840.02050.85300.01150.91780.02080.87270.0162
vowel0.46630.00690.42890.05860.43220.06510.37780.05690.43670.05730.35780.07020.44780.05400.24670.08560.24890.09260.15000.03980.43440.0748
wine0.93270.02250.91540.03700.89080.09870.94040.05400.91540.03700.92350.05870.88870.06780.50220.15350.41740.21220.92300.06040.81030.1384
wisconsin0.92800.01820.88240.09920.91580.06590.90880.04900.88240.09920.92380.05640.87870.08290.68700.14420.67920.14330.89140.08030.87110.1109
zoo0.62860.07710.90800.07900.89630.12510.90790.07900.89170.09700.90310.09250.90830.07880.45300.29200.50340.17470.78380.16940.75380.1957
Average0.62930.04210.54460.28310.53490.29590.50950.28380.53710.27990.59790.28550.50910.30070.42510.24060.42120.25250.39020.31370.51680.2798
Testing kappas of 10 other SSL algorithms: https://sci2s.ugr.es/sites/default/files/files/ComplementaryMaterial/SelfLabeled/SelfLabeled10.ods (accessed on 21 December 2021).
Table 6. Kappa comparison of MMD-SSL algorithm and 10 other SSL algorithms on 30% labeled data.
Table 6. Kappa comparison of MMD-SSL algorithm and 10 other SSL algorithms on 30% labeled data.
MMD-SSLSelf-Training MethodsCo-Training Methods
Data SetsSETRED (2005)SNNRCE (2010)APSSC (2013)Self-Training (NN)Democratic-Co (2004)Tri-Training (2005)Rasco (2008)Rel-Rasco (2010)CLCC (2010)Co-Training (NN)
MeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStdMeanStd
appendicitis0.25800.14110.49480.35530.42030.39900.52510.22770.48500.36790.57460.19110.42840.23120.14790.29580.39760.49580.45930.33750.40290.4151
australian0.63180.08370.61390.04870.61940.06610.71780.05530.61390.04870.70240.06060.59190.06880.53580.08540.55320.07760.70070.09660.61350.0398
banana0.78720.01190.73680.02510.74170.02250.66190.04410.73680.02510.74150.02730.73430.02310.72580.02380.72690.02550.08020.07920.71780.0263
chess0.96200.00580.72900.04790.73390.02410.81600.03770.72960.04820.91900.03500.59210.05070.72010.04770.72080.04280.33680.08070.70720.0518
coil20000.06170.01890.06960.05420.04300.04510.09630.03080.06720.05400.02960.03250.06290.05760.06800.05410.07220.05060.00000.00000.06930.0588
magic0.67470.01230.54070.02000.53210.01240.42700.02210.54070.02000.51440.02250.49890.02920.53850.02010.53840.02120.39450.05610.53530.0187
mammographic0.56920.07360.52360.15050.57530.15620.60830.08070.52360.15050.65980.11110.51040.15680.47970.14270.47040.14730.58120.09180.46440.1319
monk-20.90120.09320.50290.11400.47340.08690.55950.14150.51520.11390.89030.08610.35530.13730.46990.12340.49230.09880.47730.19670.40900.1243
nursery0.99440.00150.75900.01680.65830.02480.63610.01970.66210.01700.88360.01600.62330.01730.52190.01670.53600.01620.06390.01880.75900.0173
page-blocks0.82140.03390.69100.04060.67930.04430.42760.04020.68170.04170.66250.14950.70650.04110.36150.06680.35280.05660.00000.00000.69230.0375
penbased0.98870.00090.98900.00310.97110.00520.87520.00770.98900.00310.96990.00520.98760.00570.96270.00390.96000.00490.70260.03620.98910.0031
phoneme0.63550.02840.62450.05030.61420.04100.43760.04140.62450.05030.54430.04350.62210.04000.61220.05190.61650.04920.43800.04480.62270.0489
pima0.31890.06500.27570.11290.30100.12880.38650.07020.28670.10590.39520.10760.24270.13840.23850.07420.24300.05320.33560.10420.21540.1278
ring0.92090.00960.41750.02620.19500.04010.00000.00000.41750.02620.81750.01860.28560.02010.41620.02530.41650.02680.22820.04410.42330.0261
saheart0.27760.03510.27050.11130.24640.09470.27970.15990.27540.10030.32880.11330.23610.14370.27010.06630.17320.13920.12900.19130.19570.1408
satimage0.86490.00820.85970.01460.84210.02470.75160.02110.85480.01300.83860.01290.84450.01340.82140.02150.82050.01630.73110.03330.85810.0136
segment0.94590.00960.93130.02020.92630.01630.87020.02250.93130.02020.93180.01680.93690.01820.85960.02850.85710.02710.71870.04530.92730.0174
sonar0.52230.16870.52660.17030.48180.26010.55940.21280.52660.17030.45820.14820.52940.17160.36710.21980.41670.21360.23510.14630.53540.1780
spambase0.83730.00390.72620.01850.71700.02900.61360.04540.72620.01850.80160.03590.68280.03310.71950.02080.71750.02000.54680.04270.72120.0178
spectfheart0.13140.11990.20460.26650.28640.24390.14050.04390.18950.23080.35490.14560.27020.21210.16790.31660.10660.23950.00000.00000.14840.1924
texture0.99430.00160.97860.00540.95660.00520.87540.01350.97800.00610.92640.01190.97060.00600.91820.01090.91480.01550.69640.05290.97580.0060
thyroid0.82350.03070.32160.07490.28080.07590.07640.02320.29960.06190.51770.07850.29480.06340.25490.06740.26530.07830.00000.00000.30800.0599
tic-tac-toe0.49760.06960.52740.07330.53290.08880.43130.08970.54470.05600.44500.12320.23620.05600.49990.07840.51890.05680.06590.06990.50950.0696
titanic0.42940.04430.27130.19210.29210.19910.44980.08610.27130.19210.44140.07930.12620.11900.27130.19210.27000.19390.24420.06040.27130.1921
twonorm0.94740.00440.88780.01870.89860.01900.95160.01350.88780.01870.94030.01450.82620.02680.88110.01880.88220.01960.91460.02100.88700.0197
vowel0.76300.03350.75110.02640.73220.02560.66890.04330.76780.03000.55560.05600.76780.02700.57110.05580.57780.03780.14890.02950.73670.0263
wine0.94400.02510.89150.08260.83070.09310.93200.05040.90740.06820.94870.05600.90750.09480.56070.18580.52810.19680.89800.09050.77850.1145
wisconsin0.91030.03710.89620.09670.91980.04690.90600.05160.89620.09670.92740.06120.91920.04950.81070.12260.79230.13000.90950.05210.85760.1092
zoo0.89120.05390.90650.09870.89650.08770.91370.09720.90610.09880.88120.11100.90800.07890.74420.13570.76350.13460.83700.10940.87450.1311
Average0.70020.04230.61790.25570.59990.26030.57220.27480.61500.25620.67590.24270.57580.27840.53500.25070.54140.24670.40940.31050.59330.2636
Testing kappas of 10 other SSL algorithms: https://sci2s.ugr.es/sites/default/files/files/ComplementaryMaterial/SelfLabeled/SelfLabeled30.ods (accessed on 21 December 2021).
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Huang, Q.; He, Y.; Huang, Z. A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm. Mathematics 2022, 10, 39. https://doi.org/10.3390/math10010039

AMA Style

Huang Q, He Y, Huang Z. A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm. Mathematics. 2022; 10(1):39. https://doi.org/10.3390/math10010039

Chicago/Turabian Style

Huang, Qihang, Yulin He, and Zhexue Huang. 2022. "A Novel Maximum Mean Discrepancy-Based Semi-Supervised Learning Algorithm" Mathematics 10, no. 1: 39. https://doi.org/10.3390/math10010039

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop