A Two-Phase Approach for Semi-Supervised Feature Selection

: This paper proposes a novel approach for selecting a subset of features in semi-supervised datasets where only some of the patterns are labeled. The whole process is completed in two phases


Introduction
Pattern classification [1] is one of the core challenging tasks [2,3] in data mining [4,5], web mining [6], bioinformatics [7], and financial forecasting [8,9].The goal of classification [10,11] is to assign a new entity to a class from a pre-specified set of classes.As a particular case, the importance of pattern classification can be realized in the classification of breast cancer.There are two classes of patients, one belonging to the "benign" class, having no breast cancer, while the other class of patients belong to the "malignant" class, which shows strong evidence of breast cancer.A good classifier will reduce the uncertainty of misclassifying patients from being in one of these two classes.Recently, a novel approach was presented using a real-coded genetic algorithm (GA) for a polynomial neural network classifier (PNN) [12].The polynomials have powerful approximation properties [13] and excellent properties as a classifier [12].
One of the major problems in the mining of large databases is the dimension of the data.More often than not, it is observed that some features do not affect the performance of a classifier.There could be features that are derogatory in nature and degrade the performance of classifiers.Thus, one can have redundant features, bad features, and highly correlated features.Removing such features can not only improve the performance of the system but also make the learning task much simpler.More specifically, the performance of a classifier depends on several factors: i) number of training instances; ii) dimensionality, i.e., number of features; and iii) complexity of the classifier.
Dimensionality reduction can be done mainly in two ways: selecting a small but important subset of features and generating (extracting) lower-dimensional data, preserving the distinguishing characteristics of the original higher-dimensional data [14].Dimensionality reduction not only helps in the design of a classifier, but also helps in other exploratory data analysis, assessment of clustering tendency, as well as to decide on the number of clusters by looking at the scatterplot of the lowerdimensional data.Feature extraction and data projection can be viewed as an implicit or explicit mapping from a p-dimensional input space to a q (p>=q)-dimensional output space such that some criterion is optimized.
A large number of approaches for feature extraction and data projection are available in the pattern recognition literature [15][16][17][18][19][20].These approaches differ from each other in terms of the nature of the mapping function, how it is learned, and what optimization criterion is used.Feature selection leads to savings in measurement cost, because some of the features get discarded.Another advantage of selection is that the selected features retain their original interpretation, which is important to understand the underlying process that generates the data.On the other hand, extracted features sometimes have better discriminating capability, leading to better performance, but these new features may not have any clear physical meaning.
When feature selection methods use class information, it is called supervised feature selection.Although the majority of the feature selection methods are supervised in nature, there has been a substantial amount of work using unsupervised methods [21][22][23][24][25][26][27][28].Apart from supervised and unsupervised feature selection where classes are known and unknown, respectively, one more category of datasets is available called semi-supervised, where the classes are assigned to some of the patterns only.The methods to classify unknown patterns come under the category of semisupervised methods, and feature selection on such datasets is called semi-supervised feature selection.
The contributions of the paper are as follows: i. To find a subset of features that has maximum relevance and minimum redundancy (abbreviated to MRmr herein) by using the correlation coefficient.For this purpose, an algorithm (Algorithm 2) is presented to maintain a balance between the features with high relevance and the features with minimum redundancy.ii.To determine a small feature subset that produces high classification accuracy on a supervised classifier to minimize time and complexity of implementing the method.iii.The proposed method aims to demonstrate the idea that if we have a pair of two clusters that are almost identical or much closer to each other, if we know the class of a cluster of the pair, the same class can be assigned to the other cluster of the pair.iv.The class or labels of all patterns can be determined using the proposed novel approach, which will save time or cost in collecting patterns for each pattern in the dataset otherwise.v.The proposed method is a novel concept and can be applied to various real datasets.This paper is organized as follows: The existing techniques of feature selection are presented in Section 2. Section 3 presents preliminaries of the methods used in the paper.Section 4 presents the proposed scheme with two algorithms.The experiments are presented in Section 5.The results obtained from the proposed approach after experiments are discussed in Section 6 followed by conclusions in Section 7.

Existing Feature Selection Techniques
The problem of feature selection can be formulated as follows: Given a dataset p R X ⊂ (i.e., each i x X ∈ has p features), we have to select a subset of features of size q that leads to the smallest (or highest as the case may be) value with respect to some criterion.Let Ƒ be the given set of features and F the selected set of features of cardinality m, F ⊆ Ƒ.Let the feature selection criterion for the dataset X be represented by J(F, X) (lower value of J(.) indicates a better selection).When the training instances are labeled, we can use the label information, but in the case of unlabeled data, this cannot be done [29].
In Saxena et al. [29], a new approach to unsupervised feature selection preserving the topology of the data is proposed.Here, the genetic algorithm (GA) has been used to select a subset of features by taking the Sammon stress/error as the fitness function.The dataset with the reduced set of features is then evaluated using classification (1-nearest neighbor (1-NN)) and clustering (K-means) techniques.The correlation coefficient between the proximity matrices of the original dataset and the reduced one is also computed to check how well-preserved the topology of the dataset is in the reduced dimension.In feature selection, the filter model [30][31][32], wrapper model [33], and embedded model [34] are three main categories.The filter model relies on the features' properties with certain evaluation metrics.In the wrapper model, it evaluates feature sets via the model's prediction accuracy with a combination of features.For the embedded model, the feature selection part and the learning part interact with each other.Though the wrapper model and embedded model can achieve effectively selected features in certain cases, the computational cost is high in application [35].
Supervised feature selection relies on the classification labels, and the typical models are the Fisher metric [36] and Pearson's correlation coefficient [37].Unsupervised feature selection is based on feature similarity or local information.Laplacian score [38] is a typical unsupervised feature selection model, which measures the geometrical properties in the feature sets.Utilizing both the labeled data and unlabeled data is a method to achieve optimal feature subsets, which is the focus of semi-supervised feature selection [39].In [40], both labeled and unlabeled data are trained via the spectral analysis to establish a regularization framework, and the authors demonstrate that the unlabeled data can be helpful for feature selection.In [41], label propagation is conducted and a wrapper-type forward semi-supervised feature selection framework is proposed.
Xu et al. in [35] used Pearson's correlation coefficients to measure the feature-to-feature, as well as the feature-to-label, information.The coefficients are trained with the labeled and the unlabeled data, and the combination of this two-fold information is performed with max-relevance and minredundancy criteria.The experiments are applied on several real-life applications [35].Despite having labeled or unlabeled datasets individually, it is natural to have a mix of both, viz.labeled and unlabeled, in a single dataset.Such types of semi-supervised datasets can be made available intentionally or unintentionally.For the former case, labeling datasets may cost a large amount due to the processes involved to obtain labels after several experiments.In the latter case, labels are missing or doubtful or as good as not available.Some literature can be seen in Sheikhpour [42].In this paper, we propose a two-phase approach to find the missing labels of patterns of a dataset that contains some labeled data (patterns).

Preliminaries of the Methods used in the Proposed Approach
The basics of some methods used in this paper are outlined below for a quick reference.Classification: The process of separating data into groups based on some given labels or similarities.Much has been described before.In supervised classification, the training data contain the labels; the knowledge of these labels will be used to determine the label of the testing data.
Clustering: When the labels are not tagged with the patterns in advance.Similarity among the patterns is used to group (cluster) the datasets.A common approach is to start with a random set of centroids (some or all of them can even be taken as different existing patterns of the dataset),which represent their respective clusters, and then bring the closest patterns (i.e., nearest to the centroid) in a cluster.After a repetitive exercise, we obtain a set of centroids such that the patterns belonging to the clusters represented by these centroids do not shift from one cluster to the other even after repeating the exercise further.This approach is known commonly as K-means clustering [43].The K-means algorithm has some defects: 1) Algorithm is sensitive to the initial cluster center.The selection of initial centers of the pros and cons will affect the clustering results, and then influence the efficiency of the algorithm performance; 2) the algorithm is sensitive to outlier data and will result in a local optimal solution [44].
Fuzzy C-Means (FCM): FCM is a clustering method that allows one point to belong to two or more clusters, unlike K-means where only one cluster is assigned to each point.This method was developed by Dunn in 1973 [45] and improved by Bezdek in 1981 [46].The FCM provides a broader and soft assignment of a point to a cluster, which is why it is preferred over K-means.For more detail about the clustering techniques, refer to the article by Saxena et al. [47].Some semi-supervised FCM clustering algorithms are available in the literature; for an overview, refer to [48].Garibaldi et al. [49] proposed an algorithm where they apply an effective feature enhancement procedure to the entire dataset to obtain a single set of features or weights by weighting and discriminating the information provided by the user.By taking pair-wise constraints into account, they proposed a semi-supervised fuzzy clustering algorithm with feature discrimination (SFFD) incorporating a fully adaptive distance function.Although there can be other methods for semi-supervised feature selection, the objective of the present work is to apply a mechanism to obtain a subset of influential features with MRmr in Phase-1.Then, the reduced dataset due to the presence of only influential features will be clustered.In Phase-2, clustering of the reduced dataset is required and, for this purpose only, any clustering algorithm could be applied.For its flexibility in deciding a pattern for affiliation to a cluster (against K-means clustering), the FCM has been used.
Correlation: These methods are used to find a relationship among attributes (or columns) in a dataset.The most common method to determine the correlation between two column vectors is achieved by computing Pearson's correlation (RRPC) coefficient.For any two column vectors X and Y, Pearson's correlation coefficient is calculated as follows: Where xi and yi are the i-th values of the two feature vectors X and Y with their means as x -and y -, respectively.When the value of  is close to 1 (one), the two constituent column vectors are said to be highly correlated.On the other hand,  closeto 0 (zero) indicates a very poor or no similarity between the two vectors.Polynomial Neural Networks (PNNs): PNNs are a flexible neural architecture in which the topology is not predetermined or fixed, as in a conventional artificial neural network (ANN) [50], but is grown through learning layer by layer.The design is based on the group method of data handling (GMDH), which was invented by Ivakhnenko [51,52].Ivakhnenko developed the GMDH as a means of identifying nonlinear relations between input and output variables.The individual terms generated in the layers are partial descriptions (PDs) of data, being the quadratic regression polynomials with two inputs [12].
PNN-based methods used for comparison in this paper are as follows: • P1: Simple PNN method: The inputs fed in the input layer generate PDs in the successive layers [53].• P2: RCPNN with gradient descent [53]: A reduced and comprehensible polynomial neural network (RCPNN) model generates PDs for the first layer of the basic PNN model, and the outputs of these PDs along with the inputs are fed to the single-layer feed-forward neural network.The network has been trained using gradient descent.• P3: RCPNN with particle swarm optimization (PSO): This method is the same as the RCPNN except that the network is trained using particle swarm optimization (PSO) [54] instead of the gradient descent technique.
• P4: Condensed PNN with swarm intelligence: In this paper, Dehuri et al. [55] proposed a condensed polynomial neural network using swarm intelligence for the classification task.The model generates PDs for a single layer of the basic PNN model.Discrete PSO (DPSO) selects the optimal set of PDs and input features, which are fed to the hidden layer.Further, the model optimizes the weight vectors using the continuous PSO (CPSO)technique [55].• P5: All PDs with 50% training used in the proposed scheme of [12].• P6: All PDs with 80% training used in the proposed scheme of [12].• P7: Only the best 50% PDs with 50% training used in the proposed scheme of [12].• P8: Only the best 50% PDs with 80% training used in the proposed scheme of [12].• P9: Saxena et al. [29] proposed four methods for feature selection in an unsupervised manner by using the GA.The proposed methods also preserve the topology of the dataset despite reducing redundant features.
A brief summary of methods P5-P8 as used by Lin et al. [12]: In this work [12], a real-coded genetic algorithm (RCGA) has been used to improve the performance of a PNN.The PNN tends to expand to a large number of nodes, which results in a large computation, making it costly in terms of time and memory.In this approach, the partial descriptions are generated at the first layer based on all possible combinations of two features of the training input patterns of a dataset.The set of partial descriptions from the first layer, the set of all input features, and a bias constitute the chromosome of the RCGA.The ability to solve a system of equations is utilized to determine the values of the real coefficients of each chromosome of the real-coded genetic algorithm for the training dataset with the mean classification accuracy (abbreviated to CA herein) as the fitness measure of each chromosome.To adjust these values for unknown testing patterns, the RCGA is iterated using selection, crossover, mutation, and elitism.

Proposed Two-Phase Approach
The proposed approach includes two algorithms with some assumptions given in this section.The flow diagram of the proposed approach is given in Figure 1.

Assumptions
The dataset used for simulation purpose contains some patterns that have labels (or class) while the remaining patterns do not contain labels.It is also assumed that all the possible available classes have been included in the part of the dataset with known classes.In other words, no new class is assumed to be possessed by any unlabeled pattern.Here, it is to clarify that for the performance measure, we take datasets where all patterns are labeled but hide classes of some of the patterns and assume these are unlabeled patterns.If we partition the datasets into a given number of clusters, each cluster shows at least one of the classes in the majority, and that class will represent that cluster.For this reason, we calculate the majority of patterns belonging to a particular class in a cluster, and that cluster will be labeled with that class.The description of datasets is given in Table 1.The number of patterns with known and unknown labels in the dataset is given in Table 2.

Algorithm of the Two-Phase Approach
Phase-I: Finding the reduced number of features (feature selection) and then finding the centroids of the clusters in the reduced dataset where classes are known.
Phase-II: Determining the classes of patterns of the other part of the reduced dataset where classes are unknown with the help of their closest clusters obtained in the first part.
The details are given below: Algorithm 1 1. Take any dataset and shuffle the patterns of the dataset in such a random manner that all labeled patterns are spread throughout the dataset.This will ensure that every sample of patterns taken from this dataset will have all available labels in it.2. Divide the dataset into two parts, viz. the first part contains labels, and in the second part, labels are hidden to show them being absent.3. Determine a number of subsets of features based on the correlation-based MRmr approach from the dataset.The details of the process are given in Algorithm 2. 4. Form various combinations of subsets of features to be obtained by Algorithm 2. Apply each of these subsets (by taking respective reduced datasets containing only the features of the subsets) on a classifier and find the feature subset that provides the highest classification accuracy.Any supervised classifier can be used to determine the accuracy of this part.This feature set will be the recommended reduced feature subset.1-nearest neighbor (1-NN) has been applied in this paper to check the classifier's accuracy in the first part.The knowledge of class labels in the first part is used as a supervised learning.5.The entire dataset is reduced to the extent of a reduced number of features and classes as extra information with each pattern.Divide this dataset (reduced) into two parts, viz. the first part contains labels, and the second part is treated as not containing labels for the purpose of treating it as an unlabeled part.The labels will be hidden in the second part.6. Cluster the first part of the reduced dataset without taking into account the class labels using any clustering method.Compute their centroids.The number of clusters or centroids is taken as the same as the number of classes available in that dataset.The Fuzzy C-means clustering method is used for clustering.7. Find out the class of each cluster by taking the class labels of the majority of patterns in that cluster of the first part(it is also assumed).8. Cluster the second part of the dataset in which class labels are hidden.Find out the centroids of the second part using any clustering method, the same as was used in first part.9. Compare the two sets of centroids obtained in the two parts.The centroid that belongs to the first part and has a minimum distance with a centroid of the second part will form a pair.Thus, a set of pairs are formed, which is same as the number of clusters (or classes).10.In each pair, the class of the centroid of the first part is known.This class will be labeled to the centroid of the other part, the class of which is unknown.11.Check the classes of the patterns belonging to the second part obtained as above with their original classes, which were hidden in the second part.Compute the matching percentage for evaluating the classification accuracy.
The algorithm to obtain the maximum relevant and minimum redundant features is given below:

Algorithm 2
Input: Dataset with d features; with labels given on some patterns (supervised) and not given on some patterns (unsupervised) 1. Find out the correlation coefficient for the features for the dataset (supervised) where class labels are given.Sort these features in descending order of values of correlation coefficient.Let this list of features be F_Sup.Thus, a list of d correlation coefficients is obtained.2. Take a certain number of features (given in Table 3) from the start of the feature list found in (1) above.These features are the features with maximum relevance.Let this list of features be denoted by F_Sup_red.3. Find out the correlation coefficient for each feature with its combination with other features (unsupervised).There will be d(d-1)/2 such combinations.Sort these features in ascending order of values of correlation coefficient.Let this list of features be F_UnSup.4. Take a certain number of features (given in Table 3) from the start of the feature list obtained in (3) above and denote it as F_UnSup_red.These features are the features with minimum redundancy.It is worth noting that each value of correlation coefficient is generated by combining two features of the dataset,unlike the situation in (1) where each feature is combined with class label only. 5. Find out the poorest features F_poorest by computing F_UnSup_red-Sup_red, i.e., the subset that contains those features of F_UnSup_redthat are not in F_Sup_red.Thus,F_poorest is the subset of features with very low values of correlation coefficient with minimum redundancy but will be dropped as they will be harmful.6. Find out the best features F_best by taking the intersection of F_UnSup_red and F_Sup_red.This set is determined to find the size of the combination of features to be formed from the features of F_Sup_red.Denote F_best as the features with maximum redundancy and minimum redundancy.
However, this is not necessarily the best feature set as far as CA is concerned.For deciding the size of the final feature set only, this step is performed.Output: F_Sup list of features with maximum relevance and F_best as a sample subset of features with maximum redundancy and minimum redundancy.

Pseudo-Code for Proposed Two-Phase Method
Input: A dataset with labeled and unlabeled pattern a. Divide the dataset into two parts: Part with labeled and part with unlabeled patterns.b.Find appropriate feature set that consists of features with high relevance and low redundancy by eliminating very poor features from it based on correlation coefficient values among features.c.Iteratively apply various combinations of feature sets on any classifier on the first part of the dataset and find the feature set that produces the highest CA.d.Obtain the reduced dataset by keeping only those features obtained above.e. Cluster parts 1 and 2 of the reduced dataset, and find the centroids for each part.
f.The class labels of the cluster obtained from part 1 of the reduced dataset will reflect the corresponding labels of its nearest cluster in part 2. Thus, the class can be assigned to patterns of part-2, which were not labeled initially.The CA can be determined from these patterns.Output: A small but influential feature set to decide the labels of unknown patterns.

Complexity of Algorithms
Xu et al. [35] proposed an RRPC-based Semi-Supervised Feature Selection Approach.The criteria used maximum relevance while selecting features for supervised learning and minimum redundancy for unsupervised learning of features.The criterion is incremental in nature as the features are added one by one.The approach used in this paper also applies relevancy and redundancy criteria for determining an optimum set of features on the basis of values of correlation coefficients.The selection of an optimum subset of features is given in Algorithm 2. The emphasis in the paper is to maintain a good composition of features having high relevancy but minimum redundancy.The features with minimum values of correlation coefficients have been dropped if they are not part of those features that have high relevancy.An advantage of the present method is that it is simple and easy to implement.It is a new but efficient approach as well.There is no incremental approach used in this paper.All the features are pre-allocated in the dataset.Another advantage is the application of knowledge about clusters.This concept is probably not reflected in the literature to the best of our perception.In a dataset, training data can be divided into clusters with some centroids (K, for instance).The test data are also clustered with some centroids.The dataset is the same, so the clusters must also match in both partitions, i.e., training data as well as testing data.The ordering of a point in a centroid can be in a different order, but a centroid in the training data that is nearest to a centroid in the testing data must match.Therefore, the label attached to the cluster in the training data must be same as its closed centroid in the test data.
The computational complexity of the present two-phase algorithm is a sum of the following factors: Sum O(n(1+d(c 2 i +1)); K=1 as in this work, and1-Nearest Neighbor is used, where n is the number of observations (patterns) in the dataset under process, d is the number of features (attributes), c is the number of clusters (classes or labels),and i is the number of iterations.In RRPC proposed by Xu et al. [35], the computational complexity is given by the sum of terms (a) due to the similarity of matrices: O(l(d+ 1) 2 + ud 2 ), and (b) due to feature ranking: O(n(d+1) 2 + nd 2 ); hence, total complexity = a+b; where l + u = n.l: Labeled and u: Unlabeled patterns.

Experiments
The proposed two-phase method was run on an i5 machine using MATLAB.The semisupervised dataset was considered to contain two types of patterns: One part that contains the labels and the other part that does not contain labels.For finding the reduced number of features in the whole dataset, we used the approach proposed by Xu et al. [35] and used the Karl Pearson Correlation coefficient method with MRmr.For obtaining a feature subset with MRmr, Algorithm 2 is applied.Form all combinations of features from the feature set as mentioned in Algorithm 2. Find out the CA with every combination of the reduced set of features formed as above by the 1-NN method.Find out the feature subset that provides maximum CA.Thus, we have a reduced set of features; transform the whole dataset with the reduced number of features.Divide the reduced dataset into two parts: One part that contains class labels and the other part where the class labels are kept hidden.This completes Phase-I of the proposed approach.Prior to beginning Phase-II, apply any clustering algorithm (K-means clustering in present work) to divide the reduced dataset obtained in the first part into the number of clusters that is the same as the number of available classes used in the first part.Now, collect the centroids (equal to the number of classes in the first part).If we have c classes and r reduced number of features, we obtain c cluster centroids (rows) and each centroid has r values (columns).In each cluster, the class attached to the majority of patterns will be considered as the class of that cluster.This is verified by several experiments used in this method.
Phase-II starts with the second part of the reduced dataset, which is unsupervised; no classes are known.Divide this dataset into a number of clusters that is the same as the number of available classes of the dataset.Find out the centroids of the obtained clusters.Each cluster will contain a number of patterns with unknown class.Compare these centroids to those obtained in the first part of Phase-1.Find out a good mapping of each centroid in the second part with that obtained in the first part.Form the pairs of clusters such that each pair contains the first cluster centroid from the first part and the second cluster centroid from the second part of the dataset.As the class of the first cluster of this pair is known, as mentioned in Phase-I, the same class will be assigned to the second cluster belonging to the second part.The second part is already clustered in the beginning and the class of each cluster of the second part is also decided using this approach.Therefore, each pattern belonging to the clusters of the second part of the dataset will be the same as the class tagged with the individual clusters decided before.For verification, we check the class of each pattern of clusters of the second part with the original dataset.The accuracy of the matching of patterns is computed.In this manner, we also find the class of the second part, which was unknown originally.
Methods Used: For classification, we used 1-NN in Phase-I.The Karl Pearson Coefficient was used to find relevance and redundancy.Fuzzy C-means clustering [46] was used to divide datasets into C clusters.C was also used as the number of available classes.The summaries of these methods are presented in Section 3.
We used nine datasets summarized in Table 1.The datasets are divided into two parts.The ratios of known (labeled) classes against unknown classes (unlabeled), with the number of patterns in percent, are given in Table 2.The size of feature subsets to be used for processing is given in Table 3.The correlation coefficient values for a typical dataset (WBC) are given in Tables 4 and 5 for relevance and redundancy purposes, respectively.In Table 4, as well as in Table 5, the set of features selected are also given based on higher and lower correlation coefficient values, respectively.The features that are considered good in Table 4 with high correlation coefficient values must be retained in the final feature set, whereas the features having low correlation coefficient values must not find a place in the final feature set.Table 6 provides results obtained using the proposed two-phase method.Table 7 shows distances between centroids obtained by the second-part and first-part clusters.Table 8 presents a comparison of CA values obtained by various methods reported in the literature.Table 9 presents a list of features used by various methods for processing.*P10-Proposed method with 70% known labels *P11-Proposed method with 50% known labels *P12-Proposed method with 40% known labels

Results and Discussion
This paper proposes a two-phase method to find out the most probable labels attached to some of the patterns in a dataset.The main concept is to utilize the knowledge of the patterns in the dataset that are labeled.The results achieved from the experiments performed are shown in various tables.Table 1 presents the description of the datasets used for experiments.These benchmark datasets are collected from the UCI repository [56,57].The synthetic data are used to verify the correctness of the method as it is obvious that the results for synthetic data would yield maximum accuracy.For all real datasets, various ratios of the number of labeled patterns vs. number of unlabeled patterns (shown in percent) are shown in Table 2.It is apparent from this table that all experiments have been performed on three sizes of datasets in terms of a known and unknown number of patterns: (70,30), (50,50), and (40,60).Table 6 presents results obtained from experiments with various parameters.Column 1 shows the name of the dataset.Column 2 indicates what percent of portion has been used with known labels in the dataset.This information is shown in Table 2.The correlation coefficients calculated as described in the proposed scheme in Section 3 are used to find out the number of features with maximum relevance for each dataset.In the third column, the features producing higher correlation coefficient values with the corresponding class labels are mentioned.These features are termed as features with maximum relevance in Table 6.Similarly, the fourth column shows the features with low redundancy based on minimum correlation coefficient values calculated by taking features together in pairs.These features are termed as features with minimum redundancy in Table 6.In other words, these features are considered harmful for classification purposes due to their redundant nature.Column 5 gives the list of features taken finally as a reduced subset.The dataset is reduced to this set of features.The CA obtained using 1-NN using the ratio, as shown in column 2, is listed in column 6 of this table.For the first part of the dataset with labeled patterns, Fuzzy Cmeans clustering is used to find out the centroids.These centroids are shown in Column 7 of the table.The centroids are also obtained for the remaining parts of the dataset with unlabeled patterns and shown in column 8.As explained in the description of the method before, pairs are made for each centroid of Column 7 with the centroid shown in Column 8.These pairs are given in Column 9.The class of each pattern in the second part (with unlabeled patterns) as described in the method is compared, and if it matches with the actual class assigned to that pattern, the number of matching patterns increases by 1.The percent of the matching patterns is shown in Column 10 of Table 6.
From Table 6, it is noted that for synthetic data, there are three features (1,2,3) that are found to have maximum relevance, whereas two features (4,5), albeit having minimum redundancy, have too poor a values and should, therefore, not be retained in the final feature subset.The final subset keeps only two features (1,2).The selection of the final feature set is as follows: Suppose we have five features in a dataset and decided to take only 50% of the total number of features on the high correlation coefficient values; thus, take the round of (5/2) = 3 features for which the correlation coefficient values between features and class are highest.Then, take those three features that have the lowest correlation coefficient values when the feature with other features are used to calculate correlation coefficient values.Now, take those features that are only in the former set and discard those in the latter set.Find a feature set with only there features.Again, reduce this feature to 50%.Round it so we obtain the round of (3/2) = 2. Thus, the final feature set will have only two features.Which two features will be retained in the final feature set is decided as follows.Find out the combination of two features (which should be in the final feature set) with all features in the feature set determined last having three features.We can have three sets of features, say, (1, 2), (2,3), and (1,3).Find out the CA with these three subsets of features of the dataset, and the feature set that gives the maximum CA will be used to provide the reduced feature set.The CA using features 1,2 in the dataset comes out to be 100% for each of the three ratios of datasets 70%,50%, and 40% of labeled data.Let us understand the centroid mapping at this stage using an example.For synthetic data, there are two features and two classes only.The first two rows in column 7 of the table indicate (1:5.06,4.80; 2:19.60,19.61) the centroid value for cluster 1 and cluster 2, respectively.These centroids are calculated for the clusters that are obtained by partitioning the first part of the dataset where 70%, 50%, or 40% of the patterns are labeled.Similarly, in the next column, i.e., column 8 of Table 6, we have centroids for cluster 1 and cluster 2 as (1:5.00,4.78;2:19.49,19.63),respectively.These centroids are computed from the second part of the dataset where labels are not given in the remaining 30%, 50%, and 60% of data.When we observe these two pairs of centroids, we find that the first centroid of cluster 2, (1:5.00,4.7), is much closer to the first centroid of cluster 1 (1:5.06,4.80).It is placed as (1,1), which means the first centroid of the unlabeled data cluster close to the first centroid of cluster 1 of the labeled data.Similarly, the pair (2,2) is obtained.As the second centroid in both cases (i.e., 2 in pair (2,2)) is labeled, the class of the first centroids of clusters 1 and 2 must also be same as their closed centroids in the labeled data.In this manner, each pattern of the unlabeled cluster is labeled one by one and simultaneously also compared to its original class.If the class of an unlabeled pattern using this method matches with the actual class assigned to it, the CA or the number of matching patterns increases by 1.We continue the exercise for all patterns in unlabeled clusters 1 and 2. After checking for all patterns, find the percent of matching patterns in the unlabeled part of the dataset that is shown in column 10 of Table 6.For synthetic data, it is 100%, meaning all patterns that were unlabeled are correctly labeled.The same exercise is repeated for all other datasets.For datasets where the number of features is up to 20, a maximum of half of the features are chosen either for maximum relevance or minimum redundancy level.For the datasets where the number of features is more than 20, we took only 10 higher values of correlation coefficients and 10 lower correlation coefficients only.The intersection of this set will give a refined feature set, and we take 50% of the features and repeat the exercise as above.Sonar and Ionos datasets are used with 60 and 34 features, respectively.While deciding the proximity of centroids with others, Euclidean distance was used to compute the distance.From the table, it is evident that for iris data, features 3 and 4 were chosen as good features, but only one feature (number 4) was taken in the final feature set.It reduces redundancy.Due to the random partitioning of the dataset, for 70% and 40% labeled data, feature 4 is chosen, whereas for 50% labeled data, feature number 3 is chosen as the best feature and with a CA of approximately 95%.After the clustering of labeled and unlabeled data and combining their centroids, the accuracy by predicting the class of unlabeled data is also high, nearly 95%.For the wine data where CA was 87-92% while selecting features for labeled data, the prediction accuracy by the proposed approach is 78-86% for the three folds of datasets.It is worth mentioning that every time we predict accuracy, at least 10 times the method was iterated so that we have proper validation of all parts of the datasets.The maximum values obtained out of these 10 iterations have been given in Table 6.Similarly, WBC data produce a prediction accuracy in the range of 96-98% after selecting features at an accuracy range of 96-97% approximately.The selection of feature subsets for wine and WBC data is explained in Section 3. Various values are similarly shown in this table.Liver data produce a lower prediction accuracy of 52-58% with a CA of 56-63% at the time of reducing the features for labeled data.This is a poor accuracy compared to what is obtained for other datasets.It is observed that the smaller the CA obtained for the reduced feature subsets, the poorer the prediction accuracy.The distances taken between centroids of a cluster against those of other clusters were computed to find the minimum distance between centroids against other centroids.The minimum distances are shown in Table 7.As an example, for Synthetic Data, for 70% patterns labeled, the first centroids have values 5.06, 4.80 and the second centroids have values 19.60,19.61for the labeled data.For the unlabeled data, the same values for the two centroids are 5.00,4.78and 19.49,19.63,respectively.Compute the distance of the first centroids (labeled) with both centroids (unlabeled).Find the minimum distance obtained from the computations.(1,1) shown in column 5 of Table 7 means that the first centroid (labeled) has a minimum distance of 0.0632 with the first centroid of unlabeled data.Similarly, 0.1118 is the minimum distance between second centroids (labeled) with the second centroid (unlabeled).All datasets have been applied under this exercise.Only the minimum values are given in column 7 of Table 7.As a further elaboration, for sonar data, the distances between first centroids (labeled) with both centroids (unlabeled) are given.(1,2) means that the first centroid (labeled) has distances 0.2250, 0.0393 with the first and second centroid (unlabeled), respectively.The bold value means minimum value.As the second centroid (unlabeled) has minimum distance with the first centroid (labeled), the pair is taken as (1,2).The prediction accuracy or the matching pattern percentage shown in the 10th column of Table 6 is computed on the pairing basis.Thus, a (1,2) pair means that if a pattern belongs to the second cluster of the unlabeled dataset, it will have the same class as that of cluster 1.The class of cluster 1 has already been calculated by taking the class of the majority of the patterns belonging to that cluster.Each pattern in the unlabeled data when clustered will belong to one of the clusters.Its class will be same as the class with which it is paired.Count all such patterns that are correctly classified and compute its matching percentage or prediction accuracy.
Table 8 presents a comparison of performance using various methods under different parameters.The proposed two-phase method (as shown by P10, P11, and P12) performs much closer to the other methods in terms of CA for finding the class of unknown/hidden patterns.The other methods P1-P9 reported in the literature perform more or less similar to P10-P12.Table 9 presents a comparative study of the number of features used by various methods reported and the proposed method.Methods P10-P12 take a much lower number of features to predict classes of unknown patterns.

Conclusions
In this paper, a novel two-phase scheme of feature selection under semi-supervised datasets and their classification has been presented.It is observed many times that collecting labels or classes of a large number of patterns can be quite expensive and may require a lot of effort.If we can manage to collect few patterns with labels after processing, it will be worth using the knowledge of those labels in the patterns to find the labels of other patterns as well.For this purpose, Fuzzy C-means clustering has been applied to cluster data.For the part of the data where labels are known, the class of each cluster can be decided by the highest number of patterns belonging to a particular class (reiterate in the case of tie).For the data where labels are not known, that can be clustered.Now, from the two sets of clusters, using a correspondence between the centroids, a suitable mapping can be devised.The pattern belonging to unlabeled data but closely matched to the cluster of labeled data can also be labeled the same as the latter.For feature selection, Pearson's correlation coefficient can be used to find the maximum relevant and minimum redundant features.After an extensive experimental study and investigation of several datasets, it is noted that the proposed scheme presents good classification accuracy while classifying unlabeled data.Moreover, a very small number of features is also required to obtain the accuracy that will save time and space.There have been some other schemes with variations available in the literature, but the proposed method presents a simple approach and utilizes the knowledge and properties of clusters to classify patterns without labels.Keeping these facts together, it can be concluded that the proposed two-phase approach for semisupervised datasets-based feature selection can be applied to many other applications in science, engineering, data mining, health care, and similar other applications.The proposed method, however, fails to predict correct classes in the case of many classes' datasets; when we more than four classes.Although most of the real-time datasets have two or three classes and the proposed scheme deals well with such cases, it is future scope for the researchers to extend the present method to multiclass datasets.

Figure 1 .
Figure 1.The flow diagram of the proposed approach.

Table 1 .
Description of datasets used (original).

Table 2 .
The number of patterns with known and unknown labels in the dataset

Table 3 .
Size of feature sets retained for computing maximum relevance and minimum redundancy.Count this number on the basis of higher correlation coefficient values between features and classes.**Countthis number on the basis of low correlation coefficient values between each pair of features. *

Table 6 .
Results of various parameters for datasets.

Table 7 .
Pairing of cluster centroids between labeled and unlabeled datasets.

Table 8 .
Performance comparison with some other schemes.

Table 9 .
Number of features used under some other schemes.