Augmentation of Densest Subgraph Finding Unsupervised Feature Selection Using Shared Nearest Neighbor Clustering

: Determining the optimal feature set is a challenging problem, especially in an unsupervised domain. To mitigate the same, this paper presents a new unsupervised feature selection method, termed as densest feature graph augmentation with disjoint feature clusters. The proposed method works in two phases. The ﬁrst phase focuses on ﬁnding the maximally non-redundant feature subset and disjoint features are added to the feature set in the second phase. To experimentally validate, the efﬁciency of the proposed method has been compared against ﬁve existing unsupervised feature selection methods on ﬁve UCI datasets in terms of three performance criteria, namely clustering accuracy, normalized mutual information, and classiﬁcation accuracy. The experimental analyses have shown that the proposed method outperforms the considered methods.


Introduction
Over the past two decades, the pace at which humanity has been producing data is increasing continuously [1].The improvement in data acquisition methods has inevitably led to the availability of high dimensional datasets, especially in applications belonging to domains like pattern recognition, data mining, and computer vision.High dimensional data, not only leads to an increase in execution time and memory requirements it also decreases the performance and restricts the applicability of machine learning methods due to the curse of dimensionality [2,3].To mitigate this, various methods exist for reducing features, which can be broadly categorized into two approaches, i.e., feature extraction and feature selection.In feature extraction, the feature set is transformed into a new feature set with lower dimensions.On the contrary, feature selection works on the approach of selecting a subset of features from the original feature set by eliminating redundant or irrelevant features [3].Moreover, the feature selection approach benefits in retaining the original features of the data, which is advantageous in explaining the model.
In general, feature selection is a combinatorial optimization problem [4] that has facilitated various research fields such as data comprehension and visualization [5], DNA microarray analysis [6], text classification [7], and image annotation [8].The existing feature selection methods can be either supervised [9][10][11] or unsupervised [12,13].In supervised feature selection, class labels are known beforehand and are part of the context while predicting the label for unlabeled data points.While unsupervised feature selection has no such information at any stage of the feature selection process [14].Moreover, other categorizations of feature selection methods could be wrapper methods, embedded methods, and filter methods [15].Both wrapper and embedded methods are classifierdependent and may lead to overfitting, while filter methods are classifier-independent and therefore have better generalization capability [16].
In filter methods, the unsupervised approach to performing feature selection is to rank the features [17][18][19][20][21][22][23][24].Rather than automatically selecting a subset of features, this approach provides the flexibility of choosing a number of highest-ranked features that fit within the memory constraints of the system.However, there is no way of knowing the exact number of features in the optimal feature subset other than trial and error.On selecting a higher number of optimal features, the method may add redundant or irrelevant features, which leads to an increase in computational requirements and noise, respectively.On the other hand, by selecting a lower number of optimal features, the method may miss out on relevant or essential features for the complete representation of the data.
To determine the same, different graph approaches have been proposed for feature selection [25][26][27][28][29][30][31][32][33][34][35][36][37].Das et al. [30] used Feature Association Map to present a graph-based hybrid feature selection method.While Kumar et al. [34] employed correlation exponential in the graph-based feature selection method.Lim et al. [18] presented a feature-dependencybased unsupervised feature selection (DUFS) method that uses pairwise dependency of the features to perform feature selection.Furthermore, Peralta et al. [35] proposed an unsupervised feature selection method that is robust and scalable in performing feature reduction.The presented method uses dissimilarity measures along with clustering algorithms [37] and is tested on a cell imaging dataset.Moreover, Das et al. [36] proposed a feature selection method by using a bi-objective genetic algorithm with ensemble classifiers.He et al. [26] employed the Laplacian score for feature selection (LSFS) wherein the power of locality preservation of a feature is used as a basis for ranking.Multi-cluster feature selection (MCFS), proposed by Cai et al. [19], is another popular method that ranks features according to their ability to preserve the multi-cluster structure.Both LSFS and MCFS are often considered baseline methods.Goswami et al. [33] devised a feature selection technique that considers the variability score of the feature to measure the feature's importance.Recently, Mandal et al. [17] presented a maximally non-redundant feature selection (MNFS) method in which features and their pairwise mutual information compression index (MICI [27]) values are used as the nodes and corresponding edge weights of a graph.The densest subgraph of the constructed graph is identified, and the corresponding features are selected as the maximally non-redundant subset.Yan et al. [31] and Saxena et al. [38] presented an unsupervised feature selection method that returns the appropriate size of the feature subset.Bhadra et al. [32] employed a floating forward-backward search on the densest subgraph.Similarly, Bandyopadhyay et al. [12] proposed a variation of MNFS by clustering around the features of the densest sub-graph and termed it as the densest subgraph finding with the feature clustering (DSFFC) method.However, by providing a minimum number of features, there is a possibility of adding irrelevant features to the selected feature set.
Therefore, the paper presents a novel unsupervised feature selection method.The proposed method consists of two phases; the first phase is the formation of the densest feature subgraph using mutual information for feature selection and the second phase involves dynamic clustering of features using shared nearest neighbors as the criteria.The cluster representatives that are mutually exclusive to the feature subgraph are added to the selected set of features.To experimentally evaluate the proposed method, five standard UCI datasets have been considered and compared against five existing feature selection methods in terms of two performance parameters, namely ACC and MCC.
The organization of the remaining paper is as follows.In Section 2, related work is briefed along with the preliminary concepts.The proposed method is discussed in Section 3. Section 4 presents the experimental setup followed by the performance analysis of the proposed method in Section 5. Finally, the conclusion is drawn in Section 6.

Preliminaries 2.1. Maximal Information Compression Index (MICI)
The maximal information compression index (MICI) is a dissimilarity measure defined by Mitra et al. [27].Given that Σ is the covariance matrix between two random variables x and y, the MICI value (λ 2 (x,y)) is defined as the smallest Eigenvalue of Σ and can be calculated by Equation (1).2λ 2 (x, y)= var(x) + var(y) − (var(x) + var(y)) 2 − 4var(x)var(y)(1 − ρ(x, y)) 2 (1) where var(x) is the variance of the random variable x, var(y) is the variance of the random variable y and ρ(x, y) is the correlation between the random variables x and y.

Graph Density
Given a graph G(V,E,W), the average density (d) of the graph can be calculated using Equation (2). (2)

Edge-Weighted Degree
Given a graph G(V,E,W), the edge-weighted degree (δ) of a vertex i ∈ V can be calculated by Equation (3).
The mean edge weighted degree of the graph can be calculated using Equation (4).
As a fully connected graph is used in the proposed method and edge weights do not change, a subgraph G' can be uniquely identified by its corresponding vertex set V'. Therefore, δ(V'), δ mean (V'), and d(V') are defined as equivalent to δ(G'), δ mean (G'), and d(G'), respectively.

Shared Nearest Neighbors
The shared nearest neighbors (N) represent the average number of features per cluster.To compute the same, the total number of features is divided by the number of features in the resultant feature set (S), if S is the ideal feature subset.Equation (5) defines the mathematical formulation of shared nearest neighbors (N).

N=
Total number o f f eatures Number o f f eatures in resultant f eatures (5)

Nearest Neighbor Threshold Factor (β)
In the proposed method, β is the threshold parameter that limits the number of shared nearest neighbors between two features ('a' and 'b') for forming a cluster.Formally, this condition is defined by relation R which is defined in Equation (6) for two features 'a' and 'b'.
where, NL(a) and NL(b) have N * β neighbors in common and NL(a) depicts the list of N nearest neighbors for feature 'a'.
The lower value of β makes feature clusters merge in the second phase of the proposed approach, which reduces the chance of considering the redundant features.On the contrary, the higher value of β reduces the chance of missing the unique context of features in the second phase of the proposed method.For experiments, the β parameter is set as 0.99 to promote loose clustering of features.

Proposed Method
This paper presents a novel method for the unsupervised feature selection problem, which is termed as densest feature graph augmentation with disjoint feature clusters (DFG-A-DFC).Figure 1 illustrates the block diagram of the proposed method.The proposed method works in two phases: 1.
First Phase: Finding the maximally non-redundant feature subset.

2.
Second Phase: Maintaining the cluster structure of the original subspace at the cost of including some redundant features.
nearest neighbors for feature 'a'.
The lower value of β makes feature clusters merge in the second phase of the proposed approach, which reduces the chance of considering the redundant features.On the contrary, the higher value of β reduces the chance of missing the unique context of features in the second phase of the proposed method.For experiments, the β parameter is set as 0.99 to promote loose clustering of features.

Proposed Method
This paper presents a novel method for the unsupervised feature selection problem, which is termed as densest feature graph augmentation with disjoint feature clusters (DFG-A-DFC).Figure 1 illustrates the block diagram of the proposed method.The proposed method works in two phases: 1. First Phase: Finding the maximally non-redundant feature subset.2. Second Phase: Maintaining the cluster structure of the original subspace at the cost of including some redundant features.Algorithm 1 describes the pseudo-code of the proposed unsupervised feature selection method.Given an undirected fully connected graph (G), wherein vertex set (V) represents the feature set while edge weight between any two vertices depicts the MICI value between the corresponding vertices, the first phase works on identifying the densest subgraph.In Algorithm 1, steps 1 to 6 detail the first phase of the proposed method.The densest subgraph is constructed by heuristically removing the vertices that have a lower edgeweighted degree than the average edge-weighted degree of the current subgraph.As the edge weight between two vertices represents the dissimilarity between them, the removal of edges heuristically benefits the identification of features with unique information and maximizes the average edge-weighted degree.These steps are repeated until the density of the current subgraph is lower than the one in the previous iteration.The resultant of the first phase is the feature set (S 1 ), which corresponds to the vertices of the current subgraph.
Further, the 'k' parameter defines the number of vertices to be clustered while maintaining the minimum number of clusters.The value of 'k' is kept 0.5.The subgraph corresponding to a cluster (C i ) is considered 'connected' if the connection between two vertices ('i' and 'j') are related by Relation R which is depicted in Equation ( 8).
Then, a representative feature from each cluster is added to S 1 , which has the highest aggregated dissimilarity from S 1 features and is defined in Equation ( 9).
Finally, the resultant feature set of the proposed method corresponds to the features in S 1 .
Furthermore, the proposed method, DFG-A-DFC, seems similar to DSFFC as both methods are two-phase methods wherein the first phase focuses on generating the densest subgraph and the second phase tries to improve the generated clusters.However, DFG-A-DFC is quite comparable to DSFFC.In DSFFC, the first phase identifies the number of clusters for the optimal feature set and the second phase aims at finding representatives for decision boundaries.While DFG-A-DFC aims at recognizing clusters for the remaining features in the second phase.Moreover, DSFFC has additional logic for maintaining the feature-set size in a given range and DFG-A-DFC has a threshold for keeping the minimum number of features.Lastly, the DFG-A-DFC method employs shared nearest neighbors to decide if two nodes belong to the same cluster or not.DSFFC method assigns features to clusters based on the number of clusters and the expected cluster centers.Therefore, it is evident that the DFG-A-DFC method is distinguishable from the DSFFC method.

Experimental Setup
This section details the considered datasets and performance metrics for the evaluation of the proposed approach.For performance validation, the proposed method (DFG-A-DFC) is compared against four state-of-the-art unsupervised feature selection methods namely, unsupervised feature selection using feature similarity measure (FSFS) [27], densest subgraph finding with feature clustering (DSFFC) [12], multi-cluster feature selection (MCFS) [19] and Laplacian score for feature selection (LSFS) [26].Moreover, it has been observed in the literature that the number of considered features is half of the original features for fair comparison [12].To achieve the same, LSFS considers features on the basis of their ranking, while DFG-A-DFC and DSFFC keep half of their original feature size at least.

Considered Dataset
The performance of the proposed method is evaluated on eight publicly available UCI datasets [28] namely, Colon, Multiple Features, Isolet, Spambase, Ionosphere, WDBC, Sonar (Connectionist Bench), and SPECTF.Table 1 details the considered datasets in terms of the number of features, classes, and sample size of the considered dataset.It can be observed from Table 1 that the range of feature size in considered datasets varies extremely, which will evaluate the consistency of the proposed method.

Performance Evaluation
Two popular metrics are considered for the performance evaluation of the proposed method namely, classification accuracy (ACC) and Matthews correlation coefficient (MCC).Equations ( 8) and ( 9) depict the mathematical formulation of ACC and MCC, respectively.
For each feature selection method, 10-fold cross-validation is performed to calculate mean ACC and mean MCC along with respective standard deviations.Further, the classification performance of the considered feature selection methods is evaluated on four classification models, namely K-Nearest Neighbors (KNN), Naïve-Bayes, Support Vector Machine (SVM), and Adaboost.
In the KNN classifier, K is considered the square root of the data size.For the SVM classifier, RBF kernel is used with the parameters according to grid search.Finally, Naïve Bayes is used as the base estimator in the Adaboost classifier.For other parameters, the default values are referred from the respective literature.

Performance Analysis
Tables 2 and 3 demonstrate the performance of the considered feature selection methods on different classifier models in terms of mean ACC and mean MCC, respectively.From the table, it can be observed that the proposed method, DFG-A-DFC, has achieved the best results for more than 50%.The runner-up is DSFFC in terms of overall best classification accuracy.Further, Figure 2 illustrates the visual comparison of the feature-selection methods in the form of a bar chart.In the figure, the x-axis corresponds to the considered methods, while the y-axis depicts the number of times the best value is reported by a method on both parameters, i.e., ACC and MCC.It is clearly envisaged that DFG-A-DFC is best as it reports the best value for the maximum number of times among the considered methods.Further, Figure 3a depicts the comparison of the considered feature-sele ods with different classifiers on the number of datasets on which each has r best value for the ACC parameter.From the figure, it is visible that the propos DFG-A-DFC, with SVM classifier has outperformed other methods on 87% of ered datasets.Similarly, DFG-A-DFC with KNN and Naïve Baye was superio the datasets.Moreover, the proposed method shows competitive performan Adaboost classifier.Further, the same comparison was conducted for the MCC in Figure 3b.Here, DFG-A-DFC with SVM classifier has attained the best val of 6 datasets, while it is better on 50% of the datasets with KNN and Naïve Baye FCFS fails to report a best value on any dataset for both parameters.Further, Figure 3a depicts the comparison of the considered feature-selection methods with different classifiers on the number of datasets on which each has reported the best value for the ACC parameter.From the figure, it is visible that the proposed method, DFG-A-DFC, with SVM classifier has outperformed other methods on 87% of the considered datasets.Similarly, DFG-A-DFC with KNN and Naïve Baye was superior on 50% of the datasets.Moreover, the proposed method shows competitive performance with the Adaboost classifier.Further, the same comparison was conducted for the MCC parameter in Figure 3b.Here, DFG-A-DFC with SVM classifier has attained the best value on 5 out of 6 datasets, while it is better on 50% of the datasets with KNN and Naïve Baye.However, FCFS fails to report a best value on any dataset for both parameters.Further, Figure 3a depicts the comparison of the considered feature-selection methods with different classifiers on the number of datasets on which each has reported the best value for the ACC parameter.From the figure, it is visible that the proposed method, DFG-A-DFC, with SVM classifier has outperformed other methods on 87% of the considered datasets.Similarly, DFG-A-DFC with KNN and Naïve Baye was superior on 50% of the datasets.Moreover, the proposed method shows competitive performance with the Adaboost classifier.Further, the same comparison was conducted for the MCC parameter in Figure 3b.Here, DFG-A-DFC with SVM classifier has attained the best value on 5 out of 6 datasets, while it is better on 50% of the datasets with KNN and Naïve Baye.However, FCFS fails to report a best value on any dataset for both parameters.Further, the proposed feature selection method is analyzed in the terms of the reduction in the selected features.Figure 4 illustrates a bar chart for the percentage of the reduced features by considered method on the considered datasets.It can be observed that the proposed method, DFG-A-DFC, has attained a maximum reduction of 40% on the Further, the proposed feature selection method is analyzed in the terms of the reduction in the selected features.Figure 4 illustrates a bar chart for the percentage of the reduced features by considered method on the considered datasets.It can be observed that the proposed method, DFG-A-DFC, has attained a maximum reduction of 40% on the WDBC dataset.While the DFG-A-DFC method is competitive on other datasets.Therefore, it can be claimed that the proposed method is an efficient feature selection method.Moreover, this paper presents an ablation study on the two-phase architecture of the proposed method.It focuses on studying the performance of the proposed method after the second phase.Table 4 highlights the classification accuracy (ACC) of the proposed method after the first and second phases with different classifiers for each considered dataset.It is notable that the proposed method has shown significant improvement in classification accuracy for each dataset after the second phase.Therefore, it can be concluded that the inclusion of the second phase has strengthened the proposed method.

Conclusions
In this paper, a new unsupervised feature selection method, densest feature graph augmentation with disjoint feature clusters (DFG-A-DFC), has been proposed.The proposed method represents the feature set as a graph with the dissimilarity between features as the edge weights.In the first phase, the features selected in the densest subgraph are considered the initial feature subset.In the second phase, shared nearest-neighbor-based clustering is applied to the feature set.Lastly, the final feature subset is formed from the Moreover, this paper presents an ablation study on the two-phase architecture of the proposed method.It focuses on studying the performance of the proposed method after the second phase.Table 4 highlights the classification accuracy (ACC) of the proposed method after the first and second phases with different classifiers for each considered dataset.It is notable that the proposed method has shown significant improvement in classification accuracy for each dataset after the second phase.Therefore, it can be concluded that the inclusion of the second phase has strengthened the proposed method.

Conclusions
In this paper, a new unsupervised feature selection method, densest feature graph augmentation with disjoint feature clusters (DFG-A-DFC), has been proposed.The proposed method represents the feature set as a graph with the dissimilarity between features as the edge weights.In the first phase, the features selected in the densest subgraph are considered the initial feature subset.In the second phase, shared nearest-neighbor-based clustering is applied to the feature set.Lastly, the final feature subset is formed from the augmentation of the initial feature subset with representative features from the formed clusters.To validate the efficiency of the proposed method, eight UCI datasets have been considered and compared against four existing unsupervised feature selection methods in terms of two performance criteria, namely classification accuracy and Mathews correlation coefficient.Experiments demonstrate that the proposed method is an efficient method in reducing the number of features along with better performance.Thus, it can be used as an alternative for performing feature selection.In the future, the proposed method can be applied to real-time applications such as image segmentation, wireless sensor, and data mining.Furthermore, the proposed method can be extended to big data.

Figure 1 .
Figure 1.Block diagram of the proposed method.Figure 1. Block diagram of the proposed method.

Figure 1 .
Figure 1.Block diagram of the proposed method.Figure 1. Block diagram of the proposed method.

Algorithm 1 :
Densest Feature Graph Augmentation with Disjoint Feature Clusters

Algorithms 2023 ,
16, x FOR PEER REVIEW

Figure 2 .
Figure 2. Bar chart for the number of times the best value is reported by considered ACC and MCC.

Figure 2 .
Figure 2. Bar chart for the number of times the best value is reported by considered methods on ACC and MCC.

Figure 2 .
Figure 2. Bar chart for the number of times the best value is reported by considered methods on ACC and MCC.

Figure 3 .
Figure 3.Comparison of considered feature-selection methods with different classifiers on the number of datasets for which best-value is attained on (a) ACC parameter; and (b) MCC parameter.

Figure 3 .
Figure 3.Comparison of considered feature-selection methods with different classifiers on the number of datasets for which best-value is attained on (a) ACC parameter; and (b) MCC parameter.

Algorithms 2023 , 13 Figure 4 .
Figure 4. Comparison of the number of selected features by considered method.

Figure 4 .
Figure 4. Comparison of the number of selected features by considered method.

Table 1 .
Details of considered datasets.
* In the ionosphere, there are originally 34 features; but as the second column is the same for all rows, we have not considered it during our experiments.

Table 2 .
Classification accuracy of different models on considered approaches for various datasets.

Table 3 .
MCC of different models on the considered approaches for various datasets.

Table 4 .
Ablation study of the proposed method in terms of ACC parameter.

Table 4 .
Ablation study of the proposed method in terms of ACC parameter.