Self-Expressive Kernel Subspace Clustering Algorithm for Categorical Data with Embedded Feature Selection

: Kernel clustering of categorical data is a useful tool to process the separable datasets and has been employed in many disciplines. Despite recent efforts, existing methods for kernel clustering remain a signiﬁcant challenge due to the assumption of feature independence and equal weights. In this study, we propose a self-expressive kernel subspace clustering algorithm for categorical data (SKSCC) using the self-expressive kernel density estimation (SKDE) scheme, as well as a new feature-weighted non-linear similarity measurement. In the SKSCC algorithm, we propose an effective non-linear optimization method to solve the clustering algorithm’s objective function, which not only considers the relationship between attributes in a non-linear space but also assigns a weight to each attribute in the algorithm to measure the degree of correlation. A series of experiments on some widely used synthetic and real-world datasets demonstrated the better effectiveness and efﬁciency of the proposed algorithm compared with other state-of-the-art methods, in terms of non-linear relationship exploration among attributes.


Introduction
One of the goals of clustering is to mine the internal structure and characteristics of unlabeled data, which is known as unsupervised learning [1,2]. Real-world applications, i.e., pattern recognition [3], text mining [4], image retrieval [5], and bioinformatics [6], generate unlabeled data. All of these data are not just numerical data but are increasingly categorical data, which are flooding into practical applications. Clustering analysis for categorical data has attracted a great deal of interest from the scientific community. One example is that political philosophy is often measured as liberal, moderate, or conservative. Another example is that breast cancer diagnoses based on a mammograms use the categories normal, benign, probably benign, suspicious, and malignant.
In the past few decades, various clustering algorithms have been proposed [7][8][9][10][11] for numerical data. However, the attributes of categorical data are discrete, and their attribute values come from a limited symbol set. Unlike continuous data, categorical data are unable to produce a mathematical calculation, such as the mean and standard deviation. As a result, algorithms suitable for continuous data cannot be directly used for categorical data. To deal with this disadvantage, researchers have developed some clustering algorithms for categorical data, such as ROCK [12], ScaLable Information Bottleneck (LIMBO) [13,14], MGR [15], DHCC [16], and k-modes type algorithms [17][18][19][20][21][22][23]. However, each of these algorithms has its own merits and disadvantages. Even state-of-the-art algorithms have their shortcomings, and they are not effective for all datasets. For instance, ROCK is a nonk-mode agglomerative hierarchical clustering method that uses the conventional Jaccard coefficient to compute the similarity of two samples. However, the Jaccard cannot measure the specific value of the difference; it can only obtain whether the result is the same or not. In addition, the time complexity of this algorithm is high, which is quadratic with the number of objects. LIMBO uses an agglomerative information bottleneck to measure the entities' distance, but is not comprehensive enough to extract data clustering features. The MGR algorithm proposes a mean gain ratio to select cluster attributes. LIMBO and MGR are based on information theory, meaning that they can quickly take into account one related variable, but one only, while ignoring other important feature information. DHCC can analyze multiple correspondence, avoiding a one-to-one similarity calculation. However, this method is sensitive to strange objects and, compared with agglomerative approaches, DHCC is a divisive algorithm with less application. The conventional k-modes algorithm and its variants have been extensively used for categorical data clustering. The distance of the samples was measured by simple matching coefficient (SMC). However, these methods only consider the attributes' mode, while ignoring the statistical information of the data itself. Meanwhile, they can be trapped into local optima and are sensitive to initial clusters and modes. Our numerical experiments even showed that the k-modes algorithm could not identify the optimal clustering results for some particular datasets, regardless of the selection of the initial centers.
To solve the k-modes type algorithms' problems, Chen [24] proposed a probabilistic framework in which the kernel bandwidth was introduced with a soft feature selection scheme so that the cluster center equals to the smoothed frequency estimator for the categories. Feature selection is of great significance to data processing in the era of big data [25,26]. It often involves the process of selecting the most important features representing an object's attributes and then building a learning model in tasks clustering. Feature selection can not only relieve the curse of dimensionality caused by too many attributes but can also retain relevant features, remove irrelevant features, reduce the difficulty of learning tasks, and look for the essential features. Based on evaluation criteria, embedded feature selection methods such as CART [27] not only overcome the low efficiency of the wrapper feature selection method [28][29][30] but also avoid the disconnection of the filter feature selection method. Algorithms that take a filter-method approach to feature selection, such as Chi-Square [31], information gain [32], gain ratio [33], support vector machine [34,35], ReliefF [36,37], and hybrid ReliefF [38,39], are used in many practical applications. The embedded feature selection approach uses a learning model, so that the feature selection process is automatically integrated with the learner training process. Although several clustering analysis methods employ feature selection [24,40], many of the current approaches have one or more of the following disadvantages: considering all features independently, considering all attributes' importance equally, and lack of an optimization solution.
The kernel clustering method that increases the sample features' optimization process uses the Mercer kernel to map the samples in the input space to the high-dimensional feature space and clusters in the feature space. The kernel clustering method is widely used and is considered superior to classical clustering algorithms in performance. It can distinguish, extract, and enlarge useful features through non-linear mapping, so as to achieve more accurate clustering. Kernel k-means algorithm [41] makes the sample linearly separable (or nearly linearly separable) in kernel space by the "kernel" method. Still, the kernel function is defined for continuous data. Thus it cannot be directly transposed to categorical data and the algorithm based on the assumption that the original features are equally important. Some recent self-expressiveness-based methods [42][43][44] use subspace self-expressiveness property related to regularization terms. They are also not suitable for categorical data, and they all involve a linear combination of attributes.
In this paper, we view the task of clustering categorical data from a kernel clustering approach and propose a non-linear clustering algorithm for categorical data. The algorithm, named self-expressive kernel subspace clustering for categorical data (SKSCC), is based on the kernel density estimation (KDE) and probability-based similarity measurement. SKSCC not only considers the relationship between attributes in non-linear space but also gives each attribute a feature weight to measure the correlation degree. KDE has been employed in the estimation of probability distribution for categorical data [24,45,46]. This work introduces the self-expressive kernel density estimation (SKDE) in which every attribute has its own bandwidth. It then proposes a new non-linear similarity measurement method for categorical data in which a weight is added for each attribute to determine the importance of the attribute. Therefore, the objective function of the derived clustering algorithm is non-linear. As is commonly accepted, non-linear equations and equalities are not easy to solve. Therefore, we propose an efficient non-linear optimization method to solve the objective function of the clustering algorithm.
In summary, the main contributions of our work are as follows: • We define the self-expressive kernel density estimation approach, in which the symbols can be expressed by probability that is proportional to the kernel bandwidth, and the cluster center is smoothed to the frequency estimator for the categories; • We propose a non-linear feature-weighted similarity measurement method that gives consideration to the relationship between the attributes; • We put forward a non-linear optimization method in kernel subspace. Furthermore, we present the SKSCC, an efficient self-expressive kernel subspace clustering algorithm for categorical data that uses feature selection to choose the important attributes; • A series of experiments on several synthesis and real-world datasets were conducted to compare the performance of the proposed algorithm. The experimental results show that the proposed algorithm outperforms other algorithms in terms of nonlinear relationship exploration among attributes and improves the performance and efficiency of clustering.
The remainder of this paper is organized as follows: Section 2 describes related work. Section 3 introduces the KDE-based similarity for categorical data. In Section 4, the new clustering algorithm is elaborated. Experimental results are analyzed in Section 5. Section 6 presents our conclusions.

Related Work
The similarity measure of categorical data is the basis of categorical data analysis. A good clustering algorithm maximizes the similarity within clusters and minimizes the similarity between clusters. Although many researchers have proposed different methods to measure the similarity or dissimilarity of categorical data, none of them have been widely recognized. For numerical data, there are Euclidean distance, vector dot product, and other similar or different degrees of measurement objects. For categorical data, the mean and variance are not defined, and the vector dot product operation is meaningless.
In 1998, Huang [17] proposed the conventional k-modes algorithm, which is a nonweighted feature clustering approach. The k-modes algorithm can be formulated into a mathematical optimization model as follows: where w li composes a partition matrix and ∑ k l=1 w li = 1, w li ∈ {0, 1}, and Q l = {q l1 , q l2 , . . . , q lm } is the cluster center. The algorithm adopted a simple method, called overlap measure (OM) [19], to measure the distance, as shown in Equations (2) and (3). The differences between symbols are just equal or unequal (equal is 1, unequal is 0), as shown in Equation (3). where, This measure method is easy to use and has great computational efficiency, since there are no involved parameters. However, its defined distances are not always reasonable in indicating the real dissimilarity because it ignores the valuable information about the relationship of the correlated attributes. There are some variants of k-mode algorithms, such as presented in [47,48]. All of these algorithms suppose that features are equally important for clustering analysis but have seen limited use in real-world practice.
In weighted features clustering algorithms, such as WKM [22], wk-modes [21], and SCC [24], features are weighted according to their importance to the clustering tasks. In these algorithms, the features are of different importance. They calculate the similarity between the two samples by supposing each dimension independently. The mathematical optimization model of these algorithms can be expressed as follows: where W is also a partition matrix and ∑ k l=1 w li = 1, w li ∈ {0, 1}, Λ = [λ lj ] is a weight matrix, and β is an excitation parameter which is used to control the feature weight.
The algorithm also utilized the OM method to measure the distance, as Equations (2) and (3). These methods have the advantage of high clustering efficiency. In addition, feature weighting clustering algorithms assign uniform weight to all the intraattribute distances measured on the feature, which is suitable for well-defined distances. However, the distance measure is not well-defined for categorical data, as evidenced by the OM distance measurement. To solve this problem, most existing methods focus on exploring appropriate distance measures and attribute weighted mechanisms, such as MWKM [23]. These methods are all linear algorithms, in that they are based on the assumption that features are independent of each other, so that the relationship between features is ignored, which means that a great deal of information between the features is lost.
At present, two methods are mainly used to explore the non-linear relationship between attributes: deep neural network (DNN) and the kernel method. As we all know, DNNs need a large amount of data to train. The larger the amount of data, the more accurate the result. The kernel method uses the Mercer kernel function to implicitly describe the non-linear relationship between attributes and has been widely studied and applied because of its simplicity of mathematical expression and the high efficiency of calculation. Chen et al. [24] proposed a soft subspace clustering approach based on probabilistic distance. Its mathematical optimization model can be expressed as follows: where W is the weight of the dth dimension for cluster k, x is the data sample and π k is the kth cluster. Dis d (x, π k ) denotes the distance of sample x to the kth cluster on the dth dimension, which is computed by two discrete probabilities. This method also proposes to define a kernel density function κ(X d | o dl ; λ k ), as shown in Equation (6), to estimate the probability, where λ k ∈ [0, 1] is the bandwidth for every cluster.
where |O d | represents the power of O d , which is the number of aggregates, and o dl denotes Although this method considers the relationship between attributes in non-linear space, it does not distinguish the importance of attributes. This method also can be seen as one in which all attributes are independent of each other and all attributes in the same cluster use the same bandwidth.

KDE-Based Similarity for Categorical Data
In this section, we first propose a kernel density estimation (KDE) method for categorical attributes, by which each attribute has its own bandwidth. Then, the distance between categorical data objects can be expressed by a probabilistic data distribution. Moreover, a new similarity measure in the kernel subspace is defined to clustering.

Self-Expressive Kernel Density Estimation (SKDE)
Kernel density estimation method does not use the prior knowledge of the data distribution and does not attach any assumptions to data distribution. It is used to study the characteristics of data distribution from the data sample itself and is a non-parametric probability density estimation method. Unlike the kernel function seen in Equation (6), we define the kernel density function as follows: where | O d | represents the power of O d , which is the number of aggregates, and λ d represents the width of the dth attribute. It can be simply expressed as follows: where, I(·) denotes the indicator function; I(true) = 1 and I( f alse) = 0. According to the Equation (7), we can obtain: The above equation shows that the kernel function we defined satisfies the basic properties of probability distribution.
We usep(o dl |λ d ) to express the kernel probability estimation of p(o dl ). According to the basic principle of the SKDE method, we have: where DB is a sample set, f (o dl ) is the frequency estimation of o dl . In order to map categorical data to the high-dimensional space through the kernel function, a symbolic vectorization technique is used, as Definition 1 follows.

Definition 1.
We define a data object x id as follows: and satisfies the constraint condition: ∑ id can be estimated using the kernel function shown in Equation (8), as follows:

Similarity Measurement Based on Kernel Subspace
The existing mainstream methods fail to consider the relationship between features. We formally define the non-linear similarity measurement in the kernel subspace as follows: The similarity measure of kernel subspace is given by: where κ w (x i , x j ) represents the weighted features' kernel function, denoting the combination of two sample objects on each attribute.
According to Definition 2, the polynomial kernel function can be expressed as: • origin polynomial kernel function: • weighted feature polynomial kernel function: We introduce a kernel function that originally acts on continuous data to project categorical data into the kernel space and a weight vector w k = {w kd |d = 1, 2, . . . , D} for each cluster in the kernel space for original feature selection. The greater the dth dimension's contribution to cluster, the more important it is. w kd meets the constraints: We introduce an index θ(θ = 0) for w kd to control the incentive intensity, and suppose θ as a known constant. The bigger the value of θ, the smoother the weight distribution.
This similarity measure not only uses the kernel method to "kernel" the categorical data, but also considers the relationship between features in the non-linear space. We also select features in the mapped kernel space, which distinguishes the importance of features to the cluster.

Proposed Clustering Algorithm
In cluster analysis, the cluster is defined as the sample set with the minimum compactness (or dispersion), in which the compactness is measured by the similarity between the sample and the cluster center. Combined with the defined non-linear similarity measurement formula of kernel subspace, the kernel subspace clustering optimization objective function of categorical data can be defined as follows: where, v k is the center of the cluster π k , denoted as a D dimension vector v k = (v k1 , . . . , v kd , . . . , v kD ). Since a categorical attribute value is represented by a vector by Definition 1, so the dth dimension's center of the cluster π k should also be represented by a vector. Each component v kd represents the dth dimension's center, denoted as Therefore, we have: where f k (o dl ) denotes the frequency estimation of o dl ∈ O d in the dth attribute.

Non-Linear Optimization in Kernel Subspace
In the process of calculation, the sum function is operated in the kernel function (such as the polynomial kernel subspace function mentioned above), which makes it difficult to solve w kd , which, in turn, greatly increases the difficulty of solving the objective function. Therefore, we propose an efficient optimization method for solving the kernel subspace clustering optimization objective function. The objective function is transformed into the form of the existing mainstream methods (such as WKM [22] method) in order to improve the computational efficiency. The optimization objective defined by Equation (14) is further analyzed. Theorem 1 shows that for all convex kernel functions, the maximum value of Equation (14) is equivalent to the maximum value of the function of Equation (16), given by: where κ d (x i , v k ) represents the mapping function's inner product of x i and v k in the dth dimension, that is, the kernel function in the dth dimension. For example, the polynomial kernel function can be expressed as follows: Theorem 1. When θ ≥ 1, for all convex kernel functions κ(·, ·), the maximum Equation (14) has the same solution as the maximum Equation (16).
Proof. We define z d as the two input objects' combination in the dth dimension for similarity measurement in the kernel subspace. When the two input objects are the sample x i and the cluster center v k , z d represents the combination of x i and v k in the dth dimension. If we let (1) When D = 1, 2, the inequality clearly holds; (2) We suppose that the inequality clearly holds when D = n, then, w θ kd z d ).
When D = n + 1, let p n = ∑ n d=1 w kd , then, we have: We can thus obtain In particular, when θ = 1, the inequality is Jesson inequality. We acquire f (∑ D d=1 w θ kd z d ) by stretching the lower bound ∑ D d=1 w θ kd f (z d ) to upper bound. Then, we adjust w kd to maximize ∑ D d=1 w θ kd f (z d ). Through step-by-step iteration, we finally obtain the maximum of f (∑ D d=1 w θ kd z d ). Combining Definition 1 and Theorem 1, the Gaussian kernel function [49] can be expressed as follows: where , · is the Euclidean norm, σ 2 is variance, and f (x) = exp(x).

SKSCC Clustering Algorithm
The Gaussian kernel function is the most widely used kernel function, because it has a better performance for large, as well as small samples and has fewer parameters than other kernel functions. This paper proposes the SKSCC that takes the Gaussian kernel function to be the objective function, as shown in Equation (16). We can now transfer the Equation (16) to Equation (19), as follows: where σ 2 is defined as the global variance, and in which N is the number of sample set, and D is the dimension of the attributes. Equation (19) is a non-linear optimization problem with constraints. Using Lagrange multipliers, the objective function can be transferred to Equation (20) as follows: In this paper, we use the EM algorithm to optimize max J(Π, W), In other words, the local optimal value of J can be obtained by the iterative method. According to this principle, we first set Π =Π to maximize J(Π, W), and then obtain the value W, recorded asŴ. Next, we set W =Ŵ and then maximize J(Π,Ŵ) to calculate Π, recorded asΠ. The two steps are calculation ofŴ and clustering, which are detailed as follows: (1) Weight Computing We define K independent suboptimal objective functions, as follows: Let ∂J k ∂w kd = 0, then: Let ∂J k ∂ξ k = 0, then: From Equations (22) and (23), we can obtain the representation of w kd as follows: (2) Clustering Cluster can be generated by dividing x i into the cluster with the most similarity. The algorithm can be expressed as follows: In summary, the algorithm is outlined in Algorithm 1. According to the algorithmic structure, SKSCC can be viewed as an extension to the k-modes clustering algorithm, by adding step (3) to update the cluster and step (5) to compute the attribute weights, both of which are proportional to the kernel bandwidth that can be learned by the objects themselves. Therefore, as the k-modes algorithm, the SKSCC algorithm can also converge in a finite number of iterations. The time complexity of SKSCC is O(KND).

Optimization of Kernel Bandwidths
In light of the weight calculation formula Equation (24), the weights depend on the kernel bandwidths, which is the bandwidth optimization problem in the defined SKDE method. Here, we use the mean integrated squared error (MSE) method, which is a data-driven method for estimating optimal bandwidth. For the dth attribute, the kernel probability estimation's MSE for o dl ∈ O d can be expressed as follows: According to the definition of kernel function and the properties of expectation, the bandwidth λ d can be obtained. The objective function of the optimal estimation of bandwidth is as follows: Because of where N represents the number of samples. Then, we have: Due to Var[X] = E X 2 − (E[X]) 2 , [I(·)] 2 = I(·); then, we have: Therefore, we obtain: Therefore, we have: We use the frequency distribution of the training samples to estimate p(o dl ), and we calculate σ 2 d by the standard deviation of the training samples. Hence, we obtain The kernel bandwidth algorithm is outlined in Algorithm 2. Several properties of the kernel bandwidth's optimal estimation are analyzed: (1) The larger the number of samples N, the smaller the bandwidth.
The larger the number of samples N, the smaller the bandwidths. When N → ∞, the bandwidth λ d → 0. This is consistent with the effect of bandwidth as the smoothing parameter of the kernel function.
(2) The larger the data dispersion, the larger the bandwidth.
Let us calculate the derivative of λ * d with respect to s 2 d as follows: in the range [0,1). The larger the data dispersion s 2 d , the larger the bandwidth λ * d , that is to say, the larger the discreteness of an attribute, the larger the kernel bandwidth corresponding to the attribute. In particular, when an attribute categorical data are uniformly distributed, the corresponding kernel bandwidth takes the maximum value.

Algorithm 2
The kernel bandwidth calculation algorithm.

Experimental Analysis
Experiments were performed to verify the effectiveness of our proposed SKSCC on synthetic and real datasets. Comparative experiments were carried out on some current mainstream categorical clustering algorithms.

Experimental Setup
In practical applications, the Gaussian kernel function is the most widely used kernel function, because it is suitable for a variety of samples and has few parameters. Moreover, the mapping space provided by this type of kernel function isinfinitely dimensional, so that the data that are not separated in the original space can be directly mapped into linear separable points. Therefore, we chose the Gaussian kernel to mine the non-linear relationship between categorical attributes. The parameter defined as which is the global variance, and is learned from the data themselves. We chose three algorithms-k-mode [17], WKM [22], MWKM [23]-for our comparative experiments. WKM introduced attributes-weighting within the framework of the k-modes algorithm, which is a linear weighting. The MWKM algorithm weights the attributes through the frequency of the mode. All three methods are based on the principle of feature independence to calculate the sample similarity (or dissimilarity). These algorithms are selected for comparison with the non-linear similarity measurement SKSCC. The parameter β is set to 2 in WKM. The parameter β is set to 2 and T s = T v = 1 in MWKM.
Synthetic data can control the cluster structure of datasets through the number and size of clusters, which is conducive for analyzing the performance of the algorithm and its adaptability to various datasets. For this paper, we first tested on several synthetic datasets and then carried out experiments on many real datasets. Because the labels are all known, two external evaluation indices-accuracy and F-score [22]-were selected to evaluate the clustering performance of the new algorithm. The larger the value of the two indices, the better the clustering effect. F-score is defined as follows: where class k represents the kth real class in datasets, n k represents the sample number of class k , and P(class k , π i ) and R(class k , π i ) separately represent accuracy and recall compared real class class k and cluster π i of clustering results, that is, where TP represents the number of predicting correct clusters as correct clusters; FN represents the number of predicting correct clusters as false clusters; FP represents the number of predicting false clusters as correct clusters.

Discussion of Parameters
In the kernel space, each attribute is automatically given a weight to measure its similarity, and the corresponding subspace is found through feature selection.
where θ is the incentive intensity, and is the allocation parameter of control weight. Figure 1 shows the change in parameters for the weight of the three attributes in the Breastcancer dataset. Here, the discreteness of the three attributes is set to increase from attribute 1. There are four comments for θ.
(1) When θ = 0, w θ kd is the constant, that is, each attribute will be assigned an equal weight; (2) When θ = 1, θ 1−θ → ∞, but all of the weights must meet the restriction ∑ D d=1 w kd = 1, so when θ → 1 + , the attribute with the minimum deviation of the sample will be weighted, while the rest of the attributes will be given zero weight; when θ → 1 − , the importance of all attributes tends to be the same; (3) When 0 < θ < 1, the more discrete the attribute, the greater its weight; (4) When θ < 0 and θ > 1, the attribute weight is inversely proportional to the dispersion of data distribution. Considering Theorem 1, we should set θ > 1, but when θ is too larger, the difference between attribute weights is reduced.

Analysis of Synthetic Data and Results
This study used MATLAB (Version 9.9.0.1495850 R2020b) to generate the synthetic data in the experiment. First, four multi-dimensional numerical datasets were generated by the MATLAB function mvnrnd(·), in which the weight of attributes was controlled by setting the variance of attributes, and the correlation degree between attributes was controlled by adjusting the parameters of the covariance matrix. The synthesized numerical data were then discretized by equal width [40] and transformed into categorical data. The synthetic datasets that contain the correct category labels are presented in Table 1. Four datasets were used to verify the advantages of SKSCC compared with the current mainstream categorical clustering methods.

•
The covariance of attribute 1 and attribute 2 is set to −2 in DataSet1, which makes their attributes a negative correlation. The covariance of attribute 1 and attribute 4 is set to 2, which makes their attributes a positive correlation. The variances are set to be equal on each attribute; • DataSet2 and DataSet1 are set to the same clusters, but the number of attributes differs. Ten attributes are extracted to set their covariance. The variances are set to be equal on each attribute; • DataSet3 and DataSet2 are set to the same attributes, but the clusters are different. The variances are set to be equal in two clusters. Ten attributes are extracted to set their covariance; • DataSet4 is set to the most number of attributes and the clusters. Twenty attributes are extracted to set their covariance in seven clusters. All attributes are set to covariance in one cluster. A half clusters set the same variances, as well as other half clusters.  Datasets1  6  2  1000  Datasets2  20  2  1000  Datasets3  20  4  1000  Datasets4  40  8  1000 We implemented 100 runs on each algorithm and each dataset, and set θ = 1.5. The average clustering accuracy reported in Table 2 reflects the overall performance of each clustering algorithm, and the stability of clustering performance of each algorithm can be judged according to the listed variance. The smaller the variance of clustering accuracy, the better the stability of clustering performance.  Table 2, we can see that with the increase in the number of related attributes, the clustering accuracy of SKSCC is significantly higher than that of other algorithms. This is because SKSCC employs a "kernel" operation and take into consideration the relationship between attributes.

Analysis of Real-World Data and Results
In this part of the experiments, we set out to test and verify the performance of SKSCC in real-world datasets. We compared the SKSCC algorithm with three other algorithms: the original k-modes algorithm (k-mode), the weighting algorithm (WKM), and the mixed weighting algorithm (MWKM).

Real-World Datasets
To carry out the experiments, we obtained 10 datasets from the University of California Irvine (UCI) Machine Learning Repository [7]. Table 3 lists the details of these 10 datasets. The Breastcancer, Vote, Mushroom, and Adult+stretch datasets have the same clusters, but Mushroom dataset has the most samples, and Adult+stretch dataset has the least number of samples. The Balance and Splice datasets each have the same number of clusters (3), but the dimensionality of Splice is higher. The Soybeansmall and Car datasets each have the same number of clusters (4), but different attributes and samples. Dermatology and Zoo are multi-cluster datasets. Because the initial cluster centers can affect the algorithm results, we randomly selected 100 initial centers, and all of the algorithms used the same initial centers in each experiment. We implemented 100 runs on each algorithm and each dataset, and set θ = 1.5.
The average values and the errors for F-score and accuracy are presented in Table 4. The results showed that our proposed method, SKSCC, achieved the best performance in the comparative experiments on most of the datasets. Because the k-mode [17], WKM [22], and MWKM [23] algorithms are all based on the mode-type category theory, it is easy for them to descend to the clustering objective algorithm's local minimum, causing them to lose applicability. However, WKM achieved good results on the Car and Splice datasets, while MWKM achieved high accuracy on the Dermatology dataset.  Figure 2 shows the distribution of clustering accuracy for all the algorithms when run 100 times. SKSCC has the best stability. The abscissa represents each algorithm's running time, and the ordinate is the F-score value to express the results of each clustering. SKSCC has the smallest fluctuation among all the algorithms, although WKM has the best average F-score on the Splice and Car datasets, and MWKM has the best average F-score on the Dermatology dataset. The clustering results for the k-mode algorithms show significant contrast, because they consider only the module in the clustering process, which makes it easy to fall into the local optimum, and the initial cluster center is k randomly selected objects. This is reflected in the standard deviation of average precision. Because SKSCC quantizes the module, it avoids the above-mentioned problems and has more stable performance than the other algorithms.

Feature Weighting Results
Our SKSCC approach also has a feature selection effect. Using the Breastcancer dataset as an example, Figure 3 shows the attribute weights generated by the MWKM and SKSCC algorithms. It does not show the k-mode algorithm or WKM algorithm, because the former method is not weighted in its features and the latter method calculates the weights based on mode frequency, which is similar to MWKM algorithm. From Figure 4, we can see that for SKSCC, A1 and A9 acquire the largest and the smallest weights, respectively, of the benign class, but MWKM algorithm achieved the opposite results. To test the feature weighting method's rationality for the SKSCC, we removed the A1 and A9 features from the original Breastcancer data in order to form two reduced datasets. The F-score values of the different clustering algorithms on the Breastcancer dataset with the original and reduced feature sets are shown in Figure 4. For all the algorithms, the reduced dataset with the A9 feature removed achieved the highest F-score values, while the reduced dataset with the A1 feature removed showed decreased F-score values. The results indicate that our SKSCC algorithm with non-linear similarity measurement does a better job, by considering the relationship of the attributes, than the other algorithms.

Time Consumption
This paper uses a logarithm of the average time of clustering to compare the actual average times. The ordinate represents the average time (in MS) of each algorithm running on the real-world dataset. It can be seen from Figure 5 that k-mode, WKM, and MWKM algorithms have high clustering efficiency, which is one of the advantages of the modulebased clustering algorithms. Because only the module of the categorical attribute needs to be considered, the statistical information of the other categorical symbols can be ignored, which greatly reduces the algorithms' clustering times.

Conclusions
Kernel clustering with categorical data is a vital direction in application research. In view of current problems, such as supposing all features independently, considering all attributes' importance equally, and finding an optimization solution, this paper proposes a novel kernel clustering approach for categorical data, that is, a self-expressive kernel subspace clustering algorithm for categorical data (SKSCC). This paper first defines a kernel function for self-expression kernel density estimation (SKDE), in which each attribute has its own bandwidth and can be calculated by the data themselves. We also propose a novel non-linear similarity measurement method and an efficient non-linear optimization method (Theorem 1) to solve the objective function of the kernel clustering. Finally, the SKSCC algorithm is presented for categorical data. Our method not only considers the relationship between attributes in non-linear space but also gives each attribute a feature weight to measure the correlation degree in the algorithmic process. The experimental results indicate that the proposed algorithm outperforms the other algorithms on the synthetic and UCI datasets.
There are many directions that are of interest for future exploration. We will expand our approach to other kernel functions and test the performance on more datasets for various data. Our efforts will also be directed at combining our method with deep learning to estimate the parameters adaptively.