Next Article in Journal
Photon Detection as a Process of Information Gain
Next Article in Special Issue
Generalized Term Similarity for Feature Selection in Text Classification Using Quadratic Programming
Previous Article in Journal
Quaternion Valued Risk Diversification
Previous Article in Special Issue
Weighted Mean Squared Deviation Feature Screening for Binary Features
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships

College of Computer, National University of Defense Technology, Changsha 410000, China
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(4), 391; https://doi.org/10.3390/e22040391
Submission received: 4 March 2020 / Revised: 21 March 2020 / Accepted: 27 March 2020 / Published: 29 March 2020
(This article belongs to the Special Issue Information Theoretic Feature Selection Methods for Big Data)

Abstract

:
Categorical data are ubiquitous in machine learning tasks, and the representation of categorical data plays an important role in the learning performance. The heterogeneous coupling relationships between features and feature values reflect the characteristics of the real-world categorical data which need to be captured in the representations. The paper proposes an enhanced categorical data embedding method, i.e., CDE++, which captures the heterogeneous feature value coupling relationships into the representations. Based on information theory and the hierarchical couplings defined in our previous work CDE (Categorical Data Embedding by learning hierarchical value coupling), CDE++ adopts mutual information and margin entropy to capture feature couplings and designs a hybrid clustering strategy to capture multiple types of feature value clusters. Moreover, Autoencoder is used to learn non-linear couplings between features and value clusters. The categorical data embeddings generated by CDE++ are low-dimensional numerical vectors which are directly applied to clustering and classification and achieve the best performance comparing with other categorical representation learning methods. Parameter sensitivity and scalability tests are also conducted to demonstrate the superiority of CDE++.

1. Introduction

Categorical data with finite unordered feature values are ubiquitous in machine learning tasks, such as clustering [1,2] and classification [3,4]. Most machine learning algorithms are built for numerical data based on algebraic operations, such as k-means and SVM, which cannot be directly used for categorical data. These algebraic machine learning algorithms will be applicable for categorical data only if we embed the categorical data into numerical vector space. However, learning numerical representations of categorical data is not a trivial task since the intrinsic characteristics in categorical data need to be captured in embeddings.
As stated in [5], the hierarchical couplings relationship (i.e., correlation and dependency) between feature values in categorical data is a crucial characteristic which should be mined sufficiently. The sophistic couplings between feature values also reflect the correlations between features. Take the simple dataset in Table 1 as an example. It is intuitive that the value (short for feature value) F e m a l e of feature G e n d e r is highly coupled with the value L i b e r a l a r t s of feature M a j o r . Similarly, The value E n g i n e e r i n g in feature M a j o r is strongly coupled with the value P r o g r a m m e r in feature O c c u p a t i o n . Thus, the relation between feature G e n d e r and M a j o r could be expressed by a semantic cluster, i.e., { F e m a l e , L i b e r a l a r t s }, as well as feature M a j o r and O c c u p a t i o n by { E n g i n e e r i n g , P r o g r a m m e r }. These value clusters which may contain multiple values reflect the heterogeneous couplings in categorical data. Moreover, the feature value clusters are also coupled with each other in both same and different granularities. These high-level couplings are heterogeneous and therein exists both linear and nonlinear relationships.
For most learning tasks, the more relevant information (i.e., the hierarchical couplings) the categorical data embeddings captures, the better performance it has. However, besides CDE [5], other representation learning methods could capture only limited or none of the couplings in categorical data. Generally, existing methods fall into two categories: the embedding-based method and the similarity-based method. Typical embedding methods, e.g., 1-hot encoding and Inverse Document Frequency (IDF) encoding [6,7], transform categorical data to numerical data by some encoding schemes directly. But these methods treat features independently and ignore the couplings between feature values. Also, several similarity-based methods, e.g., ALGO (clustering ALGOrithm), DILCA (DIstance Learning for Categorical Attributes), DM (Distance Metric), COS (COupled attribute Similarity) [8,9,10,11], take value couplings into consideration. However, these methods do not take feature value intrinsic clusters and couplings between clusters into account so that their representation capacities are limited for categorical data.
Learning the heterogeneous hierarchical couplings in categorical data is not a trivial task. There are short of work representing hierarchical couplings in categorical data so far. To our knowledge, our previous work CDE (Categorical Data Embedding) [5] is the first work focusing on hierarchical couplings mining and categorical data representing. Compared with other existing representation methods, it gets relatively better performance. However, CDE can only capture homogeneous value clusters through single clustering strategy and linear correlation between value clusters through principal component analysis which limits its performance in complex categorical data.
To address the above issues, we propose an enhanced Categorical Data Embedding method, i.e., CDE++, which can capture heterogeneous feature value relationships in categorical data. In value couplings learning phase, we use mutual information and margin entropy to learn the interactions of features and feature values. To learn the value clusters couplings, we design a hybrid clustering strategy to get heterogeneous value clusters from multiple aspects. Then the Autoencoder is adopted on these value cluster indicator matrices to obtain lower-dimensional value embeddings which can capture complex nonlinear relationships between value clusters. We finally concatenate the value embeddings to generate an expressive object representation. In this way, CDE++ can capture the intrinsic data characteristic of categorical data in the expressive numerical embeddings which largely facilitate the following learning tasks.
The contributions of this work are summarized as follows:
  • By analyzing the hierarchical couplings in categorical data, we propose an enhanced Categorical Data Embedding method (CDE++), which could capture heterogeneous feature value coupling relationships in each level.
  • We adopt mutual information and margin entropy to capture the couplings between features and design a hybrid clustering strategy to capture more sophisticated and heterogeneous value clusters in the low level. CDE++ implements different metric-based clustering methods, including density-based clustering method and hierarchical clustering method, with various clustering granularities from different perspectives and semantics.
  • We utilize Autoencoder to learn the complex and heterogeneous value cluster couplings in the high level. With this, CDE++ maps the original value representation into a low-dimensional space, while learning both linear and nonlinear value cluster coupling relationships.
  • We empirically prove the superiority of CDE++ through both supervised and unsupervised learning tasks. Experiment results show that (i) CDE++ significantly outperforms the state-of-the-art methods and their variants in both clustering and classification. (ii) CDE++ is insensitive to its parameters and thus has stable performance. (iii) CDE++ is scalable w.r.t. the number of data instances.
The rest of this paper is organized as follow. Related work is discussed in Section 2. We introduce the proposed method, i.e., CDE++, in Section 3. Experiments setup and results analysis are provided in Section 4. We conclude this work in Section 5.

2. Related Work

Existing representation learning algorithms broadly fall into two categories: (i) embedding-based representation which represents each categorical object by a numerical vector, (ii) similarity-based representation which uses object similarity matrix to represent the categorical object.

2.1. Embedding-Based Representation

Embedding-based representation, which is the most widely used in categorical data representation, generates a numerical vector to represent each categorical object. A popular embedding method called 1-hot encoding translates each feature value to a zero-one indicator vector [6]. It first counts the values of one feature f i as | V i | . Then the value in the feature is represented by a 1-hot | V i | -dimension vector, where ‘1’ corresponds to the value entry and ‘0’ to the others. 1-hot encoding treats each value equally and ignores the instinct couplings of real datasets. Our previous work CDE [5] is a state-of-the-art embedding-based representation which makes use of coupling relationships of data sets. However, the method could not exploit heterogeneous coupling relationships comprehensively due to its clustering method and the limits of nonlinear relationship mining. This method uses a dimension reduction method, such as the principal component analysis (PCA) [12], to alleviate the curse of the dimensionality issue. IDF encoding is another popular embedding-based representation method [7], and it utilizes the probability-weighted amount of information (PWI), which is calculated based on the value frequency, to represent each value. IDF-encoding learns couplings between values from the occurrence perspective, accordingly, its ability of mining intrinsic coupling relationships of data set is very limited. The method in [13] has the same goal as our work, which is to learn transforms categorical data to numerical representations for categorical data. The main difference between the method in [13] and our method is that they need class labels while our method is an unsupervised method.
Embedding-based representation methods are also used for textual data, and there are several effective embedding methods such as Skim-gram [14], latent semantic indexing (LSI) [15], latent Dirichlet allocation (LDA) [16], as well as some variants of them in [17,18,19]. Granular Computing paradigm [20,21,22] is an embedding method which is powerful especially when dealing with non-conventional data such as graphs, sequences, text documents. However, the embedding representation for textual data is significantly different from categorical data since categorical data is structured, whereas textual data is unstructured. Thus, we do not detail these embedding methods here.

2.2. Similarity-Based Representation

Similarity-based representation methods utilize an object similarity matrix to represents categorical data. The inspiration of several similarity-based methods comes from learning couplings of categorical data. For instance, ALGO [8] first takes advantage of conditional probability in a pair of values to describe the value couplings; DILCA [9] learns a context-based distance between feature values to capture feature couplings; DM [10] incorporates the frequency probabilities and feature weighting to mining couplings of the feature. COS [11] grasps couplings from two aspects, i.e., inter-feature and intra-feature. The above similarity measures learn feature couplings by pair-values. However, they could not obtain comprehensive couplings since the value clusters and the couplings therein are not considered. Moreover, the similarity methods are inefficient because they require to calculate and store the object similarity matrix.
There are several embedding methods that utilize similarity matrix to optimize their embedding representations [23,24]. However, the performance of these embedding methods depends heavily on the underlying similarity methods.

3. Method of CDE++

3.1. Learning Process of CDE++

We aim to rebuild the categorical data set so as to make it more convenient for the following learning tasks. Figure 1a illustrates the framework of our enhanced Categorical Data Embedding Learning method (CDE++). The gray boxes in Figure 1a represent a series of learning methods, whereas the white boxes consist of a certain amount of intermediate data for our representation rebuilding. Figure 1b is an instance of data flow in CDE++. The notations are illustrated in Table 2.
As shown in Figure 1, we first construct the value couplings matrices by occurrence-based and co-occurrence-based value coupling method, which can capture the interactions between values. Then, we learn value clusters by hybrid clustering strategy with multiple granularities. After obtaining the value clusters, we learn the couplings between value clusters by the deep neural network, Autoencoder, for the value representation. Finally, we obtain the object representation by concatenating the value vectors for the following learning tasks.

3.2. Preliminaries

Consider a dataset X with n objects, that is, X = { x 1 , x 2 , . . . , x n } , where each object x i is described by d categorical features, and the features belong to F = { f 1 , f 2 , . . . , f d } . Each feature f i has a finite set of values V i = { v i 1 , v i 2 , . . . } . Moreover, the values from different features has no intersection such that the number of total feature values is | V | = i = 1 d | V i | , denoted as m.
For better describing how to calculate the joint probability of two values v i and v j , we need to introduce some symbols. Let f i denotes the feature that v i belongs to, and let v x f denotes the value in feature f of object x. Let p ( v i ) denotes the probability of v i that calculated by its occurrence frequency. Thus, the joint probability of v i and v j is
p ( v i , v j ) = | v x f i = v i v x f j = v j | n , x X .
The normalized mutual information, denoted as NMI, is a measurement of the mutual dependence between two vectors [25]. When we observe one vector, the information of the other vector that we can obtain can be quantified by NMI. Accordingly, the relation between two features f a and f b could defined as
ρ ( f a , f b ) = 2 I ( f a , f b ) H ( f a ) + H ( f b ) ,
where I ( f a , f b ) is the relative entropy of joint distribution and marginal distribution, and it is written in
I ( f a , f b ) = v i V f a v j V f b p ( v i , v j ) l o g p ( v i , v j ) p ( v i ) p ( v j ) .
H ( f a ) and H ( f b ) are the marginal entropies of feature f a and f b , respectively. The marginal entropy of the specific feature can be described by
H ( f ) = v i V f p ( v i ) l o g ( p ( v i ) ) , f f a , f b .

3.3. Learning Value Couplings

The value couplings are learned to reflect the intrinsic relationship between feature values. As we used in the previous work [5], which is proved effective and intuitional. The relation between values has two aspects: on the one hand, the occurrence frequency of one value is influenced by others; on the other hand, one value could be influenced by its pair value because of their co-occurrence relationship in one objects. For capturing the value couplings based on occurrence and co-occurrence, two coupling functions and their corresponding relation matrices ( m × m ) are constructed, respectively.
The occurrence-based value coupling function is ξ o ( v i , v j ) = ρ ( f i , f j ) × p ( j ) p ( i ) , which represents the occurrence frequency of v i influenced by v j . In this function, the NMI of two features works as a weight. After constructing the coupling function, the occurrence-based relationship matrix M o is constructed by:
M o = ξ o ( v 1 , v 1 ) ξ o ( v 1 , v m ) ξ o ( v m , v 1 ) ξ o ( v m , v m )
The co-occurrence-based value coupling function is ξ c ( v i , v j ) = p ( v i , v j ) p ( v i ) , which indicates the co-occurrence frequency of value v i influenced by value v j . Note that f i and f j will never be equal since it is impossible for two values owned by the same feature to co-occur in one object. Thus, the co-occurrence-based relationship matrix M c is designed as follow:
M c = ξ c ( v 1 , v 1 ) ξ c ( v 1 , v m ) ξ c ( v m , v 1 ) ξ c ( v m , v m )
The two matrices could be treated as new representations of value couplings based on occurrence and co-occurrence, respectively. Moreover, they could be applied in the following values clustering.

3.4. Hybrid Value Clustering

To capture the value clusters from different perspectives and semantics, we cluster the feature values in different granularities and use the new representation ( M o , M c ) as the input of the clustering algorithm. To make the cluster results more robust and reflect the data characteristics more precisely, we choose a hybrid clustering strategy, which combines the clustering results of DBSCAN (Density-Based Spatial Clustering of Applications with Noise) and HC (Hierarchical Clustering).
The motivation we use the hybrid clustering strategy is as follows: (i) The metric of DBSCAN is density-based, whereas HC is a partition-based method like K-means. So when we combine the cluster results of the two clustering methods, we can obtain the comprehensive value clusters, which is crucial for capturing the intrinsic data characteristics. (ii) DBSCAN has excellent performance for both convex data sets and non-convex data sets, whereas K-means is not suitable for non-convex data sets. HC can also solve the non-spherical datasets that K-means can not solve. (iii) DBSCAN is not sensitive to noisy points, which means DBSCAN is stable. Consequently, our hybrid clustering strategy suitable for majority data sets; meanwhile, it has a better clustering result.
DBSCAN contains a pairwise parameter τ ( e p s , M i n P t s ) , where e p s represents the maximum radius of circles centered on cluster cores, and M i n P t s represents the minimum number of objects in the circle. HC only has one parameter K, which means the number of clusters likes K-means. Therefore, for clustering with different granularities, we set parameters { τ 1 , τ 2 , . . . , τ o } and { τ 1 , τ 2 , . . . , τ c } for M o and M c clustering with DBSCAN respectively. Likewise, we set parameters { k 1 , k 2 , . . . , k o } and { k 1 , k 2 , . . . , k c } for clustering with HC.
Parameter selection. In HC clustering, the strategy of choosing K is demonstrated in Algorithm 1. Instead of giving a fixed value, we use another proportion factor ε to decide the maximum cluster number as shown in Steps (3-12) of Algorithm 1. We remove those tiny clusters with only one value from the indicator matrix. When the number of removed clusters is larger than k ε , we stop increasing K, whose initial value is 2. In DBSCAN clustering, for a specific τ ( e p s , M i n P t s ) , the parameter e p s and M i n P t s are selected based on k-distance graph. For a given k, the k-distance function is mapping each point to its k-th nearest neighbor. We sort the points of the clustering database in descending order of their k-distance values. Furthermore, we set e p s to the first point in the first “valley” of the sorted k-distance graph, and we set M i n P t s as value k. The value k is same to the parameter K of HC. The parameter selection is following [26].
After clustering, we get four clustering indicator matrices to represent the clustering results. The clustering indicator matrix of ( M o , D B S C A N ) is denoted as C o d , the size of which is m × o . Likewise, other indicator matrices are { C o h , C c d , C c h } with size { m × c , m × o , m × c } . Finally, we concatenate the four indicator matrices into one indicator matrix, denoted as C, which contains the comprehensive information of value clusters. Similar to M o and M c , C could also be regarded as a new representation of feature values based on value couplings and value clusters.

3.5. Embedding Values by Autoencoder

Deep Neural Network (DNN) is the hottest topic in machine learning because of its ability in feature extraction. Each middle layer in DNN has the ability of feature learning; it is a self-learning process without any prior knowledge.
After constructing the value clusters indicator matrix C, which contains comprehensive information, we further learn the couplings between the value clusters. Meanwhile, it requires to build a concise but meaningful value representation. It is intuitional for us to use DNN for value clusters couplings learning, and we use Autoencoder to handle this in unsupervised circumstance. The simple function of Encoder and Decoder are as follows:
E n c o d e r : c o d e = f ( x ) ,
D e c o d e r : x = g ( c o d e ) = g ( f ( x ) ) .
The Encoder is used to learn low-dimension representation c o d e of the input X. Each layer of the Encoder learns the feature and features couplings of Input X, therefore, c o d e contains the complete information of X. The Decoder is implemented to reconstruct X from its input, i.e., c o d e . The training process of Autoencoder is minimizing the loss function L o s s [ x , g ( f ( x ) ) ] . After training, the c o d e will contain the feature couplings of X and convey similar information with X as well.
The Autoencoder makes it possible for us to capture the heterogeneous value clusters couplings and obtain a relatively low dimension values representation. In our method, we train the Autoencoder by using the value clusters indicator matrix C as the input. Furthermore, we use the Encoder to calculate a new values representation matrix V n e w in m × q . The column size q is determined by | o + c + o + c | (denoted by v c ) and hidden factor λ which will be discussed in Section 4.5. The new value representation V n e w would convey the information of value clusters C as well as the clusters couplings, which is considered as a concise but meaningful value representation.

3.6. The Embedding for Objects

The final step is to model the objects embedding after we get values representation from Autoencoder. The general function is presented as
x n e w = Ω ( v 1 x , v 2 x , , v d x ) , v V n e w .
The function Ω in Equation (7) could be customized to suit for learning task in the following. We concatenate the new values from V n e w to generate the new objects embedding.
The main procedures of CDE++ are presented in Algorithm 1. Algorithm 1 has three inputs, that is, the data set X, the factor of drop redundancy value clusters ε , the hidden factor of Autoencoder λ . The algorithm mainly consists of four steps. The first step is to calculate M o and M c based on occurrence and co-occurrence value coupling function. Then CDE++ utilizes hybrid clustering strategy to cluster values with M o and M c . The parameter ε is used to control the clustering results and determines the time to terminate the clustering process. In the third step, the algorithm uses Autoencoder to learn the couplings of value clusters and generates the concise but meaningful value embedding. The parameter λ is a hidden factor of the input dimension and output dimension of Encoder, which indicates the ratio of dimension compression. Finally, CDE++ embeds objects in the data set by concatenating the value embedding.
Algorithm 1 Object Embedding ( X , ε , λ )
Input: Dataset X, Parameters ε and λ
Output: the new representation of X ( X n e w )
1:
Generate M o and M c
2:
Initialize C = ϕ
3:
for M { M o , M c } do
4:
    for c l a s s i f i e r { D B S C A N , H C } do
5:
        Initialize τ ( e p s , M i n p t s ) , k
6:
        Initialize r = 0
7:
        while r k ε do
8:
           C = [ C ; c l a s s i f i e r ( M , τ o r k ) ]
9:
           Remove the clusters containing only one value and count as r
10:
        end while
11:
    end for
12:
end for
13:
Train Autoencoder
14:
V n e w = E n c o d e r ( C , λ )
15:
X n e w = c o n c a t e n a t e v a l u e s f r o m V n e w
16:
Return X n e w
Complexity Analysis. (1) Generating value couplings matrix incurs the complexity of O ( n d 2 ) ; (2) Values clustering needs the complexity of O ( m ln m + m 2 ) ; (3) The complexity of Autoencoder is O ( m v c e p o c h ) , where v c and e p o c h are total value clusters and the iteration times respectively. (4) Generating numerical object embedding by value embeddings has the complexity of O ( n d ) . Accordingly, the total complexity of CDE++ is O ( n d 2 + m ln m + m 2 + m v c e p o c h + n d ) . In real data sets, the number of values in one feature is generally small, thus, m 2 is a little larger than d 2 . Meanwhile, m 2 is not comparable with n d 2 . The total number of value clusters v c is much smaller than m and e p o c h is iteration times which is manual setting. Therefore, the approximate time complexity of CDE++ is simplified as O ( n d 2 + m v c epoch ).

4. Experiments and Evaluation

4.1. Experiment Settings

4.1.1. Data Sets

For evaluating the performance of CDE++, as Table 3 shows, fifteen real-world datasets from UCI (https://archive.ics.uci.edu/ml/datasets.php) machine learning repository are used. These datasets cover multiple areas, e.g., life, physical, game, social, computer, education, etc. Each data set has a class label as a metric, and has several features described by categorical value.
In the unsupervised K-means task, we use the whole dataset as training sets and test sets. In the supervised SVM task, we use 75 % of the datasets for training sets and the rest 25 % for test sets.
The detailed attributes of data sets are presented in Table 3, where { n , d , q , | C | } denotes the number of objects, features, feature values, ground-truth classes in the data set respectively.

4.1.2. Baseline

In this test, CDE++ is compared with IDF encoding (denoted by “IDF”), DILCA, our previous work (i.e., the coupled data embedding denoted as “CDE”), and the widely used 1-hot coding (denoted by “1-HOT”). Moreover, to make a fair comparison, we introduce the variations of CDE and 1-HOT by replacing their last step of generating value embedding with AutoEncoder. The variations are denoted by CDE-AE and 1-HOT-AE, respectively. In CDE and its variation, the parameters are set according to its original paper. The parameters of CDE++ are mentioned in Section 4.5. We use Autoencoders with same parameters settings, as shown in Table 4.

4.1.3. Evaluation Methods

The performance of learning tasks significantly depends on the data representation. The more expressive the representation is, the better the performance. To give a convincing evaluation, we feed the obtained representation into both unsupervised and supervised learning tasks. Without loss of generality, we choose K-means as the representative unsupervised learning task, whereas SVM as the representative of supervised learning tasks.
In K-means clustering, we set the number of clusters K = | C | in each data set. We use the widely used F-score to measure the performance. The higher the F-score, the better the K-means clustering performance, so as to the object representation performance. Although the datasets we used are relative balance, we choose the micro version of F-score. The calculation of micro F-score is shown below.
p r e c i s i o n = T P i ( T P i + F P i ) , r e c a l l = T P i ( T P i + F N i ) ,
F-score = 2 r e c a l l × p r e c i s i o n r e c a l l + p r e c i s i o n ,
where T P i , F P i , F N i are the numbers of true positive, false positive, false negative for class i.
For the SVM classifying, we use Accuracy as the performance measurement. Likewise, the higher the Accuracy, the better the performance of object representation.
Since the starting points of value clustering are random, we run the proposed CDE++ 10 times and feed the obtained representations into the learning tasks. Each task is repeated 10 times to get a stable result. The reported F-score or Accuracy is the average value over these 100 validations. Therefore, the robustness of evaluation results is guaranteed.

4.1.4. Experimental Environment

All the experiments are conducted on the same workstation.
Hardware environment Macbook Pro 2016; CPU: Intel Core i7; RAM: 16 GB. Apple Inc. USA.
Software environment Matlab 2018b. MathWorks Inc. USA.

4.2. Results of Clustering

Table 3 presents the K-means clustering F-score of the tested methods. In thirteen out of fifteen datasets, CDE++ has the best performance, which is much better than other embedding methods. On average, CDE++ obtains approximate 16.58 %, 14.56 %, 9.10 %, 10.80 %, 13.56 %, 12.50 % improvement compared with IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, respectively. CDE outperforms other state-of-art representation methods due to the learning of hierarchical couplings, while CDE++ enhance the heterogeneous value relationship capturing and achieve the best performance.

4.3. Results of Classification

Table 5 demonstrates the Accuracy of SVM using the representations output by CDE, CDE-AE, 1-HOT, 1-HOT-AE, and CDE++. CDE++ performs significantly better than the first four methods, and is comparably better than 1-HOT and 1-HOT-AE. On average, CDE++ obtains approximate 12.76%, 13.55%, 10.38%, 17.3%, 5.8%, 5.11% improvement compared with IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, respectively. In the supervised learning task, our enhanced CDE++ could also keep a high performance than others. Therefore, based on the results above, CDE++ has generality for both unsupervised tasks and supervised tasks.

4.4. Ablation Study

To examine whether all the components of CDE++ is necessary, we implement the ablation study, and Table 6 shows the comparative group setting. We implement K-means clustering and SVM classification learning task using the output of objects embedding. In the implementation, (i) and (ii) use DBSCAN and HC for value clusters learning respectively, whereas (iii) uses both of them. Neither (i), (ii) nor (iii) learn value clusters couplings. (iv) uses all parts of CDE++.
Table 7 and Table 8 illustrate the K-means clustering and SVM classifying performance, respectively. Under the whole parts of CDE++, these two learning tasks obtain the highest F-score and Accuracy. Based on the ablation study, it is believed that no components can be dropped from CDE++ and the whole structure could return better objects embedding.

4.5. Sensitivity Test w.r.t. Parameters ε and λ

We examine the sensitivity of the performance of CDE++ w.t.r. ε and λ in this part. The first parameter ε is used to control the dimension of feature value representation before conducting Autoencoder, whereas the second parameter λ controls the dimension of feature value representation after Autoencoder. For the robustness of the test, we have selected four datasets by different level of clustering performance (i.e., F-score) for the sensitivity test. Both parameters are in the range of {2,4,6,8,10,20,40}.
To test the sensitivity w.r.t. ε , we first fix λ = 10 . Figure 2 and Figure 3 present the dimension of objects representation and clustering performance using different ε values. The dimension of objects representation is stable when ε 8 whereas the clustering performance is always stable. The reason why CDE++ is stable w.r.t ε is that ε only chooses some granularities of value coupling clustering. Under such granularities, the clusters with only one value have been dropped. Thus, it makes the clustering performance stable, so as to the CDE++.
Figure 4 and Figure 5 present the dimension of objects representation and clustering performance using different λ values. Likewise, we fix ε = 8 to test the sensitivity w.r.t. λ . Figure 5 shows that the clustering performance is relatively stable as a whole in the range of λ . However, the dimension of objects representation decreases as λ increases, which is illustrated in Figure 4. λ is the parameter that adjusts the ratio of the output dimension and input dimension in Autoencoder, and λ is inversely proportional to the output value representation after Autoencoder. In the value range mentioned above, though the dimension of value representation decreases, it could convey similar information in virtue of the Autoencoder algorithm. So the clustering performance would not fluctuate acutely.
Upon the sensitivity test results, we can claim that the CDE++ performance is not sensitive w.r.t. ε and λ . Moreover, we suggest ε = 8 and λ = 10 as a general parameters value pair.

4.6. Scalability Test

We split the largest dataset, i.e., C h e s s , in our work into five subsets, where the data size increase doubly, for the scaleup test w.r.t. data size. The subsets of C h e s s have six fixed features. Likewise, we synthetic data sets by varying the dimensions in [20,320] for the scalability test w.r.t. data dimension with fixed data size (e.g., 10,0 0). The feature value of the synthetic data sets is randomly chosen from { 0 , 1 } .
Figure 6 presents the scalability test results of the five embedding methods. As Figure 6 illustrates, the execution times increase subtly as the data set size increases. It demonstrates that the execution time of CDE++ is linear to the data size and the scalability of CDE++ w.r.t. data set size is well, while DILCA has O ( n 2 d 2 log d ) .
1-HOT is the most efficient embedding method since it does not consider the couplings between feature values and just translate feature value to a 1-hot vector. The time complexity of CDE++ and CDE before learning clusters coupling are similar, since the neural network Autoencoder is more time consuming than PCA, the execution time of CDE++ is longer than CDE. When we replace the PCA of CDE with Autoencoder, the execution times increase and become even longer than CDE++.
Figure 7 shows the execution time of the tested methods with different object dimensions. When the object dimension enlarges, the execution times of all the five methods rise up acutely. 1-HOT and 1-HOT-AE are much faster since they are simpler than other methods as introduced above. CDE++, CDE, and CDE-AE have higher and similar execution time because their complexities are quadratic functions of the feature number. Specifically, the execution time of CDE++ performed on a dataset with 10,000 objects and more than 300 features is about 10 minutes. Thus, we can say that the execution time is still acceptable in high dimension dataset embedding.

5. Conclusions

This paper proposes an enhanced Categorical Data Embedding method (CDE++), which aims to generate an expressive representation for complex categorical data by capturing heterogeneous feature value coupling relationships. We design a hybrid clustering strategy to capture more sophisticated and heterogeneous value clusters in the low level. We utilize Autoencoder to learn the complex and heterogeneous value cluster couplings in the high level. Different from existing representation methods, our work comprehensively captures the intrinsic data characteristic. Experiment results demonstrate that CDE++ is available for both supervised and unsupervised learning tasks, whereas it significantly outperforms existing state-of-the-art methods with good scalability and efficiency. Moreover, it is insensitive to its parameters.
Based on the superiority of CDE++, our future work is to consider mixed data (i.e., categorical and continuous data). Meanwhile, considering different applications requirements, we could customize CDE++ to get better performance.

Author Contributions

Conceptualization, B.D., S.J. and K.Z.; Data curation, B.D., S.J. and K.Z.; Formal analysis, B.D., S.J. and K.Z.; Investigation, B.D. and S.J.; Methodology, B.D., S.J. and K.Z.; Project administration, K.Z.; Resources, B.D., S.J. and K.Z.; Software, B.D. and S.J.; Supervision, K.Z.; Validation, B.D., S.J. and K.Z.; Writing–original draft, B.D. and S.J.; Writing–review and editing, B.D., S.J. and K.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by NSF 61902405.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Iam-On, N.; Boongeon, T.; Garrett, S.; Price, C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans. Knowl. Data Eng. 2010, 24, 413–425. [Google Scholar] [CrossRef]
  2. Jian, S.; Pang, G.; Cao, L.; Lu, K.; Gao, H. Cure: Flexible categorical data representation by hierarchical coupling learning. IEEE Trans. Knowl. Data Eng. 2018, 31, 853–866. [Google Scholar] [CrossRef]
  3. Chen, L.; Guo, G. Nearest neighbor classification of categorical data by attributes weighting. Expert Syst. Appl. 2015, 42, 3142–3149. [Google Scholar] [CrossRef]
  4. Alamuri, M.; Surampudi, B.R.; Negi, A. A survey of distance/similarity measures for categorical data. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 1907–1914. [Google Scholar]
  5. Jian, S.; Cao, L.; Pang, G.; Lu, K.; Gao, H. Embedding-based Representation of Categorical Data by Hierarchical Value Coupling Learning. In Proceedings of the International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 1937–1943. [Google Scholar]
  6. Pang, G.; Kai, M.T.; Albrecht, D.; Jin, H. ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-Scale Data Sets. J. Artif. Intell. Res. 2016, 57, 593–620. [Google Scholar] [CrossRef]
  7. Aizawa, A. An information-theoretic perspective of tf–idf measures. Inf. Process. Manag. 2003, 39, 45–65. [Google Scholar] [CrossRef]
  8. Ahmad, A.; Dey, L. A k-mean clustering algorithm for mixed numeric and categorical data. Data Knowl. Eng. 2007, 63, 503–527. [Google Scholar] [CrossRef]
  9. Ienco, D.; Pensa, R.G.; Meo, R. From context to distance: Learning dissimilarity for categorical data clustering. ACM Trans. Knowl. Discov. Data (Tkdd) 2012, 6, 1. [Google Scholar] [CrossRef]
  10. Jia, H.; Cheung, Y.m.; Liu, J. A new distance metric for unsupervised learning of categorical data. IEEE Trans. Neural Networks Learn. Syst. 2015, 27, 1065–1079. [Google Scholar]
  11. Wang, C.; Dong, X.; Zhou, F.; Cao, L.; Chi, C.H. Coupled attribute similarity learning on categorical data. IEEE Trans. Neural Networks Learn. Syst. 2014, 26, 781–797. [Google Scholar] [CrossRef]
  12. Jolliffe, I. Principal Component Analysis; Springer: Berlin/Heidelberger, Germany, 2011. [Google Scholar]
  13. Zhang, K.; Wang, Q.; Chen, Z.; Marsic, I.; Kumar, V.; Jiang, G.; Zhang, J. From categorical to numerical: Multiple transitive distance learning and embedding. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015; pp. 46–54. [Google Scholar]
  14. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
  15. Deerwester, S.; Dumais, S.T.; Furnas, G.W.; Landauer, T.K.; Harshman, R. Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 1990, 41, 391–407. [Google Scholar] [CrossRef]
  16. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  17. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. arXiv 2013, arXiv:1310.4546. [Google Scholar]
  18. Hofmann, T. Probabilistic latent semantic indexing. In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, CA, USA, 15–19 August 1999; pp. 211–218. [Google Scholar]
  19. Wilson, A.T.; Chew, P.A. Term weighting schemes for latent dirichlet allocation. In Proceedings of the Human language technologies: The 2010 annual conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, CA, USA, 1–6 June 2010; pp. 465–473. [Google Scholar]
  20. Martino, A.; Giuliani, A.; Rizzi, A. Granular computing techniques for bioinformatics pattern recognition problems in non-metric spaces. In Computational Intelligence for Pattern Recognition; Springer: Berlin/Heidelberger, Germany, 2018; pp. 53–81. [Google Scholar]
  21. Martino, A.; Giuliani, A.; Todde, V.; Bizzarri, M.; Rizzi, A. Metabolic networks classification and knowledge discovery by information granulation. Comput. Biol. Chem. 2020, 84, 107187. [Google Scholar] [CrossRef]
  22. Martino, A.; Giuliani, A.; Rizzi, A. (Hyper) Graph Embedding and Classification via Simplicial Complexes. Algorithms 2019, 12, 223. [Google Scholar] [CrossRef] [Green Version]
  23. Cox, T.F.; Cox, M.A. Multidimensional Scaling; Chapman and Hall/CRC: London, UK, 2000. [Google Scholar]
  24. Hinton, G.E.; Roweis, S.T. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems; MIT Press: Vancouver, BC, Canada, 2003; pp. 857–864. [Google Scholar]
  25. Estévez, P.A.; Tesmer, M.; Perez, C.A.; Zurada, J.M. Normalized mutual information feature selection. IEEE Trans. Neural Netw. 2009, 20, 189–201. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 1996, 96, 226–231. [Google Scholar]
Figure 1. Overview of CDE++.
Figure 1. Overview of CDE++.
Entropy 22 00391 g001
Figure 2. Sensitivity test w.r.t. parameter ε in term of Representation Dimension.
Figure 2. Sensitivity test w.r.t. parameter ε in term of Representation Dimension.
Entropy 22 00391 g002
Figure 3. Sensitivity test w.r.t. parameter ε in term of F-score.
Figure 3. Sensitivity test w.r.t. parameter ε in term of F-score.
Entropy 22 00391 g003
Figure 4. Sensitivity test w.r.t. hidden factor λ in term of Representation Dimension.
Figure 4. Sensitivity test w.r.t. hidden factor λ in term of Representation Dimension.
Entropy 22 00391 g004
Figure 5. Sensitivity test w.r.t. hidden factor λ in term of F-score.
Figure 5. Sensitivity test w.r.t. hidden factor λ in term of F-score.
Entropy 22 00391 g005
Figure 6. Scalability test w.r.t Data Size in term of Execution Time.
Figure 6. Scalability test w.r.t Data Size in term of Execution Time.
Entropy 22 00391 g006
Figure 7. Scalability test w.r.t Data Dimension in term of Execution Time.
Figure 7. Scalability test w.r.t Data Dimension in term of Execution Time.
Entropy 22 00391 g007
Table 1. A simple example to explain the value coupling relationships.
Table 1. A simple example to explain the value coupling relationships.
NameGenderMajorOccupation
JohnMaleEngineeringProgrammer
TonyMaleScienceAnalyst
AlisaFemaleLiberal artsLawyer
BenMaleEngineeringProgrammer
AbbyFemaleLiberal artsMarketing Manager
JamesMaleEngineeringTechnician
Table 2. The descriptions of the notations in CDE++.
Table 2. The descriptions of the notations in CDE++.
SymbolsDescription
X, xThe dataset and a specific object.
F, fThe feature set in the dataset and a specific feature.
f i The feature that value v i belongs to.
V, vThe whole feature value set in the dataset and a specific feature value.
V i The feature value set for feature f i in the dataset.
v x f The value in feature f of object x.
nThe number of objects in the dataset.
dThe number of features in the dataset.
mThe number of feature values in the dataset.
| C | The number of groud-truth classes in the dataset.
p ( v ) The probability of v that calculated by its occurrence frequency.
p ( v i , v j ) The joint probability of v i and v j .
ρ ( f a , f b ) The relation between two features f a and f b .
I ( f a , f b ) The relative entropy of joint distribution and marginal distribution between two features f a and f b .
H ( f ) The marginal entropy of featrure f.
ξ o The occurrence-based value coupling function.
ξ c The co-occurrence-based value coupling function.
M o The occurrence-based relationship matrix.
M c The co-occurrence-based relationship matrix.
τ ( e p s , M i n P t s ) The parameter of DBSCAN.
KThe number of clusters parameter of HC.
CThe cluster indicator matrix.
v c The dimension of cluster indicator matrix.
ε The factor of drop redundancy value clusters.
λ The hidden factor of Autoencoder.
qThe dimension of value after Autoencoder.
Ω The general function to generate new objects embedding.
Table 3. The Dataset attributes and F-score results of Clustering by Inverse Document Frequency (IDF), DILCA, Categorical Data Embedding (CDE), CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of F-score.
Table 3. The Dataset attributes and F-score results of Clustering by Inverse Document Frequency (IDF), DILCA, Categorical Data Embedding (CDE), CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of F-score.
Dataset_AttributesF-Score
Datasets n d m | C | IDFDILCACDECDE-AE1-HOT1-HOT-AECDE++
Zoo101174370.8270.7460.8330.790.8260.8710.879
Iris150412330.590.6320.7170.6670.5850.4670.8
Hepatitis1551936020.5350.6790.6720.6870.6770.6840.755
Tic-tac-toe95892720.5360.5420.5570.5590.5780.5780.659
Annealing7983831750.5120.5340.5770.5470.5280.5880.654
Bloger10051520.4840.4920.5460.5390.530.510.61
Balance-scale62542030.4630.4970.5140.4990.4620.4190.6
Lymphography148185940.5560.5130.5280.4890.4940.4930.568
Hayes-roth13241530.480.4780.4950.4840.3480.3410.545
Teaching A.E.151510130.3950.410.4320.440.4280.4440.503
Student A.P.131217530.4290.4230.4490.4330.4450.4660.475
Lenses244930.4420.4710.4580.4670.5460.5830.458
Nursery12,96082750.3060.2940.320.3560.2830.3820.325
Primary-tumor3391737210.2180.2240.2850.290.2910.2890.299
Chess28,056640180.1540.160.1650.160.1570.1510.174
Average 0.4620.4730.5030.4940.4790.4840.554
Table 4. Basic Parameters of Autoencoder.
Table 4. Basic Parameters of Autoencoder.
Architecture A-64-code dimension-64- A
MaxEpochs1000
LossFunctionMSE with L2 and Sparsity Regularizers
Training AlgorithmScaled Conjugate Gradient Descent
The code dimension in Architecture is determined by the hidden factor λ and the original data dimension.
Table 5. The Accuracy results of Classifying by IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of Accuracy.
Table 5. The Accuracy results of Classifying by IDF, DILCA, CDE, CDE-AE, 1-HOT, 1-HOT-AE, and our method CDE++ on 15 Data Sets. The best performance for each data set is boldfaced. The Data Sets are sorted in descending order of Accuracy.
Accuracy
DatasetsIDFDILCACDECDE-AE1-HOT1-HOT-AECDE++
Zoo0.9370.9460.970.944111
Lenses0.8260.7930.8330.7140.80.8111
Annealing0.9730.9790.9850.9780.9910.9890.988
Tic-tac-toe0.8940.8720.9130.7350.9810.980.984
Balance-scale0.7410.7130.7270.6490.9680.9570.968
Nursery0.8040.7290.8170.5490.9370.9390.943
Iris0.8830.8970.8930.8870.920.9270.911
Bloger0.7420.7330.7390.8080.7580.8520.903
Hepatitis0.8290.8170.8340.8110.8770.8040.894
Hayes-roth0.7540.7710.8070.7630.8290.8340.861
Lymphography0.7920.8030.8190.8030.8260.8340.849
Primary-tumor0.5380.5510.5770.5280.5950.610.623
Student A.P.0.5180.4990.5290.5190.5510.5440.615
Teaching A.E.0.5470.5430.5760.5090.5610.5850.613
Chess0.1840.2160.2420.2150.270.2570.413
Average0.7310.7240.7510.6940.7910.7950.838
Table 6. Ablation Study Settings.
Table 6. Ablation Study Settings.
Learning Value ClustersLearn Value Clusters Couplings
DBSCANHCAutoencoder
i🗸××
ii×🗸×
iii🗸🗸×
iv🗸🗸🗸
Table 7. F-score Results of Ablation Study based on Clustering task. The best performance for each data set is boldfaced.
Table 7. F-score Results of Ablation Study based on Clustering task. The best performance for each data set is boldfaced.
F-Score
DatasetsDBSCANHCDBSCAN+HCCDE++
Zoo0.8290.850.8540.879
Iris0.4530.4910.6460.8
Hepatitis0.5030.6170.6660.755
Tic-tac-toe0.6210.5780.5730.659
Annealing0.6060.5140.5440.654
Bloger0.530.5880.5720.61
Balance-scale0.4640.50.5490.6
Lymphography0.4340.4880.4840.568
Hayes-roth0.4450.4450.4410.545
Teaching.A.E0.4370.4190.4910.503
Student.A.P0.4350.4250.4460.475
Lenses0.6250.5830.5830.458
Nursery0.3330.2950.2730.325
Primary-tumor0.2480.2910.2880.299
Chess0.1630.1670.1510.174
Average0.4750.4830.5040.554
Table 8. Accuracy Results of Ablation Study based on Classifying task. The best performance for each data set is boldfaced.
Table 8. Accuracy Results of Ablation Study based on Classifying task. The best performance for each data set is boldfaced.
ACCURACY
DatasetsDBSCANHCDBSCAN+HCCDE++
Zoo0.9640.970.9761
Lenses10.7890.8111
Annealing0.9750.930.9710.988
Tic-tac-toe0.6950.9650.9770.984
Balance-scale0.9710.9610.9580.968
Nursery10.9440.9430.943
Iris0.8330.7380.8980.911
Bloger0.8390.7810.8190.903
Hepatitis0.8280.8380.8450.894
Hayes-roth0.8120.8370.8340.861
Lymphography0.8280.8130.8430.849
Primary-tumor0.6250.5790.5640.623
Student.A.P0.5780.5270.5270.615
Teaching.A.E0.5650.5570.5780.613
Chess0.6650.4610.2360.413
Average0.8120.7790.7850.838

Share and Cite

MDPI and ACS Style

Dong, B.; Jian, S.; Zuo, K. CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships. Entropy 2020, 22, 391. https://doi.org/10.3390/e22040391

AMA Style

Dong B, Jian S, Zuo K. CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships. Entropy. 2020; 22(4):391. https://doi.org/10.3390/e22040391

Chicago/Turabian Style

Dong, Bin, Songlei Jian, and Ke Zuo. 2020. "CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships" Entropy 22, no. 4: 391. https://doi.org/10.3390/e22040391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop