Classiﬁcation of Categorical Data Based on the Chi-Square Dissimilarity and t-SNE

: The recurrent use of databases with categorical variables in different applications demands new alternatives to identify relevant patterns. Classiﬁcation is an interesting approach for the recognition of this type of data. However, there are a few amount of methods for this purpose in the literature. Also, those techniques are speciﬁcally focused only on kernels, having accuracy problems and high computational cost. For this reason, we propose an identiﬁcation approach for categorical variables using conventional classiﬁers (LDC-QDC-KNN-SVM) and different mapping techniques to increase the separability of classes. Speciﬁcally, we map the initial features (categorical attributes) to another space, using the Chi-square (C-S) as a measure of dissimilarity. Then, we employ the (t-SNE) for reducing dimensionality of data to two or three features, allowing a signiﬁcant reduction of computational times in learning methods. We evaluate the performance of proposed approach in terms of accuracy for several experimental conﬁgurations and public categorical datasets downloaded from the UCI repository, and we compare with relevant state of the art methods. Results show that C-S mapping and t-SNE considerably diminish the computational times in recognitions tasks, while the accuracy is preserved. Also, when we apply only the C-S mapping to the datasets, the separability of classes is enhanced, thus, the performance of learning algorithms is clearly increased.


Introduction
The high demand in the handling of all types of data, forces to companies, entities, and institutions to find underlying patterns. There are several ways to deal with this issue, in general, it is called data analysis [1]. A correct data processing requires basic knowledge in the type of databases, which can be nominal or quantitative. Nowadays, the algorithms and methodologies applied in data analysis focus on quantitative data, whether for clustering, regression and classification. In the literature, it can be seen a lot of proposed works related to this type of datasets such as spectral clustering [2] , support vector machines (SMV) [3], Gaussian Processes (GP) [4], ordinary classification methods [5], among others. On the other hand, categorical data has not been widely studied. Therefore, there is a lack of sophisticated learning algorithms for this purpose. Currently, categorical data is mostly recognized with decision trees. However, this method has limitations due to low robustness and the performance is not satisfactory for validation data (low generalization capability). Categorical data have a particularity: high overlapping. For this reason, accuracy in automatic recognition is low. numeric descriptors in the ordinary K-means algorithm. Nonetheless, this proposal requires to handle a great amount of binary points when the datasets have samples with many categories, which increases its computational cost and memory storage. Other proposed methods such as the similarity coefficient of Gower [7], dissimilarity measures [8], the PAM algorithm [9], hierarchy of cluster [12], statistic fuzzy algorithms [23] and conceptual clustering methods [10] have been reported. All of them have limited performance when are they are applied to massive data of type categorical.
Also, there are reports related to clustering analysis [9,24,25], where it is discussed issues regarding apply clustering methods over categorical data. However, none of these works give a feasible solution to the existing problems in non-numeric repositories. The main recommendation is to binarize the data and to use binary similarity measures, but the memory storage becomes the main difficulty. The authors of [26] implemented a study about distances for heterogeneous data (datasets with mixed qualitative and quantitative variables) based on a supervised framework, being each sample complemented with the respective class label. But, it is not generalizable to non-labeled databases. Recently, the authors of [27] developed a clustering algorithm which maps a categorical dataset into a Euclidean space. This method reveals the data configuration with a structure based clustering (SBC) scheme, achieving acceptable results in a positive identification of groups and classes, even improving the performance obtained by benchmark approaches for unsupervised learning: K-modes [28], dissimilarity distance [29], Mkm-nof and Mkm-ndm [30]. There are two considerable handicaps with the SBC framework: first, the high computational cost; secondly, the reduced accuracy for high dimensional datasets.
Many researchers developed various machine learning algorithms, i.e., GA and Fuzzy inference [31] artificial neural networks (ANN) [32], self-adaptive method [33], support vector machines (SVM) [34], Learning Vector Quantization (LVQ) [35], extreme learning machine [36], adaptive stochastic resonance [37], model-based class discrimination (VPMCD) [38], random forest [39], Artificial Bee Colony (ABC) [40], deep belief network [41], among others. All the aforementioned techniques are pretty complicated to interpret categorical data [42]. This is because of the reduction of the number of features (attributes) is a difficult task [43]. Theoretically, the presence of many features offers the opportunity to implement classifiers having better discriminating power. Nevertheless, this is not always true in practice, because not all features are relevant for representing the underlying phenomena of interest. Thus, when reducing the number of attributes, or when creating new ones, it is possible to achieve some benefits: Lower complexity of the classifier, reducing over-fitting, increasing the interpretability of the results, robustness to noise, and improving the accuracy of a basic classifier [44,45]. Most of dimensionality reduction models are developed for continuous data. This led to the search of dissimilarity measures to map the categorical data to a continuous domain [46]. An example of this is the dissimilarity measure based on the Chi-Square distance, that allows to map from a discrete space to a continuous one.
Related to supervised learning, several researchers proposed interesting frameworks. For example, in [14] was introduced an approach based on sparse weighted naive Bayes classifier [14], this work was the first attempt to extend sparse regression for processing categorical variables with competitive outcomes. Also, the authors of [19] developed a couple attribute similarity scheme to capture a global picture of the features. Furthermore, in [20] was presented a method composed of boolean kernels, here the basic concept is to create human-readable features to ease the extraction of interpretation rules directly from the embedding space. Finally, the research of [21] showed a classifier based on a naive possibilistic estimation with a generalized minimun-based algorithm. These relevant works, demonstrated that supervised algorithms can be adapted to categorical or qualitative data.

Chi-Square Distance
The chi-square distance is similar to the Euclidean. However, it is a weighted distance and a suitable metric for the analysis of databases with qualitative, categorical or nominal variables. The Chi-square distance compares the counts of responses from categorical variables with two or more independent features: x iñ Here, D is the number of features or dimensions. The Chi-square distance uses a contingency table, with the frequency of each attribute. The weighted distance C-S with categorical features allows a better treatment of these data. This is explained because it improves the separability of the classes, and allows an easier grouping or discrimination. However, an important drawback is the augment of dimensionality due to the data mapping to a space of dissimilarity. Therefore, it is necessary to use the algorithm t-SNE for reducing the dimensionality to 2 or 3 attributes. To preserve the structure of the databases, it was implemented the C-S metric within the distance function of t-SNE for simultaneously enhancing separability of categorical data and reducing computational times in learning algorithms [46].

T-Distributed Stochastic Neighbor Embedding
t-distributed stochastic neighbor embedding (t-SNE) minimizes the divergence between two distributions: a distribution that measures similarities by pairs of input objects X = (x 1 , x 2 , . . . , x N ) ∈ R D 1 and a distribution that measures similarities by pairs of the corresponding points of low dimension in the embedding Y = (y 1 , y 2 , . . . , y N ) ∈ R D 2 , being D 1 > D 2 . Suppose a dataset of N input objects X = (x 1 , x 2 , . . . , x N ) and a function d(x i , x j ) that calculates a distance between a pair of objects, for example, the Euclidean distance d(x i , x j ) = ||x i − x j || 2 . Then, t-SNE defines joint probabilities p i,j that measure the similarity between x i and x j [17]: Also: In the above formulation, the bandwidth of the Gaussian cores, σ i , is set in such a way that the perplexity of the conditional distribution p i is equal to a predefined perplexity µ. As a result, the optimal value of σ i varies depending on the object: in regions of the data space with a higher data density, σ i tends to be smaller, and vice versa. The optimal value of σ i for each input object can be found using a simple binary search [47] or a robust root search method.
The objective of t-SNE is to find a D 2 -dimensional map Y = (y 1 , y 2 , . . . , y N ) ∈ R D 2 for an optimal reflecting of the similarities p i,j . Therefore, it measures the similarities q i,j between two points y i y y j in a similar way: The heavy tails of the normalized Student-t allow the modeling of dissimilar input objects x i y x j by low-dimensional counterparts y i and y j . The locations of the insertion points y i are determined by minimizing the divergence of Kullback-Leibler between the joint distributions P and Q: Due to the asymmetry of the Kullback-Leibler divergence, the objective function focuses on modeling high values of p ij (similar objects) by high values of q ij (nearby points in the embedding space). The objective function is usually minimized when descending along the gradient [48]:

Standard Classification Techniques
We test four standard classifiers at the supervised learning stage: Linear Bayesian (LDC), Quadratic Bayesian (QDC), Support Vector machine (SVM) and K-nn. The purpose is to demonstrate that the core of this work is the processing of categorical data through the Chi-square mapping for increasing class separability and t-SNE for dimensionality reduction.

Support Vector Machines (SVMs)
Support vector machines (SVMs) are prevalent in applications such as natural language processing, speech, image recognition and artificial vision. The full theory of SVMs can be found in [49]. This approach can be divided as follows: • Separation of classes: It is about finding the optimal separating hyperplane between the two classes by maximizing the margin between the closest points of the classes.

•
Overlapping classes: The incorrect data points of the discriminating margin are weighted to reduce their influence (soft margin). • Non-linearity: When a linear separator cannot be found, the points are mapped to another dimensional space where the data can be separated linearly (this projection is realized via kernel techniques).

•
Solution of the problem: The whole task can be formulated as a quadratic optimization problem that can be solved by known methods.
SVMs belong to a class of machine learning algorithms called kernel methods. Common kernels used in SVMs include: RBG or Gaussian, linear, polynomial, sigmoidal, among others [50]. We choose the RBG function due to its flexibility for different type of data. We set the Gamma and C hyper-parameters of the RBF kernel through cross-validation.

Bayesian Classifier
According to the Bayes rules, the probability of an example E = (x 1 , x 2 , x 3 , . . . , x D ) be the class C is (where D is the number of attributes or features): E is classified as class C = + if and only if: Suppose that all attributes are independent of the class variable; that is to say, the resulting classifier is then: The function f b (E) is called the Naive Bayes Classifier. The difference of the linear discriminant classifier (LDC) and quadratic (QDC) is the assumption of the covariance function. Specifically, if the covariance is assumed as equal for all classes, we refer to LDC, allowing a considerable mathematical simplicity for calculating the prediction distribution, but there is a possible loss of generalization capability. If the covariance is assumed different for all classes, we refer to QDC, and we can separate non-linear data with more accuracy, but the calculation of prediction distribution is more complex [51].

K-Nearest Neighbor (K-nn)
The learning process of the K-nn method is based on the storage of data. The method is described as follows: • The training data X = x 1 , x 2 , . . . , x N with labels y = y 1 , y 2 , . . . , y N (being N the number of data samples) are stored in memory.

•
For a new sample x i ∈ R D , where D is the number of attributes, it is found the k-nearest neighbors using a distance d in the whole training set (k can be 1, 3, 5, 7, . . . ).

•
It is performed a voting procedure for selecting the class of the new sample x i .

•
Common distances d are: where Σ −1 is the covariance matrix between x and y. - In this work, we employ the Mahalanobis distance. Also, we tested k = 3, 5, and 7 neighbors, but we report the best results obtained for k = 3.

Datasets and Experimental Setup
We test seven public datasets downloaded from UCI machine learning repository https:// archive.ics.uci.edu/ml/index.php. Table 1 describes the databases and their main characteristics. First, we evaluate the t-SNE distances (Cosine, Jaccard, Mahalanobis, Chebychev, Minkowski, City block, Seuclidean, Euclidean, Chi-tsne) for demonstrating that C-S metric combined with the t-SNE algorithm (Chi-tsne), enhances separability of categorical databases. Then, we classify the datasets using four approaches (LDC, QDC, SVM, K-nn) to find which learning method is the most accurate in this context. For the sake of comparison, we test four different setups over the data: The single classifiers, the classifiers + t-SNE, the classifiers + C-S, and the classifiers + C-S + t-SNE). See Table 2 for the description of the experimental setups. We calculate the accuracy (AC) and computational times for all classifiers in each setup, under the same conditions. We perform a hold-out validation scheme, with ten repetitions for each experiment, taking 70% of the data for training and 30% for validation. The simulations were performed with Matlab software on a server Intel (R) Xeon (R), CPU E5-2650 v2-2.60 GHz, two processors with eight cores, and 280 GB-RAM.

Results and Discussion
We observed from experimental results that the Chi-square (CS) distance is suitable for categorical data due to its mathematical nature. Initially, this divergence increases the dimension of data, maps the data to a real domain, and improves the separation of classes. Latter, we perform a dimension reduction with t-SNE for avoiding computational complexity. We do not consider another methods such as Kullback-Liebler divergence and Wasserstein distance, because they are especially developed for probabilistic distributions and estimation of parameters (KL is not symmetric, which can be an important limitation). However, this is not our case, because we do not assume a probability distribution over the categorical data. We pretend to map the categorical attributes to a real domain (instead of a integer domain) and increasing their separability.
According to the previously pointed out, the Figure 1 illustrates the main goal of the C-S. In this case, we show three of the seven databases (Congressional Voting Records, Balloons and Breast Cancer). We can see that original input space (left column) is highly overlapped and the features only take integers values. On the contrary, when the datasets are mapped with the C-S, the separability of data is increased. Table 3 shows the accuracy and standard deviation for LDC, QDC, SVM, and K-nn, when we use the t-SNE algorithm over the databases. The objective was to evaluate the distances (Cosine, Jaccard, Mahalanobis, Chebychev, Minkowski, City block, Seuclidean, Euclidean) commonly applied in t-SNE method and to demonstrate that C-S is the most suitable for categorical attributes. We can see that C-S metric outperforms the comparison distances with statistically significant differences in most of cases. Also, the t-SNE reduces the dimensionality of mapped data without a losing of relevant information or structure of data. Figure 2 shows the accuracy achieved for each learning method in different experimental setups described in Table 2. We can identify four different setups for each dataset. The first one, consists of evaluating the standard classifiers in categorical databases without any processing or mapping the data. We can observe that classification outcomes are not the best for each dataset. This probes that categorical data must be processed or mapped before the recognition tasks.
In the second setup, we test the classifiers over the datasets mapped with the C-S dissimilarity. This allows to obtain a better separability, but a higher dimensionality which means major computational times. However, the C-S mapping generates the best classification results for all datasets, as we see in Figure 2. We consider this mapping transforms the categorical data to quantitative, and learning methods performs much better in this scenario. We explain this as follows: The primary function of the C-S mapping is to increase the dimensionality of data to alleviate he overlap of categorical features. Recall that categorical attributes are integers: X ∈ Z D . When X is mapped with the C-S dissimilarity, the feature domain is transformed too, i.e X ∈ Z D → X * ∈ R K with K > D. For this reason, the C-S mapping realizes a transformation from categorical to quantitative data.
In the third setup, we perform a combination of processing techniques. We initially map the data with the C-S dissimilarity. Then, we apply the Chi-tSNE algorithm for reducing the number of attributes to three. This reduction of dimensionality diminishes computational times while preserves the data structure. Accuracy results are comparable with the the first setup, but computational times are highly better than the other setups. This setup is the suitable for on-line recognitions systems.
Finally, the fourth setup applies the Chi-tSNE directly over the categorical datasets without a C-S mapping. Although, computational times demanded for training the learning algorithms are lower, the accuracy is affected.  Table 3. Classification results (accuracy) for several distances of the t-SNE algorithm over seven UCI public datasets. LDC and QDC correspond to linear and quadratic Bayesian classifier, K-nn stands for K-nearest neighbor and SVM is the support vector machine. The datasets: A, B, BC, C, LD, MB, V are defined in Table 1.   In general, we can see in Figure 2 that the best setup in terms of accuracy was the second one, when the categorical features (integer values) are mapped with the C-S dissimilarity to a real space (quantitative) with higher dimensionality, achieving a better separability. It should be noted that the best classifier was the K-nn in most of experiments. It is important to mention that the most efficient method in computational cost was in the third setup as shown in Table 4. This is remarkable, because the percentages of accuracy are competitive, with the addition of achieving the lowest computational times.    Table 1. Finally, to demonstrate the efficiency of our method, we made a comparison with several classification methods reported in the literature for recognition of categorical databases: the sparse weighted naive Bayes classifier [14], coupled attribute similarity method [19], Boolean kernels [20], and a possibilistic naive Classifier with a generalized minimun-based algorithm [21]. We find five databases of the seven that we use in this paper. We obtain better classification accuracy with our proposed methodology than comparison methods, as can be seen in Table 5. Table 5. Accuracy results in identification of categorical data for the comparison methods versus the C-S approach. SWNBC corresponds to the sparse weighted naive Bayes classifier [14], C4.5 is the coupled attribute similarity method [19], BK is the classifier based on Boolean kernel [20], and NPC is the naive possibilistic classifier [21]. 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00 100.0 ± 0.00

Conclusions and Future Work
In this work, we implemented a recognition approach for categorical data. To do this, we developed two interesting and suitable options. First, we mapped the categorical attributes to a higher dimensionality space with a Chi-square (C-S) dissimilarity. This procedure allows to transform the feature domain of categorical datasets from integers to real values, alleviating the overlapping problem. We can observe from Figure 1 that a mapping of categorical data increases recognition accuracy. Second, we introduced an alternative distance based on Chi-square in the t-stochastic neighbor embedding method (tSNE), see Table 3 for results. The combination of the C-S dissimilarity and the Chi-tSNE applied on categorical data, simultaneously increases data separability and reduces the computational times for classification, when we tested standard classifiers: LDC, QDC, k-nn and SVM over public categorical datasets downloaded from the UCI repository, as we showed in Table 4. Also, we described how our proposal using C-S as a measure of dissimilarity outperformed other methods for classification of categorical data reported in the literature [14,[19][20][21], see Table 5.
As future work, we propose a new metric based on a kernel formulation specially designed for qualitative databases, for example Boolean kernels. Also we would like to evaluate advanced classifiers such as Gaussian processes or deep learning. Finally, we encourage the reader to perform an analysis of the Chi-square and its invariance properties based on Wasserstein Information Matrix [52].