Does Deep Learning Work Well for Categorical Datasets with Mainly Nominal Attributes?

: Given the complexity of real-world datasets, it is di ﬃ cult to present data structures using existing deep learning (DL) models. Most research to date has concentrated on datasets with only one type of attribute: categorical or numerical. Categorical data are common in datasets such as the German (-categorical) credit scoring dataset, which contains numerical, ordinal, and nominal attributes. The heterogeneous structure of this dataset makes very high accuracy di ﬃ cult to achieve. DL-based methods have achieved high accuracy (99.68%) for the Wisconsin Breast Cancer Dataset, whereas DL-inspired methods have achieved high accuracy (97.39%) for the Australian credit dataset. However, to our knowledge, no such method has been proposed to classify the German credit dataset. This study aimed to provide new insights into the reasons why DL-based and DL-inspired classiﬁers do not work well for categorical datasets, mainly consisting of nominal attributes. We also discuss the problems associated with using nominal attributes to design high-performance classiﬁers. Considering the expanded utility of DL, this study's ﬁndings should aid in the development of a new type of DL that can handle categorical datasets consisting of mainly nominal attributes, which are commonly used in risk evaluation, ﬁnance, banking, and marketing.


Background
Among existing deep learning (DL) models, convolutional neural networks (CNNs) [1,2] are the best architecture for most tasks involving image recognition, classification, and detection [3]. However, Wolpert [4,5] described what has come to be known as the no free lunch (NFL) theorem, which implies that all learning algorithms perform equally well when averaged over all possible datasets. This counterintuitive concept thereby suggests the infeasibility of finding a general, highly predictive algorithm. Gȏmez and Rojas [6] subsequently empirically investigated the effects of the NFL theorem on several popular machine learning (ML) classification techniques using real-world datasets.

Types of Data Attributes
It is substantially more challenging to accurately present data structures using existing DL models due to the complexity and variety of real-world datasets. Most of the research on this issue has concentrated on datasets with only one attribute: categorical or numerical. However, the number of cases with more than one type of attribute in a supervised control architecture has increased [7]. Categorical attributes are composed of two subclasses-nominal and ordinal-with the latter inheriting some properties of the former. Similar to nominal attributes, all of the categories (i.e., possible values) of the attributes in ordinal data-in other words, the data associated with only the ordinal attributes-are qualitative and therefore unsuitable for mathematical operations; however, they are naturally ordered and comparable [8]. As an example, consider a dataset related to individuals containing a numerical attribute such as 0.123, 4.56, 10, 100, an ordinal attribute such as a Stage I, II, or III cancer diagnosis, and a nominal attribute such as university student, public employee, company employee, physician, or professor [9].

Heterogeneous Datasets
A numerical dataset's characteristics differ from those of categorical datasets that contain only ordinal attributes, such as the Wisconsin Breast Cancer Dataset (WBCD). (https://archive.ics.uci.edu/ ml/datasets/breast+cancer+wisconsin+(original)). From a practical perspective, categorical data that involve a mix of nominal and ordinal attributes are common in credit scoring datasets [10]. The German (-categorical) credit scoring dataset (https://archive.ics.uci.edu/mL/datasets/statlog+ (german+credit+data)) is a typical heterogeneous dataset that contains numerical, ordinal, and nominal attributes. The heterogeneous structure of this dataset makes very high accuracy difficult to achieve. Several alternative approaches, such as artificial bee colony (ABC)-based support vector machines (SVMs) [11], feature selection and random forest (RF) [12], Information Gain Directed Feature Selection [13], synthetic minority oversampling technique (SMOTE)-based ensemble classification [14], and extreme learning machines (ELMs) [15], have been developed in recent years for the German credit dataset.

Deep Learning (DL) Approaches for Datasets with Ordinal Attributes
The present author previously proposed a general-purpose and straightforward method [16] to map weights in deep neural networks (DNNs) trained by deep belief networks (DBNs [17]) to weights in backpropagation neural networks (NNs). It uses the recursive-rule eXtraction (Re-RX [18]) algorithm with J48graft [19,20], which led to the proposal of a new method to extract accurate and interpretable classification rules for categorical datasets, including rating or grading ordinal attributes. This method was then applied to the WBCD, a small, high-abstraction ordinal dataset with prior knowledge [21]. The present author also noted that the German credit dataset was a relatively low-level abstraction dataset that mainly includes the nominal attributes of banking professionals without prior knowledge.

State-of-the-Art DL Classifiers for Categorical and Mixed Datasets
A variety of high-accuracy classifiers have recently been proposed. DL-based methods used for the WBCD have achieved accuracy as high as 99.68% [22][23][24], whereas DL-inspired methods used for the Australian credit dataset (http://archive.ics.uci.edu/ml/datasets/statlog+(australian+credit+approval)) have achieved accuracy as high as 97.39% [24,25].
Recently, the present author presented a new rule extraction method [26] for achieving transparency and conciseness in credit scoring datasets with heterogeneous attributes using a one-dimensional (1D) fully-connected layer first (FCLF)-CNN [23] combined with the Re-RX algorithm with a J48graft decision tree (hereafter 1D FCLF-CNN [ Figure 1]). Although it does not completely overcome the accuracy-interpretability dilemma for DL, it does appear to resolve this issue for credit scoring datasets with heterogeneous attributes.

Novelty of This Paper
Nevertheless, and contrary to our expectation, although it appears relatively easy to construct DL-based and DL-inspired classifiers with very high accuracies, to our knowledge, no method has been proposed to classify the German credit dataset. We hypothesized that one reason for the high accuracies would be the ratio of the number of ordinal and nominal attributes.
In this paper, we provide new insights for the reasons why DL-based and DL-inspired classifiers do not work well for categorical datasets mainly consisting of nominal attributes, as well as the barriers to achieving very high accuracies for such datasets. We also discuss the pitfalls of using nominal attributes to design high-performance classifiers.

Categorical Datasets and Their Recent High Accuracies
In this section, we first tabulate the characteristics of three categorical datasets (Table 1). Tables  2 and 3 show the test accuracies and area under the receiver operating characteristic curves (AUC-ROCs) [27] obtained by recent high accuracy classifiers for the WBCD, and the German dataset. A new method of concordant partial AUC (cpAUC) [28] was proposed as a related reference. Table 4 shows parameter settings for training the 1D FCLF-CNN for the German and Australian credit scoring datasets. Table 5 also shows the test accuracies and AUC-ROCs obtained by recent high accuracy classifiers for the Australian dataset. The WBCD is composed of 699 samples (16 with missing values) obtained from fine-needle aspiration (FNA) [29] of human breast tissue. FNA allows malignancy in breast masses to be

Novelty of This Paper
Nevertheless, and contrary to our expectation, although it appears relatively easy to construct DL-based and DL-inspired classifiers with very high accuracies, to our knowledge, no method has been proposed to classify the German credit dataset. We hypothesized that one reason for the high accuracies would be the ratio of the number of ordinal and nominal attributes.
In this paper, we provide new insights for the reasons why DL-based and DL-inspired classifiers do not work well for categorical datasets mainly consisting of nominal attributes, as well as the barriers to achieving very high accuracies for such datasets. We also discuss the pitfalls of using nominal attributes to design high-performance classifiers.

Categorical Datasets and Their Recent High Accuracies
In this section, we first tabulate the characteristics of three categorical datasets (Table 1). Tables 3 and 4 show the test accuracies and area under the receiver operating characteristic curves (AUC-ROCs) [27] obtained by recent high accuracy classifiers for the WBCD, and the German dataset. A new method of concordant partial AUC (cpAUC) [28] was proposed as a related reference. Table 2 shows parameter settings for training the 1D FCLF-CNN for the German and Australian credit scoring datasets. Table 5 also shows the test accuracies and AUC-ROCs obtained by recent high accuracy classifiers for the Australian dataset. The WBCD is composed of 699 samples (16 with missing values) obtained from fine-needle aspiration (FNA) [29] of human breast tissue. FNA allows malignancy in breast masses to be investigated in a noninvasive and cost-effective manner. In total, nine features related to the size, shape, and texture of the cell nuclei are measured in each sample. The observed values are scored on a 10-point scale, where 1 denotes the closest to being benign, and each sample is given a class label (benign or malignant). Among the 683 complete samples, 339 malignant and 444 benign cases were observed. Pathologists assessed the features based on an analysis of these nine ordinal features [30].

Recent High Accuracies by Classifiers for the WBCD
The accuracy for the WBCD plateaus at around 99.00%. DL-based classifiers [22][23][24] are relatively competitive with other recent high-accuracy classifiers; however, Deep Forest does not work well for the WBCD. Zhou and Feng [31,32] proposed a unique gcForest (multi-grained cascade forest) approach for constructing a non-NN-style deep model. Their approach was a novel decision tree ensemble with a cascade structure that enables representation learning. We first demonstrated that this DL-inspired method had a considerably lower classification accuracy of 95.52%.

Credit Scoring Datasets
Credit application scoring is an effective method for classifying whether a credit applicant belongs to a legitimate (creditworthy) or suspicious (non-creditworthy) group based on their credentials. Improving the predictive performance of credit scoring models, especially for applicants that fall into the non-creditworthy group, could be expected to have a substantial impact on the financial industry [38].

German (-Categorical) Dataset
The German dataset contains 1000 samples with 20 features that describe the applicant's credit history. In this dataset, 700 and 300 samples describe creditworthy and uncreditworthy applicants, respectively. Nominal attributes include the status of an existing checking account, credit history, the purpose of credit taken by a customer, savings accounts/bonds, present type and length of employment, personal characteristics and sex, debtors or guarantors, property holdings, installment plans, housing status, and job status [39].

Recent High-Accuracy Classifiers for the German Dataset
In many cases, credit scoring datasets contain customer profiles that consist of numerical, ordinal, and mainly nominal attributes; here, these are referred to as heterogeneous credit scoring datasets [26]. This section on the German dataset reveals the pros and cons of DL, DL-based classifiers, and DL-inspired classifiers from different perspectives. Ensemble classifiers with a neighborhood rough set [10], ABC-based SVM [11], and Bolasso based feature selection and RF [12] showed very high accuracies.

Australian Dataset
The Australian dataset consists of 690 samples with 14 features. Each sample in the Australian dataset is composed of six categorical and eight numerical attributes, as well as a class attribute (accepted as positive or rejected as negative), where 307 and 383 instances are positive and negative applicants, respectively. The Australian dataset consists of credit card applications, but the feature values and names are changed into random symbols to preserve the data's secrecy [39]. Table 5. Comparison of recent classifier performances for the Australian dataset.

Discussion
We hypothesized that a reason why DL does not work well was the number of features and the characteristics of the attributes. At first glance, the results in Table 4 seemed to be following their descriptions of low-dimensional data for Deep Forest [31,32]. The authors noted that "fancy architectures like CNNs could not work on such data as there are too few features without spatial relationships" [32] (p. 81) because a DL-based classifier [23] achieved an accuracy of 98.71%, which made it easy to classify the WBCD with only nine ordinal attributes.
However, a DL-based classifier [23,41] and Deep Forest [31,32] ranked at the bottom (Tables 4  and 5). Deep Forest showed the lowest classification accuracy, at 71.12%. Differences in the accuracies obtained using a 1D FCLF-CNN combined with Re-RX with J48graft (1D FCLF-CNN with Re-RX with J48graft) and Deep Forest were 10.17% and 15.45%, respectively, with the highest accuracy of 86.57% [10]. These accuracies were inferior to that of the highest obtained using a rule extraction method (79.0%) [42].
We believe that the main reason for this is that the German dataset consists of mainly nominal attributes, which are commonly found in finance and banking, risk evaluation [12], and marketing [43]. Therefore, innovations are needed for DL, DL-based, and DL-inspired methods for heterogeneous datasets such as the German credit dataset. We also argue that the Bene1 [44] and Bene2 [44] datasets, which have been used by major Benelux-based financial institutions to summarize consumer credit data, consist of mainly nominal attributes.
Surprisingly, a new deep genetic cascade ensemble of different SVM classifiers (so-called Deep Genetic Cascade Ensembles of Classifiers [DGCEC]) system [25] achieved the current highest accuracy (97.39%) for the Australian dataset. That study aimed to design a novel deep genetic cascade ensemble (16-layer system) of SVM classifiers based on evolutionary computation, ensemble learning, and DL techniques that would allow the effective binary classification of accepted or rejected borrowers. This new method was based on a combination of the following: SVM classifiers, normalizations, feature extractions, kernel functions, parameter optimization, ensemble learning, DL, layered learning, supervised training, feature selection (attributes), and the optimization of classifier parameters using a genetic algorithm.
Although the spiking extreme learning machine method [39] was very systematic and sophisticated for the Australian dataset, it achieved the second-highest accuracy (95.98%) DL-inspired tactic can achieve considerably higher classification accuracies. This method was specialized to predict creditworthiness in the Australian dataset. Hence, the results obtained by these methods are worse than those obtained by the DGCEC system.

A Black-Box ML Approach to Achieve Very High Accuracies for the German (-Numerical) Credit Dataset
Pławiak et al. [45] recently proposed the Deep Genetic Hierarchical Network of Learners, a fusion-based system with a 29-layer structure that includes ML algorithms, SVM algorithms, k-nearest neighbors, and probabilistic NNs, as well as a fuzzy system, normalization techniques, feature extraction approaches, kernel functions, and parameter optimization techniques based on error calculation. Remarkably, they achieved the highest accuracy (94.60%) for a German (-numerical; 24 attributes) dataset with all numerical and no nominal attributes. However, this dataset had no nominal attributes, making it easy to classify, and thus, fundamentally different from the German (-categorical; 20 attributes) dataset. As a result, we omitted the performance from Table 4. Furthermore, even if this kind of ML technique is effective to enhance classification accuracy, these methods hinder the conversion of a "black box" DNN trained by DL into a "white box" consisting of a series of interpretable classification rules [46].

Pitfalls for Handling Nominal Attributes to Design High-performance Classifiers
As shown in Table 1, the WBCD consists of ordinal attributes. The Australian dataset is a mixed dataset, i.e., it consists of ordinal and numerical attributes. On the other hand, the German credit dataset is a mixed dataset that consists of mainly nominal attributes. When attempting to achieve only the highest accuracy for the German dataset, many papers often do not focus on handling the datasets with heterogeneous attributes and maintaining the characteristics of the nominal attributes appropriately. For example, the highest accuracy achieved for the German (-categorical) dataset was 86.57% by Tripathi et al. [10]. Kuppili et al. [40] and Tripathi et al. [39] achieved considerably higher accuracies for the German dataset using an SVM and NN requiring that each data instance be represented as a vector of real numbers. However, they did not appropriately handle nominal attributes as they converted them into numerical attributes before feeding them into the classifiers. If this pitfall can be avoided, classifiers with much higher accuracy can be designed using DL-based or DL-inspired methods.

Conclusions
In this paper, we have provided new insights into why DL-based and DL-inspired classifiers do not work well for categorical datasets that mainly consist of nominal attributes and the barriers to achieving very high accuracies for such datasets. One limitation of this work is the limited number of categorical datasets with mainly nominal attributes. As mentioned above, there are bigger datasets similar to the German datasets, including the Bene1, Bene2, Lending Club, and Bank Loan Status datasets. Considering the vastly expanded utility of DL, attempts should be made to develop a new type of DL to handle categorical datasets consisting of mainly nominal attributes, commonly used in risk evaluation, finance, banking, and marketing.
Funding: This work was supported in part by the Japan Society for the Promotion of Science through a Grant-in-Aid for Scientific Research (C) (18K11481).

Acknowledgments:
The authors would like to express their appreciation to their graduate students, Naoki Takano, and Naruki Yoshimoto, for their helpful discussions during this research.

Conflicts of Interest:
The authors declare no conflict of interest.