Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data

This study aims to compare the classification performance of statistical models on highly imbalanced kidney data. The health examination cohort database provided by the National Health Insurance Service in Korea is utilized to build models with various machine learning methods. The glomerular filtration rate (GFR) is used to diagnose chronic kidney disease (CKD). It is calculated using the Modification of Diet in Renal Disease method and classified into five stages (1, 2, 3A and 3B, 4, and 5). Different CKD stages based on the estimated GFR are considered as six classes of the response variable. This study utilizes two representative generalized linear models for classification, namely, multinomial logistic regression (multinomial LR) and ordinal logistic regression (ordinal LR), as well as two machine learning models, namely, random forest (RF) and autoencoder (AE). The classification performance of the four models is compared in terms of accuracy, sensitivity, specificity, precision, and F1-Measure. To find the best model that classifies CKD stages correctly, the data are divided into a 10-fold dataset with the same rate for each CKD stage. Results indicate that RF and AE show better performance in accuracy than the multinomial and ordinal LR models when classifying the response variable. However, when a highly imbalanced dataset is modeled, the accuracy of the model performance can distort the actual performance. This occurs because accuracy is high even if a statistical model classifies a minority class into a majority class. To solve this problem in performance interpretation, we not only consider accuracy from the confusion matrix but also sensitivity, specificity, precision, and F-1 measure for each class. To present classification performance with a single value for each model, we calculate the macro-average and micro-weighted values for each model. We conclude that AE is the best model classifying CKD stages correctly for all performance indices.


Introduction
Chronic kidney disease (CKD) is defined as kidney damage or the presence of a decreased glomerular filtration rate (GFR) for more than three months [1]. The National Kidney Foundation (NKF) created a guideline to help medical doctors identify the level of kidney disease and thus improve the quality of care for patients with kidney disease. The NKF presents the standard classification of kidney disease into five stages based on how well the kidneys can filter waste and excess fluid out of the blood. In the early stages, kidneys are able to filter out waste from blood. In the later stages, kidneys try to remove waste and may stop working altogether. In particular, stage 3 is separated into two stages, stage 3a and stage 3b [2].
With rising chronic non-communicable diseases, such as obesity, diabetes, and hypertension, CKD has also been increasing rapidly worldwide, and its global prevalence ranges from 8% to 18% [3]. The most common causes of CKD are diabetes and hypertension. Other problems that can damage the kidneys include glomerulonephritis, polycystic kidney disease, nephrotoxic drugs, kidney stones, and infection. CKD is associated with a significant increase in cardiovascular disease, stroke, and death [4,5]. Accordingly, early detection and effective intervention of CKD are important to slow disease progression and reduce mortality.
Kidney function screening, such as serum creatinine concentration and dipstick urine testing, is an easy and inexpensive way to detect CKD, and in Korea, it is recommended that the entire population over 40 years old be screened [1]. The National Health Screening program helps in the early detection of medical problems and contributes to reducing further medical and social costs of diseases [6]. In recent years, machine learning and deep learning methods have been used to solve the class imbalance issue [7]. Machine learning methods have been utilized to predict and classify data in the healthcare field. Various machine methods have been used to classify and predict diabetes, and magnetic resonance imaging data of the brain have been used to detect Alzheimer's disease [8,9]. The prediction of heart disease by employing artificial neural network, decision tree, and naïve Bayes algorithms has also been proposed [10]. Moreover, a study was performed to classify and predict liver diseases by using the multilayer perceptron algorithm [11].
A similar study was conducted with 361 patients to classify different CKD stages based on the estimated GFR (eGFR) values via machine learning methods, where creatinine, gender, and age were considered as explanatory variables [12]. However, the National Health Screening dataset consists of big data that has a class imbalance problem because abnormal findings are only a small portion of the collected data [13]. A major challenge in effective healthcare data analytics is the highly imbalanced data with multiple skewed response variables due to highly unequal numbers of samples, as is the case in the present study. Highly imbalanced data may have high overall accuracy in classification and prediction results in categories with high response rates, but the sensitivity and specificity in categories with minority classes may be very low. Therefore, this study uses the entire data to classify CKD according to GFR. Building a model that considers overall accuracy and the sensitivity and specificity of each category is essential to more accurately classify highly imbalanced data and predict results.
The most commonly used method for classifying imbalanced data is to reconstruct the data through sampling techniques and apply this reconstruction to the analysis [14,15]. The oversampling technique can be used to duplicate a minority class to balance the dataset when one class of data is the minority class. Conversely, when a class of data is the majority class, undersampling can be used to balance it. However, oversampling methods may suffer from overfitting and become sensitive to overlapping between minority and majority classes, while undersampling methods easily deteriorate information about samples and classes [16]. In other words, sampling methods cannot accurately reflect the characteristics of very imbalanced data when the original data are damaged and new data are applied.
A related research showed that for imbalanced data, projecting samples from the input space on to a feature space with better classification representation between the majority and minority classes yields good classification for both classes [17]. Other studies argued that autoencoder (AE) extracts meaningful feature representation, and therefore, it is useful to develop a model for classifying imbalanced data [18][19][20][21]. On the basis of the above research, this study aims to build a model to explain the imbalanced characteristics and classify and predict more accurately by applying AE to CKD data.
In this study, we investigated the national health check data of a 134,895 samples cohort and evaluated the performance of CKD classification using generalized linear models for classification, multinomial LR and ordinal LR, and machine learning algorithms, RF and AE. It is known that there would be a bias problem in maximum likelihood estimator when modeling logistic regression with imbalanced data [22]. This is why we expected that generalized linear models would give poor classification performance. RF based on the bagging algorithm utilizes the ensemble learning technique. It would reduce the overfitting problem in decision trees and also helps reduce the variance and therefore leads to improvement in the classification performance.
The response variable of the data used in this study is the different CKD stages based on eGFR, which is a categorical variable with five stages (i.e., 1, 2, 3a, 3b, 4, and 5). In addition, this variable consists of highly imbalanced data with multiple response rate categories skewed to one side. Therefore, by constructing and comparing many machine learning models, this study identifies the factors influencing GFR through statistical models that reflect the characteristics of the ordinal response variable, with the goal of deriving the best model that more accurately classifies highly imbalanced data.
This paper is organized as follows. Section 2 describes the models and the data used in this study, including response variable, explanatory variables, and their characteristics. Section 3 presents and discusses the analysis results of comparing the classification performance between generalized linear models and machine learning algorithms. Finally, Section 4 concludes the paper with an analysis and future research directions.

Multinomial Logistic Regression Model
The multinomial LR model is a generalized logistic regression model used when the response variable has three or more categories. When the logit link function, which takes the logarithmic value of the odds as the link function, is used in the generalized linear model, the resulting model is known as the multinomial LR model. The logit link function and the multinomial LR model are expressed respectively as Equations (1) and (2).
where x 1 , . . . , x p are explanatory variables; p is the number of explanatory variables; β 1 , . . . , β p are the regression coefficients; and π j is the probability of the response variable to be in the jth category. Furthermore, the sum of the probabilities of categories belonging to the response variable has to be 1.

Ordinal Logistic Regression Model
The ordinal LR model is a generalized logistic regression model used when the response variable has two categories. It is used to build models with three or more ordinal multicategory response variables. Cumulative probabilities and logits between cumulative categories can be defined and modeled as Equations (3) and (4) [23].
A model with cumulative logit j is similar to a logistic regression model that combines the categories of 1, . . . , j into one category and views the categories j + 1 through J as another category.

Random Forest
Random forest (RF) as a classification tool is a technique for generating multiple decision trees through sampling with replacement and using the outcomes to derive the final result [24,25]. First, the decision tree forms a tree model, classifies data, and makes a prediction by repeating the process of dividing each variable. Decision trees are largely composed of nodes and branches and set classification criteria that develop a tree. The classification criteria used include the p-value of the chi-squared statistic, Gini Index, Entropy Index, and others. The smaller the p-value of the chi-squared statistic, the larger the impurity. The larger the Gini Index or the Entropy Index, the larger the impurity. Therefore, these indices are performed in the direction of decreasing the impurity.
As depicted in Figure 1, RF is characterized by an improved predictive performance from generating multiple decision trees through the bootstrapping technique and combining them. RF has various advantages, such as presenting unexcelled accuracy, running efficiently on large sample sizes, giving variable importance in classification, and generating internal unbiased estimates.
A model with cumulative logit j is similar to a logistic regression model that combines the categories of 1, …, j into one category and views the categories j + 1 through J as another category.

Random Forest
Random forest (RF) as a classification tool is a technique for generating multiple decision trees through sampling with replacement and using the outcomes to derive the final result [24,25]. First, the decision tree forms a tree model, classifies data, and makes a prediction by repeating the process of dividing each variable. Decision trees are largely composed of nodes and branches and set classification criteria that develop a tree. The classification criteria used include the p-value of the chisquared statistic, Gini Index, Entropy Index, and others. The smaller the p-value of the chi-squared statistic, the larger the impurity. The larger the Gini Index or the Entropy Index, the larger the impurity. Therefore, these indices are performed in the direction of decreasing the impurity.
As depicted in Figure 1, RF is characterized by an improved predictive performance from generating multiple decision trees through the bootstrapping technique and combining them. RF has various advantages, such as presenting unexcelled accuracy, running efficiently on large sample sizes, giving variable importance in classification, and generating internal unbiased estimates.

Autoencoder
AE is a multilayer neural network structure in which the number of nodes in the input layer is equal to the nodes in the output layer. The goal of AE learning is to make the output node value equal to the input node value. A basic AE structure is shown in Figure 2. It has the structural feature in which the number of nodes from the input layer decreases on their way to the output layer and then increases at a specific point before reaching the output layer. Many different terms are used to describe the specific point, such as code, latent variable, feature vector, and hidden representation. The part where the number of nodes is reduced from the input layer is called an encoder, and the part where the number of nodes is increased to the output layer is called a decoder. This structural feature, in particular, makes it possible to perform hierarchical feature extractions or dimension reductions in the encoder part. When an AE is used to extract features, only the portion between the input layer and the middle layer is cut out. Deep learning neural networks can be configured by

Autoencoder
AE is a multilayer neural network structure in which the number of nodes in the input layer is equal to the nodes in the output layer. The goal of AE learning is to make the output node value equal to the input node value. A basic AE structure is shown in Figure 2. It has the structural feature in which the number of nodes from the input layer decreases on their way to the output layer and then increases at a specific point before reaching the output layer. Many different terms are used to describe the specific point, such as code, latent variable, feature vector, and hidden representation. The part where the number of nodes is reduced from the input layer is called an encoder, and the part where the number of nodes is increased to the output layer is called a decoder. This structural feature, in particular, makes it possible to perform hierarchical feature extractions or dimension reductions in the encoder part. When an AE is used to extract features, only the portion between the input layer and the middle layer is cut out. Deep learning neural networks can be configured by adding a fully connected neural network or the softmax layer behind the cutout. When training with the constructed deep learning neural network using the encoder part of the AE, both the input and output parts are used [26].
Diagnostics 2020, 10, 415 5 of 17 the constructed deep learning neural network using the encoder part of the AE, both the input and output parts are used [26]. A stacked AE (SAE) is a special type of AE consisting of several layers of sparse AE where the output of each hidden layer is linked to the input of the successive hidden layer. SAE trains the hidden layer through an unsupervised learning algorithm and then fine-tunes the training with a supervised method. The three key steps of SAE are as follows [26].
• AE is trained using input data and then the learned data are acquired.
• Learned data from the previous layer are used continuously as input for the next layer until the training is completed. • Once all the hidden layers are trained, the backpropagation algorithm is used to minimize the cost function and the weights are updated with the training set to achieve fine-tuning.
The advantage of SAE is that it extracts much detailed information from the raw data, thus providing better features from all explanatory variables. This characteristic of SAE eventually improves the accuracy in model performance [26].

Classification Performance
In this section, two generalized linear models for classification, namely, multinomial LR and ordinal LR models, as well as two machine learning method models, namely, RF technique and AE, were applied and compared to identify the model that best classifies CKD stages of highly imbalanced data.
In traditional model assessment, the models are fitted once by using the original training data. Thanks to recent advances in computing power, cross validation (CV) is necessarily conducted to select the better model because it can provide researchers information about the variability of fitted models. There are several CV methods, such as the hold-out method (the validation set approach), leave-one-out CV (LOOCV), and K-fold CV. LOOCV is a special of K-fold CV in which K is set to equal the sample size n.
LOOCV is clearly computationally expensive, especially for AE. Hence, as a means to test the data obtained from this study, the K-fold CV was utilized. K-fold CV tested and evaluated the data collected, which were split into k-subsets to guarantee the reliability for classification performance. A 10-fold CV was applied in this study. Figure 3 presents the methodology and performance workflow of this study. A stacked AE (SAE) is a special type of AE consisting of several layers of sparse AE where the output of each hidden layer is linked to the input of the successive hidden layer. SAE trains the hidden layer through an unsupervised learning algorithm and then fine-tunes the training with a supervised method. The three key steps of SAE are as follows [26].
• AE is trained using input data and then the learned data are acquired.

•
Learned data from the previous layer are used continuously as input for the next layer until the training is completed. • Once all the hidden layers are trained, the backpropagation algorithm is used to minimize the cost function and the weights are updated with the training set to achieve fine-tuning.
The advantage of SAE is that it extracts much detailed information from the raw data, thus providing better features from all explanatory variables. This characteristic of SAE eventually improves the accuracy in model performance [26].

Classification Performance
In this section, two generalized linear models for classification, namely, multinomial LR and ordinal LR models, as well as two machine learning method models, namely, RF technique and AE, were applied and compared to identify the model that best classifies CKD stages of highly imbalanced data.
In traditional model assessment, the models are fitted once by using the original training data. Thanks to recent advances in computing power, cross validation (CV) is necessarily conducted to select the better model because it can provide researchers information about the variability of fitted models. There are several CV methods, such as the hold-out method (the validation set approach), leave-one-out CV (LOOCV), and K-fold CV. LOOCV is a special of K-fold CV in which K is set to equal the sample size n.
LOOCV is clearly computationally expensive, especially for AE. Hence, as a means to test the data obtained from this study, the K-fold CV was utilized. K-fold CV tested and evaluated the data collected, which were split into k-subsets to guarantee the reliability for classification performance.
Diagnostics 2020, 10, 415 6 of 17 A 10-fold CV was applied in this study. Figure 3 presents the methodology and performance workflow of this study. To compare the performance of each model, a confusion matrix was calculated for each test set, and the accuracy, sensitivity, specificity, precision, and F1-Measure of the models were calculated and compared. Accuracy, sensitivity, specificity, precision, and F1-Measure were calculated using the mean value of 10 confusion matrices. Specificity corresponds to the ratio of accurate prediction in the case where the classification of the response variable is higher than the reference level. Sensitivity corresponds to the ratio of accurate prediction in the case where the classification of CKD stages is lower than the reference level.
Sensitivity and specificity are considerably more important than accuracy when evaluating the performance on highly imbalanced data because the imbalanced dataset is either dominated with positive or negative cases. where TP is the number of true positive classification cases, TN is the number of true negative classification cases, FP is the number of false positive classification cases, and FN is the number of false negative classification cases.
For multiclass classification performance of 10-fold CV, 10 models are built on the basis of 10 CV datasets, yielding 10 confusion matrices. To calculate a single value of accuracy, sensitivity, specificity, precision, and F1-Measure, the averages of the 10 values of performance measure were obtained.

Dataset
National Health Insurance Corporation (NHIC) in Korea provides Koreans with the National Health Insurance Service. NHIC collects several types of personal medical data and manages a database according to Article 1 of the "Act on Provision and Utilization of Public Data." This study utilized the health examination cohort database managed by the NHIC. Data on 478,740 persons, selected through a simple random sampling of 10% of all persons aged 51 years and older who had maintained health insurance status as of 2013 were used. Figure 4 shows a schematic of the study subjects. Among the total 478,740 possible subjects, 214,818 samples who had received general health To compare the performance of each model, a confusion matrix was calculated for each test set, and the accuracy, sensitivity, specificity, precision, and F1-Measure of the models were calculated and compared. Accuracy, sensitivity, specificity, precision, and F1-Measure were calculated using the mean value of 10 confusion matrices. Specificity corresponds to the ratio of accurate prediction in the case where the classification of the response variable is higher than the reference level. Sensitivity corresponds to the ratio of accurate prediction in the case where the classification of CKD stages is lower than the reference level.
Sensitivity and specificity are considerably more important than accuracy when evaluating the performance on highly imbalanced data because the imbalanced dataset is either dominated with positive or negative cases. where TP is the number of true positive classification cases, TN is the number of true negative classification cases, FP is the number of false positive classification cases, and FN is the number of false negative classification cases.
For multiclass classification performance of 10-fold CV, 10 models are built on the basis of 10 CV datasets, yielding 10 confusion matrices. To calculate a single value of accuracy, sensitivity, specificity, precision, and F1-Measure, the averages of the 10 values of performance measure were obtained.

Dataset
National Health Insurance Corporation (NHIC) in Korea provides Koreans with the National Health Insurance Service. NHIC collects several types of personal medical data and manages a database according to Article 1 of the "Act on Provision and Utilization of Public Data." This study utilized the health examination cohort database managed by the NHIC. Data on 478,740 persons, selected through a simple random sampling of 10% of all persons aged 51 years and older who had maintained health insurance status as of 2013 were used. Figure 4 shows a schematic of the study subjects. Among the total 478,740 possible subjects, 214,818 samples who had received general health examinations administered by the NHIC were chosen initially. The 263,922 people who had not taken health examinations as of 2013 were excluded from the dataset because they did not have medical information at that time. After the samples with missing information or variables were eliminated, the final group of study subjects included 134,895 people. The data that support the findings of this study are available from the National Health Insurance Sharing Service (https://nhiss.nhis.or.kr).

Description of Variable
The explanatory and response variables of this study are shown in Table 1 below. Categorical explanatory variables include income deciles, type of disability, gender, and smoking status, while continuous explanatory variables include age, fasting blood sugar, body mass index (BMI), systolic pressure, diastolic pressure, serum creatinine, gamma glutamyl transpeptidase (GTP), high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, hemoglobin, aspartate amino-transferase (AST), alanine amino-transferase (ALT), total cholesterol, triglycerides, and waist measure. In addition, GFR combined with the Modification of Diet in Renal Disease (GFR-MDRD) was used as one of the response variables, which are described in detail in Section 3.1.

Description of Variable
The explanatory and response variables of this study are shown in Table 1 below. Categorical explanatory variables include income deciles, type of disability, gender, and smoking status, while continuous explanatory variables include age, fasting blood sugar, body mass index (BMI), systolic pressure, diastolic pressure, serum creatinine, gamma glutamyl transpeptidase (GTP), high-density lipoprotein (HDL) cholesterol, low-density lipoprotein (LDL) cholesterol, hemoglobin, aspartate amino-transferase (AST), alanine amino-transferase (ALT), total cholesterol, triglycerides, and waist measure. In addition, GFR combined with the Modification of Diet in Renal Disease (GFR-MDRD) was used as one of the response variables, which are described in detail in Section 3.1.

Stages of CKD
The GFR used as the response variable in this study is categorized into six stages, and the calculation formula for GFR using the MDRD method is as follows: The result of Equation (5) was technically divided into five stages according to the criteria in Table 2. This study regards stages 3A and stage 3B as individual stages and used six CKD stages based on eGFR as the response variable.

The Property and Treatment of Highly Imbalanced Data
A classification problem on highly imbalanced data predominantly occurs in the real world. Highly imbalanced data have an extremely skewed distribution, in that the number of observations belonging to one class is significantly lower than those belonging to other classes. The problem hinders analytic models from sufficiently training and learning the feature of the minority class. In this situation, the predictive model developed using conventional machine learning algorithms could be biased and inaccurate.
Sampling technique is considered as one of the approaches for solving such class imbalanced problems. The main idea of this technique is to create a balanced class distribution. Various sampling techniques have their pros and cons. For example, random oversampling increases the number of cases in the minority class by replicating these cases at random, and undersampling is used to balance a dataset with the minority class when a class of data is the overrepresented majority class. Unlike the undersampling technique, random oversampling does not lead to any information loss. However, it may result in overfitting because it replicates minority class events.
In this study, we tried to avoid using sampling techniques to find the best classification model on highly imbalanced data and instead compared generalized linear models to recent machine learning algorithms for original data.

Preprocessing
Data preprocessing was initially conducted via one-hot encoding for categorical variables and scaling for continuous variables. Resampling methods were not utilized so that the performance of AE for the original imbalance dataset could be confirmed.
We used R 3.6.2 as a statistical analysis tool and packages such as "nnet," "ordinal," "randomForest," and "h2o" to build models, multinomial LR, ordinal LR, RF, and AE, respectively. For RF, the number of variables randomly sampled as candidates at each split was set to √ 19, and the number of trees to grow was set to 10.
In the first step of AE, we treated our input data as labeled with the same input values. Next, the network was forced to learn the identity through a nonlinear and reduced representation for half of each training data in order for a neural network to obtain convergence. This process is called unsupervised, layer-wise pretraining of supervised tasks. The second step of AE was to fine-tune the learned model for the other half of training data to build a final AE model. The hidden layer was set to (50, 20, 6, 20, 50) for both unsupervised and supervised tasks.

Results and Discussion
The results presented in this section are as follows. Section 3.1 presents the baseline characteristics according to the six stages of CKD based on eGFR. Section 3.2 provides the results of the four built models to classify the response variable into multiple classes. It likewise compares the classification performance of the four models in terms of accuracy, sensitivity, specificity, precision, and F1-Measure so we can choose the best classification model. Finally, Section 3.3 compares the results of this study with those of a similar study and discusses the applicability of AE to other medical fields.

Baseline Characteristic of Variables
This section discusses the baseline characteristics of all explanatory variables by each class of response variable. Table 3 shows the mean, standard deviation, and frequency of each variable according to the six categories of GFR-MDRD.
A look at the frequencies of the six categories of GFR-MDRD in Table 3 reveals a sharp decrease in frequency in the high-risk group compared with that in the normal group, and in the most at-risk groups of <15 and 15-29, the frequencies are very small at 0.33% and 0.15%, respectively. These figures indicate the need for caution when interpreting classification performance, such as accuracy, sensitivity, and so on. Creatinine levels tend to increase rapidly as we move from the normal to the risk groups, and age and the proportion of women are higher.  Table 4 shows the classification results obtained by applying the generalized linear models and machine learning models presented above to the different CDK stages. The average accuracies of the four models are all high at 0.8244, 0.9682, 0.9948, and 0.9958 for multinomial LR, ordinal LR, RF, and AE, respectively. The average accuracy for the two machine learning models is slightly higher than that for the generalized linear models. However, when it comes to the performance interpretation on a highly imbalanced dataset, average sensitivity and average specificity should be considered as key criteria. A close look at the average sensitivity of stages 4 and 5, minority classes, and interesting categories reveals that multinomial LR and RF present poor sensitivity given that the sensitivity of stage 4 shows 0.0195 and 0.2771, respectively. Meanwhile, the sensitivity of stage 5 for multinomial LR and RF is also not enough to signify if such algorithms are good models.

Results
On When comparing the average values of accuracy, sensitivity, and specificity of the four models applied in this study, the AE has the highest accuracy of 0.9958, with the average sensitivity of 0.9976, 0.9965, 0.9905, 0.9223, 0.8462, and 0.9818, indicating that it has the highest value in the risk group (30-44, 15-29, <15) among the four models. Therefore, AE performance is the best among the four models in classifying the CKD stages of highly imbalanced data. Table 4 also gives the standard deviation for the 10-fold CV confusion matrix. A comparison of the standard deviation for classification performance, such as accuracy, sensitivity, specificity, precision, and F1-Measure between algorithms, confirms that AE shows a lower value of standard deviation than the standard deviation of multinomial LR, ordinal LR, and RF.
The incomplete F1-Measure from multinomial LR and RF seems to be strange. This phenomenon is derived from that fact that F1-Measure is calculated by averaging the F1-Measure from 10 confusion matrices. If both sensitivity and precision are 0 in a certain confusion matrix, then F1-Measure is not available because the denominator is 0. This fact also provides us evidence that multinomial LR and RF perform poorly in modeling for one or more specific CV sets.
The model comparison of classification performance indices, such as accuracy, sensitivity, specificity, precision, and F1-Measure, shows that machine learning algorithms, RF and AE, have better performance than do multinomial LR and ordinal LR. With respect to sensitivity, ordinal LR and AE present better performance than do multinomial LR and RF. This result is based on the sensitivity for the minority class in stages 4 and 5. All these results from the statistical modeling suggest that AE is the best model for dealing with highly imbalanced data without resampling techniques.
AE algorithm based on artificial neural network gives the best classification result when compared with the other three techniques in classifying CKD into six stages, as shown below. The accuracy of AE is 99.58%.  In addition, a confusion matrix for AE is calculated by the sum of confusion matrices from 10-fold CV in Figure 5. The areas of orange and yellow color for stages 4 and 5 indicate, respectively, the correctly classified cases for each stage. Figure 5 graphically confirms the high sensitivity for each stage presented above, indicating that many cases are correctly classified, and the number of diagonal elements occupies most cases in the confusion matrix from AE.

•
• Stage 5 patients: sensitivity 98.18%, specificity 99.99%, precision 98.21%, and F1-Measure 98.18% In addition, a confusion matrix for AE is calculated by the sum of confusion matrices from 10fold CV in Figure 5. The areas of orange and yellow color for stages 4 and 5 indicate, respectively, the correctly classified cases for each stage. Figure 5 graphically confirms the high sensitivity for each stage presented above, indicating that many cases are correctly classified, and the number of diagonal elements occupies most cases in the confusion matrix from AE.
Specifically, the values of sensitivity for stages 3B and 4 are 92.23% and 84.62%, respectively. These figures are lower than the other values of sensitivity because of the rare cases in stages 3B (0.66%) and 4 (0.15%). Although this phenomenon often occurs in an imbalanced dataset, we can interpret that the sensitivities for stages 3B (92.23%) and 4 (84.62%) show good performance.  Specifically, the values of sensitivity for stages 3B and 4 are 92.23% and 84.62%, respectively. These figures are lower than the other values of sensitivity because of the rare cases in stages 3B (0.66%) and 4 (0.15%). Although this phenomenon often occurs in an imbalanced dataset, we can interpret that the sensitivities for stages 3B (92.23%) and 4 (84.62%) show good performance. Table 5 lists the macro-averaged and micro-weighted sensitivity, precision, and F1-Measure for each model. This step is conducted to combine the per-class sensitivity, precision, and F1-Measure into a single value. Macro values are calculated through the arithmetic mean of per-class values. On the other hand, to calculate micro values, the sensitivity, precision, and F1-Measure of each class are weighted by the number of samples from that class. Given that some cells of the F1-Measure are not available for multinomial LR and RF, we can confirm that the micro F1-Measures for two models are vacant. Similar to the result from Table 4, it can be confirmed from Table 5 that AE shows superior classification performance over other models on the basis of the macro and micro values.

Discussion
For logistic regression to show good classification performance, the number of cases of the categorical response variables has to be evenly distributed. When logistic regression is applied to unbalanced data, the unbiasedness of coefficients cannot be guaranteed. From the analysis result of this work, the classification performance for minority classes is confirmed to be relatively lower than the performance measure from AE. Clearly, the performance of ordinal LR is better than that of multinomial LR. A possible explanation for this outcome is that ordinal LR utilizes more samples to build a model than does multinomial LR. The fact that the response variables have an ordered category may be a reason for this as well. However, ordinal LR presents a less balanced performance for both majority and minority classes than does AE.
The reason this study used AE for modeling on imbalanced data is to check whether feature extraction, an outstanding advantage of AE, is effective for minority classes. The results confirm that AE shows excellent performance in both majority and minority classes. This study attempts to build a model by using raw data without the use of sampling techniques for imbalanced data. The result can be interpreted as meaning that AE is a model that can be free from the problem of data distortion of the sampling technique. One of the characteristics of medical data is that the data are highly imbalanced because interesting events tend to happen rarely. For these reasons, AE can be widely applied to a variety of medical research fields.
One reason AE shows better classification performance in this study may be due to the large number of samples. A related study classifying different CKD stages based on eGFR values shows that probabilistic neural network presents better performance than other machine learning algorithms, such as multilayer perceptron algorithm, support vector machine, and radial basis function. The dataset used consisted of 361 Indian patients with CKD and contained 25 variables (11 numerical and 14 categorical) [12]. The sample size is substantially less than that of our research, which had 134,895 samples. Therefore, a possibility exists that classification performance may be affected by the total sample size.

Conclusions
Several studies have confirmed that AE has excellent performance in feature extraction. Results of this study reveal that this advantage of AE enables it to learn the characteristics of minority samples in highly imbalanced data. With most medical data having the form of imbalanced data, the possibility of applying AE in various medical diagnoses is very strong. For example, AE can be utilized to detect rare symptoms via ECG or EEG signals in the form of sound data. The value of AE use will also be high in the field of disease diagnosis using X-ray, CT, and MRI, which are image data types.
This study did not perform a comparison between the models built with the dataset using the resampling technique and models built using the original data. This decision was made on the basis of the large number of samples. Contrarily, there are many small sample data in medical fields. Therefore, it would be very meaningful to study the effectiveness of the recently suggested sampling techniques.