Next Article in Journal
Uncovering Potential Roles of Differentially Expressed Genes, Upstream Regulators, and Canonical Pathways in Endometriosis Using an In Silico Genomics Approach
Previous Article in Journal
The Suitability of FGF21 and FGF23 as New Biomarkers in Endometrial Cancer Patients
Open AccessArticle

Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data

1
Department of Information & Statistics, Chungbuk National University, Chungbuk 28644, Korea
2
Department of Internal Medicine, Chungbuk National University College of Medicine, Chungbuk 28644, Korea
3
Department of Internal Medicine, Chungbuk National University Hospital, Chungbuk 28644, Korea
4
Intelligent Network Research Section, Electronics and Telecommunications Research Institute, 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Korea
5
Department of Statistics, Sungshin Women’s University, Seoul 02844, Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this manuscript.
Diagnostics 2020, 10(6), 415; https://doi.org/10.3390/diagnostics10060415
Received: 1 April 2020 / Revised: 3 June 2020 / Accepted: 16 June 2020 / Published: 18 June 2020
(This article belongs to the Section Machine Learning and Artificial Intelligence in Diagnostics)
This study aims to compare the classification performance of statistical models on highly imbalanced kidney data. The health examination cohort database provided by the National Health Insurance Service in Korea is utilized to build models with various machine learning methods. The glomerular filtration rate (GFR) is used to diagnose chronic kidney disease (CKD). It is calculated using the Modification of Diet in Renal Disease method and classified into five stages (1, 2, 3A and 3B, 4, and 5). Different CKD stages based on the estimated GFR are considered as six classes of the response variable. This study utilizes two representative generalized linear models for classification, namely, multinomial logistic regression (multinomial LR) and ordinal logistic regression (ordinal LR), as well as two machine learning models, namely, random forest (RF) and autoencoder (AE). The classification performance of the four models is compared in terms of accuracy, sensitivity, specificity, precision, and F1-Measure. To find the best model that classifies CKD stages correctly, the data are divided into a 10-fold dataset with the same rate for each CKD stage. Results indicate that RF and AE show better performance in accuracy than the multinomial and ordinal LR models when classifying the response variable. However, when a highly imbalanced dataset is modeled, the accuracy of the model performance can distort the actual performance. This occurs because accuracy is high even if a statistical model classifies a minority class into a majority class. To solve this problem in performance interpretation, we not only consider accuracy from the confusion matrix but also sensitivity, specificity, precision, and F-1 measure for each class. To present classification performance with a single value for each model, we calculate the macro-average and micro-weighted values for each model. We conclude that AE is the best model classifying CKD stages correctly for all performance indices. View Full-Text
Keywords: imbalanced data; autoencoder; machine learning; chronic kidney disease; national health screening imbalanced data; autoencoder; machine learning; chronic kidney disease; national health screening
Show Figures

Figure 1

MDPI and ACS Style

Jeong, B.; Cho, H.; Kim, J.; Kwon, S.K.; Hong, S.; Lee, C.; Kim, T.; Park, M.S.; Hong, S.; Heo, T.-Y. Comparison between Statistical Models and Machine Learning Methods on Classification for Highly Imbalanced Multiclass Kidney Data. Diagnostics 2020, 10, 415.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Search more from Scilit
 
Search
Back to TopTop