Performance Analysis of Conventional Machine Learning Algorithms for Identification of Chronic Kidney Disease in Type 1 Diabetes Mellitus Patients

Chronic kidney disease (CKD) is one of the severe side effects of type 1 diabetes mellitus (T1DM). However, the detection and diagnosis of CKD are often delayed because of its asymptomatic nature. In addition, patients often tend to bypass the traditional urine protein (urinary albumin)-based CKD detection test. Even though disease detection using machine learning (ML) is a well-established field of study, it is rarely used to diagnose CKD in T1DM patients. This research aimed to employ and evaluate several ML algorithms to develop models to quickly predict CKD in patients with T1DM using easily available routine checkup data. This study analyzed 16 years of data of 1375 T1DM patients, obtained from the Epidemiology of Diabetes Interventions and Complications (EDIC) clinical trials directed by the National Institute of Diabetes, Digestive, and Kidney Diseases, USA. Three data imputation techniques (RF, KNN, and MICE) and the SMOTETomek resampling technique were used to preprocess the primary dataset. Ten ML algorithms including logistic regression (LR), k-nearest neighbor (KNN), Gaussian naïve Bayes (GNB), support vector machine (SVM), stochastic gradient descent (SGD), decision tree (DT), gradient boosting (GB), random forest (RF), extreme gradient boosting (XGB), and light gradient-boosted machine (LightGBM) were applied to developed prediction models. Each model included 19 demographic, medical history, behavioral, and biochemical features, and every feature’s effect was ranked using three feature ranking techniques (XGB, RF, and Extra Tree). Lastly, each model’s ROC, sensitivity (recall), specificity, accuracy, precision, and F-1 score were estimated to find the best-performing model. The RF classifier model exhibited the best performance with 0.96 (±0.01) accuracy, 0.98 (±0.01) sensitivity, and 0.93 (±0.02) specificity. LightGBM performed second best and was quite close to RF with 0.95 (±0.06) accuracy. In addition to these two models, KNN, SVM, DT, GB, and XGB models also achieved more than 90% accuracy.


Introduction
Diabetes mellitus (DM) is currently one of the most severe health issues facing the world, and it affects around 463 million individuals worldwide [1]. DM is considered one of the most prevalent endocrine and metabolic disorders, causing substantial damage to various organs, including the kidney [2,3]. As a result, persons with diabetes mellitus are more likely to develop chronic renal disease. According to the International Diabetes Federation (IDF), around 10% of DM patients have Type 1 diabetes mellitus (T1DM).
In the T1DM population, the lifetime risk of kidney impairment is estimated to be 50% and could be as high as 70% [4]. According to the 2016 Annual Data Report of the US Renal Data System, diabetic kidney disease is one of the leading causes of end-stage renal disease (ESRD) in North America [5]. However, although ESKD has stabilized or declined in patients with type 1 diabetes over the past decades [6,7], most likely due to the increased use of renin-angiotensin system (RAS) blockers [8], it remains a life-threatening complication. Chronic kidney disease (CKD) is linked to higher morbidity and mortality in T1DM patients, and ESRD significantly increases mortality [9].
The presence of a chronic decline in renal function and structural kidney damage is diagnosed as CKD [10]. Glomerular filtration rate (GFR), which represents the amount of fluid our kidney filters per unit time, is the most precise indicator of overall kidney function [11]. Normal renal function can be defined using estimated glomerular filtration rate (eGFR), and this definition is age-dependent. An eGFR of more than 90 mL/min/1.73 m 2 is considered as normal renal function. Although the eGFR value decreases with age, an eGFR value lower than 90 mL/min/1.73 m 2 indicates that the kidney is not working properly. Although CKD diagnosis and classification have changed over time, according to KDIGO 2012 and current international standards, a person with an eGFR less than 60 mL/min/1.73 m 2 for more than 3 months is considered a CKD patient [12]. Weariness, fluid retention, abnormalities in the urine, limb edema, nausea, vomiting, and neurological and cognitive impairment are the symptoms of CKD, although it can be asymptomatic in many cases [13]. Thus, there is typically a chance of a delay in recognizing, diagnosing, and treating the many etiologies of CKD, since people can be asymptomatic and need a specific laboratory-based test to identify CKD.
Furthermore, in the conventional urine protein (urinary albumin)-based CKD diagnosis technique, 24 h urine collection specimen analysis is considered the gold standard. Although the urinary albumin-to-creatinine ratio (uACR) and urinary protein-to-creatinine ratio (uPCR) currently represent excellent alternatives to the gold-standard analysis of a 24 h urine collection [14], there is still a tendency to bypass the urine albumin test. According to Medicare (a national health insurance program in the USA) claims data for diabetic patients, only half of these patients conduct tests for urine albumin [5]. However, early detection of CKD can benefit patients in receiving effective treatment because there are therapy options for slowing the progression of renal disease [15]. As CKD is ubiquitous in patients with T1DM and can be asymptomatic, an accurate prediction model that operates on easily available features can be helpful to recognize patients at higher risk of kidney function decline who may benefit from more intensive management.
The use of machine learning (ML) algorithms in addressing various disease classification problems has recently expanded due to remarkable advancements in related technologies [16][17][18]. Although there are some examples of using ML tools in kidney disease prediction [13,[19][20][21][22], their use in developing CKD prediction models for type 1 diabetes mellitus patients is scarce. For example, Segal et al. [13] employed a gradient boosting tree algorithm (extreme gradient boosting implementation) to construct a model for predicting ESRD. Another study [20] established and compared nine ML models to estimate the 24 h urinary protein result response to detect CKD. These two studies did not emphasize diabetic patients.
On the other hand, Makino et al. [22] used artificial intelligence (AI) to design a prediction model for diabetic kidney diseases (DKD) based on electronic medical records (EMRs) with a 0.74 AUC score at maximum. Another research conducted by Dagliati et al. [19] applied four machine learning methods to create prediction models to identify complications of type 2 diabetes mellitus (T2DM) and achieved an accuracy of up to 0.838. Low et al. [23] used stepwise multivariable logistic regression to design a CKD progression prediction model for T2DM, where sensitivity and specificity were 75.6% and 72.3%. Some other studies also developed kidney disease prediction models in T2DM patients [24,25].
Outside some common factors, type 1 diabetes is different from type 2 diabetes [26]. Moreover, type 1 diabetes patients are diagnosed at a younger age than those with type 2 diabetes and are subjected to diabetes-related risk factors for a more extended period. Thus, adult patients with T1DM have an overall greater risk of CKD and ESKD than patients with T2DM [27,28]. However, unfortunately, limited research has been conducted to develop prediction models for CKD in T1DM patients. Vistisen et al. [8] used Poisson regression analysis to develop an ESKD prediction model in T1DM patients with C-statistics between 0.88 and 0.96. Colombo et al. [6] employed ridge regression to create a model for predicting renal disease progression in T1DM patients. To our knowledge, no other prediction models have been built to identify CKD in the type 1 diabetic population. Here, none of these models used traditional machine learning algorithms. In addition, both models included albuminuria as one of the most vital features, which is increased excretion of urinary albumin (urine protein) and a kidney damage marker. However, according to National Kidney Foundation, USA, 24 h urine collection is needed to properly detect albuminuria [29], which is inconvenient in many cases and can be overlooked easily by many asymptomatic CKD patients [5].
This study aimed to construct and compare CKD prediction models for T1DM patients utilizing 10 traditional supervised ML algorithms: logistic regression (LR), k-nearest neighbor (KNN), Gaussian naive Bayes (GNB), support vector machine (SVM), stochastic gradient descent (SGD), decision tree (DT), gradient boosting (GB), random forest (RF), extreme gradient boosting (XGB), and light gradient-boosted machine (LightGBM). Here, we included only demographic, behavioral, medical history, and biochemical blood features, which are easily available during routine follow-up of T1DM patients to predict CKD. We also applied three feature ranking techniques, random forest (RF), k-nearest neighbor (KNN), and extremely randomized trees classifier (Extra Tree), to find the relative importance of these features. In summary, this study provides a reliable machine learningbased CKD prediction model dedicated to the T1DM population. The model can operate using simple routine checkup data of T1DM patients and deliver results in no time. As a result, when a 24 h urine protein-based laboratory test is not feasible, this model can be used to predict CKD. Furthermore, all T1DM patients may utilize this model to make an educated guess on their CKD state during their regular checkups, and this will increase the likelihood of detecting asymptomatic CKD patients at an earlier stage.

Overall Process
This study followed a pipeline of seven steps: primary data selection, data imputation to fill missing data, data augmentation to balance target classes, feature ranking to identify most important features, machine learning algorithms to develop models, model evaluation, and best model selection. Figure 1 illustrates the overall working procedure of training and testing different machine learning models for CDK prediction in T1DM patients.

Data Collection
This study used the GFR dataset from the Epidemiology of Diabetes Interventions and Complications (EDIC) clinical trial. The National Institute of Diabetes, Digestive, and Kidney Diseases (Bethesda MD, Montgomery, Maryland, USA) conducted this trial to observe the impact of intensive diabetes treatment on the T1DM population [30,31]. The EDIC study started with 1375 T1DM patients in 1994 and is still going [32]. Here 48% of patients were female, and 52% of them were male.
The EDIC study collects data at 28 EDIC clinic sites across the US and Canada; this ensured diversity in patient types. This study is a longitudinal study, whereby the patients' initial age range was between 19 years and 57 years, and, after decades of data collection, this study had patient' information from age 19 years to 80 years. According to the International Diabetic Federation (IFD), most T1DM patients are adults aged 20-79 years [1]. In the EDIC study, serum creatinine levels were measured annually throughout the period at the EDIC Central Biochemistry Laboratory, University of Minnesota [31], using an automated kinetic modification of the Jaffe reaction on a Beckman Synchron CX3 Clinical C System [33,34]. The Chronic Kidney Disease Epidemiology Collaboration (CKD-EPI) formula was used to calculate estimated GFR (eGFR) using data on serum creatinine levels, age, sex, and race [31,35]. A sustained eGFR value <60 mL/min/1.73 m 2 on at least two consecutive collections was considered an abnormal eGFR.
During the EDIC trial, participants' body mass index (BMI), blood pressure (BP), and glycated hemoglobin (HbA1c) levels were all measured yearly [36]. The presence of systolic BP ≥ 140 and/or diastolic BP ≥90 mmHg on two successive yearly visits was considered as incident hypertension. Pulse pressure (PP) was calculated using the difference between systolic and diastolic pressure [36]. The albumin excretion rate (AER) and fasting lipid levels (cholesterol, triglycerides, HDL, and LDL) were measured every 2 years [31]. That study also included other physical (sex, age, weight), behavioral (smoking, drinking), medication use (use of antihypertensive medications, ACE inhibitors, and lipid-lowering agents), and diabetes-specific (daily insulin dose and duration of diabetes) information [36].
Demographic and behavioral data were assessed by self-report, whereas experienced persons checked blood pressure, and medication use was assessed yearly by self-report [35]. All laboratory measurements were carried out using standardized methods in the EDIC central biochemistry laboratory, and long-term quality control mechanisms were in place to guard against measurement drift [36,37].
In our research, we considered data of 1375 participants over 16 years of the EDIC study from 1994 to 2010. After removing all duplicate data, we finally selected 3184 samples in total. In our dataset, the target variable had two classes: CKD represented as 1 and non-CKD represented as 0. We used an eGFR value of less than 60 mL/min/1.73 m 2 on at least two consecutive collections to define CKD as defined by KDIGO 2012 [12]. Furthermore, other studies utilized the same measurement to indicate substantial GFR deterioration [33,[36][37][38][39][40], and it is considered evidence of CKD [10,11].
In our study, along with medical history, demographics, and behavioral information, we only considered laboratory data available through routine checkups of a T1DM patient. In total, we included 19 parameters: age, sex, BMI, smoking and drinking habit, hypertension, use of ACE inhibitors and antihypertensive medicine, daily insulin dose, hypercholesterolemia, duration of insulin-dependent diabetes mellitus (IDDM), glycated hemoglobin (HbA1c) levels, total cholesterol, triglycerides, high-density lipoproteins (HDL), low-density lipoproteins (LDL), systolic blood pressure (SBP), diastolic blood pressure (DBP), and mean blood pressure. These parameters were considered essential for CKD detection in other studies [13,19,20,[38][39][40]. To avoid overfitting problems, we did not consider parameters such as albumin excretion rate (AER), serum creatinine, and current GFR because serum creatinine is used to calculate eGFR, and AER is a CKD identifier [10,24,35]. Moreover, 24 h urine is necessary to calculate albumin [29], whereas this study aimed to develop a prediction model which would quickly predict CDK with easily available routine checkup data.

Data Imputation
Our primary dataset had 3184 samples, with 68 missing values in five features: sex, smoking and drinking habit, use of ACE inhibitors, and daily insulin dose. We used three data imputation techniques, random forest (RF), k-nearest neighbors (KNN), and multiple imputation by chained equations (MICE), to fill missing values [41][42][43]. Thus, we created three datasets using three different imputation methods: Dataset RF, Dataset KNN, and Dataset MICE. Figure 2 represents the correlation heatmap of Dataset KNN. Other datasets also provided similar correlation heatmaps (Supplementary Figures S1 and S2).

Data Augmentation
Our datasets were imbalanced; a total of 391 of 3184 subjects had CKD. Therefore, we used the SMOTETomek technique, which combines the synthetic minority oversampling technique (SMOTE) and the Tomek links undersampling techniques to balance the dataset [44,45].

Feature Ranking
To find feature importance, we applied three feature ranking models to each dataset. We developed these models using extreme gradient boosting (XGB), random forest (RF), and extremely randomized trees classifier (Extra Tree) [41,46,47]. Then, we rearranged all 19 features on the basis of their relative importance. Thus, in combination, we had three datasets with three feature rankings for each dataset.

ML Model Development
This research examined the performance of conventional ML algorithm-based CKD classifiers on T1DM patients. Here, we applied 10 traditional supervised ML algorithms: logistic regression (LR), k-nearest neighbor (KNN), Gaussian naïve Bayes (GNB), support vector machine (SVM), stochastic gradient descent (SGD), decision tree (DT), gradient boosting (GB), random forest (RF), extreme gradient boosting (XGB), and light gradientboosted machine (LightGBM) to develop 10 different prediction models. We used all three datasets with three different feature ranking strategies for each model to find out the best combination of data imputation technique, feature ranking technique, and number of features.
In-house-built Python 3.7 codes using the Scikit-learn machine learning library [48] were applied to develop all ML models for prediction, data imputation, augmentation, and feature ranking. To train and test the developed ML models, we used stratified k-fold cross-validation where the value of k was 10. In this work, a multiclass SVM model was considered. The KNN model was created for 25 nearest neighbors, and the RF model used a 100-bagged decision tree.

Statistical Analysis
All statistical analyses for baseline EDIC patient characteristics were performed contrasting the CKD and Non-CKD groups. Our data had both continuous and categorical parameters. We calculated the mean ± standard deviation (SD), standard error of the mean (SEM), maximum and minimum value, 95% confidence interval, and correlation for continuous features. An independent t-test was used to find out the 95% confidence intervals, and the correlation between different variables and CKD was evaluated using Pearson's correlation coefficient with p-values. Table 1 shows the baseline factors of continuous features of EDIC patients to better understand the patients' characteristics. On the other hand, Table 2 presents the baseline characteristics of categorical parameters. Here, all categorical features had binary values (0 or 1). We used the same method to calculate the correlation coefficient. In-house build Python 3.7 codes were applied to perform all statistical analyses.

Preparing Datasets
In the first year, the average age of EDIC patients was 35.093 (±6.98) years, with a mean diabetes duration of 13.64 (±4.94) years. Initially, we had only four patients with CKD. However, across the 16 years of EDIC study, another 66 patients got CKD. We considered data over the 16 years of the EDIC study, 20,394 samples in total, and finally selected 3184 participants, with 391 of them having CKD. After processing the primary dataset using three data imputations and the SMOTETomek augmentation technique, each of the final three datasets was prepared with 2790 (±10) samples per class.
We applied three different feature ranking algorithms (XGB, RF, Extra Tree) on augmented datasets to create three separate feature rankings for each dataset. Here, we found hypertension, antihypertensive medication, and duration of IDDM as the three most important features in most cases. All three algorithms returned hypertension as the most important feature in all datasets. Other significant characteristics were triglycerides, ACE inhibitors, age, SBP, HDL, LDL, total cholesterol, drinking, mean BP, BMI, drinker, daily insulin dose, and HbA1c. However, the position of different features and their relative importance value varied significantly in these lists. Figure 3 represents the relative importance of features using the Extra Tree, XGB, and RF techniques on Dataset KNN. We observed that each feature ranking model returned almost identical results on every dataset (Dataset RF, Dataset KNN, and Dataset MICE); details can be found in the Supplementary Materials ( Supplementary Figures S3-S5).

Performance Evaluation of ML Models
After applying three different data imputation techniques, we had three different datasets (Dataset RF, Dataset KNN, and Dataset MICE), and we applied three different feature ranking methods (XGB, RF, Extra Tree) to these three datasets. In total, we had nine combinations of different data imputation techniques and feature ranking models, and we implemented different machine learning algorithms to construct CKD prediction models using all these combinations. For each combination, we trained and evaluated every ML model using the top feature, then the top two features, the top three features, etc., continuing for all 19 features, to identify the best combination of feature ranking model, data imputation technique, and minimum number of features to achieve the best performance. We applied 10 conventional ML algorithms, LR, KNN, GNB, SVM, SGD, DT, GB, RF, XGB, and LightGBM, to develop CKD prediction models and used 10-fold stratified cross-validation to train and test every model, with a 9:1 training/test data ratio.
No significant difference was found in performance using different data imputation techniques. We had very few missing values, which could explain why alternative data imputation techniques had a minimal effect on model performance. On the other hand, the minimum number of features required to achieve the optimal model performance varied significantly between feature ranking techniques. The RF classifier model achieved the highest result with the XGB feature ranking method. This model used 11 features to reach 0.96 (±0.01) accuracy with 0.98 (±0.01) sensitivity and 0.93 (±0.02) specificity. For this model, selected variables were hypertension, antihypertensive medicine, triglycerides, duration of IDDM, drinker, daily insulin dose, age, ACE inhibitors, BMI, HbA1c, and LDL. Figure 4 shows the ROC curve of the RF model using 1-19 features ranked by the XGB feature ranking technique on Dataset KNN. The ROC curves for the best models of the other algorithms can be found in the Supplementary  Materials (Supplementary Figures S6-S14).  LightGBM came in second place and was pretty close to RF in terms of accuracy, with 0.95 (±0.06). In addition to these two models, KNN, SVM, DT, GB, and XGB models obtained greater than 90% accuracy. Despite having a lower sensitivity than several algorithms, SVM had the best specificity. The performance of the best models for each of the 10 ML algorithms employing three feature ranking approaches is shown in Table 3. The overall best model is shaded. Details of these models can be found in the Supplementary Materials (Supplementary Tables S1-S10). In Table 3, we only considered the KNN data imputation technique. The outcomes of the other two data imputation approaches were nearly identical.

Discussion
At present, one of the fastest-growing diseases is diabetes mellitus (DM), with approximately 463 million people suffering worldwide [1]. DM patients have a higher risk of developing serious health problems that can affect the heart, eyes, kidneys, nerves, and teeth. According to the International Diabetes Federation (IDF), the leading cause of kidney failure in developed countries is diabetes [3]. Furthermore, IDF estimates that around 10% of these DM patients suffer from type 1 diabetes mellitus (T1DM). Although T1DM can affect people at any age, it generally develops among young adults. As a result, they are exposed to diabetes-related risk factors for a more extended period. Chronic kidney disease (CKD) is one of the most significant complications of T1DM, and about half of the patients with T1DM have a lifetime risk of developing CKD [4].
The most important accessible indicator of overall kidney function is the glomerular filtration rate (GFR). It represents the amount of fluid filtered through the kidney per unit of time [11]. The estimated glomerular filtration rate (eGFR) can be used to define the normal renal function, and this definition is age-dependent. The eGFR value decreases with age, but it needs to be greater than 90 mL/min/1.73 m 2 to be considered normal renal function, whereas people with eGFR less than 60 mL/min/1.73 m 2 for more than 3 months are considered CKD patients [12,15].
Moreover, CKD is hard to detect, as it can be asymptomatic in many cases. People with CKD for a long period may not exhibit any symptoms, and, because of this asymptomatic nature, there is a typical chance of delay in its recognition [13]. Furthermore, there is a tendency to bypass the traditional urine protein (urinary albumin)-based CKD diagnostic approach. According to Medicare (a national health insurance program in the United States), barely half of the diabetes individuals get a urine albumin test [5]. Early detection of CKD can be helpful to prevent the risk of end-stage kidney disease (ESKD) through intensive management. As T1DM patients have an enormous risk of developing CKD, a prediction model that can predict CKD from patients' routine checkup data would greatly help them.
Machine learning (ML) approaches are now being explored in various medical systems. Due to the recent boost in related technology, applying ML techniques has become easier. Health professionals are more enthusiastic about using their advantages of flexibility and self-learning capacity as an aiding system for reliable performance. Intelligent systems based on ML algorithms have been intensively investigated for various biomedical systems, focusing on disease detection and risk reduction [16][17][18]49,50]. Like other severe diseases, CKD has piqued the interest of researchers in creating ML-based diagnosis systems for CKD [3,13,18,[20][21][22]. However, their application in developing prediction models for CKD in type 1 diabetic mellitus patients is rare.
In the literature, several ML-based kidney disease classifier models have been reported; however, most of them did not focus on diabetes mellitus patients. For example, Segal et al. [13] used the extreme gradient boosting algorithm to build a prediction model to identify end-stage renal disease (ESRD) progression for patients who already have CKD. In addition, Xiao et al. [20] targeted predicting 24 h urinary protein outcomes to detect CKD by applying different ML models. They included logistic regression, Elastic Net, lasso regression, ridge regression, support vector machine, random forest, XGBoost, neural network, and k-nearest neighbor and got the highest AUC of 0.873. Some studies only considered type 2 diabetes mellitus (T2MD) patients to develop kidney disease prediction models. Low et al. [23] applied multivariable logistic regression to design a CKD progression prediction model in patients with T2DM, where both sensitivity and specificity were below 80%. In another study, Dunkler et al. [25] designed a multinomial logistic model to predict CKD risk in individuals with type 2 diabetes.
Although type 1 diabetes and type 2 diabetes share some common characteristics, they are different [26]. Nevertheless, T1DM occurs at a younger age and stays for a longer period than T2DM. As a result, T1DM patients have a greater risk of kidney diseases (including CKD) than T2DM patients. Thus, a CKD prediction model solely concentrating on T1DM patients would be more appropriate. Unfortunately, we found only two studies focused on developing kidney disease prediction models for T1DM patients.
Vistisen et al. [8] developed and evaluated an ESKD prediction model in T1DM patients using Poisson regression analysis and achieved C-statistic values between 0.88 and 0.96. Their study used albuminuria, smoking status, physical activity, alcohol intake, antihypertensive treatment, lipid-lowering treatment, RAS-blocker treatment, eGFR, and previous cardiovascular disease as variables and achieved a C-statistic of 0.888 (95% CI 0.849-0.927) in the derivation cohort. In another study conducted by Colombo et al. [6], ridge regression was implemented to build a model to predict renal diseases in T1DM patients. That study included serum creatinine, urinary albumin/creatinine ratio (ACR), age, sex, diabetes duration, follow-up time, HbA1c, and prior cardiovascular disease information to predict final eGFR with an r 2 of 0.745 (p < 10 −16 ). Both studies used albuminuria as a parameter, which designates increased excretion of one kind of urine protein (urinary albumin). Albuminuria is generally used as a marker of kidney damage, but its measurement is lengthy. National Kidney Foundation, USA recommends using 24 h urine to measure albuminuria [29], which is not convenient for many patients, and barely half of the USA diabetes patients get this urine protein-based test [5]. Asymptomatic CKD patients in particular can exclude this test from their routine checkup due to this inconvenience. Moreover, traditional machine learning algorithms were not considered in these two studies.
This study applied and evaluated 10 traditional machine learning algorithms to build prediction models to quickly predict CKD in T1DM patients from easily available routine follow-up data. We used 16 years of data of 1375 type 1 diabetes mellitus patients from the clinical trials of the Epidemiology of Diabetes Interventions and Complications (EDIC) [30,31]. Our study included age, sex, BMI, smoking and drinking habit, hypertension, hypercholesterolemia, duration of insulin-dependent diabetes mellitus (IDDM), use of ACE inhibitors and antihypertensive medicine, daily insulin dose, glycated hemoglobin (HbA1c) levels, total cholesterol, triglycerides, high-density lipoproteins (HDL), lowdensity lipoproteins (LDL), systolic blood pressure (SBP), diastolic blood pressure (DBP), and mean blood pressure. These parameter values are easily available through routine checkups of a T1DM patient and have been considered in other clinical models for predicting renal function decline in diabetes patients [13,19,20,[38][39][40]. We used the KDIGO 2012 [12] definition of CKD; an eGFR value of less than 60 mL/min/1.73 m 2 for more than three months was considered as CKD. We did not include serum creatinine, albumin excretion rate (AER), and current GFR to avoid overfitting, as serum creatinine is the most important parameter to calculate eGFR and AER is itself a CKD identifier. In addition, 24 h urine analysis is necessary to measure AER.
Our data had missing values and class imbalance. Three AI-based data imputation techniques, random forest (RF), k-nearest neighbors (KNN), and multiple imputation by chained equations (MICE), were used to fill missing values [41][42][43]. In addition, we used a combination of oversampling and undersampling techniques SMOTETomek to address class imbalance [51]. We used extreme gradient Boosting (XGB), random forest (RF), and extremely randomized trees classifier (Extra Tree) for feature ranking and to select the 10 most significant features [41,46,47]. Thus, we had nine distinct combinations of different data imputation approaches and feature ranking models. We used 10 machine learning algorithms, logistic regression (LR), k-nearest neighbor (KNN), Gaussian naive Bayes (GNB), support vector machine (SVM), stochastic gradient descent (SGD), decision tree (DT), gradient boosting (GB), random forest (RF), extreme gradient boosting (XGB), and light gradient-boosted machine (LightGBM), to develop different prediction models for classifying CKD utilizing all nine combinations. We trained and assessed each ML model using the top feature, then the top two features, the top three features, and so on, until we found the optimum combination of feature ranking model, data imputation technique, and number of features to achieve the best performance. We employed 10-fold stratified cross-validation to evaluate different models, with a 9:1 training/test data ratio.
Hypertension, antihypertension medicine, and duration IDDM were the top three features in most feature ranking techniques. Triglycerides, ACE inhibitors, age, SBP, HDL, LDL, total cholesterol, drinking, mean BP, BMI, drinker, daily insulin dose, and HbA1c were other top features, but their positions and relative importance values were different for different models. We had few missing values, and different data imputation techniques showed no significant difference in performance.
With the XGB feature ranking technique and top 11 features, the RF classifier algorithm produced the best CKD prediction model with 0.96 (±0.01) accuracy, 0.98 (±0.01) sensitivity, and 0.93 (±0.02) specificity. LightGBM came in second with 0.95 (±0.06) accuracy. In addition to these two models, the accuracy of KNN, SVM, DT, GB, and XGB models was more than 90%. SVM had the greatest specificity while having a lower sensitivity than several algorithms.
In this study, conventional machine learning algorithms were used to develop a CKD prediction model in T1DM patients for the first time. Here, the suggested model showed reliable performance with more than 95% accuracy. Moreover, to operate this model, we do not need to collect 24 h urine protein or other critical values. Only general data from routine follow-up of a T1DM patient is enough to produce an accurate result without any delay. Consequently, this model can be used to predict CKD when critical laboratory tests are not possible. In addition, all T1DM patients may use this model to make an educated prediction of their CKD status during a regular checkup, and this can improve the chances of discovering asymptomatic CKD patients at an earlier stage.

Conclusions
CKD is one of the most common diabetes-related complexities, and almost 50% of T1DM patients have a lifetime risk. Diagnosis of CKD is complicated because it can be asymptotic even in the late stages. Although there are some prediction models to detect CKD in T2DM patients, this is a rare approach in T1DM patients, and none of them use traditional ML algorithms. Nevertheless, the application of ML in several biomedical fields has shown a positive influence on enhancing performance over conventional methods. This study investigated the performance of various common ML approaches (LR, KNN, GNB, SVM, SGD, DT, GB, RF, XGB, and LightGBM) in the diagnosis and stratification of CKD in T1DM patients. We used general features available from a routine checkup. This analysis found that the models developed by the random forest (RF) algorithm with all 19 variables worked better in CKD classification. Therefore, a random forest or LightGBM-based CKD prediction technique can help healthcare professionals to identify potential CKD patients in T1DM patients and refer them for further investigation.  Table S1. Performance analysis of LR algorithm using KNN data imputation and Extra Tree feature ranking; Table S2. Performance analysis of KNN algorithm using KNN data imputation and Extra Tree feature ranking; Table S3. Performance analysis of GNB algorithm using KNN data imputation and XGB feature ranking; Table S4. Performance analysis of SVM algorithm using KNN data imputation and XGB feature ranking; Table S5. Performance analysis of SGD algorithm using KNN data imputation and Extra Tree feature ranking; Table S6. Performance analysis of DT algorithm using KNN data imputation and XGB feature ranking; Table S7. Performance analysis of GB algorithm using KNN data imputation and Extra Tree feature ranking; Table S8. Performance analysis of RF algorithm using KNN data imputation and XGB feature ranking; Table S9. Performance analysis of XGB algorithm using KNN data imputation and Extra Tree feature ranking; Table S10. Performance analysis of LightGBM algorithm using KNN data imputation and XGB feature ranking.