Next Article in Journal
IL-6R Inhibitors and Gastrointestinal Perforations: A Pharmacovigilance Study and a Predicting Nomogram
Next Article in Special Issue
Segmentation of ADPKD Computed Tomography Images with Deep Learning Approach for Predicting Total Kidney Volume
Previous Article in Journal
Chemerin and Polycystic Ovary Syndrome: A Comprehensive Review of Its Role as a Biomarker and Therapeutic Target
Previous Article in Special Issue
Usage of the Anemia Control Model Is Associated with Reduced Hospitalization Risk in Hemodialysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluating Feature Selection Methods for Accurate Diagnosis of Diabetic Kidney Disease

by
Valeria Maeda-Gutiérrez
1,†,
Carlos E. Galván-Tejada
1,*,†,
Jorge I. Galván-Tejada
1,
Miguel Cruz
2,
José M. Celaya-Padilla
1,
Hamurabi Gamboa-Rosales
1,
Alejandra García-Hernández
1,
Huizilopoztli Luna-García
1 and
Klinge Orlando Villalba-Condori
3
1
Unidad Académica de Ingeniería Eléctrica, Universidad Autónoma de Zacatecas, Jardín Juárez 147, Centro, Zacatecas 98000, Mexico
2
Unidad de Investigación Médica en Bioquímica, Centro Médico Nacional Siglo XXI, Hospital de Especialidades, Instituto Mexicano del Seguro Social, Av. Cuauhtémoc 330, Col. Doctores, Del. Cuauhtémoc, Ciudad de Mexico 06720, Mexico
3
Virrectorado de Investigación, Universidad Nacional Pedro Henríquez Ureña, Santo Domingo 1423, Dominican Republic
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Biomedicines 2024, 12(12), 2858; https://doi.org/10.3390/biomedicines12122858
Submission received: 11 November 2024 / Revised: 10 December 2024 / Accepted: 12 December 2024 / Published: 16 December 2024
(This article belongs to the Special Issue The Promise of Artificial Intelligence in Kidney Disease)

Abstract

:
Background/Objectives: The increase in patients with type 2 diabetes, coupled with the development of complications caused by the same disease is an alarming aspect for the health sector. One of the main complications of diabetes is nephropathy, which is also the main cause of kidney failure. Once diagnosed, in Mexican patients the kidney damage is already highly compromised, which is why acting preventively is extremely important. The aim of this research is to compare distinct methodologies of feature selection to identify discriminant risk factors that may be beneficial for early treatment, and prevention. Methods: This study focused on evaluating a Mexican dataset collected from 22 patients containing 32 attributes. To reduce the dimensionality and choose the most important variables, four feature selection algorithms: Univariate, Boruta, Galgo, and Elastic net were implemented. After selecting suitable features detected by the methodologies, they are included in the random forest classifier, obtaining four models. Results: Galgo with Random Forest achieved the best performance with only three predictors, “creatinine”, “urea”, and “lipids treatment”. The model displayed a moderate classification performance with an area under the curve of 0.80 (±0.3535 SD), a sensitivity of 0.909, and specificity of 0.818. Conclusions: It is demonstrated that the proposed methodology has the potential to facilitate the prompt identification of nephropathy and non-nephropathy patients, and thereby could be used in the clinical area as a preliminary computer-aided diagnosis tool.

1. Introduction

Declared as a global health emergency, diabetes mellitus (DM), or currently referred to simply as diabetes as established by the International Diabetes Federation (IDF), affects more than 537 million people worldwide. The IDF has estimated that the number of subjects with DM will increase from 643 million in 2030 to 783 million in 2045 [1]. In this sense, the early detection of DM is extremely important; nevertheless, there is a high number of people living with undiagnosed DM, approximately 240 million people worldwide.
In Mexico, the number of individuals with DM is predicted to increase from 14.1 million in 2021 to 22.3 million by 2045, of whom more than 95% would have had type 2 diabetes (T2D) [2,3]; which is an endocrine disease, resulting from a pancreatic  β  cell dysfunction combined with insulin resistance. Moreover, poor glycemic control results in multiple complications, including the development of micro and macro-vascular diseases.
Diabetic kidney disease (DKD) is one of the most common microvascular complications of T2D [4]. Clinically, it is distinguished by persistently elevated levels of albumin in the urine, a progressive decline in the glomerular filtration rate, and hypertension, leading to end-stage renal disease (ESRD) [5]. The traditional understanding of DKD progression has been marked by an initial rise in urinary albumin excretion, followed by the development of significant albuminuria and a subsequent rapid deterioration in renal function. As a result, proteinuria has long been viewed as a crucial indicator of declining renal health. Nevertheless, this theoretical model has been challenged by evidence demonstrating that some patients with proteinuria are able to revert to normal albumin excretion levels, either spontaneously or through comprehensive risk management strategies. These findings have raised doubts about the reliability of microalbuminuria as a marker and the appropriate timing of interventions in the management of DKD [4,6].
The percentage of ESRD attributed to DM varies between 10% and 67% because its prevalence is up to 10 times higher in people with DM [7]. In addition, it is a major health burden on health systems; for the specific case of Mexico, the cost generated in 2018 by DKD was $11,763 million (MXN, ~577,296,820 USD) for ESRD resulting from T2D [8].
It is a fact that the presence of T2D together with the development of its comorbidities, the care needs, treatment options, and associated costs have a significant impact. The pathogenesis of DKD is complex and is still not fully understood, resulting in poor therapeutic outcomes [9,10]. Therefore, early prevention, detection and diagnosis of DKD is fundamental in order to prevent its progression [11]. Actually, machine learning (ML) and artificial intelligence (AI) have played an important and transformative role in the healthcare industry. ML techniques have been extensively applied in various aspects of clinical decision-making and patient care, including but not limited to diagnosis, risk assessment, prediction, and prognosis [12,13,14,15]. These advancements have significantly enhanced the efficiency and accuracy of medical processes, leading to improved patient outcomes and more personalized treatment strategies. Furthermore, ML algorithms have exhibited exceptional proficiency in analyzing the extensive datasets of intricate healthcare information, encompassing electronic health records (EHRs), medical imaging, and genomic data. This analytical capability allows for the extraction of significant information and patterns that were previously inaccessible to healthcare practitioners. The incorporation of ML into healthcare frameworks holds the promise of transforming the methodologies employed in disease diagnosis, treatment and management. Consequently, this integration has the potential to usher in a new era of more efficacious and streamlined healthcare delivery.
The primary objective of this study is to explore the performance of four popular feature selection methods, namely, the Univariate filter method, Boruta, Galgo, and Elastic net, to identify risk factors that may contribute to the diagnosis of DKD in a clinical Mexican dataset; to evaluate the subset of relevant variables Random Forest was used for classifying non-DKD and DKD patients.

Related Work

Recent research endeavors have prioritized the identification of the risk factors associated with DM, to enhance diagnostic decision-making processes. These investigations have significantly contributed to the advancement of predictive modeling and risk assessment within the realm of DM. Noteworthy studies, exemplified by the innovative feature selection and classification model for heart disease prediction [16], have yielded novel insights into the intricate interplay of factors influencing DM diagnosis. Furthermore, the study of Nagarajan et al. [17] has facilitated refined data analysis and real-time monitoring, thereby bolstering DM management initiatives. Additionally, Chang et al. [18], have elucidated population-specific patterns and trends, offering tailored strategies for DM prevention and treatment. Moreover, the detection of DR through principal component analysis multi-label feature extraction and classification [19], has underscored the imperative of early detection and intervention in mitigating DM-related complications. Collectively, these studies underscore the pivotal role of harnessing ML techniques and innovative methodologies in addressing the multifaceted challenges inherent in DM diagnosis. This comprehensive body of research serves as a foundation for understanding the complexities surrounding DM and sets the stage for exploring related conditions such as DKD.
In the case of Rodriguez-Romero et al. [20], data mining and ML techniques were applied to identify DKD biomarkers. The diabetic dataset consisted of 10,251 subjects with 18 factors. The identification of the most important features was performed using the InfoGain method. Then, six learning algorithms were tested (one rule [1R], J48 decision tree [J48], random forest [RF], simple logistic [SL], sequential minimal optimization [SMO], and naïve Bayes [NB]). Based on the experimental results, RF and SL exhibited the best performance.
Jiang et al. [21] proposed the prediction of DKD in T2D patients. A total of 302 Chinese subjects, and 19 clinical features were included. Focused on identifying a set of the most important features, a Least Absolute Shrinkage and Selection Operator (LASSO) analysis was conducted, where nine potential attributes were found. Then, a multivariable logistic regression model was performed with the candidate predictors. Finally, the concordance index (C-index) was calculated, obtaining 0.934.
Furthermore, Shi et al. [22] developed a DKD or diabetic retinopathy (DR) incidence risk nomogram. It is important to mention, that 4219 patients were divided into groups: T2D with DKD or DR, with both, and without any complication. LASSO regression was selected as a feature selection method, thus 7 of 23 attributes were potential predictors of DKD, and four in DR. To tackle this, a logistic regression analysis was conducted to optimize the feature selection; the risk factors with a p-value less than 0.05 were selected and combined to generate a prediction model. The performance of the multivariable logistic regression model showed an AUC of 0.807.
Likewise, Xi et al. [23], constructed a DKD risk nomogram; the study enrolled 1095 subjects divided into DKD (203) and non-diabetic kidney disease (NDKD, 892). With regard to reducing the predictors of DKD, a total of 23 clinical features were submitted into a LASSO regression model, obtaining 18 possible features. Then, from the remaining predictors, a logistic regression analysis was applied, resulting in 10 statistically significant predictors that were used to build the model. Their proposed nomogram prediction model achieved an AUC of 0.813, demonstrating good discrimination.
The investigation conducted by Maniruzzaman et al. [24] presented the comparison of machine learning (ML) methods for the prediction of DKD patients. The dataset consisted of 133 respondents having 73 cases, and 60 controls. The combination of principal component analysis (PCA) for feature selection, and six ML algorithms namely linear discriminant analysis (LDA), support vector machine (SVM), logistic regression (LR), k-nearest neighborhood (K-NN), naïve Bayes (NB) and artificial neural network (ANN) was implemented to select the best features at distinct PCA cutoff values. One of their experiments was to optimize the kernel of SVM demonstrating that SVM-radial basis function (RBF) provided better performance. Afterward, the best features fed into the six ML techniques, and it was noted that the SVM-RBF kernel obtained a maximum accuracy (88.7%), 0.91 AUC at a PCA cutoff of 0.96.
On the other hand, Yang and Jiang [25] developed a nomogram to assess the risk of DKD in T2D patients; 706 subjects and 23 features were included in the statistical analysis. A univariate logistic regression test was used to evaluate the significance of each variable; attributes with a p-value less than 0.05 were considered significant. Subsequently, 13 main features were subjected to a stepwise regression according to the principle of akaike information criterion (AIC), giving the best model with the smallest AIC value. A binary logistic regression model was constructed with the seven independent risk predictors; the prediction performance was measured using the C-index and AUC values. The results showed a 0.773 and 0.758 in the training and validation sets, respectively.
Table 1 summarizes the general approaches of the selected studies, including the feature selection algorithms, ML techniques, and the predictors of DKD. According to these relevant studies, we resolved to undertake a comprehensive comparison of different feature selection methods. These encompass a range of techniques, including filter-based (Univariate), penalty-based (Elastic net), tree-based (Boruta) and heuristic (GALGO) approaches.
Among these methods, Univariate evaluation ranks features based on a performance measure, and then selects the top variables with the highest scores [26]. An elastic net is designed to strike a delicate balance between accuracy and the magnitude of weight values within a linear model. It accomplishes this through the utilization of  L 1  and  L 2  regularization techniques. This is primarily attributed to its tendency to produce sparse models with minimal non-zero weight values [27]. Boruta, in comparison with other feature selection algorithms, adheres to an all-relevant variable selection approach, encompassing all features associated with the target output variable. In contrast, other techniques adhere to a minimum-optimum approach, relying on a limited subset of features to minimize errors in a chosen classifier. Furthermore, Boruta is adept at considering complex multi-variable relationships and can proficiently explore interactions between variables [28]. Lastly, GALGO, a genetic algorithm-based feature selection, initiates with a randomly generated population of models, subsequently refining these models through the emulation of natural selection mechanisms. This evolutionary process involves strategies such as an increased replication rate for the more efficient variable subsets, mutation for generating variations, and crossover for enhancing variable combinations. Significantly, the concurrent examination of variable sets organized within chromosomes during the selection process underscores the genuinely multivariate nature of variable selection [29]. The inclusion of these methods provides a holistic analysis of their performance in the context of DKD diagnosis in Mexican patients.

2. Materials and Methods

The architecture of the proposed ML classification system is shown in Figure 1, and has been discussed in the next subsections.

2.1. Dataset Description

In this study, the dataset was obtained from “Unidad de Investigación Médica en Bioquímica, Centro Nacional Siglo XXI, Instituto Mexicano del Seguro Social (IMSS)”. The Mexican patients signed an informed consent letter, and the protocol meets the Helsinki criteria which conforms to the ethical principles of the Ethics Committee of IMSS under the number R-2011-785-018.
  • Inclusion criteria for cases:
    -
    Patients diagnosed with T2D according to the ADA criteria: fasting glucose equal to or greater than 126 mg/dL;
    -
    Any gender;
    -
    Affiliated beneficiaries to the IMSS with active and up-to-date benefits at time of enrollment;
    -
    Not having any family members participating in the study;
    -
    Patients who agreed to participate in the study and signed the informed consent form;
    -
    Adults diagnosed before the age of 55, with the current age for inclusion being between 35 and 85 years.
  • Inclusion criteria for controls:
    -
    Individuals aged between 35 and 85 years, without meeting the criteria for metabolic syndrome (according to the International Adult Treatment Panel (ATP) III criteria);
    -
    Any gender;
    -
    Not having any family members participating in the study;
    -
    Individuals who agreed to participate in the study and signed the informed consent form;
    -
    Fasting glucose (12 h) below 100 mg/dL, and post-load glucose levels below 140 mg/dL two hours after ingesting 75 g of glucose.
  • Exclusion criteria:
    -
    Beneficiaries with temporary or seasonal insurance coverage
    -
    Individuals without a permanent residence of those who cannot be reached by phone, either at their home or through a family member;
    -
    Woman in the climacteric stage.

2.1.1. Clinical Variables

Thirty-two baseline clinical variables were used as predictors to train and test the models: education (EDU), salary (SAL), sex, age, diagnosis age of diabetes (Age DX), waist hip ratio (WHR), body mass index (BMI), glucose (GLU), urea, creatinine (CRE), cholesterol (CHOL), high density lipoprotein (HDL), low density lipoprotein (LDL), triglycerides (TG), HDL uncorrected by treatment (UHDL), LDL uncorrected by treatment (ULDL), total cholesterol uncorrected by treatment (TCHOLU), TG uncorrected (UTG), systolic blood pressure (SBP), diastolic blood pressure (DBP), SBP uncorrected by treatment (USBP), DBP uncorrected by treatment (UDBP), hypertension treatment (HA-TX, refers to medical interventions designed to manage and control high blood pressure), lipids treatment (LIPIDS-TX, refers to medical interventions aimed at managing and correcting abnormal levels of lipids in the blood), glycated hemoglobin (HbA1c), glomerular filtration rate (GFR), glibenclamide (GB), metformin (MF), pioglitazone (PG), rosiglitazone (RG), acarbose (AB), and insuline (INS).
A total of 22 subject’s—diagnosed with T2D and T2D patients with DKD—averaging 57.9 ± 12.6, within a range of 36–84 years. Patient-specific variables included demographic, clinical observations and medical treatment. Furthermore, participant details, including continuous variables presented as mean ± SD, categorical variables expressed as n (%), and the associated p-values, are summarized in Table 2. The binary outcome for the classification model was defined as the presence or absence of DKD.

2.1.2. Data Pre-Processing

Initially, the dataset contained a total number of 35 attributes; however, for the pre-processing step, the ID of each patient, neuropathy and retinopathy cases were removed because those input variables are not relevant for this study. Furthermore, in certain variables specifically, there are two missing observations in both variables DBP (9.0909%) and DBPU (9.0909%), while variable GFR has one missing observation (4.5454%). In this regard, a method employing the mean of the non-missing observations was employed to fill the missing values. According to Liu and Gopalakrishnan [30], the decision to utilize mean imputation is consistent with their findings, besides reinforcing the notion that imputation methods do not have a substantial impact on the results. As indicated by their research, there is no consistent superiority of one imputation method over another. In several cases, mean imputation has demonstrated its ability to yield results comparable to more complex algorithms.

2.2. Feature Selection Methodologies

2.2.1. Boruta

The Boruta algorithm [31,32] is a relevant features selection wrapper method built around the RF classifier which provides important and non-important attributes from an information system. In the process, Boruta randomizes the input dataset (A1, A2, A3, …, An) by adding copies of all the variables, named shadow attributes (S1, S2, S3, …, Sn); then, the extended dataset is used to build an RF classifier in order to find the maximum Z-score among shadow attributes (MZSA). All the variables that scored lower than MZSA are treated as rejected, those with a higher MZSA are labeled as confirmed; finally, the remaining variables were marked as tentative.

2.2.2. GALGO

GALGO employs genetic algorithms (GA) for selecting models with high fitness. This procedure starts from a random population of feature subsets known as chromosomes. Each chromosome is evaluated for its ability to classify the desirable outcome, obtaining a certain level of accuracy. The main idea is to replace the initial population with a new one including the variants of chromosomes with a higher classification accuracy; the progressive improvement of chromosome population is inspired by the process of natural selection (selection, mutation, and crossover). The proportion of the solution space increases with the chromosome populations in partiality isolated environments (niches); the chromosomes can migrate from one niche to another, ensuring the recombination of good solutions [29].
This process is carried out in four main steps:
  • Setting up the analysis. The input and the outcome features are specified, as well as the parameters that define the GA environment (gene expression), the classification method, and the error estimation.
  • Searching for relevant multivariate models. Starting from a random population of chromosomes, the GA method can find a diverse collection of good local solutions.
  • Refinement and analysis of the population of selected chromosomes. The chromosomes selected from the GA are subjected to a backward selection strategy, aiming to obtain a model with a chromosome population that significantly contributes to the classification accuracy.
  • Development of a representative statistical model. A single representative model is obtained; for this stage, a forward selection strategy was implemented based on the step-wise inclusion of the most prevalent genes in the chromosome population.

2.2.3. Elastic Net

Zou and Hastie [33] proposed the elastic net, a shrinkage and selection method.
It is a convex combination of two different types of regularization methods: Ridge and LASSO regression ( L 2 , L 1 ).
  • L 1 , LASSO, a penalized least squares method imposing an  L 1  penalty on the regression coefficients, does a continuous shrinkage and automatic variable selection.
  • L 2 , minimizes the residual sum of squares subject to a bound of the  L 2 -norm of the coefficients. Ridge regression achieves its better prediction performance through a bias-variance trade-off; however, this method cannot produce a parsimonious model due to it always keeping all the predictors.
The penalization method of the elastic net is defined by Equation (1):
β ^ ( e l a s t i c n e t ) = a r g m i n y X β 2 2 + λ 2 β 2 + λ 1 β 1
where X is the measurement matrix, and  λ 2  and  λ 1  are the shrinkage parameters.
The elastic net formula is shown in the Equation (2):
P α = i = 1 p ( 1 α ) β j 2 + α β j
where  α  determines the specific type of regression; Ridge ( a l p h a  = 0) and LASSO ( a l p h a  = 1).
This union allows for learning a sparse model where few of the entries are non-zero, while still maintaining the regularization properties [34].

2.3. Classification Method

Random Forest

An ensemble method developed by Breiman [35]; RF is an ML algorithm that can perform classification and prediction tasks. This classifier provides a collection of decision trees (DT), in which each DT casts a unit vote for the most popular class; the final decision result of RF is obtained by the principle of majority, where the output of all the DT’s are aggregated to produce one final decision [36].

2.4. Evaluation Metrics

In this study, area under the curve (AUC), sensitivity (SN) and specificity (SP) were used to measure the effectiveness of the RF model. The formulas are as follows:
S N = T P T P + F N
S P = T N T N + F P
where true positive (TP) represents the number of instances that are positive and correctly classified; true negative (TN) means the number of negative cases that are classified as negative; false positive (FP) is defined by negative instances that are incorrectly classified as positive, and false negative (FN), are the number of positive cases that are misclassified as negative. SN and SP analysis is commonly used for the evaluation of machine learning algorithms. SN has the capability of testing to correctly identify an individual as positive in disease (i.e., diabetic patients with DKD). The SP is the opposite of SN, this metric has the ability of a test to correctly classify an individual as disease-free (i.e., diabetic patients without DKD) [37].
Also, the receiver operating characteristic (ROC) curve and AUC are used to evaluate a classifier’s performance. The ROC curve, a two-dimensional measure, is a graph technique for visualizing the model efficiency, plotting the probability of correctly classifying (SN) against the incorrectly classifying (SP) examples. The AUC gives the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance; when the AUC = 1.0 it indicates that the model can perfectly distinguish between high and low-risk patients, and when the AUC = 0.5, it is equivalent to random chance [38,39].

2.5. Cross Validation

A statistical method of evaluating and comparing machine learning algorithms is cross-validation (CV). One popular form of CV is a k-fold CV, where the dataset D is split randomly into k folds (subsets,  D 1 , D 2 , D 3 , , D k ), then the model is trained and tested k times, while the remaining  k 1  folds are used for learning [40].
Figure 2, shows a representation of 10 fold CV.

2.6. Interpretability

SHapley Additive exPlanations (SHAP) [41] compute Shapley values, which quantify the contribution of each feature to the model’s predictions. The methodology involves evaluating the model’s output across various combinations of features, measuring the average change in prediction when a specific feature is included versus excluded. This difference, referred to as the Shapley value, represents the marginal contribution of a feature to the overall prediction. By providing a numerical assessment of the influence of each feature, SHAP enables a detailed interpretability of the model’s decision-making process [42,43].
The data were analyzed using R (version 4.0.3), a free statistical software. The following R packages were used for modeling: Boruta [31], MLeval [44], glmnet [45], GALGO [29], caret [46] and iml [47].

3. Results

3.1. Comparison to Established Feature Selection Methods

3.1.1. Univariate and Multivariate Feature Selection

Univariate feature selection (UFS) examines the strength of the relationship between each variable and the response. In order to assess the affinity, several statistics approaches can be applied; however, in this case, the AUC value was calculated. In the context of FS, AUC is often used to evaluate the relevance of a feature by interpreting its values as the output of a classifier. Features with AUC values closer to 1 indicate high discriminative power, while those near 0.5 suggest no discriminatory ability [48]. This characteristic makes AUC particularly useful for identifying features that contribute meaningfully to the prediction task.
Table 3, presents 32 univariate models based on an RF approach, for those models that obtained an AUC value ≥ 0.60 a multivariate model was constructed.
For better comparison, firstly, the main approach was to include the 32 features and make an evaluation of the model. Therefore, aiming to develop a multivariate model established on the most significant features, seven selected variables were used as inputs for the RF algorithm.
Equation (5) represents the confirmed formula for the proposed multivariate model.
O U T P U T G L U + U R E A + C R E + T C H O L U + U T G + G F R + G B

3.1.2. Boruta

Figure 3 shows the Boruta result plot. Blue boxplots correspond to minimal, average, and maximum Z-scores for each shadow attribute. Red, yellow, and green boxplots represent Z-scores of, respectively, the rejected, tentative and confirmed attributes.
Table 4, shows a summary of each attribute: mean, median, minimal, and maximal importance, the number of hits normalized to number of importance source runs, and the decision.
An important aspect to consider is that tentative attributes may be left without a decision; to fill the missing decisions, a simple comparison is performed using the median Z-score of the attributes with the median Z-score of the shadow attributes. The confirmed formula that will define a model based only on the confirmed attributes is presented in Equation (6):
O U T P U T C R E

3.1.3. GALGO

Table 5, presents the general parameters used in GALGO.
To obtain a statistically significant model, the settings of this methodology were selected according to the literature [49,50]. Fifty-five feature RF models evolved throughout 300 generations. In each generation the chromosomes are selected, reproduced, crossed, and mutated, achieving an accurate and robust model. The fitness was defined as the accuracy of the model to classify the desirable outcome.
In Figure 4 it is possible to observe the most surviving features ordered by rank. From the plot, CRE and UREA show stable ranks while the remaining change the rank colors due to their instability.
The forward selection process is shown in Figure 5. The y-axis presents the classification accuracy, and the x-axis illustrates the genes ordered by their rank. In terms of accuracy, the model labeled as “1” containing the most three frequent genes was the best.
A robust gene backward elimination step was conducted, resulting in a representative model constructed with only three genes. The model obtained from GALGO with RF as a classifier is shown in Figure 6.

3.1.4. Elastic Net

The Elastic net model used a combination of optimized values for Ridge ( α  = 0) and LASSO ( α  = 1) regressions. The values of the hyper-parameters  α  and  λ  were optimized by averaging five repetitions of 10-fold CV, in favor of minimizing the deviance error; thus, the optimal parameters of the elastic net were as follows:  α  = 0.555 and  λ  = 0.2.
In Figure 7, each curve corresponds to a variable. The x-axis shows  λ  for the log-lambda value. Furthermore, we indicate the variables going to zero as the penalty increases.
Table 6 summarizes the Elastic net performance, where 6 of 32 features were significant. Also, we show the shrunken coefficients for all the retained variables.

3.2. Classification Performance

According to the Fawcett criterion [38], the interpretation of the AUC values is as follows:
  • Bad test = [0.5, 0.6)
  • Regular test = [0.6, 0.75)
  • Good test = [0.75, 0.9)
  • Very good test = [0.9, 0.97)
  • Excellent test = [0.97, 1)
These values were used to evaluate and interpret the efficiency of the models.
First, a supervised RF classification model was constructed; this model includes 32 features and employs a 10-fold CV strategy. Table 7, presents the results of the RF model, obtaining an AUC value of 0.60 indicating a bad performance. To estimate the probability of the disease the SN and SP were calculated.
Figure 8, shows the AUC obtained with the selected 32 features. The results established that the irrelevant variables degrade the performance of the model. This means that this model may not efficiently distinguish the outcome.
Subsequently, four RF classification models were constructed using a 10-fold CV strategy:
  • Model 1. “feature obtained by Boruta [CRE]” + RF
  • Model 2. “features obtained by GALGO [CRE, UREA and LIPIDS.TX]” + RF
  • Model 3. “features obtained by Elastic net [AGE, UREA, CRE, LIPIDS.TX, GFR and GB]” + RF
  • Model 4. “features obtained by UFS [GLU, UREA, CRE, TCHOLU, UTG, GFR, and GB]” + RF
The performance for each classification model was reported in Table 8. In the extreme case, Boruta, a tree-based method, selected only one attribute. However, the model showed a poor performance, achieving an AUC value of 0.61. On the contrary, GALGO, which uses a multivariate approach, obtained 0.80 of AUC representing a very good classification test with three features; in terms of SN (0.909) and SP (0.818), the model has the capability to diagnose individuals with and without the condition. On the other hand, Elastic net, a traditional method, is a combination of Ridge and LASSO penalty where the regression coefficients are shrinking near to zero; the model attained 0.75 of AUC with six variables, demonstrating an acceptable efficiency and a regular performance in terms of SN and SP. Finally, the seven characteristics were taken into account for a multivariate evaluation. The results showed regular discrimination (0.74 of AUC) but a lower ability of the model to distinguish the non-DKD subjects.
It should be noted that all methods have similarities. Despite the different methodologies of FS algorithms, it is important to note that CRE is a dominant attribute. Another biochemical indicator, the UREA, which is a strong variable, appears in the three FS method, on the contrary, GFR, LIPIDS.TX, and GB were consistent.
Figure 9, shows the AUC-ROC for each model that uses a different methodology of FS.

3.3. Model Interpretation

The GALGO feature selection method identified CRE, UREA, and LIPIDS.TX as the key attributes that best distinguish patients with DKD from those without. When these features were used in an RF classification model, it achieved high performance, with an AUC of 0.80. These findings demonstrate the effectiveness of GALGO in identifying clinically relevant predictors and the ability of RF to leverage these features for accurate classification. To provide greater interpretability of the model’s decision-making process, the SHAP algorithm was applied.
Figure 10 illustrates the feature importance scores computed using the SHAP methodology applied to the predictors of the Random Forest classifier. The feature importance is evaluated based on the cross-entropy loss, which reflects the influence of each variable on the predictive performance of the model. CRE has the highest importance score, and creatinine levels emerge as the most critical variable in predicting DKD; UREA demonstrates moderate importance reinforcing its clinical utility in assessing kidney function and its association with DKD progression, and LIPIDS.TX highlights the potential influence on DKD risk management and outcomes.
Figure 11 demonstrates how CRE levels influence the model’s predicted probability of DKD using SHAP combined with Partial Dependence Plots (PDPs).
The graph is divided into two panels: NO (NO-DKD) and YES (YES-DKD). In the NO panel, higher CRE levels correlate with a sharp decrease in the predicted probability of DKD, suggesting that elevated CRE levels reduce the likelihood of false-positive classifications among non-DKD patients. Conversely, in the YES panel, higher CRE levels correlate with an increased predicted probability of DKD, reaching a maximum at levels exceeding 1.0 mg/dL. In addition, the plot highlights a non-linear relationship between CRE and DKD predictions, emphasizing its asymmetric impact on the NO and YES groups.
Figure 12 illustrates the SHAP-based dependence plot for UREA levels. In the NO panel, the relationship between UREA levels and the predicted probability of DKD shows fluctuations, with a noticeable peak at approximately 25 mg/dL, followed by a decline and stabilization around 0.4 for higher UREA levels. This indicates that while UREA levels might momentarily increase the likelihood of misclassification, higher levels tend to reduce the probability of false positives among NO-DKD patients. In contrast, the YES panel shows an initial dip in predicted probability at UREA levels near 25 mg/dL, followed by a steady increase and stabilization around 0.6 for higher levels. This indicates that elevated UREA levels contribute positively to the classification of DKD, aligning with its established role as a maker of impaired renal function.
In the Figure 13, the NO panel shows the high predicted probability of DKD (~0.75) when LIPIDS.TX is inactive (0), which sharply declines to ~0.4 as LIPIDS.TX becomes active (1). Otherwise, the YES panel illustrates an increased probability of DKD, from ~0.3 when LIPIDS.TX is inactive to ~0.6 as LIPIDS.TX becomes active. These findings suggest a positive relationship between lipid therapy and the diagnosis of DKD.

4. Discussion and Conclusions

This work focused on comparing different feature selection (FS) methodologies to identify predictive risk factors of DKD using a dataset from Mexican T2D patients, both with and without DKD. Initially, the dataset comprised 47 features, but after a pre-processing stage to remove irrelevant features, it was narrowed down to 32 variables, as detailed in Table 2. The importance of identifying crucial features for disease diagnosis led to the adoption of feature dimension reduction methods such as UFS, Boruta, GALGO, and Elastic net, aimed at retaining significant information.
For the prediction of DKD or non-DKD status, a binary classifier was developed utilizing ML techniques. Among these, RF, an ensemble learning algorithm composed of multiple DTs, was selected for its extensive use in healthcare, particularly for developing prediction models. RF’s success is attributed to its ability to handle highly non-linear data, robustness to noise, and a simpler tuning process compared to other ensemble algorithms [51,52,53].
The attributes selected through FS methodologies were input into the RF classification algorithm, reducing the initial 32 candidate features to smaller subsets: 1 (Boruta), 3 (GALGO), 6 (Elastic net), and 7 (UFS). These subsets were then modeled using a 10-fold CV, demonstrating the effectiveness of this approach in achieving stable evaluations with a smaller dataset [54]. Table 8, demonstrates that the behavior of Elastic Net and the multivariate model, in terms of the variables selected, are pretty similar. Otherwise, Boruta and GALGO pick a fewer number of predictors.
The performance evaluation was carried out by quantitative metrics such as AUC, SN and SP. An ideal test has an AUC value of 1.0. However, an AUC < 0.5 is considered to have a reasonable discriminating ability, allowing for the description of the problem [55]. Then, according to the Fawcett criterion, the models were evaluated [38]. To begin, all the features were analyzed by the RF approach, the general model attained a 0.60 AUC value, but the performance terms of identifying DKD (SN) and non-DKD individuals (SP) was poor. Boruta + RF reached (0.61 AUC value) a similar performance with only one risk factor, but in comparison with the general model, the portion of negative instances was higher. On the other hand, Galgo + RF, had the most acceptable approach since it selects three variables, and records an AUC of 0.80, with the effectiveness of identifying cases and controls of the disease. Elastic net + RF can be an effective method, but it chooses an exhaustive set of attributes. It is important to note that compared with GALGO + RF, both employed the three same characteristics; nevertheless, this method selects a high number of variables and drops the precision of the model, obtaining an AUC of 0.75 which is a regular test, with equal SN and SP values (0.720). Finally, the UFS selects seven risk factors, those predictors formed a multivariate model which was evaluated in the same way as the other methodologies and achieved 0.74 of AUC, but in terms of SN and SP, the performance was deficient.
With the aim of comparing the outcomes derived from the RF algorithm against alternative machine learning approaches, we harnessed the three distinct features obtained by GALGO. Table 9 shows the metrics encompassing sensitivity, specificity and the AUC values for both RF and Support Vector Machine (SVM), elucidated across an array of cross-validation strategies.
The results exhibited that RF achieved superior performance, particularly underscored in the settings featuring k = 10 and Leave-One-Out Cross Validation (LOOCV); this is a validation method where each individual sample in the dataset is used once as a validation set while the remaining samples form the training set. The elevated AUC values accentuate RF’s adeptness in efficaciously discriminating between the classes, thereby increasing its utility in the precise classification of DKD and non-DKD patients. These findings serve to enhance the stature of RF as a valuable tool in the classification of DKD subjects, and the demonstrated robustness within cross-validation elucidates the reliability of RF across a spectrum of clinical scenarios.
The findings of the study highlight the importance of understanding different FS methods. Using appropriate modeling methods is a key component of this process. A generalization of the findings is that for clinical problems in a smaller dataset, GALGO, which is a genetic algorithm, obtains better results in order to classify patients with DKD.
Using GALGO and RF with only three features, the proposed model achieved impressive results with an AUC of 0.800, demonstrating its robustness. These outcomes are remarkable as they indicate that a parsimonious feature selection strategy can achieve a level of performance comparable to studies employing a more extensive set of variables (Table 10).
The nature-inspired methodologies, such as GALGO, extend their examination beyond linear associations between features and target outcomes. These techniques assess the collective potential of feature subsets rather than focusing solely on individual feature performance; while this approach may appear straightforward, it is essential to note that the combinatorial nature of feature subsets results in an exponential rise in the number of potential combinations, rendering their exhaustive evaluation computationally demanding within reasonable time frames. The application of a genetic algorithm contributes to the development of a resilient multivariate model [56].
The features obtained by this methodology are in accordance with several investigations mentioned above (Table 1). CRE and UREA are common biochemical indicators used to monitor the function of the kidney. CRE is an assured indicator of kidney function, which reflects the capacity of renal filtering; a high level of this biomarker is associated with poor clearance of CRE by the kidneys. Its sensitivity is poor in the early stages of renal impairment unless the damage is important enough to engage filtering [57,58]. UREA is an indicator of kidney damage and dysfunction and is helpful in giving an initial diagnosis [59].
From a molecular perspective, both CRE and UREA are closely tied to the pathophysiological mechanisms underpinning DKD. Persistent inflammation of the circulatory system and renal tissue serves as a critical pathophysiological basis for the development and progression of DKD. This inflammation can be triggered by metabolic, biochemical, and hemodynamic disturbances commonly observed in DKD [60]. Chronic hyperglycemia in diabetes patients induces oxidative stress and activates inflammatory pathways, leading to the production of pro-inflammatory cytokines such as IL-6 and TNF- α . These molecules contribute to tubular and glomerular injury, reducing renal filtration efficiency and increasing serum creatinine and urea levels. Persistent inflammation also promotes extracellular matrix deposition, leading to fibrosis, which exacerbates renal dysfunction [9,61].
With regard to LIPIDS.TX, the dataset of this study only mentions whether the patients are under medical lipids treatment. It should be noted that if the patient is undergoing the treatment, and efficient control is not achieved, there are deleterious effects of elevated cholesterol levels on renal injury and on the initiation and progression of DKD [62]. Likewise, Ayodele et al. [63] support that lipids play a role in the development and progression of glomerular injury. Dyslipidemia, a common feature in nephrotic syndrome and DKD, exacerbates renal damage through multiple mechanisms. The uptake of oxidized LDL by mesangial cells promotes glomerulosclerosis, while free fatty acids directly impair podocyte function, leading to cellular damage and loss. Additionally, dysregulated lipid metabolism, including altered cholesterol and fatty acid pathways, contributes to injury in glomerular and tubular cells. In DKD, mitochondrial dysfunction further aggravates these effects by disrupting lipid metabolism, inducing oxidative stress, and triggering inflammatory pathways, ultimately accelerating fibrosis and renal decline [64,65].
This research was carried out under the guidance of subject matter experts in diabetes and its associated conditions, guaranteeing that the methodology and findings align with clinical significance. This oversight ensured that the predictions and risk factors identified by the model were consistent with the current clinical understanding of DKD.
There are some limitations to this study. Firstly, data were only available from a limited set of patients with DKD. In total, data were available for only 22 patients (11 cases and 11 controls). Secondly, it may only be applicable to Mexican people; the cultural, genetic, and environmental factors unique to Mexico could influence the manifestation and progression of DKD, making it challenging to extrapolate the findings to other populations. Replicating the study in different ethnic groups and geographic regions would provide valuable insights into the generalizability of the results and help to identify potential population-specific factors contributing to the disease; Thirdly, it would be interesting to reproduce this methodology with the inclusion of other routine laboratory parameters that compromise the renal failure, and comorbidities, namely neuropathy and retinopathy data; including these additional variables could provide a more comprehensive understanding of factors influencing the development and progression of DKD. Furthermore, exploring the interactions between these attributes and identifying novel biomarkers predictive of disease outcomes could enhance the predictive accuracy and clinical utility of the model. It is crucial to acknowledge that the dataset lacks variables pertaining to physical activity and dietary habits, factors known to substantially impact the progression of diabetes and associated complications. This omission represents a limitation in the study, as incorporating these variables could have provided a more comprehensive perspective on the factors contributing to DKD.
Finally, in this context the attainment of highly effective models alone is insufficient. It is imperative to comprehend their functioning and their associations with input data. This is where explanatory techniques, such as SHAP and LIME (Local Interpretable Model-Agnostic Explanations), play a fundamental role. These techniques are pivotal for analyzing and comprehending how the model’s outcomes correlate with various input features, thereby providing clarity and transparency to the clinical decision-making process. For instance, the analyzed Figure 11, Figure 12 and Figure 13 demonstrate how key attributes such as CRE, UREA and LIPIDS.TX influence the model’s predictions in a differentiated manner. Specifically, the SHAP-based analysis underscores the non-linear relationships between CRE and UREA levels and the predicted probability of DKD, revealing distinct patterns for patients classified as non-DKD and DKD. These relationships validate the biological relevance of the variables and emphasize how the model utilizes these features to differentiate between DKD and non-DKD cases. Similarly, the impact of LIPIDS.TX underscores the importance of including therapeutic factors in clinical models, as it shows how treatment-related variables can influence predictions. The integration of explanatory algorithms not only facilitates a profound understanding of how model predictions are derived but also enables the identification of critical factors influencing medical decisions. This is essential for enhancing the quality of DM management and the treatment of its complications. Furthermore, the utilization of explanatory techniques empowers healthcare professionals to validate and refine the model’s performance. By analyzing the intricate relationships between input features and model outcomes, as demonstrated in the discussed Figure 11, Figure 12 and Figure 13, clinicians can identify potential biases or areas for improvement within the model. This iterative process of refinement not only enhances the reliability and accuracy of predictions but also fosters trust and confidence in the model’s capabilities among healthcare providers.
In summary, different FS methodologies were applied to identify predictors of DKD. Four main RF classification models are presented in this work (USF + RF, Boruta + RF, GALGO + RF and Elastic net + RF). Afterward, the performance measures were calculated. The results showed that the GALGO + RF model achieved 0.80 of AUC, obtaining promising results with only three risk factors: CRE, UREA, and LIPIDS.TX, permitting the effectiveness of classifying patients with DKD. There is a need to detect which risk factors participate in the development of DKD, diagnose the disease early, and determine the most effective medical treatments that are necessary to prevent the kidney damage progression. Based on the results, the model could be used to enable the development of a preliminary tool that can be useful to assist clinicians in the diagnosis of DKD and may improve their decision-making in the management of this disease. While the selected biomarkers are well-established in nephrology, their application within advanced FS methodologies such as GALGO outlines the potential of combining traditional markers in innovative ways to improve diagnostic performance. This approach not only validates the relevance of these biomarkers but also emphasizes their importance in creating highly sensitive and specific models for underrepresented populations, such as the Mexican cohort. The results evidence that leveraging ML techniques can uncover complex interactions between biomarkers, providing added clinical value beyond their independent contributions. As a future potential direction of research, the development of a hybrid model is proposed to integrate clinical data and imaging for improving the diagnosis and prediction of DKD progression. This model will combine comprehensive clinical information from patient records, including demographic data, medical history, laboratory test results, and prior treatments. Besides, medical imaging data such as ultrasounds will be collected to assess renal structure function. This future work has the potential to significantly improve the effectiveness of DKD diagnosis and management, leading to better patient outcomes and more personalized and efficient healthcare delivery.

Author Contributions

Conceptualization, V.M.-G., C.E.G.-T., J.I.G.-T., M.C., A.G.-H. and H.L.-G.; Methodology, V.M.-G., C.E.G.-T., J.I.G.-T., M.C., J.M.C.-P., A.G.-H. and H.L.-G.; Software, V.M.-G., C.E.G.-T. and H.G.-R.; Validation, V.M.-G., C.E.G.-T. and J.I.G.-T.; Formal analysis, V.M.-G., C.E.G.-T., J.I.G.-T. and H.G.-R.; Investigation, V.M.-G., C.E.G.-T., J.I.G.-T., M.C. and H.G.-R.; Resources, C.E.G.-T., M.C., J.M.C.-P., A.G.-H. and H.L.-G.; Data curation, V.M.-G.; Writing—original draft, V.M.-G., C.E.G.-T., J.I.G.-T. and J.M.C.-P.; Writing—review & editing, V.M.-G., C.E.G.-T., J.I.G.-T., M.C., J.M.C.-P., H.G.-R., A.G.-H. and K.O.V.-C.; Visualization, V.M.-G., J.M.C.-P. and K.O.V.-C.; Supervision, C.E.G.-T., J.I.G.-T., M.C., J.M.C.-P. and H.G.-R.; Project administration, H.L.-G. and K.O.V.-C.; Funding acquisition, H.L.-G. and K.O.V.-C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the Instituto Mexicano del Seguro Social (IMSS) (protocol code R-2011-785-018). The dataset was obtained from the Unidad de Investigación Médica en Bioquímica, Centro Nacional Siglo XXI, IMSS.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

No new data data were created. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Atlas, I. IDF Diabetes Atlas, 10th ed.; International Diabetes Federation: Brussels, Belgium, 2021. [Google Scholar]
  2. Saeedi, P.; Petersohn, I.; Salpea, P.; Malanda, B.; Karuranga, S.; Unwin, N.; Colagiuri, S.; Guariguata, L.; Motala, A.A.; Ogurtsova, K.; et al. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: Results from the International Diabetes Federation Diabetes Atlas. Diabetes Res. Clin. Pract. 2019, 157, 107843. [Google Scholar] [CrossRef] [PubMed]
  3. Barengo, N.C.; Apolinar, L.M.; Estrada Cruz, N.A.; Fernández Garate, J.E.; Correa González, R.A.; Diaz Valencia, P.A.; Gonzalez, C.A.C.; Rodriguez, J.A.G.; González, N.C. Development of an information system and mobile application for the care of type 2 diabetes patients at the primary care level for the health sector in Mexico: Study protocol for a randomized controlled, open-label trial. Trials 2022, 23, 253. [Google Scholar] [CrossRef]
  4. Liu, X.z.; Duan, M.; Huang, H.d.; Zhang, Y.; Xiang, T.y.; Niu, W.c.; Zhou, B.; Wang, H.l.; Zhang, T.t. Predicting diabetic kidney disease for type 2 diabetes mellitus by machine learning in the real world: A multicenter retrospective study. Front. Endocrinol. 2023, 14, 1184190. [Google Scholar] [CrossRef] [PubMed]
  5. Huang, G.M.; Huang, K.Y.; Lee, T.Y.; Weng, J.T.Y. An interpretable rule-based diagnostic classification of diabetic nephropathy among type 2 diabetes patients. BMC Bioinform. 2015, 16, S5. [Google Scholar] [CrossRef] [PubMed]
  6. Yokoyama, H.; Araki, S.i.; Honjo, J.; Okizaki, S.; Yamada, D.; Shudo, R.; Shimizu, H.; Sone, H.; Moriya, T.; Haneda, M. Association between remission of macroalbuminuria and preservation of renal function in patients with type 2 diabetes with overt proteinuria. Diabetes Care 2013, 36, 3227–3233. [Google Scholar] [CrossRef] [PubMed]
  7. Atlas, I. IDF Diabetes Atlas, 9th ed.; International Diabetes Federation: Brussels, Belgium, 2019. [Google Scholar]
  8. Vázquez-Moreno, M.; Locia-Morales, D.; Peralta-Romero, J.; Sharma, T.; Meyre, D.; Cruz, M.; Flores-Alfaro, E.; Valladares-Salgado, A. AGT rs4762 is associated with diastolic blood pressure in Mexicans with diabetic nephropathy. J. Diabetes Its Complicat. 2021, 35, 107826. [Google Scholar] [CrossRef]
  9. Samsu, N. Diabetic nephropathy: Challenges in pathogenesis, diagnosis, and treatment. BioMed Res. Int. 2021, 2021, 1497449. [Google Scholar] [CrossRef] [PubMed]
  10. Basuli, D.; Kavcar, A.; Roy, S. From bytes to nephrons: AI’s journey in diabetic kidney disease. J. Nephrol. 2024, 1–11. [Google Scholar] [CrossRef] [PubMed]
  11. Afsaneh, E.; Sharifdini, A.; Ghazzaghi, H.; Ghobadi, M.Z. Recent applications of machine learning and deep learning models in the prediction, diagnosis, and management of diabetes: A comprehensive review. Diabetol. Metab. Syndr. 2022, 14, 196. [Google Scholar] [CrossRef]
  12. Vijayan, V.V.; Anjali, C. Prediction and diagnosis of diabetes mellitus—A machine learning approach. In Proceedings of the 2015 IEEE Recent Advances in Intelligent Computational Systems (RAICS), Trivandrum, India, 10–12 December 2015; pp. 122–127. [Google Scholar]
  13. Zou, Q.; Qu, K.; Luo, Y.; Yin, D.; Ju, Y.; Tang, H. Predicting diabetes mellitus with machine learning techniques. Front. Genet. 2018, 9, 515. [Google Scholar] [CrossRef]
  14. Fatima, M.; Pasha, M. Survey of machine learning algorithms for disease diagnostic. J. Intell. Learn. Syst. Appl. 2017, 9, 1–16. [Google Scholar] [CrossRef]
  15. Tuppad, A.; Patil, S.D. Machine learning for diabetes clinical decision support: A review. Adv. Comput. Intell. 2022, 2, 22. [Google Scholar] [CrossRef]
  16. Nagarajan, S.M.; Muthukumaran, V.; Murugesan, R.; Joseph, R.B.; Meram, M.; Prathik, A. Innovative feature selection and classification model for heart disease prediction. J. Reliab. Intell. Environ. 2022, 8, 333–343. [Google Scholar] [CrossRef]
  17. Nagarajan, S.M.; Deverajan, G.G.; Chatterjee, P.; Alnumay, W.; Ghosh, U. Effective task scheduling algorithm with deep learning for Internet of Health Things (IoHT) in sustainable smart cities. Sustain. Cities Soc. 2021, 71, 102945. [Google Scholar] [CrossRef]
  18. Chang, V.; Bailey, J.; Xu, Q.A.; Sun, Z. Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms. Neural Comput. Appl. 2023, 35, 16157–16173. [Google Scholar] [CrossRef]
  19. Usman, T.M.; Saheed, Y.K.; Ignace, D.; Nsang, A. Diabetic retinopathy detection using principal component analysis multi-label feature extraction and classification. Int. J. Cogn. Comput. Eng. 2023, 4, 78–88. [Google Scholar] [CrossRef]
  20. Rodriguez-Romero, V.; Bergstrom, R.F.; Decker, B.S.; Lahu, G.; Vakilynejad, M.; Bies, R.R. Prediction of nephropathy in type 2 diabetes: An analysis of the ACCORD trial applying machine learning techniques. Clin. Transl. Sci. 2019, 12, 519–528. [Google Scholar] [CrossRef]
  21. Jiang, S.; Fang, J.; Yu, T.; Liu, L.; Zou, G.; Gao, H.; Zhuo, L.; Li, W. Novel model predicts diabetic nephropathy in type 2 diabetes. Am. J. Nephrol. 2020, 51, 130–138. [Google Scholar] [CrossRef] [PubMed]
  22. Shi, R.; Niu, Z.; Wu, B.; Zhang, T.; Cai, D.; Sun, H.; Hu, Y.; Mo, R.; Hu, F. Nomogram for the risk of diabetic nephropathy or diabetic Retinopathy among patients with type 2 diabetes mellitus based on questionnaire and biochemical indicators: A cross-sectional study. Diabetes Metab. Syndr. Obes. Targets Ther. 2020, 13, 1215. [Google Scholar] [CrossRef]
  23. Xi, C.; Wang, C.; Rong, G.; Deng, J. A nomogram model that predicts the risk of diabetic nephropathy in type 2 diabetes mellitus patients: A retrospective study. Int. J. Endocrinol. 2021, 2021, 6672444. [Google Scholar] [CrossRef] [PubMed]
  24. Maniruzzaman, M.; Islam, M.M.; Rahman, M.J.; Hasan, M.A.M.; Shin, J. Risk prediction of diabetic nephropathy using machine learning techniques: A pilot study with secondary data. Diabetes Metab. Syndr. Clin. Res. Rev. 2021, 15, 102263. [Google Scholar] [CrossRef]
  25. Yang, J.; Jiang, S. Development and Validation of a Model That Predicts the Risk of Diabetic Nephropathy in Type 2 Diabetes Mellitus Patients: A Cross-Sectional Study. Int. J. Gen. Med. 2022, 15, 5089. [Google Scholar] [CrossRef] [PubMed]
  26. Jović, A.; Brkić, K.; Bogunović, N. A review of feature selection methods with applications. In Proceedings of the 2015 38th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015; pp. 1200–1205. [Google Scholar]
  27. Lopez-Rincon, A.; Martinez-Archundia, M.; Martinez-Ruiz, G.U.; Schoenhuth, A.; Tonda, A. Automatic discovery of 100-miRNA signature for cancer classification using ensemble feature selection. BMC Bioinform. 2019, 20, 480. [Google Scholar] [CrossRef] [PubMed]
  28. Ahmadpour, H.; Bazrafshan, O.; Rafiei-Sardooi, E.; Zamani, H.; Panagopoulos, T. Gully erosion susceptibility assessment in the Kondoran watershed using machine learning algorithms and the Boruta feature selection. Sustainability 2021, 13, 10110. [Google Scholar] [CrossRef]
  29. Trevino, V.; Falciani, F. GALGO: An R package for multivariate variable selection using genetic algorithms. Bioinformatics 2006, 22, 1154–1156. [Google Scholar] [CrossRef]
  30. Liu, Y.; Gopalakrishnan, V. An overview and evaluation of recent machine learning imputation methods using cardiac imaging data. Data 2017, 2, 8. [Google Scholar] [CrossRef] [PubMed]
  31. Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef]
  32. Kursa, M.B.; Jankowski, A.; Rudnicki, W.R. Boruta–a system for feature selection. Fundam. Inform. 2010, 101, 271–285. [Google Scholar] [CrossRef]
  33. Zou, H.; Hastie, T. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2005, 67, 301–320. [Google Scholar] [CrossRef]
  34. Alrashdi, A.M.; Atitallah, I.B.; Al-Naffouri, T.Y. Precise performance analysis of the box-elastic net under matrix uncertainties. IEEE Signal Process. Lett. 2019, 26, 655–659. [Google Scholar] [CrossRef]
  35. Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  36. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J.C.; Sheridan, R.P.; Feuston, B.P. Random forest: A classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. [Google Scholar] [CrossRef]
  37. Parikh, R.; Mathai, A.; Parikh, S.; Sekhar, G.C.; Thomas, R. Understanding and using sensitivity, specificity and predictive values. Indian J. Ophthalmol. 2008, 56, 45. [Google Scholar] [CrossRef] [PubMed]
  38. Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  39. Rakotomamonjy, A. Optimizing Area Under Roc Curve with SVMs. In Proceedings of the ROC Analysis in Artificial Intelligence, 1st International Workshop, ROCAI-2004, Valencia, Spain, 22 August 2004; pp. 71–80. [Google Scholar]
  40. Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the IJCAI’95: 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Volume 14, pp. 1137–1145. [Google Scholar]
  41. Lundberg, S. A unified approach to interpreting model predictions. arXiv 2017, arXiv:1705.07874. [Google Scholar]
  42. Ghosh, S.K.; Khandoker, A.H. Investigation on explainable machine learning models to predict chronic kidney diseases. Sci. Rep. 2024, 14, 3687. [Google Scholar] [CrossRef] [PubMed]
  43. Ahmad, M.A.; Eckert, C.; Teredesai, A. Interpretable machine learning in healthcare. In Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, New York, NY, USA, 4–7 June 2018; pp. 559–560. [Google Scholar]
  44. John, C.R. MLeval: Machine Learning Model Evaluation, R package version 0.3; 2020. Available online: https://cran.r-project.org/package=MLeval (accessed on 1 September 2024).
  45. Friedman, J.; Hastie, T.; Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010, 33, 1–22. [Google Scholar] [CrossRef] [PubMed]
  46. Kuhn, M. caret: Classification and Regression Training, R package version 6.0-86; 2020. Available online: https://cran.r-project.org/package=caret (accessed on 1 September 2024).
  47. Molnar, C.; Casalicchio, G.; Bischl, B. iml: An R package for interpretable machine learning. J. Open Source Softw. 2018, 3, 786. [Google Scholar] [CrossRef]
  48. Sun, L.; Wang, J.; Wei, J. AVC: Selecting discriminative features on basis of AUC by maximizing variable complementarity. BMC Bioinform. 2017, 18, 73–89. [Google Scholar] [CrossRef] [PubMed]
  49. Martínez-Torteya, A.; Treviño, V.; Tamez-Peña, J.G. Improved diagnostic multimodal biomarkers for Alzheimer’s disease and mild cognitive impairment. BioMed Res. Int. 2015, 2015, 961314. [Google Scholar] [CrossRef] [PubMed]
  50. Zhang, Z.; Trevino, V.; Hoseini, S.S.; Belciug, S.; Boopathi, A.M.; Zhang, P.; Gorunescu, F.; Subha, V.; Dai, S. Variable selection in Logistic regression model with genetic algorithm. Ann. Transl. Med. 2018, 6, 45. [Google Scholar] [CrossRef] [PubMed]
  51. Wu, C.C.; Yeh, W.C.; Hsu, W.D.; Islam, M.M.; Nguyen, P.A.A.; Poly, T.N.; Wang, Y.C.; Yang, H.C.; Li, Y.C.J. Prediction of fatty liver disease using machine learning algorithms. Comput. Methods Programs Biomed. 2019, 170, 23–29. [Google Scholar] [CrossRef] [PubMed]
  52. Ganggayah, M.D.; Taib, N.A.; Har, Y.C.; Lio, P.; Dhillon, S.K. Predicting factors for survival of breast cancer patients using machine learning techniques. BMC Med. Inform. Decis. Mak. 2019, 19, 48. [Google Scholar] [CrossRef] [PubMed]
  53. Lebedev, A.; Westman, E.; Van Westen, G.; Kramberger, M.; Lundervold, A.; Aarsland, D.; Soininen, H.; Kłoszewska, I.; Mecocci, P.; Tsolaki, M.; et al. Random Forest ensembles for detection and prediction of Alzheimer’s disease with a good between-cohort robustness. NeuroImage Clin. 2014, 6, 115–125. [Google Scholar] [CrossRef] [PubMed]
  54. Cui, M.; Gang, X.; Gao, F.; Wang, G.; Xiao, X.; Li, Z.; Li, X.; Ning, G.; Wang, G. Risk assessment of sarcopenia in patients with type 2 diabetes mellitus using data mining methods. Front. Endocrinol. 2020, 11, 123. [Google Scholar] [CrossRef] [PubMed]
  55. Lasko, T.A.; Bhagwat, J.G.; Zou, K.H.; Ohno-Machado, L. The use of receiver operating characteristic curves in biomedical informatics. J. Biomed. Inform. 2005, 38, 404–415. [Google Scholar] [CrossRef] [PubMed]
  56. Villagrana-Bañuelos, K.E.; Maeda-Gutiérrez, V.; Alcalá-Rmz, V.; Oropeza-Valdez, J.J.; Oostdam, H.V.; Ana, S.; Castañeda-Delgado, J.E.; López, J.A.; Borrego Moreno, J.C.; Galván-Tejada, C.E.; et al. COVID-19 outcome prediction by integrating clinical and metabolic data using machine learning algorithms. Rev. Investig. Clínica 2022, 74, 314–327. [Google Scholar] [CrossRef]
  57. Dabla, P.K. Renal function in diabetic nephropathy. World J. Diabetes 2010, 1, 48. [Google Scholar] [CrossRef]
  58. Currie, G.; McKay, G.; Delles, C. Biomarkers in diabetic nephropathy: Present and future. World J. Diabetes 2014, 5, 763. [Google Scholar] [CrossRef]
  59. Campion, C.G.; Sanchez-Ferras, O.; Batchu, S.N. Potential role of serum and urinary biomarkers in diagnosis and prognosis of diabetic nephropathy. Can. J. Kidney Health Dis. 2017, 4, 2054358117705371. [Google Scholar] [CrossRef]
  60. Lim, A.K.; Tesch, G.H. Inflammation in diabetic nephropathy. Mediat. Inflamm. 2012, 2012, 146154. [Google Scholar] [CrossRef] [PubMed]
  61. Shang, J.; Wang, L.; Zhang, Y.; Zhang, S.; Ning, L.; Zhao, J.; Cheng, G.; Liu, D.; Xiao, J.; Zhao, Z. Chemerin/ChemR23 axis promotes inflammation of glomerular endothelial cells in diabetic nephropathy. J. Cell. Mol. Med. 2019, 23, 3417–3428. [Google Scholar] [CrossRef] [PubMed]
  62. Valensi, P.; Picard, S. Lipids, lipid-lowering therapy and diabetes complications. Diabetes Metab. 2011, 37, 15–24. [Google Scholar] [CrossRef] [PubMed]
  63. Ayodele, O.E.; Alebiosu, C.O.; Salako, B.L. Diabetic nephropathy–a review of the natural history, burden, risk factors and treatment. J. Natl. Med Assoc. 2004, 96, 1445. [Google Scholar]
  64. Lin, P.H.; Duann, P. Dyslipidemia in kidney disorders: Perspectives on mitochondria homeostasis and therapeutic opportunities. Front. Physiol. 2020, 11, 1050. [Google Scholar] [CrossRef] [PubMed]
  65. Mitrofanova, A.; Merscher, S.; Fornoni, A. Kidney lipid dysmetabolism and lipid droplet accumulation in chronic kidney disease. Nat. Rev. Nephrol. 2023, 19, 629–645. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Architecture of the proposed classification system. (The value of k refers to the 10 folds used in the k-fold cross-validation process).
Figure 1. Architecture of the proposed classification system. (The value of k refers to the 10 folds used in the k-fold cross-validation process).
Biomedicines 12 02858 g001
Figure 2. Procedure of 10-fold cross-validation. In this study, the entire dataset was divided into 10 subsets, nine of them were taken as the training data, and the rest were used as testing.
Figure 2. Procedure of 10-fold cross-validation. In this study, the entire dataset was divided into 10 subsets, nine of them were taken as the training data, and the rest were used as testing.
Biomedicines 12 02858 g002
Figure 3. Boxplots of attributes based on importance values. (The horizontal line within each box represents the median value, and the white circles indicate outliers).
Figure 3. Boxplots of attributes based on importance values. (The horizontal line within each box represents the median value, and the white circles indicate outliers).
Biomedicines 12 02858 g003
Figure 4. Gene rank stability.
Figure 4. Gene rank stability.
Biomedicines 12 02858 g004
Figure 5. Forward selection models.
Figure 5. Forward selection models.
Biomedicines 12 02858 g005
Figure 6. The model obtained from GALGO with RF as a classifier.
Figure 6. The model obtained from GALGO with RF as a classifier.
Biomedicines 12 02858 g006
Figure 7. Elastic net regularized coefficients are shown as a function of log. Each line represents the coefficient path of a feature as the regularization penalty changes. The colors correspond to different features in the model.
Figure 7. Elastic net regularized coefficients are shown as a function of log. Each line represents the coefficient path of a feature as the regularization penalty changes. The colors correspond to different features in the model.
Biomedicines 12 02858 g007
Figure 8. ROC curve for the DKD diagnosis by using the entire dataset.
Figure 8. ROC curve for the DKD diagnosis by using the entire dataset.
Biomedicines 12 02858 g008
Figure 9. ROC curves of the four RF classification models.
Figure 9. ROC curves of the four RF classification models.
Biomedicines 12 02858 g009
Figure 10. Feature importance derived from Shapley values for the top predictors. The predictors include creatinine (CRE), lipids treatment (LIPIDS.TX) and urea (UREA). The horizontal bars indicate the contribution of each feature to the model’s predictions, measured by Shapley values, with CRE showing the highest impact.
Figure 10. Feature importance derived from Shapley values for the top predictors. The predictors include creatinine (CRE), lipids treatment (LIPIDS.TX) and urea (UREA). The horizontal bars indicate the contribution of each feature to the model’s predictions, measured by Shapley values, with CRE showing the highest impact.
Biomedicines 12 02858 g010
Figure 11. Dependence plot—effect of CRE levels on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, where higher CRE levels correspond to a lower probability of misclassification. The right panel (‘YES’) represents YES-DKD patients, where increasing CRE levels are positively associated with the likelihood of DKD prediction.
Figure 11. Dependence plot—effect of CRE levels on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, where higher CRE levels correspond to a lower probability of misclassification. The right panel (‘YES’) represents YES-DKD patients, where increasing CRE levels are positively associated with the likelihood of DKD prediction.
Biomedicines 12 02858 g011
Figure 12. Dependence plot—effect of UREA levels on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, showing fluctuating probabilities with a peak around 25 mg/dL, followed by stabilization at lower probabilities as urea levels increase. The right panel (‘YES’) represents YES-DKD patients, where urea levels initially cause a dip in predicted probability but then steadily increase, stabilizing at higher probabilities beyond 50 mg/dL.
Figure 12. Dependence plot—effect of UREA levels on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, showing fluctuating probabilities with a peak around 25 mg/dL, followed by stabilization at lower probabilities as urea levels increase. The right panel (‘YES’) represents YES-DKD patients, where urea levels initially cause a dip in predicted probability but then steadily increase, stabilizing at higher probabilities beyond 50 mg/dL.
Biomedicines 12 02858 g012
Figure 13. Dependence plot—effect of LIPIDS.TX on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, showing a high predicted probability of misclassification for those without lipid treatment, which sharply declines beyond a certain threshold of treatment presence. The right panel (‘YES’) represents YES-DKD patients, where the predicted probability of DKD increases with the presence of lipid treatment, stabilizing at higher probabilities as treatment is applied.
Figure 13. Dependence plot—effect of LIPIDS.TX on the predicted probability of DKD. The left panel (‘NO’) represents NO-DKD patients, showing a high predicted probability of misclassification for those without lipid treatment, which sharply declines beyond a certain threshold of treatment presence. The right panel (‘YES’) represents YES-DKD patients, where the predicted probability of DKD increases with the presence of lipid treatment, stabilizing at higher probabilities as treatment is applied.
Biomedicines 12 02858 g013
Table 1. Summary of reviewed studies addressing DKD prediction in T2D patients.
Table 1. Summary of reviewed studies addressing DKD prediction in T2D patients.
Author’s and YearFeature Selection MethodRisk FactorsClassification Algorithm
Rodriguez-Romero et al. 2019 [20]InfoGainAge
Cholesterol
Triglycerides
Low-density lipoprotein
Urinary albumin excretion
Glomerular filtrarion rate
1R
J48
RF
SL
SMO
NB
Jiang et al. 2020 [21]LASSODR
HbA1c
Gender
Anemia
Hemateuria
DM duration
Blood pressure
Urinary protein excretion
Estimated glomerular filtrarion rate
LR
Shi et al. 2020 [22]       LASSO
LR
HbA1c
Disease course
Body mass index
Total triglycerides
Blood urea nitrogen
Systolic blood pressure
Postprandial blood glucose
LR
Xi et al. 2021 [23]         LASSO
LR
Age
Gender
Hypertension
Medicine use
Duration of DM
Body mass index
Serum creatinine level
Blood urea nitrogen level
Neutrophil to lympocyte ratio
Red blood cell distribution width
LR
Maniruzzaman et al. 2021 [24]PCAAge
HbA1c
Triglycerides
DM duration
Body mass index
Fasting blood sugar
Low density lipoprotein
High density lipoprotein
Systolic blood pressure
Diastolic blood pressure
       LDA
SVM
LR
K-NN
NB
ANN
Yang and Jiang  2022 [25]Univariate
AIC
HbA1c
Triglycerides
Hypertension
Serum creatinine
Body mass index
Blood urea nitrogen
Diabetic peripheral neuropathy
LR
Abbreviations: Diabetic Kidney Disease (DKD); Type 2 Diabetes (T2D); Information Gain (InfoGain); Least Absolute Shrinkage and Selection Operator (LASSO); Diabetic Retinopathy (DR); Glycated Hemoglobin (HbA1c); Diabetes Mellitus (DM); Logistic Regression (LR); Principal Component Analysis (PCA); Linear Discriminant Analysis (LDA); Support Vector Machine (SVM); K-Nearest Neighbors (K-NN); Naive Bayes (NB); Artificial Neural Network (ANN); Akaike Information Criterion (AIC).
Table 2. Summary statistics of the entire dataset.
Table 2. Summary statistics of the entire dataset.
Predictors (n = 32)Cases (n = 11)Controls (n = 11)p-Value
Demographic Characteristics
EDU
1—Elementary school
2—Secondary school
3—Technical school
4—High school
5—Professional
6—Postgraduate
2 (18.18%)
3 (27.27%)
2 (18.18%)
-
4 (36.36%)
-
1 (9.09%)
2 (18.18%)
2 (18.18%)
3 (27.27%)
3 (27.27%)
-
0.559
SAL
1—Less than 2000.00 MXN (~98.16 USD)
2—Between 2000.00 and 5000.00 MXN (~98.16–245.39 USD)
3—More than 5000.00 MXN (~245.39 USD)
3 (27.27%)
3 (27.27%)
5 (45.45%)
3 (27.27%)
2 (18.18%)
6 (54.54%)
0.409
SEX
0—Male
1—Female
6 (54.54%)
5 (45.45%)
6 (54.54%)
5 (45.45%)
1
AGE, years62.909 (±11.597)52.909 (±12.045)0.0816
AGE DX, years45.636 (±6.561)45.272 (±11.199)0.922
Clinical observations
WHR, cm/cm0.963 (±0.092)0.914 (±0.094)0.233
BMI kg/m230.486 (±3.685)29.225 (±7.205)0.591
SBP, mmHg125.909 (±15.623)1223.636 (±15.666)0.722
DBP, mmHg82.727 (±9.045)79.545 (±8.790)0.395
USBP, mmHg122.272 (±12.915)120.909 (±12.21)0.79
UDBP, mmHg80.909 (±8.312)78.181 (±7.507)0.413
GLU, mg/dL127.363 (±43.845)167.363 (±78.888)0.159
UREA, mg/dL45.818 (±25.732)28 (±7.835)0.1082
CRE, mg/dL1.196 (±0.622)0.733 (±0.136)0.074
CHOL, mg/dL222.518 (±74.48)206.663 (±58.252)0.475
HDL, mg/dL37.881 (±10.939)40.136 (±14.742)0.672
LDL, mg/dL161.045 (±67.839)136.055 (±37.055)0.294
TG, mg/dL247.427 (±162.302)310.536 (±274.891)0.503
TCHOLU, mg/dL193.909 (±69.288)193 (±41.916)0.969
UHDL, mg/dL41.639 (±9.982)41.0909 (±14.117)0.913
ULDL, mg/dL130.181 (±60.237)123.545 (±22.979)0.722
UTG, mg/dL211.090 (±151.341)302 (±260.799)0.325
HBA1C, mmol/L6.994 (±2.527)8.209 (±2.667)0.28
GFR76.807 (±28.853)111.454 (±44.331)0.0583
Medical treatment
LIPIDS-TX
0—No
1—Yes
4 (36.36%)
7 (63.63%)
8 (72.72%)
3 (27.27%)
0.095
HA-TX
0—No
1—Yes
7 (63.63%)
4 (36.36%)
8 (72.72%)
3 (27.27%)
0.48
GB
0—No
1—Yes
6 (54.54%)
5 (45.45%)
10 (90.90%)
1 (9.09%)
0.08
MF
0—No
1—Yes
3 (27.27%)
8 (72.72%)
3 (27.27%)
8 (72.72%)
1
PG
0—No
1—Yes
11 (100%)
-
11 (100%)
-
-
RG
0—No
1—Yes
11 (100%)
-
11 (100%)
-
-
AB
0—No
1—Yes
11 (100%)
-
11 (100%)
-
-
INS
0—No
1—Yes
8 (72.72%)
3 (27.27%)
6 (54.54%)
5 (45.45%)
0.379
Table 3. Univariate analysis: AUC values of each feature.
Table 3. Univariate analysis: AUC values of each feature.
FeatureAUC ValueFeatureAUC Value
EDU0.29LDL0.29
SAL0.55TG0.50
SEX0.26TCHOLU0.64
AGE0.53UHDL0.58
AGE DX0.47ULDL0.55
WHR0.25UTG0.67
BMI0.57HBA1C0.46
SBP0.13GFR0.68
DBP0.28GB0.60
USBP0.40MF0.16
UDBP0.41PG-
GLU0.67RG-
UREA0.60AB-
CRE0.61INS0.45
CHOL0.53LIPIDS.TX0.48
HDL0.44HA TX0.24
Table 4. Attributes importance stats.
Table 4. Attributes importance stats.
AttributemeanImpmedianImpminImpmaxImpnormHitsDecision
EDU−0.68−1.02−2.071.730.00Rejected
SAL0.230.18−1.812.000.00Rejected
SEX−0.44−0.725−1.691.410.00Rejected
AGE0.811.04−0.992.050.00Rejected
AGE DX0.010.21−1.961.930.00Rejected
WHR−1.10−1.36−2.140.800.00Rejected
BMI0.050.09−1.011.200.00Rejected
GLU3.012.98−0.656.610.46Tentative
UREA0.920.96−1.072.340.00Rejected
CRE6.696.832.169.810.85Confirmed
LIPIDS-TX0.120.12−2.072.230.00Rejected
CHOL−0.020.37−2.901.160.00Rejected
HDL−0.43−0.38−2.201.560.00Rejected
LDL−1.00−0.96−2.130.070.00Rejected
TG−0.72−0.59−2.601.100.00Rejected
TCHOLU−0.54−0.98−2.661.760.00Rejected
UHDL−1.08−1.06−3.280.420.00Rejected
ULDL0.01−0.01−2.702.110.00Rejected
UTG−1.08−0.98−2.32−0.160.00Rejected
HA-TX−0.56−0.95−1.621.380.00Rejected
SBP−0.81−0.74−2.471.410.00Rejected
DBP−0.81−0.89−2.080.630.00Rejected
USBP−1.03−1.25−2.130.480.00Rejected
UDBP−0.51−0.46−1.660.750.00Rejected
HBA1C0.080.15−1.462.210.00Rejected
GFR2.953.00−0.757.230.48Tentative
GB0.13−0.22−1.162.050.00Rejected
MF−0.92−1.00−2.081.000.00Rejected
PG0.000.000.000.000.00Rejected
RG0.000.000.000.000.00Rejected
AB0.000.000.000.000.00Rejected
INS−0.75−0.85−1.720.870.00Rejected
Table 5. GALGO input parameters.
Table 5. GALGO input parameters.
ParameterDescriptionValue
Classification methodMethod used for classificationRF
Chromosome sizeNumber of variables include in each model5
Max solutionsA collection of solutions desired50
Max generationsNumber of generations that GA can evolve300
Goal fitnessDesired fitness value1.0
Table 6. Selected features of elastic net analysis.
Table 6. Selected features of elastic net analysis.
FeaturesCoefficients
AGE0.010
UREA0.001
CRE0.494
LIPIDS-TX0.406
GFR−0.004
GB0.310
Table 7. Evaluation metrics for the model with 32 features.
Table 7. Evaluation metrics for the model with 32 features.
AUCSNSP
0.600.5450.455
Table 8. Performance of models together with the number of variables each feature selection method.
Table 8. Performance of models together with the number of variables each feature selection method.
Feature Selection MethodFeatures NumberFeaturesAUCSNSP
Boruta1CRE0.610.4550.636
Galgo3CRE
UREA
LIPIDS.TX
0.800.9090.818
Elastic net6AGE
UREA
CRE
LIPIDS.TX
GFR
GB
0.750.7200.720
Multivariate7GLU
UREA
CRE
TCHOLU
UTG
GFR
GB
0.740.6360.545
Table 9. Performance comparison among different cross-validation strategies and classification models.
Table 9. Performance comparison among different cross-validation strategies and classification models.
LOOCV + (3 FEAT [GALGO]) RFLOOCV + (3 FEAT [GALGO]) SVM
AUCSNSPAUCSNSP
0.7900.8180.8180.7600.7270.818
k = 10 CV + (3 FEAT [GALGO]) RFk = 10 CV + (3 FEAT [GALGO]) SVM
AUCSNSPAUCSNSP
0.8000.9090.8180.7500.7270.727
Table 10. Comparative overview of our study and previous investigations.
Table 10. Comparative overview of our study and previous investigations.
Author’s and YearFeature Selection
Method
# of FeaturesClassification AlgorithmEvaluation Metrics
Rodriguez-Romero et al.
2019 [20]
InfoGain61R
J48
RF
SL
SMO
NB
SN: 0.871; SP: 0.979; ACC: 0.871
SN: 0.887; SP: 0.249; ACC: 0.887
SN: 0.887; SP: 0.999; ACC: 0.887
SN: 0.899; SP: 0.183; ACC: 0.899
SN: 0.893; SP: 0.159; ACC: 0.893
SN: 0.804; SP: 0.355; ACC: 0.803
Jiang et al. 2020 [21]LASSO9LRC-INDEX: 0.934
Shi et al. 2020 [22]LASSO
LR
10LRAUC: 0.807; C-INDEX: 0.807
Xi et al. 2021 [23]LASSO
LR
10LRAUC: 0.813; C-INDEX: 0.819
Maniruzzaman et al. 2021 [24]PCA10LDA
SVM-RBF
LR
KNN
NB
ANN
AUC: 0.880; SN: 0.867; SP: 0.852
AUC: 0.910; SN: 0.867; SP: 0.863
AUC: 0.890; SN: 0.800; SP: 0.823
AUC: 0.890; SN: 0.833; SP: 0.849
AUC: 0.900; SN: 0.717; SP: 0.904
AUC: 0.900; SN: 0.800; SP: 0.823
Yang and Jiang 2022 [25]Univariate
AIC
7LRAUC: 0.758; C-INDEX: 0.758
ProposedGALGO3RFAUC: 0.800; SN: 0.909; SP: 0.818
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Maeda-Gutiérrez, V.; Galván-Tejada, C.E.; Galván-Tejada, J.I.; Cruz, M.; Celaya-Padilla, J.M.; Gamboa-Rosales, H.; García-Hernández, A.; Luna-García, H.; Villalba-Condori, K.O. Evaluating Feature Selection Methods for Accurate Diagnosis of Diabetic Kidney Disease. Biomedicines 2024, 12, 2858. https://doi.org/10.3390/biomedicines12122858

AMA Style

Maeda-Gutiérrez V, Galván-Tejada CE, Galván-Tejada JI, Cruz M, Celaya-Padilla JM, Gamboa-Rosales H, García-Hernández A, Luna-García H, Villalba-Condori KO. Evaluating Feature Selection Methods for Accurate Diagnosis of Diabetic Kidney Disease. Biomedicines. 2024; 12(12):2858. https://doi.org/10.3390/biomedicines12122858

Chicago/Turabian Style

Maeda-Gutiérrez, Valeria, Carlos E. Galván-Tejada, Jorge I. Galván-Tejada, Miguel Cruz, José M. Celaya-Padilla, Hamurabi Gamboa-Rosales, Alejandra García-Hernández, Huizilopoztli Luna-García, and Klinge Orlando Villalba-Condori. 2024. "Evaluating Feature Selection Methods for Accurate Diagnosis of Diabetic Kidney Disease" Biomedicines 12, no. 12: 2858. https://doi.org/10.3390/biomedicines12122858

APA Style

Maeda-Gutiérrez, V., Galván-Tejada, C. E., Galván-Tejada, J. I., Cruz, M., Celaya-Padilla, J. M., Gamboa-Rosales, H., García-Hernández, A., Luna-García, H., & Villalba-Condori, K. O. (2024). Evaluating Feature Selection Methods for Accurate Diagnosis of Diabetic Kidney Disease. Biomedicines, 12(12), 2858. https://doi.org/10.3390/biomedicines12122858

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop