Optimal Feature Selection-Based Dental Caries Prediction Model Using Machine Learning for Decision Support System

The high frequency of dental caries is a major public health concern worldwide. The condition is common, particularly in developing countries. Because there are no evident early-stage signs, dental caries frequently goes untreated. Meanwhile, early detection and timely clinical intervention are required to slow disease development. Machine learning (ML) models can benefit clinicians in the early detection of dental cavities through efficient and cost-effective computer-aided diagnoses. This study proposed a more effective method for diagnosing dental caries by integrating the GINI and mRMR algorithms with the GBDT classifier. Because just a few clinical test features are required for the diagnosis, this strategy could save time and money when screening for dental caries. The proposed method was compared to recently proposed dental procedures. Among these classifiers, the suggested GBDT trained with a reduced feature set achieved the best classification performance, with accuracy, F1-score, precision, and recall values of 95%, 93%, 99%, and 88%, respectively. Furthermore, the experimental results suggest that feature selection improved the performance of the various classifiers. The suggested method yielded a good predictive model for dental caries diagnosis, which might be used in more imbalanced medical datasets to identify disease more effectively.


Introduction
Oral health is essential for general health and quality of life. Oral health means being free from throat cancer, infections and sores in the mouth, gum disease, tooth loss, dental caries, and other diseases so that no disturbances limit biting, chewing, smiling, speaking, and psychosocial well-being. One of the most occurring oral health diseases is dental caries. Dental caries is a health issue caused by residual food that attaches to the teeth, causing calcification. Teeth become porous, hollow, and sometimes fractured as a result. Dental caries is a disease associated with the hard tissues of the teeth, namely, enamel, dentin, and cementum, in the form of decayed areas on the teeth, occurring as a result of the process of gradually dissolving the mineral surface of the teeth and continuing to grow to the inside of the teeth.
According to Global Burden studies in 2019, dental caries is the most frequent oral illness impacting around 3.5 billion individuals, with 2 billion suffering from permanent dental caries [1]. Moreover, in 2020, more than 6 million patients with dental caries in the Republic of Korea visited dentists, of which 1.45 million were children (0~9 years old) [2]. Therefore, dental caries illness is a challenge for scientists to address.
Recent research on dental caries was conducted by Rimi et al. [3], who discussed the prediction of dental caries using machine learning (ML). In their experiments, they used nine algorithms, namely, K-Nearest Neighbor (KNN), logistic regression (LR), support vector machine (SVM), random forest (RF), naïve Bayes (NB), classification and regression results showing that the proposed FPFS-kNN, designated as FPFS-kNN (P) and based on Pearson's correlation coefficient, outperforms the state-of-the-art kNN-based algorithms in 24 of 35 datasets in terms of each studied measure and 31 of 35 datasets in terms of the accuracy measure. Furthermore, the results demonstrate that FPFS-kNN (P) outperforms the others in 29 datasets for accuracy and macro F-score rates and in 24 datasets for precision, recall, and macro F-score rates.
Over the last few years, gene expression data have been widely used with ML and computational tools. Several matrix-factorization-based dimensionality reduction algorithms have been developed in gene expression analyses. However, such systems can be improved in terms of efficiency and dependability. The research conducted by Farid Saberi-Movahed et al. [14] proposed a Dual Regularized Unsupervised Feature Selection Based on Matrix Factorization and Minimum Redundancy (DR-FS-MFMR), a new approach to feature selection. The primary goal of DR-FS-MFMR is to remove unnecessary features from the original feature set. To achieve this goal, the primary feature selection problem is stated in terms of two aspects: (1) the data matrix factorization in the feature weight matrix and the representation matrix and (2) the correlation information connected to the selected feature set. The objective function is then enhanced by applying two data representation features and an inner product regularization criterion to complete the redundancy minimization process and the sparsity task more precisely. A vast number of experimental investigations on nine gene expression datasets were carried out to demonstrate the competency of the DR-FS-MFMR approach. The computational results show that DR-FS-MFMR is efficient and productive for gene selection tasks.
The research conducted by Saeid Azadifar. et al. [15] proposed a unique graph-based relevancy-redundancy gene selection method for cancer diagnosis approaches that can effectively eliminate redundant and irrelevant genes. The proposed method uses a unique algorithm that finds gene clusters (i.e., maximum-weighted cliques). By grouping similar genes, the proposed method prevents the selection of redundant and similar genes. The provided findings show that the performance of the created similarity measure outperforms all other measures. On a colon dataset, the classification accuracy was 87.89%; on a Small-Round-Blue-Cell Tumor (SRBCT) dataset, it was 83.19%; on a Leukemia dataset, it was 92.14%; and on a Prostate Tumor dataset, it was 83.73%.
The main contributions of this study are as follows: to identify the most important characteristics needed to improve the detection of dental caries and to obtain the most performant model for dental caries prediction via comparative experimentation on a novel combination of GINI and mRMR techniques, which allow imbalanced dental data with a high number of features to be classified under GBDT selection to achieve efficient performance. The ML model described in this study efficiently provides patients with superiorquality dental services, including diagnosis and treatment.
The rest of this paper is structured as follows: Section 2 describes the caries prediction model proposed in this study. The experimentation and results are presented in Section 3. Section 4 presents the discussion, while Section 5 concludes the paper.

Materials and Methods
This section presents the dataset, data preprocessing, an overview of the different methods used in this study, and the proposed prediction model employed for predicting dental caries. Five prediction methods were trained using the reduced subsets of features obtained via feature selection algorithms, and the performances of the ML methods were compared. A conceptual representation of the proposed dental caries prediction model is shown in Figure 1. Following data collection via the GMFT survey, a preprocessing operation was performed in which feature selection (Chi-Square, Relief F, mRMR, and Correlation) and feature importance (Chi-Square + GINI, Relief F + GINI, mRMR + GINI, Correlation + GINI) methods were used to curate the dataset and obtain the optimal dataset. Finally, the prediction algorithms were fine-tuned, and experimentation was carried out. The best performer was chosen as the final classifier.
Bioengineering 2023, 10, x FOR PEER REVIEW 4 of 15 + GINI) methods were used to curate the dataset and obtain the optimal dataset. Finally, the prediction algorithms were fine-tuned, and experimentation was carried out. The best performer was chosen as the final classifier.

Data Collection
The dataset used in this study is the 2018 Children's Oral Health Survey conducted by the Korea Centers for Disease Control and Prevention. The data were collected by dentists visiting each institution for an oral examination survey. A total of 22,288 respondents were surveyed, and the oral health awareness questionnaire was selected and used for the survey. The oral health questionnaire consisted of a total of 43 items and 1 label, including age, gender, place of residence, snack frequency, tooth brushing frequency, oral care use, smoking experience, oral health awareness, and behavior, with act_caries as the label. These data are not subject to Institutional Review Board (IRB) approval, as they do not record patients' personal information. Complete descriptions of the questionnaire items are available in Table S1 (see the supplementary file).

Data Preprocessing
Data cleaning before building ML models is an essential step in increasing the efficiency and accuracy of the models [16], hence preventing bias or degradation of the model performance. The raw data were stripped of any empty or less critical features. The most significant features were chosen using feature importance and feature selection techniques (the methods used are described in the supplementary file). Features that were redundant, less significant, or unrelated to the target variable were discarded, leaving just the optimal features. The selected optimal features were scaled (using the Min-Max algorithm) to values between 0 and 1 to improve training speed and model performance and to facilitate more effective learning and the understanding of the task.

Definition 1. The normalization of is defined by
In Equation (1), is the transformed value, is the original value, is the minimum value of the column, and is the maximum value. The Min-Max method reacts quickly in the presence of outliers and does not change the original content of the data. This study adopted the Synthetic Minority Oversampling Technique (SMOTE) algorithm, considered the most widely used oversampling technique employed to generate

Data Collection
The dataset used in this study is the 2018 Children's Oral Health Survey conducted by the Korea Centers for Disease Control and Prevention. The data were collected by dentists visiting each institution for an oral examination survey. A total of 22,288 respondents were surveyed, and the oral health awareness questionnaire was selected and used for the survey. The oral health questionnaire consisted of a total of 43 items and 1 label, including age, gender, place of residence, snack frequency, tooth brushing frequency, oral care use, smoking experience, oral health awareness, and behavior, with act_caries as the label. These data are not subject to Institutional Review Board (IRB) approval, as they do not record patients' personal information. Complete descriptions of the questionnaire items are available in Table S1 (see the Supplementary File).

Data Preprocessing
Data cleaning before building ML models is an essential step in increasing the efficiency and accuracy of the models [16], hence preventing bias or degradation of the model performance. The raw data were stripped of any empty or less critical features. The most significant features were chosen using feature importance and feature selection techniques (the methods used are described in the Supplementary File). Features that were redundant, less significant, or unrelated to the target variable were discarded, leaving just the optimal features. The selected optimal features were scaled (using the Min-Max algorithm) to values between 0 and 1 to improve training speed and model performance and to facilitate more effective learning and the understanding of the task.

Definition 1. The normalization of u is defined bŷ
In Equation ( This study adopted the Synthetic Minority Oversampling Technique (SMOTE) algorithm, considered the most widely used oversampling technique employed to generate synthetic data [17]. This technique solves the problem of class imbalance present in the experimental data, therefore increasing the classification accuracy by solving the biased problem [18]. Our dataset contained an uneven data distribution, with almost 20,593 samples representing patients without dental caries "labeled 0" and 1695 samples for those with dental caries "labeled 1," as seen in Figure 2a. The data distribution ratio with reference to the target variable before and after the applied SMOTE is shown in Figure 2.
Bioengineering 2023, 10, x FOR PEER REVIEW 5 of 15 synthetic data [17]. This technique solves the problem of class imbalance present in the experimental data, therefore increasing the classification accuracy by solving the biased problem [18]. Our dataset contained an uneven data distribution, with almost 20,593 samples representing patients without dental caries "labeled 0" and 1695 samples for those with dental caries "labeled 1," as seen in Figure 2a. The data distribution ratio with reference to the target variable before and after the applied SMOTE is shown in Figure 2.
(a) Before SMOTE (b) After SMOTE

Methods Used in the Proposed Prediction Model
This section describes the mRMR feature selection techniques, the GINI feature importance, and the GBDT algorithms. These were the three methods utilized in the design of the proposed model.

Minimum Redundancy-Maximum Relevance (mRMR)
The features are rated according to their relevance to the target variable in the mRMR feature selection approach. The redundancy of the features is taken into account when ranking them. The feature with the highest rank in mRMR is the one with the most relevance to the target variable and the least redundancy among the characteristics. Redundancy and relevance are measured using Mutual Information (MI) [19][20][21].

Definition 2. Joint entropy is defined by
In this equation, ( , ) signifies the compound probability distribution function of X and Y random variables, whereas ( ) and ( ) define the marginal probability distribution function of X and Y random variables, respectively. When two random variables are totally independent, the mutual information is 0. It is symmetrical and cannot have a negative value ( ( , ) ≥ 0, ( , ) = ( , )). Assume that is the feature to be chosen and that | | denotes the number of items in this collection. The two criteria listed above must be met to ensure that the feature set chosen is the most efficient set [12].

Methods Used in the Proposed Prediction Model
This section describes the mRMR feature selection techniques, the GINI feature importance, and the GBDT algorithms. These were the three methods utilized in the design of the proposed model.

Minimum Redundancy-Maximum Relevance (mRMR)
The features are rated according to their relevance to the target variable in the mRMR feature selection approach. The redundancy of the features is taken into account when ranking them. The feature with the highest rank in mRMR is the one with the most relevance to the target variable and the least redundancy among the characteristics. Redundancy and relevance are measured using Mutual Information (MI) [19][20][21].

Definition 2. Joint entropy is defined by
In this equation, p(x, y) signifies the compound probability distribution function of X and Y random variables, whereas p 1 (x) and p 2 (y) define the marginal probability distribution function of X and Y random variables, respectively. When two random variables are totally independent, the mutual information is 0. It is symmetrical and cannot have a negative value (I(X, Y) ≥ 0, I(X, Y) = I(X, Y)). Assume that S is the feature to be chosen and that |S| denotes the number of items in this collection. The two criteria listed above must be met to ensure that the feature set chosen is the most efficient set [12].

GINI
GINI impureness (or the GINI index) is a statistic utilized in decision trees to determine how to divide the data into smaller groups. It estimates the frequency with which a randomly selected element from a set would be erroneously classified if randomly labeled according to the distribution of labels in a subset [22]. GINI importance (also known as the mean decrease in impurity) is one of the most often used processes for estimating feature importance. It is simple to implement and readily available in most tree-based algorithms, such as RF and gradient boosting. The GINI significance of a feature gauges its effectiveness in minimizing uncertainty or variance in decision tree creation. Thus, each time a split occurs in a tree regarding a particular characteristic, the GINI impurity is added to its total importance [23].

Gradient Boosting Decision Tree (GBDT)
Gradient boosting is a type of ML technique used for regression and classification problems, with its weak prediction model (typically the decision tree) generating a forecast model in the form of a collection. It, like other strengthening approaches, constructs the model in stages and allows for the optimization of the loss function of any separable variables to a generalized model [24][25][26].

Definition 5. The initial constant value C is obtained as
where L(y i C) is the loss function. arg min c .

Definition 6.
For each j = 1, 2, . . . , J, the residual along the gradient direction is written as Definition 7. The corresponding optimal fitting value for each j = 1, 2, . . . , J, is computed as follows: Definition 8. The following depicts the model update operation:

Dental Caries Prediction Model
This section describes the proposed prediction model. The model receives three parameters: the training dataset S, the number of features to be considered, and the threshold of feature importance to be considered. At every iteration of the set of features, the weight importance of the feature is computed, and the mRMR is used to select the subset of fea-tures from the feature in S. The mRMR is calculated to rate the features according to their relevance to the target feature while considering redundancy. The features are ranked according to their mRMR, and a subset of features is selected according to the specified threshold. From the selected subset of features H, a sub-subset of features is selected via Gini Impurity (GI).
Four machine learning algorithms (RF, LR, SVM, and GBDT) and one Deep Learning (DL) model, Long Short-Term Memory (LSTM), are experimented on using the sub-subset features. RandomizedSearchAlgorithm is employed to find the best hyperparameters and reduce unnecessary computation, therein solving the drawback of GridSearchCV. Once the best parameters are found, training is performed. The best classifier (GBDT) is obtained after comparisons of the model evaluation metrics (accuracy, F1-score, recall, precision, and receiver operating curve (AUROC)). The complete algorithm of the proposed model is presented in Algorithm 1.

Algorithm 1: Proposed model for Dental Caries Prediction.
Input: Training dataset S := (x 1 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n ); F n Number of features to be selected, and θ threshold of feature importance. Output: highest performing classification approach C.
for ( f eature f i in S, i + 1): Compute the weight importance of features.
relevance = mutual In f ormation ( f i , y) redundancy = initialize to 0 for ( f eature f j in S, j + 1) redundancy += mutual In f ormation (fi, fj) end for mrmrValues[ f i ] = relevance −redundancy; 3. end for 4. H = sort(mrmrValues).take(θ) select a subset of the most important features from the total set S. 5.
for ( f eature f k in H, i + 1): rank features by their predictive power and select the most important one.
• Compute the Gini impurity of the target variable on f k as GI = GiniImpurity( f k ).
• Split the data samples x from f k into two subsets x 1∼k and x k∼n then compute the Avg Gini impurity of the two subsets.
• Store in the list of feature importance the Gini impurity caused by the splitting of the data on the current feature.

6.
Normalize the feature importances so that they sum to 1.

7.
Select a final subset of features based on feature importances obtained.

Experimentation
This section describes the experimental data, the hyperparameter tuning process for the various ML models, and the caries prediction model results. The initial stage in the experiment was to construct different subsets of features from the dataset using feature selection and feature importance methods (explained in Section 2.3 and in the Supplementary Material). Following that, the ML models were applied to the complete feature set and the different subsets of features to demonstrate the benefit of feature selection. Finally, the results of each model were compared to determine which model performed best.

Dataset
The 2018 pediatric oral health examination data were used to train, validate, and evaluate the proposed method's effectiveness and performance. This dataset has 43 attributes in total. Chi-Square, Relief F, and mRMR techniques were utilized, with the number of features (k) adjusted to forty-three, forty, thirty-five, thirty, twenty-five, twenty, fifteen, ten, and five per experiment. The correlation method was also used to determine which variables to investigate further and to perform rapid hypothesis testing. A cutoff of 0.85 was set while performing the correlation technique, and features with a correlation value greater than that of the threshold were removed from the dataset's 43 features. In addition, GINI feature importance was applied to these four methods (Chi-square, Relief F, mRMR, and Correlation). The dataset was subjected to eight different approaches. The different subset numbers of features to which the feature selection and importance techniques were applied are shown in Table S2 (see Supplementary File). Except for LSTM and SVM, the study used feature importance with the GBDT, RF, and LR models. Because SVM lacks feature importance properties, they cannot be applied [27]. For LSTM, only the permutation importance algorithm can be applied [28]. The dataset attribute descriptions are available in Table S1 (see Supplementary).

Hyperparameters of Different Machine Learning Models
A well-selected training dataset and properly tuned ML algorithms were necessary to accurately predict dental caries. In this study, five models were employed, with 80% of the data used for training and validation, while the remaining 20% were set aside for testing and determining the best model for caries prediction. The hyperparameter tuning was caried out through randomizedSearch instead of GridSearch. Table 1 shows the models used, the different feature selection and feature importance techniques, the hyperparameters of each model, and the optimal hyperparameter values achieved using the randomized Search algorithm.

Results
This section discusses the result obtained from the performed experimentation. The ML models were trained with the complete feature sets first and then with various reduced sets of features to demonstrate the effects of feature selection and importance techniques on the classification performances. Out of the 80% ratio of the dataset used for cross-validation training, 20% was utilized to evaluate the model efficacy, highlight selection bias or overfitting issues, and provide insights into how the model would generalize to an independent dataset. The results described below, which are the test data results, only present the excellent accuracy of each experimental condition.

Performance of the Classifiers without Feature Selection
This subsection presents the experimental findings achieved when training the GBDT, RF, LR, SVM, and LSTM models with the full feature set. A summary of these observations is provided in Table 2. Further, Figure 3 depicts the AUC-ROC curves for the GBDT, RF, LR, and SVM algorithms, as well as the training vs. validation accuracy and loss for the LSTM model. The findings reveal that RF outperformed the other ML classifiers, with F1-score, precision, recall, and accuracy values of 0.87, 0.92, 0.86, and 0.91, respectively.
This section discusses the result obtained from the performed experimentation. The ML models were trained with the complete feature sets first and then with various reduced sets of features to demonstrate the effects of feature selection and importance techniques on the classification performances. Out of the 80% ratio of the dataset used for cross-validation training, 20% was utilized to evaluate the model efficacy, highlight selection bias or overfitting issues, and provide insights into how the model would generalize to an independent dataset. The results described below, which are the test data results, only present the excellent accuracy of each experimental condition.

Performance of the Classifiers without Feature Selection
This subsection presents the experimental findings achieved when training the GBDT, RF, LR, SVM, and LSTM models with the full feature set. A summary of these observations is provided in Table 2. Further, Figure 3 depicts the AUC-ROC curves for the GBDT, RF, LR, and SVM algorithms, as well as the training vs. validation accuracy and loss for the LSTM model. The findings reveal that RF outperformed the other ML classifiers, with F1-score, precision, recall, and accuracy values of 0.87, 0.92, 0.86, and 0.91, respectively.   Various reduced feature sets were produced in this study using the Chi-Square, Relief F, mRMR, Correlation, and GINI methods. The combination of the mRMR feature selection and GINI importance algorithms enabled the selection of features with the highest relevance to the label. The experimental results of the experimentation performed using vari-  Various reduced feature sets were produced in this study using the Chi-Square, Relief F, mRMR, Correlation, and GINI methods. The combination of the mRMR feature selection and GINI importance algorithms enabled the selection of features with the highest relevance to the label. The experimental results of the experimentation performed using various subsets of features are shown in Table 3 and Figure 4. Table 3 depicts the feature selection procedures (mRMR + GINI, Relief F + GINI, Chi-Square + GINI, Relief F, and Chi-Square) used to construct the datasets, with different numbers of features chosen depending on their importance in the prediction of the target variable.
The results obtained with the reduced feature sets were compared to show the efficacy of feature selection on model performance. As seen in Table 3, all the classifiers employed outperformed their respective performances when trained on the complete feature set (as shown in Table 2). The F1-score, precision, recall, accuracy, and AUC values of the proposed model (combining mRMR + GINI and GBDT) were 93.79%, 99.84%, 88.44%, 95.19%, and 95%, respectively, which outperformed the RF, LR, SVM, and LSTM models both before and after feature selection, as well as the conventional GBDT. This gain in performance highlights the usefulness of feature selection in the ML models. As a result, the proposed model is an effective method for predicting dental caries. Table 3 summarizes the study's findings and presents each model's performance. The proposed approach demonstrated the best predictive performance. As shown in Tables 3 and S2, each model's appropriate number of features varied. Table 3 presents the best performance produced via various optimal feature sets. The accuracy of mRMR + GINI paired with GBDT was 95.19%, which was 5.53% higher than the traditional GBDT. RF (with Relief F + GINI for feature selection) produced an accuracy of 95.13%, 5.08% higher than the conventional RF; similarly, LSTM (with Chi-Square + GINI), SVM (with Relief F), and LSTM (with Chi-Square) achieved accuracies of 82.56%, 90.39%, and 84%, respectively. These results were 0.53%, 9.11%, and 9.33% higher than their counterparts trained on the complete feature sets, as shown in Table 2.    Table 3 summarizes the study's findings and presents each model's performance. The proposed approach demonstrated the best predictive performance. As shown in Tables 3  and S2, each model's appropriate number of features varied. Table 3 presents the best performance produced via various optimal feature sets. The accuracy of mRMR + GINI paired with GBDT was 95.19%, which was 5.53% higher than the traditional GBDT. RF (with Relief F + GINI for feature selection) produced an accuracy of 95.13%, 5.08% higher than the conventional RF; similarly, LSTM (with Chi-Square + GINI), SVM (with Relief F), and LSTM

Discussion
Dental caries is rapidly increasing and recognized as a significant public health problem beyond personal health care. In addition, the prevention and early detection of dental caries are critical for reducing the social costs of dental caries that will occur in the future. Currently, a caries diagnosis is performed using radiographs or probes. According to the clinical experience of dental specialists, these procedures demonstrate significant variations in the accuracy and reliability of dental caries diagnoses. Given the difficulties and limitations of diagnoses based on the clinician's subjective opinion and experience, research on and the development of ML-based dental caries decision support systems (DSSs) are required. These DSS tools will aid in the prevention of dental caries, the management of oral hygiene, the improvement of dietary-caries-related food habits, and the reduction of diagnostic time and expense.
The main purpose of this study was to construct a dataset comprising the most relevant features to enable an effective prediction of dental caries. Moreover, a model was proposed that can effectively and accurately classify dental caries using the selected feature set. Four ML algorithms (RF, LR, SVM, and GBDT) and one DL were trained on both the complete feature set and the optimized datasets (consisting of reduced feature sets).
The present study used the Chi-Square, Relief F, mRMR, Correlation, and GINI methods to obtain the optimal dataset. The most optimal dataset consisted of a feature set reduced to 18 out of 43 features, resulting in a 6% improvement in accuracy over training with the complete feature set. The experimental results reveal that the models trained on the reduced feature sets performed better than their respective counterparts trained on the complete feature sets. The proposed model trained on the reduced set of features remarkably predicted dental caries and yielded better performances than the previously published questionnaire-based dental caries prediction DSS.
Karhade, D. S. et al. [22] used data from 6404 people to study the classification of early childhood dental caries. Their proposed model yielded an AUC-ROC of 0.74, a sensitivity of 0.67, and a positive predictive value of 0.64. Park Y.H. et al. [24] performed a similar study in children under the age of 5 and reported an AUC-ROC value of 0.784 for LR, 0.785 for XGBoost, 0.780 for RF, and 0.780 for LightGBM. These findings outperformed those of Karhade, D. S. et al. [22] and Park, Y. H. et al. [24]. Our proposed prediction model's AUC-ROC values were 0.96, 0.95, 0.89, 0.90, and 0.83 for RF, GBDT, LR, SVM, and LSTM, respectively. The current study's data collection included 22,287 survey data samples. The proposed (mRMR + GINI and GBDT) model in our work achieved an accuracy of 95%, an F1-score of 93%, a precision of 99%, and a recall of 88%. However, because the data utilized were from the Korean child population, the performance analysis is only indicative of the Korean population rather than other groups. Simultaneously, an absolute comparison with other published studies was not possible due to the data's privacy restrictions, resulting in its nonpublic availability.

Conclusions
The analysis of dental caries is one of the most frequent topics for modern-day oral health care research due to its severity and alarming rising rate. This study proposed an approach that combines mRMR + GINI feature selection and the GBDT classifier to improve the detection of dental caries. Five ML classifiers (LR, RF, SVM, LSTM, and GBDT) were implemented as benchmarks for a performance comparison. To start, the relevance of the various attributes was computed using Chi-Square, Relief F, mRMR, Correlation, and GINI. The classifiers were then trained using both the reduced and complete feature sets. The experimental results show that the feature selection improved the classifiers' performance. The proposed approach achieved superior performance over the other classifiers and methods in the recent literature. Therefore, combining mRMR + GINI and GBDT is a practical approach to dental caries detection and can be potentially applied for the early detection of dental caries through DSS diagnosis tools.
In this experiment, the importance of features was selected through the GINI technique. The GINI technique employed to perform feature importance could not be used with the SVM and LSTM models and could only be used with LR, RF, and GBDT. In future work, we plan to use datasets collected from Korea and other countries to analyze similarities in dental caries occurrence. In addition, we intend to apply permutation importance (a feature importance technique) to improve the models' performance. The proposed model shows promising performance and can be applied as a diagnostic aid to identify patients with dental caries. It can also make practical suggestions for caries prevention and treatment plan design. Through this, it is possible to drastically reduce patient diagnosis time and social costs due to tooth decay.