Utilization of Machine Learning Methods for Predicting Orthodontic Treatment Length

: Treatment duration is one of the most important factors that patients consider when deciding whether to undergo orthodontic treatment or not. This study aimed to build and compare machine learning (ML) models for the prediction of orthodontic treatment length and to identify factors affecting the duration of orthodontic treatment using the ML approach. Records of 518 patients who had successfully ﬁnished orthodontic treatment were used in this study. Seventy percent of the patient data were used for training ML models, and thirty percent of the data were used for testing these models. We applied and compared nine machine-learning algorithms: simple linear regression, modiﬁed simple linear regression, polynomial linear regression, K nearest neighbor, simple decision tree, bagging regressor, random forest, gradient boosting regression, and adaboost regression. We then calculated the importance of patient data features for the ML models with the highest performance. The best overall performance was obtained through the bagging regressor and adaboost regression ML methods. The most important features in predicting treatment length were age, crowding, artiﬁcial intelligence case difﬁculty score, overjet, and overbite. Without patient information, several ML algorithms showed comparable performance for predicting treatment length. Bagging and adaboost showed the best performance when patient information, including age, malocclusion, and crowding, was provided.


Introduction
Treatment duration is one of the most important factors that patients consider when deciding whether to undergo orthodontic treatment [1].An exact and accurate prediction of the duration of the total orthodontic treatment might motivate patients or prepare them for what to expect (Mavreas and Athanasiou, 2008) [2].Additionally, a reliable idea of the treatment duration helps the orthodontist to better plan the overall treatment and the sequence of appointments (Fink and Smith, 1992;Mavreas and Athanasiou, 2008) [1,2].Earlier studies reported that orthodontic treatment employing fixed appliances typically lasts 14 to 33 months (Kafle et al., 2019;Tsichlaki et al., 2016) [3,4] with a mean of around 22 to 24 months, depending on the discrepancy being treated (Aljehani and Baeshen, 2018;Simister, 2007) [5].
As teeth have to be moved through the bone, one decisive factor influencing the speed of orthodontic tooth movement and thus treatment duration is bone metabolism, i.e., the ability of bone to remodel as a result of the applied force systems (Abbing et al., 2020) [7].Bone metabolism depends, in part, on age, the bony structure itself, and/or systemic disease (Abbing et al., 2020; Kaur and El-Bialy, 2020; Landin-Ramos, 2020) [7][8][9].One could approach the prediction of treatment duration via bone morphology.Here, the bone structure and density, the thickness of the cortical bone, and the structure of the spongious bone would have to be analyzed in detail.An approach using fractal analysis of panoramic X-ray images has recently been presented (Cesur et al., 2020) [10], while more classical approaches use indices of severity, such as the American Board of Orthodontics Discrepancy Index (ABO-DI), to give an answer to patients' frequent question, "When do I get my braces off?" (Aljehani and Baeshen, 2018) [5].
Artificial intelligence (AI) is bringing a paradigm shift to healthcare, powered by the increasing availability of healthcare data and the rapid progress of analytics techniques [11].Machine learning (ML) is a subset of AI techniques, used to determine complex models and extract knowledge.In clinical practice, ML predictive models can assist the clinician in decision-making regarding individual patient care [12,13].
To our knowledge, ML has not been used to predict orthodontic treatment length.Therefore, our study aimed to build and compare ML models to predict orthodontic treatment length and to identify factors affecting the duration of orthodontic treatment using an ML approach.

Materials and Methods
We retrospectively evaluated the records of 631 patients who completed orthodontic treatment at All Care Orthodontics, Chicago, IL.Ethical approval (IRP Number 20193360) for this study was obtained from the research ethics committee of WIRB-Copernicus.All experiments were completed in accordance with approved guidelines.
The inclusion criteria were as follows: patients who had (1) received comprehensive orthodontic treatment; (2) successfully finished their orthodontic treatment without disruption during the treatment period; (3) a complete set of standard orthodontic records pretreatment and at a debond appointment; and (4) had treatment by a board-certified orthodontist.The exclusion criteria were patients who had: (1) received limited orthodontic treatment; (2) received phase one orthodontic treatment; (3) had treatment disrupted and, consequently, increased treatment length; (4) more than four failed appointments; (5) treatment under Medicaid coverage; and (6) craniofacial syndromes.A total of 518 patients met the inclusion criteria, and their records were used in this study.
The following parameters were collected for each patient: (1) gender, race, and age when treatment started; (2) commute distance to the orthodontic office in miles; (3) overjet, overbite, maxillary, and mandibular arch crowding calculated in mm; (4) malocclusion classification (I, II, and III); (5) actual treatment length, in months, starting from the bonding to the debonding appointment; (6) estimated treatment length determined by an orthodontist; (7) treatment difficulty estimated by artificial intelligence (AI score: 1, easy to 5, very difficult) using a deep learning model, previously published by Talaat et al., 2021 [13].
The cases corresponding to each of the possible outcomes were divided into two groups: 70% of cases were used for ML training and the remaining 30% for ML testing.The same training and testing sets were used with every model to ensure a fair comparison.After each model was trained and optimized using 70% of the patient sample, the remaining 30% of cases served as the testing dataset to evaluate the model's predictive ability.We compared all models using three indicators: mean squared error (MSE) of the training data, MSE of the testing data, and coefficient of determination (R2) of the model on the entire dataset.Ideally, the testing MSE should be as low as possible.A training MSE that is much lower than the testing MSE usually indicates the model overfitting on the training dataset.In addition, a higher R2 score is desirable, representing the proportion of the variance for the dependent variable (actual treatment time) that is explained by independent variables in a regression model.Furthermore, we analyzed residual values according to the statistical best practices and generated feature importance and permutation importance for each model.

Results
This study used data from 518 patients, 281 females and 237 males.The mean patient age was 17.49 +/− 8.15 years, and the mean patient treatment time was 26.10 +/− 8.15 months; the mean crowding was 3.18 +/− 3.64 mm for the maxillary arch and 2.79 +/− 3.56 mm for the mandibular arch (negative crowding represents spacing); class I malocclusion was present in 299 cases, class II in 145, and class III in 74.The mean treatment difficulty estimated by AI score was 2.53 +/− 0.81.The mean patient commute distance to the orthodontic office was 3.44 +/− 4.979 miles (Table 1) (Figures 1 and 2).The correlation between the variables shown in Figure 3 revealed that the overbite and overjet values were highly correlated (0.43).In addition, both maxillary and mandibular crowding values were highly correlated (0.51).All other pairs did not show significant correlations.The correlation between the variables shown in Figure 3 revealed that the overbite and overjet values were highly correlated (0.43).In addition, both maxillary and mandibular crowding values were highly correlated (0.51).All other pairs did not show significant correlations.Different ML models behave differently when processing the inputs.Accordingly the performance of these models also varies.For the ML algorithms evaluated, the follow ing was observed: bagging and adaboost were the best models, with much lower MSE values for both training and testing datasets and a higher R2 score to explain the variances (Table 2) (Figures 4 and 5).Different ML models behave differently when processing the inputs.Accordingly, the performance of these models also varies.For the ML algorithms evaluated, the following was observed: bagging and adaboost were the best models, with much lower MSE values for both training and testing datasets and a higher R2 score to explain the variances (Table 2) (Figures 4 and 5).The charts shown in the following figures identify the importance of each indicator in the ML models through feature importance and permutation importance.The R2 scores of between 0.27 and 0.33 were significantly larger than the chance level, making it possible to subtract individual feature importance and permutation importance to probe which features are most predictive.The charts shown in the following figures identify the importance of each indicator in the ML models through feature importance and permutation importance.The R2 scores of between 0.27 and 0.33 were significantly larger than the chance level, making it possible to subtract individual feature importance and permutation importance to probe which features are most predictive.
Feature importance, as the name suggests, shows the importance of each feature variable in the model.For a complex such as bagging, random forest, or adaboost, feature importance is the average of all submodels.Permutational importance measures the decrease in a model score when a single feature value is randomly shuffled.This procedure breaks the relationship between the feature and the target.Therefore, this decrease in the model score indicates how much the model depends on the feature.
With or without AI scores, the feature importance shows that patient age, maxillary crowding, and mandibular crowding are the three most predictive components in the Bagging model (Figure 6).Overjet, overbite, and race identification also have quite a significant feature importance.Feature importance, as the name suggests, shows the importance of each feature variable in the model.For a complex such as bagging, random forest, or adaboost, feature importance is the average of all submodels.Permutational importance measures the decrease in a model score when a single feature value is randomly shuffled.This procedure breaks the relationship between the feature and the target.Therefore, this decrease in the model score indicates how much the model depends on the feature.
With or without AI scores, the feature importance shows that patient age, maxillary crowding, and mandibular crowding are the three most predictive components in the Bagging model (Figure 6).Overjet, overbite, and race identification also have quite a significant feature importance.We can see that patient age, maxillary crowding, and mandibular crowding are also the top predictive variables measured by permutation importance in the bagging model (Figure 7).In addition, Figure 5 shows that the AI score played an important role in the model including the AI score as a predictive variable.We can see that patient age, maxillary crowding, and mandibular crowding are also the top predictive variables measured by permutation importance in the bagging model (Figure 7).In addition, Figure 5 shows that the AI score played an important role in the model including the AI score as a predictive variable.We can see that patient age, maxillary crowding, and mandibular crowding are also the top predictive variables measured by permutation importance in the bagging model (Figure 7).In addition, Figure 5 shows that the AI score played an important role in the model including the AI score as a predictive variable.With results very similar to the bagging model, the feature importance of the adaboost model (with or without an AI score; Figure 8) shows that patient age, maxillary crowding, and mandibular crowding are the three most predictive components.Overjet, overbite, and race identification also have significant feature importance.With results very similar to the bagging model, the feature importance of the adaboost model (with or without an AI score; Figure 8) shows that patient age, maxillary crowding, and mandibular crowding are the three most predictive components.Overjet, overbite, and race identification also have significant feature importance.The permutation importance results of adaboost (Figure 9) show results similar to those of the bagging models, with patient age, maxillary, and mandibular crowding being more significant than other variables.In the adaboost model without an AI score, overjet stood out as the second most important variable.The permutation importance results of adaboost (Figure 9) show results similar to those of the bagging models, with patient age, maxillary, and mandibular crowding being more significant than other variables.In the adaboost model without an AI score, overjet stood out as the second most important variable.The permutation importance results of adaboost (Figure 9) show results similar to those of the bagging models, with patient age, maxillary, and mandibular crowding being more significant than other variables.In the adaboost model without an AI score, overjet stood out as the second most important variable.

Discussion
The ML models built in this study were used to predict the orthodontic treatment length based on multiple factors, including patient demographics, types of malocclusion, and measures of malocclusion severity such as crowding, overjet, and AI score for treatment difficulty.When we evaluated the performance of different ML models, we found that the bagging and adaboost models had better performance than the other ML models tested.Bagging, or bootstrap aggregating, is based on the decision tree model.It generates multiple samples of training data via bootstrapping, training a deeper decision tree on each sample of training data, then outputs the averaged results of all models, i.e., aggregating.Compared to regular decision tree models, bagging enjoys the benefits of high

Discussion
The ML models built in this study were used to predict the orthodontic treatment length based on multiple factors, including patient demographics, types of malocclusion, and measures of malocclusion severity such as crowding, overjet, and AI score for treatment difficulty.When we evaluated the performance of different ML models, we found that the bagging and adaboost models had better performance than the other ML models tested.Bagging, or bootstrap aggregating, is based on the decision tree model.It generates multiple samples of training data via bootstrapping, training a deeper decision tree on each sample of training data, then outputs the averaged results of all models, i.e., aggregating.Compared to regular decision tree models, bagging enjoys the benefits of high expressiveness and low variances.Adaboost is a complex boosting decision tree regression model that uses multiple subsequent trees of residuals to build a combined, e.g., boosting.Adaboost assigns larger weights to outliers in each iteration of the boosting model building.This makes Adaboost especially efficient compared to other boosting methods [15].We tested the performance of the ML models with and without the AI score [13].Adding the AI score improved the ML models' performances and this was especially evident with the bagging and adaboost models.The AI score is based on malocclusion detection and assessment by AI from clinical images, including crowding, spacing, deep bite, open bite, and crossbite [13].AI score is a novel method for assessing the case difficulty, confirming that the more difficult the case, the longer the treatment duration.
We assessed the feature importance for the ML predictive models; patient age, maxillary crowding, and mandibular crowding were the top features.Patient age could be a contributing factor due to the biological differences between adolescents and adults.Vayda et al., in 1995, reported significant differences in treatment length between adults and adolescents [16].Other studies reported no significant differences in treatment length between adults and adolescents [17].Additional parameters contribute to treatment length prediction by ML.For example, crowding, overjet, overbite, and AI score are all measures for the severity of the malocclusion; previous studies found that quantitative malocclusion indices, such as peer assessment rating (PAR) and the objective grading system (OGS), correlated with treatment length [3].Other factors were found to have less contribution, such as gender, race, and malocclusion classification into Class I, II, and III; this aligns with previous findings [1,3,7].Unexplored factors may also contribute to treatment length, including the orthodontic technique employed, operator skill and experience, and patient compliance.The impact of these factors is unknown and needs to be examined.
The scope of this study was to build a predictive model that can be used at initial patient screening or consultation.Other parameters can be used to fine-tune the ML models in the future.Furthermore, individual and subjective issues create more variations than the quantifiable factors presented in the study.However, we can perform additional studies to correlate those numeric variables to better understand the impact on treatment length.A clinical application of the ML predictive models presented in this study could be a software or a mobile application with a graphical user interface (GUI) that could be used during the orthodontic screening or consultation to provide helpful information for both the patient and the orthodontist (Figure 10).Furthermore, these ML models could be integrated with orthodontic software currently available.

Conclusions
We achieved our objective of developing predictive models-based ML methods.Bagging and adaboost ML methods provided good predictability for orthodontic treatment length when patient information, such as age, malocclusion, and crowding, was provided.Furthermore, the study demonstrated the relative importance of each factor.Additional studies should be conducted on large, diverse datasets to include more variables and improve the performance of ML models for understanding of orthodontic treatment length.

Figure 1 .
Figure 1.Histogram of patient age distribution.Figure 1. Histogram of patient age distribution.

Figure 1 .
Figure 1.Histogram of patient age distribution.Figure 1. Histogram of patient age distribution.

Figure 1 .
Figure 1.Histogram of patient age distribution.

Figure 2 .
Figure 2. Histograms showing (a) actual treatment time distribution and (b) actual treatment time based on malocclusion class.Boxplot showing (c) malocclusion versus actual treatment time.

Figure 2 .
Figure 2. Histograms showing (a) actual treatment time distribution and (b) actual treatment time based on malocclusion class.Boxplot showing (c) malocclusion versus actual treatment time.

Figure 3 .
Figure 3. Heat map showing the correlation between variables.

Figure 3 .
Figure 3. Heat map showing the correlation between variables.

Figure 4 .
Figure 4. Scatterplots comparing actual treatment time vs predicted treatment time for the bagging model.

Figure 4 .
Figure 4. Scatterplots comparing actual treatment time vs predicted treatment time for the bagging model.

Figure 5 .
Figure 5. Scatterplots comparing actual treatment time vs predicted treatment time for the AdaBoost model.

Figure 5 .
Figure 5. Scatterplots comparing actual treatment time vs predicted treatment time for the Ad-aBoost model.

Figure 6 .
Figure 6.Feature importance for the Bagging model.

Figure 6 .
Figure 6.Feature importance for the Bagging model.

Figure 7 .
Figure 7. Permutation importance for the bagging model.

Figure 7 .
Figure 7. Permutation importance for the bagging model.

Figure 8 .
Figure 8. Feature importance in the adaboost model.

Figure 8 .
Figure 8. Feature importance in the adaboost model.

Figure 8 .
Figure 8. Feature importance in the adaboost model.

Figure 9 .
Figure 9. Permutation importance for the adaboost model.

Figure 9 .
Figure 9. Permutation importance for the adaboost model.

Oral 2022, 2 , 12 Figure 10 .
Figure 10.Graphical user interface for mobile application for treatment length prediction.

Table 1 .
Description of Patient Demographic Data.

Table 2 .
Performance Comparison of the ML Models.

Table 2 .
Performance Comparison of the ML Models.