The Accuracy of Sex Identification Using CBCT Morphometric Measurements of the Mandible, with Different Machine-Learning Algorithms—A Retrospective Study

In forensics, predicting the sex is a crucial step in identification. Many studies have aimed to find an accurate and fast technique to estimate the sex. This study was conducted to determine the accuracy of volumetric and linear measurements of three-dimensional (3D) images of the mandible obtained from cone beam computed tomography (CBCT) radiographs, using different machine-learning (ML) models for sex identification. The CBCTs of 104 males and 104 females were included in this study. The radiographs were converted to 3D images, and the volume, surface area, and ten linear measurements of the mandible were obtained. The data were evaluated using statistical analysis and five different ML algorithms. All results were considered statistically significant at p < 0.05, and the precision, recall, f1-score, training accuracy, and testing accuracy were used to evaluate the performance of the ML models. All the studied parameters showed statistically significant differences between sexes p < 0.05. The right coronoid-to-gonion linear distance had the highest discriminative power of all the parameters. Meanwhile, Gaussian Naive Bayes (GNB) showed the best performance among all the ML models. The results of this study revealed promising outcomes; the sex can be easily determined, with high accuracy (90%).


Introduction
Human identification is one of the most important aspects of forensic odontology. The identification of deceased persons is essential for social and legal closure [1]. Forensic dentistry is critical in recognizing individuals' unique dental and facial features. It is regarded as a powerful tool for the identification of human remains in events where highly damaged and dismembered dead bodies are recovered from mass graves, earthquakes, tsunamis, airplane crashes, train and road accidents, bomb blasts, and wars [2].
Sex determination is considered an essential step in forensic and anthropological human identification. It narrows the search for the deceased person's identity, by excluding approximately one-half of the population [3]. Skeletal evaluation is a very accurate and reliable method for sex determination; in particular, the accuracy of morphometric measurements of the pelvis can reach up to 95% [4]. However, in some instances, sex determination can be very challenging for forensic experts, when only a few fragments of a human skeleton have been recovered [5].
The mandible is a probable finding by forensic experts, as it is the strongest bone in the human face, characterized by its compact nature and higher resistance to degradation by environmental factors [6]. To overcome morphological sex estimation subjectivity and experience-related error, many studies have been conducted to evaluate the mandible morphometrically, to identify its ability to predict sex accurately. Interestingly the literature shows that the mandible's morphometric parameters have significant sexual dimorphism,

Study Design
This study was conducted from March 2022 to February 2023. The ethical approval of this study was obtained from the Ethics Committee at the College of Dentistry/University of Sulaimani-No. 36/21, on 11 August 2021. This research was performed following the relevant guidelines. The authors waived informed consent, as they analyzed archives of CBCT radiographs.
A total of 687 CBCT radiographs had been scanned from the archive of Foton Maxillofacial Imaging Clinic in Sulaimani/Iraq. Radiographs that showed the entire mandible and belonged to subjects aged 18 years and above from both sexes were included in this study. On the other hand, CBCTs that met the following criteria were excluded: 1.
CBCTs did not show the entire mandible, or showed less than ten remaining teeth.

2.
Certain pathological conditions affected the mandible's size and shape. 3.
CBCT images contained metal in or near the mandible, or radiographs showed bone resorption around the lower teeth that reached the apical third of the roots.
As a result of the inclusion and exclusion criteria, 479 CBCTs were excluded, and only 208 radiographs were included in this study (104 females and 104 males), aged 18-70 years.
All the CBCT images were acquired using the CS 9600 unit manufactured by Carestream Dental, Atlanta, GA, USA., with the following technical specifications: 15.4 cm spherical imaging volume, 150 µm × 150 µm × 150 µm voxel size, and a field of view of 16 × 12 cm. All radiographs were taken according to the following parameters: 120 kVp, 5 mA, and an exposure time of 40 s.

Image Analysis
The CBCT radiographs were exported to multiple digital imaging and communications in medicine (DICOM) file formats, using CS 3D Imaging version 3.10.21, developed by Carestream Dental, Atlanta, GA, USA. The DICOM files were imported into RealGIUDE version 5.0 software, developed by RealGUIDE™ Software Suite Milan, Italy. The software automatically creates a 3D image of the mandible, facial bones, and other bones included in the radiograph.
All bones were eliminated manually, by using the sculpting tool of the software, except for the mandible. The following settings were used in the bone segmentation section of the software: the bone threshold was set to 400, the background was set to 300, the segmentation quality was set to 80%, the smooth was set to 2, and the maximum number of objects was set to 1 (to eliminate other objects, and keep the largest object, which was the mandible). Using the segment tool, the software automatically detected the boundaries of the mandible, then the mandible was segmented and the mesh was created. By using the remove, add, smooth, and fill tools of the software, any minor defects were corrected manually. The 3D images were exported in the stereolithography (STL) file format, using the Export Anatomy tool in the Report/Export section of the software.
The Vol and the SA of the mandible were obtained, using the Meshmixer version 3.5.474 software, developed by Autodesk corporate, San Francisco, CA, USA. The STL files were opened in the 3D Slicer version 4.13.0 software, an open-source software platform for biomedical research, and all linear measurements (as shown in Figure 1) were taken manually by a single examiner. The intra-examiner reliability test was conducted on 30 randomly collected samples, after six months. The surface area/volume (SA/Vol) ratio, the mean coronoid-to-gonion linear distance (mean CorGonLD), and the mean gonion-tomenton linear distance (mean GonMenLD) were calculated using Microsoft Excel 2021, developed by Microsoft Corporation, Redmond, WA, USA.

Statistical Analysis
The data were analyzed using Statistical Package for The Social Sciences (SPSS) Version 25, developed by International Business Machines Corporation (IBM), New York, NY, USA. The intraclass correlation coefficient was used to assess the intra-examiner relia-

Statistical Analysis
The data were analyzed using Statistical Package for The Social Sciences (SPSS) Version 25, developed by International Business Machines Corporation (IBM), New York, NY, USA. The intraclass correlation coefficient was used to assess the intra-examiner reliability, and descriptive statistics were used to calculate the mean ± standard deviation (SD), median, and interquartile range (IQR).
The normality of the data was tested using the Shapiro-Wilk normality test. The nonparametric data were analyzed using the Mann-Whitney U test, while the independent t-test was applied to the parametric data. To reveal the discriminative power of each morphometric measurement of the mandible when predicting the sex, an ROC (receiver operating characteristic) analysis was applied. All results were considered statistically significant at less than p < 0.05.

Machine-Learning Algorithms
Anaconda Navigator version 2.1.4, and Jupyter Notebook version 6.4.8., developed by Anaconda Inc., Austin, TX, USA, in the Python programming language version 3.10 developed by The Python Software Foundation, Wilmington, DE, USA. were used in this study to create ML models. The ML modeling was carried out using 9th gen i7, MSI personal computer Model: GF65. The GNB, LR, DT, RF, and KNN algorithms were used to analyze the data. The dataset was split, with 80% designated as the training set, while the remaining 20% of the data was assigned as the test set. Furthermore, to assess the reliability of ML modeling, tenfold cross-validation of the accuracy values has been performed, by mixing the dataset through shuffling and running the ML algorithms ten times. The mean and SD of each ML algorithm's accuracy were calculated using Microsoft Excel 2021, developed by Microsoft Corporation, Redmond, WA, USA.
Performance Criteria The normality of the data was tested using the Shapiro-Wilk normality test. The nonparametric data were analyzed using the Mann-Whitney U test, while the independent t-test was applied to the parametric data. To reveal the discriminative power of each morphometric measurement of the mandible when predicting the sex, an ROC (receiver operating characteristic) analysis was applied. All results were considered statistically significant at less than p < 0.05.

Machine-Learning Algorithms
Anaconda Navigator version 2.1.4, and Jupyter Notebook version 6.4.8., developed by Anaconda Inc., Austin, TX, USA, in the Python programming language version 3.10 developed by The Python Software Foundation, Wilmington, DE, USA. were used in this study to create ML models. The ML modeling was carried out using 9th gen i7, MSI personal computer Model: GF65. The GNB, LR, DT, RF, and KNN algorithms were used to analyze the data. The dataset was split, with 80% designated as the training set, while the remaining 20% of the data was assigned as the test set. Furthermore, to assess the reliability of ML modeling, tenfold cross-validation of the accuracy values has been performed, by mixing the dataset through shuffling and running the ML algorithms ten times. The mean and SD of each ML algorithm's accuracy were calculated using Microsoft Excel 2021, developed by Microsoft Corporation, Redmond, WA, USA.

Performance Criteria
The Sensitivity, Specificity, Accuracy, Recall, Precision, and F1 score values were included as performance criteria. The process of taking the CBCTs, image analysis, and data-handling using ML algorithms is presented in Figure 2.
The Sensitivity, Specificity, Accuracy, Recall, Precision, and F1 score values were included as performance criteria. The process of taking the CBCTs, image analysis, and data-handling using ML algorithms is presented in Figure 2.

Results
A total of 208 CBCTs were included in this study (104 males and 104 females). The mean ± SD age of the male subjects was 36.74 ± 13.71, while the mean ± SD age of the female subjects was 37.56 ± 14.58. The mean age between the male and female samples showed no significant difference (p = 0.678). The distribution of the morphometric values of the mandible in response to the sex is shown in Figure 3.

Results
A total of 208 CBCTs were included in this study (104 males and 104 females). The mean ± SD age of the male subjects was 36.74 ± 13.71, while the mean ± SD age of the female subjects was 37.56 ± 14.58. The mean age between the male and female samples showed no significant difference (p = 0.678). The distribution of the morphometric values of the mandible in response to the sex is shown in Figure 3.  The normality test revealed that the BiconB, BigonB, BicorB, RCorGonLD, left coronoid-to-gonion linear distance (LCorGonLD), mean CorGonLD, LGonMenLD, and mean GonMenLD were normally distributed, while the volume (Vol), surface area (SA), SA/Vol, RGonMenLGonA, and RGonMenLD were nonparametric. The statistical relations of all parameters showed significant differences between males and females, with higher The normality test revealed that the BiconB, BigonB, BicorB, RCorGonLD, left coronoidto-gonion linear distance (LCorGonLD), mean CorGonLD, LGonMenLD, and mean Gon-MenLD were normally distributed, while the volume (Vol), surface area (SA), SA/Vol, RGonMenLGonA, and RGonMenLD were nonparametric. The statistical relations of all parameters showed significant differences between males and females, with higher values in male subjects (Tables 1 and 2). An ROC analysis was conducted, to assess the discriminative power of each parameter in predicting sex. The ROC curves of all the mandible morphometric measurements are shown in Figure 4. The RCorGonLD and mean CorGonLD showed the higher AUC (0.914 and 0.913, respectively), followed by the LCorGonLD (0.901) and SA (0.888), while the SA/Vol showed the minimum discriminative power in estimating sex (AUC = 0.419) among all the morphometric mandible parameters (see Table 3). The ROC analysis showed statistical differences between all the parameters in males and females, with p < 0.05.
The RCorGonLD showed the highest sensitivity and specificity among all measurements (0.846 and 0.865, respectively), while the SA/Vol showed the lowest sensitivity and specificity (0.442 and 0.452, respectively). The AUC, cut-off value, p value, sensitivity, and specificity of all parameters are presented in Table 3.
GNB demonstrated the higher discriminative power of all the ML algorithms (AUC = 0.955), followed by the LR and RF (0.945 and 0.944, respectively), while the AUC of the KNN classifier  Figure 5 shows the ROC curve and AUC of all the ML algorithms.
An ROC analysis was conducted, to assess the discriminative power of each parameter in predicting sex. The ROC curves of all the mandible morphometric measurements are shown in Figure 4. The RCorGonLD and mean CorGonLD showed the higher AUC (0.914 and 0.913, respectively), followed by the LCorGonLD (0.901) and SA (0.888), while the SA/Vol showed the minimum discriminative power in estimating sex (AUC = 0.419) among all the morphometric mandible parameters (see Table 3). The ROC analysis showed statistical differences between all the parameters in males and females, with p < 0.05.   It was observed that GNB had the highest testing accuracy (0.90) and the lowest training accuracy (0.84) among all the ML algorithms, and the results of the confusion matrix of GNB predicted 19 of 20 females, and 19 of 22 males, correctly. This was followed by RF and LR (testing accuracy = 0.88), and the testing accuracy of the KNN model was 0.83. Meanwhile, the DT had the lowest testing accuracy (0.80) and highest training accuracy (1), with 17 of 20 females, and 17 of 22 males, correctly predicted, which were found in the confusion matrix of the DT. Regarding the f1-score, the GNB algorithm had the highest f1-score for males and females (0.90), succeeded by RF and LR (0.88). However, the DT demonstrated the lowest f1-score for both sexes (0.81). The precision, recall, and f1-score, as well as the training accuracy and the testing accuracy of all the ML algorithms, are presented in Table 4. sensitivity and specificity (0.442 and 0.452, respectively). The AUC, cut-off value, p value, sensitivity, and specificity of all parameters are presented in Table 3.
GNB demonstrated the higher discriminative power of all the ML algorithms (AUC = 0.955), followed by the LR and RF (0.945 and 0.944, respectively), while the AUC of the KNN classifier was 0.897. The ROC analysis of the DT algorithm showed the lowest AUC among all the ML models (AUC = 0.811). Figure 5 shows the ROC curve and AUC of all the ML algorithms.     sensitivity and specificity (0.442 and 0.452, respectively). The AUC, cut-off value, p value, sensitivity, and specificity of all parameters are presented in Table 3. GNB demonstrated the higher discriminative power of all the ML algorithms (AUC = 0.955), followed by the LR and RF (0.945 and 0.944, respectively), while the AUC of the KNN classifier was 0.897. The ROC analysis of the DT algorithm showed the lowest AUC among all the ML models (AUC = 0.811). Figure 5 shows the ROC curve and AUC of all the ML algorithms.    It was observed that GNB had the highest testing accuracy (0.90) and the lowest training accuracy (0.84) among all the ML algorithms, and the results of the confusion matrix of GNB predicted 19 of 20 females, and 19 of 22 males, correctly. This was followed by RF and LR (testing accuracy = 0.88), and the testing accuracy of the KNN model was 0.83. Meanwhile, the DT had the lowest testing accuracy (0.80) and highest training accuracy (1), with 17 of 20 females, and 17 of 22 males, correctly predicted, which were found in the confusion matrix of the DT. Regarding the f1-score, the GNB algorithm had the highest f1-score for males and females (0.90), succeeded by RF and LR (0.88). However, the DT demonstrated the lowest f1-score for both sexes (0.81). The precision, recall, and f1-score, as well as the training accuracy and the testing accuracy of all the ML algorithms, are presented in Table 4.   The results of the confusion matrices of all the ML models are shown in Figure 7.  The results of the confusion matrices of all the ML models are shown in Figure 7. The mean impact of each parameter on the RF and DT algorithms' magnitude SHAP values has been used, as shown in Figure 8. Regarding the reliability of this study, the result of the intraclass correlation coefficient, which was applied to assess the intra-examiner reliability of all parameters, ranged from 0.95 to 0.99. In addition, tenfold cross-validation was used to appraise the performance of the ML algorithms. The highest accuracy was obtained from KNN (0.848 The mean impact of each parameter on the RF and DT algorithms' magnitude SHAP values has been used, as shown in Figure 8. The mean impact of each parameter on the RF and DT algorithms' magnitude SHAP values has been used, as shown in Figure 8. Regarding the reliability of this study, the result of the intraclass correlation coefficient, which was applied to assess the intra-examiner reliability of all parameters, ranged from 0.95 to 0.99. In addition, tenfold cross-validation was used to appraise the performance of the ML algorithms. The highest accuracy was obtained from KNN (0.848 Regarding the reliability of this study, the result of the intraclass correlation coefficient, which was applied to assess the intra-examiner reliability of all parameters, ranged from 0.95 to 0.99. In addition, tenfold cross-validation was used to appraise the performance of the ML algorithms. The highest accuracy was obtained from KNN (0.848 ± 0.059) and LR (0.848 ± 0.068), followed by RF (0.84 ± 0.048), and GNB (0.84 ± 0.049), while DT showed the lowest accuracy (0.814 ± 0.042). The results of the tenfold cross-validation tests of all the ML algorithms are shown in Table 5.

Discussion
In forensic dentistry, various methods have been used to determine the sex of unknown human remains [25]. The radiograph is one of the most valuable tools in forensic odontology, because it gives objective evidence of the dental treatments, as well as the anatomical conditions, of the deceased person. In addition, it is a non-destructive, easy-to-use, and quick technique, which makes it cost-effective in comparison to molecular technology [26].
Medical image segmentation entails dividing DICOM images into distinct and meaningful segments. The selection of an appropriate threshold level is crucial for the segmentation of various structures in the skull. The 3D models of these structures are created by surface reconstruction, based on contour interpolation from different segments. The segmentation of 3D radiographic images is widely used in reconstructing anatomical structures of the cranium. The accuracy of 3D radiographic image reconstruction is influenced by the segmentation threshold range [27]. In this study, the mandible was segmented with minimal artifacts, by setting the bone threshold value to 400, and the background threshold value to 300.
Morphometric analysis of various bones of the human body has been used to predict the sex. The pelvis and cranium show the highest level of sexual dimorphism in the human skeleton. According to the literature, the possibility of finding an intact mandible is high, because the mandible is more durable than other bones in the cranium, and is composed of compact bone [28,29]. This study aimed to evaluate the performance of morphometric measurements obtained from CBCTs of the mandibles, using ML algorithms, in predicting the sex accurately.
In this study, all the morphometric measurements of the mandible (BiconB, BicorB, BigonB, LCorGonLD, LGonMenLD, mean CorGonLD, mean GonMenLD, RCorGonLD, RGonMenLD, RGonMenLGonA, SA, SA/Vol, and Vol) showed statistically significant differences between males and females p < 0.05. Males had higher mean values for all measured parameters than females. This can be attributed to the fact that during male growth, testosterone levels; and the more extended puberty phase, with a related longer duration of bone growth; affected the bone size. Another consideration is the muscular tension that encourages bone growth; because males have stronger masticatory muscles than females, the mandible is generally more developed in males [30,31].
The ROC findings of all the parameters revealed that RCorGonLD had the highest sensitivity and specificity in predicting the sex (AUC = 0.914). The most accurate ML model in estimating the sex was GNB (90%), followed by LR and RF, while DT showed the lowest prediction accuracy (80%). In this study, the AUC and ROC curve were used to measure the classifier's ability to distinguish between males and females, the diagnostic efficiency of each morphometric mandible measurement in the prediction of the sex, and the cut-off values for each predictor. Researchers have indicated that using the ROC curve and AUC has advantages, such as the fact that the AUC is based on both specificity and sensitivity, and is unaffected by the prevalence of one investigated group over the other, in contrast to the single measures of specificity, sensitivity, and diagnostic accuracy. Furthermore, AUC and ROC curves can be used to compare different models, and are insensitive to class imbalance [32].
For this study, the null hypothesis was rejected, as sex could be determined with accuracy reaching up to 90%, by analyzing the morphometric measurements of the mandible using ML algorithms. The accuracy of sex estimation in this study appeared to be high, as the accuracy of sex prediction in previous studies that only evaluated the morphometric measurements taken from the mandible ranged between 53% and 90% [33][34][35][36][37][38][39][40].
A study conducted by Saloni et al. [33] showed that the sex could be predicted from the mandibular ramus, with accuracy reaching up to 77.6%, while Mehta et al. [34] reported that the sex could be estimated with the high accuracy of 77.3% from a minimum ramus breadth. Finally, Samatha et al. estimated the sex with an accuracy rate of 53% for males, and 60% for females. Still, a comparison with these results is difficult, as they used liner measurements from orthopantomogram radiographs (OPG) of the mandible, and analyzed them using basic statistics. In this study, CBCT radiography was used, which is considered the gold standard in oral and maxillofacial region imaging. Furthermore, unlike the 2D imaging technique, liner and volumetric measurements can be obtained using CBCT radiography [14].
The accuracy of other studies that have used CBCT radiography of the mandible and discriminant function analysis to estimate the sex ranges between 67% and 84.1% [36][37][38][39]. Despite the use of discriminant function analysis by many researchers in the field of forensic dentistry in predicting the sex, it has many limitations. For instance, the morphometric measurements of bone are not linear [41]. Moreover, all variables should be parametric, statistically independent, randomly sampled, and have equal sample sizes for both groups [42]. To overcome the limitations of discriminant function analysis, different ML algorithms have been used in this study, where the sample was split, with 80% dedicated to training, and the remaining 20% used to test the models.
Gabriela et al. [40], in their study, measured the ramus height and maximum length, coronoid height, gonial angle, and bigonial distance of 103 mandibular bones, using a digital caliper. The data were analyzed through an LR model that had been developed using 83 samples, and the remaining 20 samples were used for testing. The accuracy of this model in predicting the sex was 90%. This result is close to the finding of the present study. Abualhija et al. [43] used LR to estimate sex from the measurements of the ramus of the mandible from OPG radiographs. However, they predicted sex correctly with a total accuracy of 77.6%. Their result contradicted the finding of this study. This difference could be due to different radiographical techniques, and the measured parameters.
One of the most contemporary techniques used in forensics is artificial intelligence. Patil et al. [44] studied seven liner parameters in OPG radiographs of the mandible to determine the sex using three different methods (discriminant analysis, LR, and artificial neural network analysis). The result indicated that the accuracy of discriminant analysis was 69.1%, while the accuracy of LR was 69.9%, and the accuracy of the artificial neural network was 75%. The differences in their results compared to the results of the present study could be attributed to the measuring parameters, and the use of different algorithms in the current study to identify the sex.
Using ML algorithms to analyze volumetric and liner data obtained from CBCT radiographs of various structures of the maxillofacial region is a new technique to estimate the sex. Hamad et al. [45] reported that the sex could be predicted with high accuracy, reaching up to 98%, by studying the maxillary sinus morphometry with the aid of ML models. This result suggests that using ML models and CBCT radiographs is a promising approach to determining the sex.
Few studies have been conducted evaluating 3D radiography of the cranium and mandible using ML models. Toy et al. [46] evaluated twenty-five liner morphometric measurements obtained from computed tomography (CT) scans of 150 male and 150 female individual skulls and mandibles, using DT, RF, LR, linear discriminant analysis, quadratic discriminant analysis, and extra tree classifier ML algorithms. The results indicated that the sex could be predicted successfully, with a high accuracy (90%). The result of the study conducted by Toy et al. is similar to the sex prediction accuracy of the present research.
Regarding the reliability of this study, intra-examiner calibration was used, to assess the examiner's reproducibility, and a tenfold cross-validation test was applied, to evaluate the performance of the ML models. The small sample size, and the fact that the sample belonged to one ethnic group, are considered the main limitations of this research.

Conclusions
The identification of deceased persons is essential for social and legal closure. Hence, sex determination is a crucial step in human identification, as it narrows the search for identity by excluding nearly one-half of the cases. Accordingly, much research has been conducted to find an accurate, fast, and cost-effective technique to estimate the sex. The findings of this study indicate that the RCorGonLD, and mean CorGonLD are the most reliable parameters for predicting the sex with high accuracy. Interestingly, the sex can be quickly determined with high accuracy, up to 90%, using the ML algorithm GNB. However, further studies with larger sample sizes and racial diversity are needed, to support the result of this study. Informed Consent Statement: Patient consent was waived due to collected samples being obtained from archived CBCT radiographs.

Data Availability Statement:
The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest:
The authors declare no conflict of interest.