Automated Caries Screening Using Ensemble Deep Learning on Panoramic Radiographs

Caries prevention is essential for oral hygiene. A fully automated procedure that reduces human labor and human error is needed. This paper presents a fully automated method that segments tooth regions of interest from a panoramic radiograph to diagnose caries. A patient’s panoramic oral radiograph, which can be taken at any dental facility, is first segmented into several segments of individual teeth. Then, informative features are extracted from the teeth using a pre-trained deep learning network such as VGG, Resnet, or Xception. Each extracted feature is learned by a classification model such as random forest, k-nearest neighbor, or support vector machine. The prediction of each classifier model is considered as an individual opinion that contributes to the final diagnosis, which is decided by a majority voting method. The proposed method achieved an accuracy of 93.58%, a sensitivity of 93.91%, and a specificity of 93.33%, making it promising for widespread implementation. The proposed method, which outperforms existing methods in terms of reliability, and can facilitate dental diagnosis and reduce the need for tedious procedures.


Introduction
Dental health is important because of the correlation between oral health problems and illnesses such as cardiovascular disease and diabetes. Oral health has a significant impact on their overall health and quality of life. Oral health problems such as mouth and face discomfort, oral and throat cancer, oral infection and sores, periodontal (gum) diseases, tooth decay, and tooth loss impede a person's ability to bite, chew, and speak and affect psychological health. In 2016, the World Health Organization (WHO) projected that over 3.5 billion individuals were impacted by oral disorders and expected this number to continue to rise [1].
Dental caries form when acids produced by bacteria in the mouth erode dentin, causing damage to tooth structure or attachment, which can make gums bleed. They are the most common chronic oral disease in adults, affecting around 60% of adults over the age of 50. Dental health is part of oral health [2], including the state of oral tissues as well as factors that can affect oral health. Dental plaque is initially a soft, thin film. Soft plaque turns into hard plaque, which cannot be easily removed by brushing, via mineralization with calcium, phosphate, and other minerals [3]. Over time, caries cause holes, destroy the tooth, and increase the risk of further damage, including tooth loss ( Figure 1).
Medical imaging technology, such as that based on X-rays and other forms of radiation, is used for diagnosis and treatment. Multimodal medical imaging technologies allow more than one form of radiation to be used at the same time to obtain an image that is more accurate and complete. Such technologies help doctors determine the best course of action for their patients. They also help reduce pain and speed up the diagnosis process. A concern Medical imaging technology, such as that based on X-rays and other forms of radiation, is used for diagnosis and treatment. Multimodal medical imaging technologies allow more than one form of radiation to be used at the same time to obtain an image that is more accurate and complete. Such technologies help doctors determine the best course of action for their patients. They also help reduce pain and speed up the diagnosis process. A concern of patients is radiation exposure. However, the radiation emitted is generally very low-level and is not likely to cause any long-term health problems.
Advancements in medical imaging technology enable the rapid gathering and analysis of a large amount of data. Computer-aid diagnoses (CADs) can assist physicians to interpret 2D and 3D images [4]. 3D imaging provides more detail and is thus useful for complex cases. A deep-learning-based method can segment the mandible from core beam computed tomography images [5]. 2D imaging provides essential information for diagnosing problems such as cancer, diabetes, and caries [6,7]. Several studies [8][9][10] have advocated the use of photoacoustic images, wavelength images, or ultrasound imaging for caries detection. Other studies [10,11] have proposed an approach that employs an RGB oral endoscope image. However, most systems cannot observe the detailed anatomy of a tooth, especially the root, and hence cannot be used to diagnose caries. Dental radiography is a simple and affordable imaging method that can be performed in most dental offices and hospitals; other imaging techniques, such as CT radiography and near-infrared ranging, are more costly and thus less commonly used [12]. Dental radiography images are thus preferable for the early detection of caries based on computer-aid diagnosis.

Literature Review
Caries detection based on radiography uses panoramic radiographs, periapical images, bitewing images, or occlusal images. Panoramic radiographs, which are the most complex, present the health condition of all teeth and provide a benefit of medical history in a whole oral image, whereas the other types of images show only a few teeth in a specific region. Periapical, bitewing, and occlusal images provide similar information. Therefore, panoramic radiographs are more informative and preferred for caries detection Li et al. [13] used support vector machine (SVM) and a backpropagation neural network (BPNN) to identify tooth decay. The autocorrelation coefficient and the gray level Advancements in medical imaging technology enable the rapid gathering and analysis of a large amount of data. Computer-aid diagnoses (CADs) can assist physicians to interpret 2D and 3D images [4]. 3D imaging provides more detail and is thus useful for complex cases. A deep-learning-based method can segment the mandible from core beam computed tomography images [5]. 2D imaging provides essential information for diagnosing problems such as cancer, diabetes, and caries [6,7]. Several studies [8][9][10] have advocated the use of photoacoustic images, wavelength images, or ultrasound imaging for caries detection. Other studies [10,11] have proposed an approach that employs an RGB oral endoscope image. However, most systems cannot observe the detailed anatomy of a tooth, especially the root, and hence cannot be used to diagnose caries. Dental radiography is a simple and affordable imaging method that can be performed in most dental offices and hospitals; other imaging techniques, such as CT radiography and near-infrared ranging, are more costly and thus less commonly used [12]. Dental radiography images are thus preferable for the early detection of caries based on computer-aid diagnosis.

Literature Review
Caries detection based on radiography uses panoramic radiographs, periapical images, bitewing images, or occlusal images. Panoramic radiographs, which are the most complex, present the health condition of all teeth and provide a benefit of medical history in a whole oral image, whereas the other types of images show only a few teeth in a specific region. Periapical, bitewing, and occlusal images provide similar information. Therefore, panoramic radiographs are more informative and preferred for caries detection Li et al. [13] used support vector machine (SVM) and a backpropagation neural network (BPNN) to identify tooth decay. The autocorrelation coefficient and the gray level co-occurrence matrix are used separately in their method for feature extraction. SVM and BPNN models are then used separately for classification. On a testing set, SVM had an accuracy of 79% and BPNN had an accuracy of 75%. These accuracies are insufficient for practical applications. Their study did not describe the dataset and thus the validity of their research is unknown. Yu et al. [14] attempted to improve the backpropagation neural network layer and autocorrelation coefficient matrix feature extraction. Their approach was evaluated using 80 private dental radiographs. An accuracy of 94% was obtained; however, as the number of network layers increases, the system becomes more computationally expensive. The sensitivity, specificity, precision, and F-measure were not reported. The small testing data (35 photographs) and lack of cross-validation are shortcomings of their study.
Patil et al. [15] developed a dragonfly-specific intelligent system. The feature set is extracted using multi-linear principal component analysis (MPCA). After the characteristics are loaded into a neural network classifier, the classifier is trained using the adaptive dragonfly algorithm as an optimization strategy. 120 private dental photographs were used to assess the MPCA model non-linear programming with the adaptive dragonfly algorithm (MNP-ADA) with three test scenarios. Each test case consisted of a total of 40 photographs, 28 and 12 of which were utilized for training and testing, respectively. Other classifiers and feature sets, such as linear discriminant analysis (LDA) [16], principal component analysis (PCA) [17], and independent component analysis (ICA) [18], as well as fruit fly (FF) [19] and grey-wolf optimization (GWO) [20], were employed for comparison. The MNP-ADA model achieved an accuracy of 90%, a sensitivity of 94.67%, and a specificity of 63.33%. This low specificity indicates that patients without caries were incorrectly labeled as patients with caries. The high precision but limited specificity may raise questions about the data balance between photographs with and without caries.
Singh et al. [21] proposed an automated caries detection method based on Radon transform (RT) and discrete cosine transform (DCT). To capture low-frequency information, RT is performed on X-ray images for each degree. 2D DCT is then applied to the RT images to extract frequency characteristics (DCT coefficients). These characteristics are transformed into a 1D coefficient vector in a zigzag way. Principal component analysis is then applied to this vector to retrieve features. Using decision tree, k-nearest neighbor, random Forest, naive Bayes, sequential minimum optimization, radial basis function, decision stumps, and AdaBoost classifiers, the smallest number of features are then combined. The best result was achieved with random forest, with an accuracy of 86%, a sensitivity of 91%, and a specificity of 80%.
Le et al. [22] proposed a framework for diagnosing dental problems, called the Dental Diagnosis System (DDS), that uses panoramic radiographs. It is based on a hybrid approach that combines segmentation, classification, and decision-making. For the segmentation task, it used the best method for dental image segmentation, which based on semi-supervised fuzzy clustering. For the classification task, a graph-based algorithm called affinity propagation clustering was developed. To select a disease from a group of diseases found in the segments, a decision-making method was developed. DDS was designed based on actual dental cases in Hanoi Medical University, Vietnam, which included 87 dental photographs of cases with five prevalent diseases, namely root fracture, wisdom teeth, tooth decay, missing teeth, and periodontal bone resorption. The accuracy of DDS is 92.74%, which is higher than those of systems based on fuzzy inference (89.67%), fuzzy k-nearest neighbor (80.0%), prim spanning tree (58.46%), Kruskal spanning tree (58.46%), and affinity propagation clustering (90.01%). Their dataset consisted of various types of images, which may have led to unreliable results.
Most previous researches have an undependable method which is low performance or cannot fully automated diagnosis. In the present study, we comprehensively evaluate panoramic radiographs and develop a fully automated and dependable caries screening approach.

Dataset
We received a dataset from dentists at Shinjuku East Dental Office. The dataset consists of unprocessed radiographs of 95 individuals. These radiographs were automatically processed to generate 533 tooth regions in the tooth region proposal stage. Images are Entropy 2022, 24, 1358 4 of 12 from real patient cases from the hospital. The patients were 18 years old or older and provided consent. It is important to highlight that caries is more severe in adults (over 18 years old) since their teeth are no longer milk teeth but rather permanent teeth, which cannot be restored to their previous state. The University Committee at Tokai evaluated the publishing and usage rights of the images in the dataset based on ethical considerations. Figure 2 shows an example image from the dataset. It includes the mouth and a portion of the patient's jaw bone.

Dataset
We received a dataset from dentists at Shinjuku East Dental Office. The dataset consists of unprocessed radiographs of 95 individuals. These radiographs were automatically processed to generate 533 tooth regions in the tooth region proposal stage. Images are from real patient cases from the hospital. The patients were 18 years old or older and provided consent. It is important to highlight that caries is more severe in adults (over 18 years old) since their teeth are no longer milk teeth but rather permanent teeth, which cannot be restored to their previous state. The University Committee at Tokai evaluated the publishing and usage rights of the images in the dataset based on ethical considerations. Figure 2 shows an example image from the dataset. It includes the mouth and a portion of the patient's jaw bone.

Method
The proposed method, shown in Figure 3, consists of tooth segmentation, tooth feature descriptor, and caries prediction processes. In the first stage, a YOLO model is applied for tooth region proposal. Then, the proposal region is segmented from the image and fed into the feature descriptor. Several pre-trained networks, namely VGG16 [23], VGG19 [23], Resnet18 [24], Resnet50 [24], Resnet101 [24], Xception [25], and Densenet201 [26], are used as feature descriptors to extract informative features. Next, the features are used to train an SVM [27] classifier. Finally, a majority voting method is applied using the model features to produce the final optimal result.

Method
The proposed method, shown in Figure 3, consists of tooth segmentation, tooth feature descriptor, and caries prediction processes. In the first stage, a YOLO model is applied for tooth region proposal. Then, the proposal region is segmented from the image and fed into the feature descriptor. Several pre-trained networks, namely VGG16 [23], VGG19 [23], Resnet18 [24], Resnet50 [24], Resnet101 [24], Xception [25], and Densenet201 [26], are used as feature descriptors to extract informative features. Next, the features are used to train an SVM [27] classifier. Finally, a majority voting method is applied using the model features to produce the final optimal result.

Tooth Region Segmentation
As mentioned, caries detection methods that directly use images received from the dentist have been developed. The images are usually either unprocessed or periapical images, which makes using them expensive in terms of human labor and cost. In the present research, an automatic region proposal method is used to reduce cost and improve diagnosis.
First, we create an image's region of interest. To prevent encroachment on the teeth, we choose a region in the center of the image with a preliminary ratio compared to the original image of 1:1.4. The images are scaled to fit the Yolov3 model's input size. The YOLOv3 model is used to suggest a tooth region, with Squeeze Net as the network's base [28][29][30]. We increase the number of detection heads and concatenate the output of each detection head with a suitable layer to generate better results. However, we must consider the model's size to avoid overfitting and decrease complexity. Three detection heads are utilized in this detection model. A detailed illustration of the tooth segmentation process is shown in Figure 4. The fine-tuned parameters are given in Table 1.

Tooth Region Segmentation
As mentioned, caries detection methods that directly use images received from the dentist have been developed. The images are usually either unprocessed or periapical images, which makes using them expensive in terms of human labor and cost. In the present research, an automatic region proposal method is used to reduce cost and improve diagnosis.
First, we create an image's region of interest. To prevent encroachment on the teeth, we choose a region in the center of the image with a preliminary ratio compared to the YOLOv3 model is used to suggest a tooth region, with Squeeze Net as the network's base [28][29][30]. We increase the number of detection heads and concatenate the output of each detection head with a suitable layer to generate better results. However, we must consider the model's size to avoid overfitting and decrease complexity. Three detection heads are utilized in this detection model. A detailed illustration of the tooth segmentation process is shown in Figure 4. The fine-tuned parameters are given in Table 1.

Deep Pre-Trained Network as Feature Descriptors
In this work, a convolutional neural network with pre-trained weights is employed as a feature descriptor to extract deep activated features. To determine the best descriptor of pre-trained networks, the seven most popular networks, namely VGG16, VGG19, Resnet18, Resnet50, Resnet101, Xception, and Densenet, were used. Technically, the network processes RGB pictures, whereas the radiographs are grayscale; hence, we multiplied the grayscale channel to replace the image's missing channels. Table 2 shows the depth, parameters, size, and input size for each pre-trained model. Among the network models, Densenet has the most layers (201), and VGG16 has the fewest layers (23).

Classification
The extracted feature set from each feature descriptor in the preceding stage is used to train an SVM classifier to predict caries [31]. The SVM model seeks to identify the ideal hyperplane for describing the difference between data (caries and non-caries in this scenario. The Gaussian radial basis function is used in the classifier to reduce the number of training points. For data D = {(x i , y i ), i = 1 . . . N} and y i ∈ {−1, 1}, the SVM model and mapping function of the Gaussian kernel can be described as follows: where C > 0 is the selected parameter and ξ is a set of slack variables.
where K is the kernel function and A is a constant. We also applied the feature set to k-nearest neighbor [32,33] and random forest [34][35][36][37] classifiers for comparison with support vector machine.

Majority Voting
The predictions of each feature and the SVM predictor are considered as individual opinions that depend on different contributions of accuracy performance. To produce a final prediction, voting is conducted among the predictors. The final diagnosis is made where N is the number of predictors, n is the predictor number, L is the number of classes, and P is probability.

Measures
The performance of the proposed method was evaluated in terms of accuracy (ACC), sensitivity (SEN), and specificity (SPEC). In addition, the positive predictive value (PPV), negative predictive value (NPV), F1-score, and processing time are presented. The detailed calculation of each measure is as follows: where true positive (TP) indicates the number of caries images correctly classified as caries, true negative (TN) indicates the number of non-caries images correctly classified as noncaries, false positive (FT) indicates the number of non-caries images incorrectly classified as caries, and false negative (FN) indicates the number of caries images incorrectly classified as non-caries.

Result Evaluation
An analysis of majority voting for several pre-trained neural networks and a classifier was conducted. The results are shown in Table 3. Overall, SVM has the best performance for every feature descriptor and in the final vote. An increase in the depth of a network increased accuracy. For SVM, the accuracy, sensitivity, and specificity with Densenet were 90.57%, 95.65%, and 86.67%, respectively, which is predictable due to the depth of the network. VGG16 had the lowest accuracy, sensitivity, and specificity (79.25%, 73.91%, and 83.33%, respectively). The majority voting made use of each feature descriptor and increased performance to 92.45% for accuracy, 95.65% for sensitivity, and 90% for specificity using an SVM classifier. Even though there might be some circumstances which random forest have a better sensitive, other measuring elements are not compatible.
To develop and evaluate an effective caries detection system, the training and testing sets were randomly divided for cross-validation. The k-fold cross-validation was used to evaluate the proposed method's robustness. The results demonstrate that the proposed method reliably adapts to unknown samples and covers the whole problem space. Additionally, k-fold cross-validation was used to avoid overfitting the proposed method to our testing data. It was applied to the method that best represents the issue, which is the Entropy 2022, 24, 1358 8 of 12 SVM. The difference in accuracy between folds is around 6% (lowest accuracy: 90.57%, highest accuracy: 96.23%). All average values of accuracy, sensitivity, and specificity are higher than 93%, which indicates that our method is stable and reliable. We also computed the receiver operating characteristic (ROC) curves and area under the curves (AUC). The ROC curves, which describe the data for each fold and the average value, are presented in Table 4 and Figure 5.  To compare the complexity of the method for various feature descriptors, we computed the execution time of each process in MATLAB2020a running in a Windows 10 environment on a computer with an Intel i7 CPU and a GeForce GTX 2060 GPU, 32 GB RAM. Table 5 shows the execution time for each function in seconds. The operation for the Densenet feature descriptor is the most time-consuming. It took 113.7 seconds to finish, which is at least 10 times longer than any other operation. In comparison, the fastest process on Resnet18 took only 4.33 seconds. Without considering the training process, the proposed method can be widely used because of its high processing speed.  To compare the complexity of the method for various feature descriptors, we computed the execution time of each process in MATLAB2020a running in a Windows 10 environment on a computer with an Intel i7 CPU and a GeForce GTX 2060 GPU, 32 GB RAM. Table 5 shows the execution time for each function in seconds. The operation for the Densenet feature descriptor is the most time-consuming. It took 113.7 seconds to finish, which is at least 10 times longer than any other operation. In comparison, the fastest process on Resnet18 took only 4.33 seconds. Without considering the training process, the proposed method can be widely used because of its high processing speed.
Finally, we compare the proposed method with state-of-the-art methods. A short description of each existing method and its dataset is given below. Existing methods use distinct datasets, whose size and complexity affect performance. Therefore, this comparison is preliminary. The specifications of the state-of-the-art methods are given in Table 6. Although the method in [15] achieved a promising 90% accuracy, its low specificity of 63.33% is insufficient. In addition, the methods in [14,15] primarily use periapical images, which are often basic and require human effort to produce the final result. In contrast, the method in [21] has a general outcome that is not particular to the state of carious teeth. Despite a promising accuracy of 92.47%, the method in [22] is hampered by its use of mixed data, which leads to unknown validity. In addition, the sensitivity and specificity of this method were not reported. The table indicates that the proposed method has an accuracy of 95.38% and outperforms most existing methods. In addition, we present a full technique evaluation in a comprehensive dataset. Finally, we compare the proposed method with state-of-the-art methods. A short description of each existing method and its dataset is given below. Existing methods use distinct datasets, whose size and complexity affect performance. Therefore, this comparison is preliminary. The specifications of the state-of-the-art methods are given in Table 6. Although the method in [15] achieved a promising 90% accuracy, its low specificity of 63.33% is insufficient. In addition, the methods in [14,15] primarily use periapical images, which are often basic and require human effort to produce the final result. In contrast, the method in [21] has a general outcome that is not particular to the state of carious teeth. Despite a promising accuracy of 92.47%, the method in [22] is hampered by its use of mixed data, which leads to unknown validity. In addition, the sensitivity and specificity of this method were not reported. The table indicates that the proposed method has an accuracy of 95.38% and outperforms most existing methods. In addition, we present a full technique evaluation in a comprehensive dataset.

Conclusions
This study proposed a method for segmentation and caries diagnosis for caries screening. Most existing methods perform caries classification using periapical images, which require human labor to extract the input image. In contrast, the proposed method extracts the tooth region of interest automatically. Although the automatically segmented images may contain some errors, the proposed method has an accuracy of 93.58%, outperforming state-of-the-art methods.
Because features are extracted from seven feature descriptors, redundant features may be overcrowded. In future work, we would like to analyze each feature's contribution to lowering the computational cost. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Restriction applies to the availability of these data. The data were obtained from Shinjuku East Dental Office (the director is Makoto Kumon) and are available from authors with permission of Makoto Kumon or by sending a request to Makoto Kumon at: http://www.shinjukueast. com/doctor-staff/ (accessed on 3 August 2022).