Deep Fusion Feature Extraction for Caries Detection on Dental Panoramic Radiographs

: Caries is the most well-known disease and relates to the oral health of billions of people around the world. Despite the importance and necessity of a well-designed detection method, studies in caries detection are still limited and show a restriction in performance. In this paper, we proposed a computer-aided diagnosis (CAD) method to detect caries among normal patients using dental radiographs. The proposed method mainly consists of two processes: feature extraction and classiﬁcation. In the feature extraction phase, the chosen 2D tooth image was employed to extract deep activated features using a deep pre-trained model and geometric features using mathematic formulas. Both feature sets were then combined, called fusion feature, to complement each other defects. Then, the optimal fusion feature set was fed into well-known classiﬁcation models such as support vector machine (SVM), k-nearest neighbor (KNN), decision tree (DT), Naïve Bayes (NB), and random forest (RF) to determine the best classiﬁcation model that ﬁt the fusion features set and perform the most preeminent result. The results show 91.70%, 90.43%, and 92.67% for accuracy, sensitivity, and speciﬁcity, respectively. The proposed method has outperformed the previous state-of-the-art and shows promising results when none of the measured factors is less than 90%; therefore, the method is promising for dentists and capable of wide-scale implementation caries detection in hospitals.


Introduction
Oral health plays a main role in people's overall health and quality throughout their lifetime, regardless of nationality, region, or religion. It is healthy conditions without mouth and facial pain, oral and throat cancer, oral infection and sores, periodontal (gum) diseases, tooth decay, tooth loss, and disorders that limit an individual's capacity in biting, chewing, speaking, and psychosocial wellbeing. The World Health Organization (WHO) estimated that around 3.5 billion people were affected by oral diseases in 2016 and continuedly increasing [1]. Caries, also known as tooth decay or oral cavities, is the most common disease that affects the quality of life worldwide. Around 60%-90% of school children and almost 100% of adults have dental cavities. Caries is the breakdown of teeth due to acid made by bacteria. The symptoms of untreated caries come in different forms or colors, such as yellow or black, resulting in oral pain, facial pain, tooth loss, and is a major cause of noncommunicable disease. Treatment of oral diseases is usually expensive and not part of universal health coverage. Dental treatment costs 5% of total health spending and is generally a 20% out-of-pocket expenditure in many developed countries. The condition seems to be worse in most developing countries where people cannot afford oral health treatment services. Most caries conditions are treatable and preventable in the early stage, Detection of caries may consist of three phases: (1) segment (or isolate) the diagnosis tooth from other teeth; (2) preliminary diagnosis to determine whether a tooth has decay; (3) comprehensive diagnosis to make a treatment for the decaying tooth as well as classify the stage of decay into four groups (C1-C4) based on the condition and damage of the tooth. Although a nurse could perform phase one, phases two and three may need practical experience from a dentist. In this research, we aim to develop a method to make a preliminary diagnosis at phase two to reduce the dentist's effort on non-caries patients.
Recently, with the development of medical imaging technology, computer-aided diagnosis systems (CADs) play the main role in the early detection of several diseases such as cancer, diabetes, or even caries [3,4]. Caries can be detected using several different methods and techniques. Some researchers proposed detection using photoacoustic images, wavelengths, or ultrasound images [5][6][7]. Other research has detailed an approach using an RGB oral endoscope image [8,9]; however, most approaches cannot provide a detailed structure of the tooth, especially the tooth root, and therefore struggle to provide caries diagnosis. Compared to oral endoscope imaging, dental radiographs provide greater image quality and detect detailed structure deformity in the tooth [10]; therefore, the dental radiographic image is the most well-known approach, and is preferable for the detection of caries in the early stage.
Clinically, dental radiographs, which are used to identify teeth problems and evaluate oral health, were taken by X-ray with a low level of radiation to capture images of the interior of teeth and gums. Radiographs are usually shown in grayscale images or sometimes in color images; however, color radiographs require significant investment, which provides a barrier to entry for most hospitals, especially hospitals in low-income countries; to account for this, we focused on grayscale radiographs. Unfortunately, there is no reliable dataset that provides high-quality images, descriptions, and reliable ground truth. In this field, most data are only shared within some strict conditions, such as all the researchers must publish to a specific journal or must be a member of some group or event. Some researchers publish their private data used in their research. The data usually have problems with the quality of the image, size of data, lack of description and ground-truth, and/or lack of data availability in the long-term. In this research, dataset and ground truth were provided by Dr. Kumon Makoto, director of Shinjuku East Office, under a research contract with Tokai University. Dr. Kumon Makoto received The Academy of Clinical Dentistry Certified Physician and registered as a professional dentist with No.148529 on 19 May 2003. With 18 years of experience as a dentist and responsibility for over 200 patients per month, he could reliably provide a truthful dataset. More importantly, all the patients who participated in the dataset collection were real patients of Dr. Makoto and under his treatment. Each caries tooth in the dataset was already confirmed in the patient's medical history during the treatment. For reasons mentioned above, we believe that our dataset is trustworthy and can be used for research and publication purposes.

Related Works
In the dental examination using radiographs, caries can be recognized as a break in the tooth, parts missing from a tooth, or tooth loss. There is no obvious symptom or criteria on the shape, size, or intensity for tooth decay except the dentist's diagnosis experience, which causes a huge challenge for computer-aided diagnosis systems based on image processing. Wei Li et al. [11] proposed a method to detect tooth decay using a support vector machine (SVM) and a backpropagation neural network (BPNN). The method uses two features set separately for features extraction: Autocorrelation Coefficient and Graylevel Co-occurrence Matrix. Then, a model of SVM and BPNN was applied separately for classification purposes. The result shows that SVM has around 79% accuracy on the testing set, whereas BPNN is around 75% accuracy. The result is inefficient and needs more works for improvement. Besides this, in the article, the dataset's description is not mentioned; thus, it may lead to questions about the research's reliability.
Yang Yu et al. [12] tried to enhance the backpropagation neural network layer and features extraction of autocorrelation coefficient matrix. The method was tested on 80 private tooth images (55 images for training and 35 images for testing and shows 94% accuracy); however, there is a great computational burden when the number of layers in the backpropagation neural network is increased. In addition, effective measures, such as sensitivity (SEN), specificity (SPEC), precision (PRE), and F-measure, are not mentioned. Further, the pretty small testing data (35 images) without cross-validation also shows weakness, which cannot address the whole problem of tooth decay. Shashikant Patil [13] proposed an intelligent system with dragonfly optimization. Multi-linear principal component analysis (MPCA) was applied to extract the feature set. The features set were then fed into a neural network classifier trained using an optimization method, which was the adaptive dragonfly algorithm (ADA). The proposed MPCA model non-linear Programming with ADA (MNP-ADA) was tested with 120 private tooth images divided into three test cases. Each test case consisted of 40 images, 28 images were used for training, and 12 images were used for testing. The other classifiers, such as fruit fly (FF) [14] and grey-wolf optimization (GWO) [15], and feature sets, such as linear discriminant analysis (LDA) [16], principal component analysis (PCA) [17], and independent component analysis (ICA) [18], were also used in the testing for comparison. The final average results show that the MNP-ADA model reaches 90% accuracy, 94.67% sensitivity, and 63.33% specificity. The result shows a low performance of specificity, which describes non-caries patients misclassified as caries patients; therefore, the distinction between caries and non-caries patients is not efficient, so the performance needs to be improved. Because the result shows a high accuracy value despite a low specificity value, this may lead to hesitations about the balance of data between caries and non-caries images. This study also shows other measure values, such as precision and f1-score, which are discussed in more detail in the results section.
Nowadays, deep learning makes a great breakthrough in the machine learning field [19]. The convolutional neural network (CNN) is the most well-known deep learning model, which could be used for many purposes, such as the detection of new unknown objects (transfer learning), fine-tuning the weight, or feature extraction [20][21][22][23]; however, so far as we are aware, there is no previous study that applied deep learning for caries classification problems, especially in dental radiographs; meaning that there may be a need for research in this area. In addition, a single CNN model may result in an unsatisfied performance and neglect a large space of unexplored potential of image data. Thus, there is a need for improvement from the deep activated features by combining other sources for more features. Consequently, in this study, we propose a deep activated model that can best describe dental radiographs and improve the performance of the feature set by combining other mathematic features such as mean, standard deviation (STD), and texture features. Each deep activated model features set is extracted carefully by testing the result of each considerably deep layer. The mathematic features are also tweaked to get the minimum features set while maintaining optimal performance. The combination features set, called the "fusion feature" in this study, is later fed into different classification models to find the best models that fit the features set and perform the best distinction in data. This study focused on two key objectives: (i) Stability, which based on the large data, can describe the problem and cross-validation to measure the different situation of problem; (ii) Performance is a better result in accuracy and improves specificity since the balance between sensitivity and specificity is sometimes more important than accuracy. Other measures are also shown for comparison with the previous study.
The rest of the paper is organized as follows. Section 3 describes the dataset and proposed method and describes how to implement our method step-by-step. Section 4 shows the results of each step described in Section 3. The results of previous studies are also mentioned for comparison. Section 5 provides discussion, a summary, and conclusions.

Materials and Methods
This section describes the proposed method as well as gives information about our dataset. Since there is no specific well-known public dataset in this field, a carefully prepared dataset is important for evaluating the proposed method; thus, most researchers prefer to build their own datasets for experiments [11][12][13].

Radiographs Dataset
To the best of our knowledge, the tooth is diverse in size, shape, and structure. Characteristics of tooth decay contribute even more to this diversity; therefore, the larger a dataset is, the better it can describe the tooth decay issues. Our dataset was collected and labeled by a dentist from the Tokai hospital. The dataset was assessed for quality and ethics by Tokai university's committee for the right of use and publication; however, the dataset's images are panoramic oral radiographs of all teeth, whereas dental diagnosis and treatment should be made for every individual tooth. Consequently, we needed to manually segment the tooth into each sub-image, which consists of the target tooth, which needs the diagnosis, and its label. The segmentation is simple and can be done by any dentist or nurse; therefore, we anticipate no considerable effect in this study ( Figure 2). To simulate the real cases, where the area determined for each tooth varies between whoever performs the segmentation, we do not take the area and range of tooth fixed in any size but consider it very flexible depending on the tooth's size and position and surrounding space. After the segmentation, the dataset comprised 533 image samples: 229 caries teeth and 304 non-caries teeth. Since the difference in the number of caries and non-caries images remain a small proportion (caries/non-caries is approximate 0.43/0.57), the dataset can be considered as balanced data. Each image is a two-dimensional grayscale image that consists of the target tooth and its surrounding areas, such as black empty space or a part of neighboring teeth. The images present the original condition of teeth without any modification in color, size, or angle. All images are flexible in size, which matches the standard segmentation process and is then fed into the same size layer for the feature extraction step.

Method
Caries detection mainly consists of two stages: feature extraction and classification. In the first stage, we experimented to find the deep activated features from pre-trained models that best describe radiography, such as Alexnet [20], Googlenet [24], VGG16 [25], VGG19 [25], Resnet18 [26], Resnet50 [26], Resnet101 [26], and Xception [27] networks. The experiments were done in the deepest layers of each model. Later, the mathematic features, such as mean, STD, and texture features such as Haralick's features [28], were extracted to improve the feature's information. Both features set are later combined into fusion features. The second stage is where we test the feature set in the classification models, such as support vector machine (SVM), Naïve Bayes (NB), k-nearest neighbor (KNN), decision tree (DT), and random forest (RF). The whole process, along with other sub-stages, are shown in Figure 3.

Features Descriptors using Pre-trained CNN Deep Networks
A pre-trained CNN is used in this study as a feature descriptor to extract the deep activated features. The eight most well-known networks, Alexnet, Googlenet, VGG16, VGG19, Resnet18, Resnet50, Resnet101, and Xception network, were applied to find the best descriptor pre-trained networks. Table 1 describes each pre-trained model specifications in detail, such as depth, parameter, size, and input size. The most common recommended layers for extraction are usually the latest layer before the "prediction" layer for the deepest learning rate; therefore, for our experiment so far, we tested several layers before the "prediction" layer (except the "drop" layer because the drop layer likely shows the same information as the previous respective layer). The image needs to be resized at a specific size before feeding it into a particular network. Technically, the network processes RGB images when the radiographs are grayscale; therefore, we multiplied the grayscale channel to replace the missing channel in the image. The layers and network tests are shown in the Results section. Geometric features are a fundamental factor to describe any kind of problem. Since the features are extracted using a mathematic formula, the features are therefore understandable and explainable. Despite of the contribution of deep activated features descriptors, geometric features can contain sufficient and relevant information that is noticeable to human. Furthermore, deep activated features usually explore the data in a way impenetrable for human, whereas the geometric features are usually learned from the expert's experience in the field; therefore, the geometric features are necessary and irreplaceable for solving a complex problem.
In clinical practice, dentists manually determine the difference between caries and non-caries depending on the damage of structure in the tooth. The damage to the tooth's structure can be explained by the difference in size, shape, contrast, margin, intensity, and so on. Based on their characteristic, the suspicious features, which describes the state of the tooth, is extracted such as mean, Haralick's features [28], and gray level co-occurrence matrices (GLCM) features [29,30]. Table 2 describes the name and formula of used features in detail. In the formula, I(x, y) presents the pixel value I at the coordinate point x, y of the candidate image N. p(i, j) presents the (i, j) th entry of GLCM matrix. N g presents number of distinct gray levels in the image. µ and σ present the mean and standard deviation values.

Fusion Features
The extracted feature in deep networks and geometric features are combined in this step. The whole extract geometric features are connected to each deep activated feature. The fusion feature is then fed into a classification model in the next step. Also, to measure the efficiency of the geometric feature as well as fusion feature over deep activated features, we measured the performance by feeding each deep activated feature and fusion features into the classifier at the same condition as fusion features (Figure 4). The comparison of the result between fusion and deep activated features is discussed in the Results section.
Difference entropy

Classification
Each deep activated feature is combined with geometric features and then fed into the classification model. To test the fusion efficiency between deep activated and geometric features, the deep activated features are also tested separately and compared with the fusion features. Most tests were conducted using a well-known "optimal margin classifier," also known as support vector machine (SVM) [31].
The SVM model aims to find the optimal hyperplane that can best describe the distinction between data, caries, and non-caries in this case. To moderate the number of training points, we apply the Gaussian radial basis function in the classifier. For a given training data D = {(x i , y i ), i = 1 . . . N} and y i ∈ {−1, 1} the SVM classifier and the mapping function of the Gaussian kernel can be described as follows in Equations (1) and (2): where C > 0 is the selected parameter and ξ is a set of slack variables.
where K is the kernel function and A is the constant. Furthermore, to guarantee the best classification which fit the features set, we also test the best features set on k-nearest neighbor (KNN) [32], decision tree (DT) [33], Naïve Bayes (NB) [34], and Random Forest (RF) [35].

Experimental Results
This section describes how we conducted the experiment as well as gives information on the experiment environment. The result of each step is explained in detail. The best result was compared to the previous state-of-the-art.

Measures
Performance assessment of the proposed method in this study relied on three wellknown measures: accuracy (ACC), sensitivity (SEN), and specificity (SPEC). Plus, we also present precision or positive predictive value (PPV), negative predictive value (NPV), f1score, the area under the curve (AUC), and processing time to give a comprehensive view about the advantage of the proposed method and for other research reference purposes. The calculation of the measure can be explained as follows in Equations (3)- (8): where: True positive (TP) presents the number of caries images classified correctly as caries; True negative (TN) presents the number of non-caries images classified correctly as non-caries; False-positive (FT) means the number of non-caries images classified wrongly as caries; False-negative (FN) means the number of caries images classified wrongly as non-caries.

Experiment and Result
In the first stage of the experimental result, we define the optimal layers in each deep pre-trained network that best represents the problem. Table 3 explains the features set extracted from each deep pre-trained network respective to their layer. The extracted features set are tested with a support vector machine model to reach the final classification result. There is no reference for choosing the layer in each network; therefore, we tried several layers before its prediction layer. Some of them show that the best layers are the pooling layer, whereas others may choose the layer before. The highest performance can be reached from the "fc8" layer in the VGG16 model, which presents an accuracy of 90.57%, sensitivity of 91.30%, and specificity of 90%. Furthermore, Resnet50, Resnet101, and Xception also show a very promising result of around 88% accuracy. Noticeably, none of the deep activated features have less than 80% accuracy, proving that deep activated features are effective. The highest performance for each measured factor regarding to network was highlighted in bold.
To enhance the performance so far, we combined each deep activated feature set with geometric features and fed them into the SVM model ( Table 4). The result of the fusion feature shows that the fusion Xception feature evolved. After the combination, Figure 5 shows that the fusion features of the Xception network have become the most prominent features and improved the performance to 90.45%, 100%, and 86.67% for accuracy, sensitivity, and specificity, respectively. The highest difference is the improvement of sensitivity from 91% to 100%; therefore, the fusion Xception features set has demonstrated geometric contribution in the proper combination with deep activated features. Although the performance is compatible with Xception fusion features, Resnet18 and Googlenet also show an improvement of 83.02% to 86.79% accuracy and 84.91% to 88.68% accuracy, respectively. Noticeably, none of the fusion feature sets have lower accuracy than their respective deep activated features. In conclusion, fusion features show an obvious advantage on deep activated features set alone. The highest performance for each measured factor regarding to network was highlighted in bold.
We randomly divided the training and testing set for cross-validation to design and evaluate the appropriate caries detection method. The k-fold cross-validation is a wellknown reliable technique to test the robustness of the method. The application of k-fold cross validation proves the proposed method's reliability to cover the whole problem and adapt to the unknown samples; this technique was also used to prevent the overfitting of the method on our testing data.
We then used the most prominent results obtained by the different classification models to determine which classification model best fit the features set. In this study, decision tree (DT), k-nearest neighbor (KNN), Naïve Bayes (NB), random forest (RF), and support vector machine were used (Table 5). In this step, we also applied k-fold cross-validation to prevent the overfitting of the method and to calculate the final average assessment. The support vector machine is obviously the most dominant model that shows an accuracy of 91.70%, sensitivity of 90.43%, and specificity of 92.67%. As mentioned earlier in Section 3.1, the used dataset is considered balanced and shows a small difference in the number of caries and non-caries samples; therefore, precision (also known as positive predictive value) and recall pair (also known as sensitivity) also show promising values of 91.51% and 90.43%, respectively. Finally, we generated the receiver operating characteristic(ROC) curves for each classifier; the ROC curves describe each classifier in each fold of the experi-ment. The mean ROC curve of each classifier is interpolated in each graph from Figure 6a to Figure 6e and compared in Figure 6f.    For a comprehensive assessment, the execution time of the proposed caries detection method is also computed. The experiments were conducted using the Matlab2020a environment in Windows 10. The main process was performed using a CPU core i7-9750 HF, supported by a GeForce GTX 2060 graphic card.
Each function process is carefully considered since they are factors used to determine the complexity of the method. Table 6 shows that the total process takes 13.79 s in total, and the most complex function, which is deep activated feature extraction, takes less than 10 s. Also, the geometric feature calculation was extracted smoothly in only 2.52 s. Based on these results, we consider the proposed method to perform considerably well and capable of wide implementation, even with low computer specifications. Based on the prediction and evaluation time, an image of a tooth after segmentation will take only 0.28 s (less than 1 s) to know if it is caries or non-carries. This is optimal for dentists, even in a large hospital with a huge number of patients. Lastly, the proposed method was compared with the previous state-of-the-art techniques (Table 7). Because the different methods were conducted in different datasets, each dataset's size and complexity will make a difference in performance. For a fair comparison, we detail the method and describe the difference and the advantages/disadvantages of each method. In addition, because some methods are not fully described but have been tested on other datasets in other papers, we provide a reference to the appropriate study and provide a description. The comparison table shows that [11,12] have a disappointing performance, whereas [13] performs much better; however, considering the accuracy of 90.00%, the sensitivity of 94.67, and specificity of 63.33%, we can see an imbalance in data as well as a low-performance result. Our proposed method achieved 92.67% specificity compared to the other methods, which is a 29.34% improvement, and the remaining sensitivity values are better than 90%. The 4.24% decrease in sensitivity value is a worthy exchange for the improvement of specificity.

Conclusions
In this article, we present a caries detection method using radiography images. Firstly, the radiography images were manually defined by dentists as either caries or non-caries. Later, in the feature extraction process, tooth images were used to extract the deep activated feature. The proper layer used to extract deep activated features from each deep pre-trained model was defined during experiments. Then, the geometric feature was also extracted and combined with deep activated features to build fusion features. The optimal features set was explored by a performance comparison between deep activated features fusion features. The set of geometric features was reduced to its minimum while retaining the optimal information. Next, we fed the fusion into classification models such as support vector machine (SVM), decision tree (DT), k-nearest neighbor (KNN), Naïve Bayes (NB), and random forest (RF) to classify between caries and non-caries images. Our proposed method has achieved 91.70%, 90.43%, and 92.67% for accuracy, sensitivity, and specificity, respectively. We improved the accuracy by 1.7%, from 90% to 91.70%, and the specificity by 29.34%, from 63.33% to 92.67%; the sensitivity was also good at 90.43% compared to previous state-of-the-art methods. The proposed method gives two key contributions: the first contribution is to find the best features set, which is the combination between deep activated features and geometric features, and then fit a proper classification model to describe the problem. The second contribution is to enhance the performance by improving the specificity measure factor. The performance of the deep activated feature is not proportional to the complexity or size of the model. The VGG16 deep activated feature is better than Xception, whereas the fusion result is the opposite. Our choice of the deep activated feature plays an important role; however, choosing analytically calculated features also contributed to the result equally. The finding of which deep activated feature features are compatible with analytically calculated features is more important than finding the best deep activated feature among all pre-trained models. While most research tries to build networks as deep as possible to improve the learning performance, our result proves that the performance is sometimes irrelevant to the network's depth. More importantly, the calculated feature's combination may play a key role in improving the performance and, therefore, unexchangeable for the pre-trained model's depth. The processing time, which takes 13.79 s for the whole experiment and 0.28 s for prediction, demonstrates that the method can be widely implemented in a low-tech computer for a trivial time-consuming. Nonetheless, despite the advantage compared with the previous state-of-the-art, this study's limitation is that we conducted the detection of caries based on the manually segmented teeth. In future work, we will improve our research as a fully automated system by performing automatic segmentation. We also have a great interest in extending our method to classify different caries stages by using three-dimensional approaches. By that, our system will be an adjunct tool for both experienced and junior dentists.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
Restriction applies to the availability of these data. The data were obtained from Shinjuku East Dental Office (the director is Makoto Kumon) and are available from authors with permission of Makoto Kumon or by sending a request to Makoto Kumon at: http: //www.shinjukueast.com/doctor-staff/.