1. Introduction
Artificial intelligence (AI) models have made remarkable advancements in various fields, with deep convolutional neural networks (CNNs) [
1] emerging as a powerful subset of AI, especially for processing and analyzing images. These networks, inspired by the structure of the visual cortex in the human brain, show superior performance in tasks such as image classification, object detection, and image segmentation [
2]. In the field of medical imaging, these networks have demonstrated promising capabilities in detecting and diagnosing various diseases such as breast cancer, heart disease, and brain tumors [
3,
4]. Their performance is often reported to be comparable to that of experienced professionals, significantly reducing the time required for diagnosis [
5,
6,
7].
Recently, dental medicine has also started to benefit from such deep learning techniques [
8]. Specifically, these techniques have been applied to panoramic radiographs and cone-beam computed tomography (CBCT) images with the aim of assisting clinicians in detecting and analyzing dental conditions and diseases in the maxillofacial region [
9,
10,
11]. Examples include the detection of maxillary sinus mucosa [
12], pharyngeal airway space [
13], calcifications of the cervical carotid artery [
14], jaw cysts [
15,
16], supernumerary mesio-buccal root canals on maxillary molars [
17], vertical root fractures [
18], and periapical lesions (PALs).
PALs are one of the most frequent pathological occurrences in dental images. They resemble usually bacteria-induced osteolytic areas around the tip of the roots within a few millimeters in diameter [
19,
20]. PALs are conventionally analyzed in radiographs, whereas CBCT images often reveal these lesions as incidental findings [
20,
21]. While widely used conventional intra- and extra-oral radiographs [
22] lead to lower radiation doses but suffer from superimposition issues due to their projective nature, CBCT allows fully three-dimensional (3D) imaging of the maxillofacial region at the cost of higher dose requirements. However, due to its volumetric nature, CBCT has been shown to improve the detection of PALs when compared with radiographs [
23,
24,
25]. Manually identifying PALs with high sensitivity (recall) in both imaging modalities requires a certain amount of experience to prevent overlooked findings. As a result, automated deep learning-based methods for PAL detection in radiographs or CBCT imaging data have been proposed [
16,
26,
27,
28,
29,
30,
31,
32,
33,
34]. Serving as the foundation of this study, the promising CNN-based approach for periapical lesion detection in CBCT images proposed in [
32] achieved a sensitivity of 97.1% and a specificity of 88.0% when evaluated on 144 CBCT volumes with 206 lesions.
The great success of any deep CNN-based approach is based upon the assumption that training and testing data come from the same data distribution. However, when the test data deviates from the training data distribution, the ability of deep neural networks to generalize and perform well on the new data degrades [
35]. This phenomenon is often observed in clinical datasets due to factors such as anatomical anomalies, image artifacts, or occlusions, which shift the data distribution. In light of this, our validation study aims to provide a thorough statistical evaluation regarding the effectiveness and generalization capability of the CNN-based PAL-detection model proposed in [
32] on an entirely new, previously unseen clinical CBCT dataset with a shifted data distribution compared to the data used to train the model. The null hypothesis of this validation study is that the method proposed in [
32] delivers an inferior result when applied to our new, challenging evaluation dataset from clinical practice.
3. Results
From
Table 4, it can be seen that the overall sensitivity of the deep learning-based lesion detection approach evaluated on all present teeth was 86.7% (95% CI: 82.3–90.3%) when compared with the expert-derived ground truth. The specificity of the software was 84.3% (95% CI: 82.8–85.6%). The null hypothesis of inferiority of the software with respect to sensitivity could not be rejected (
), while the null hypothesis of inferiority with respect to specificity could be rejected (
). In our dataset consisting of images, where either one or both jaws were available, any of the jaws could have missing teeth. Out of a total of 669 missing teeth, the software found 42 false positive (
) lesion predictions, while 627 missing teeth were correctly identified as negatives (
). The confusion matrices of the overall results for present teeth, as well as for present and missing teeth combined, are given in
Table 5 and
Table 6.
Table 4 also illustrates individual lesion detection results stratified per jaw. For the upper jaw (
patients with a total of 1598 present teeth), the sensitivity is 87.8% (95% CI: 82.3–92.0%), and the specificity is 82.3% (95% CI: 80.2–84.3%). For the lower jaw (
patients with a total of 1349 present teeth), the sensitivity is 84.6% (95% CI: 76.2–90.9%), and the specificity is 86.4% (95% CI: 84.4–88.3%). The difference in sensitivity between the upper jaw and lower jaw is 3.2% (95% CI: −5.2–11.5%). This difference is not significant according to Fisher’s exact test when comparing the two jaws (
). The difference in specificity between the upper jaw and lower jaw is −4.1% (95% CI: −6.9–1.4%), which is significant (
) according to Fisher’s exact test.
Moreover, we illustrate lesion detection results stratified per tooth type (combined for both jaws) in
Table 4. Sensitivities are below
for the three categories where also the total number of lesions was comparatively lower (third molars, canines, lateral incisors). On the other hand, for the remaining five tooth categories, the average sensitivity is
.
Finally, we analyze the lesion detection results with respect to lesion classifications (periapical index scores according to Estrela et al. [
20]). In
Figure 2, we plot a histogram of true positives and false negatives per lesion type, illustrating that for the smallest lesion type (class 1, with a diameter of periapical radiolucency between 0.5 and 1 mm), the sensitivity is low, while for classes 2 through 5 (diameters of periapical radiolucency larger than 1 mm, see also
Table 2), the sensitivities are much higher.
Exemplary qualitative results of the software are shown in
Figure 3.
4. Discussion
Recent machine learning methods, especially deep neural networks for assisting experts in the detection and segmentation of lesions in medical imaging data, have shown tremendous success, but they struggle with issues due to a lack of generalization to datasets from clinical practice [
35]. We have performed a thorough evaluation and non-inferiority testing of a recently published algorithm for automated periapical lesion segmentation from dental CBCT images [
32]. This algorithm was hidden behind a graphical user interface that solely produced a lesion segmentation given an input image from the new, single-use testing dataset used in this study. The dataset comprises 196 subjects with images of adult upper and lower jaws, including tooth gaps, dental restorations, implants, and impacted third molars (see
Table 1). Additionally, the new dataset was obtained with a specific criterion that allowed the inclusion of up to 11 missing teeth per jaw. This led to a significantly higher number of missing teeth compared to the dataset in [
32], where the aim was to include jaws with a minimal number of missing teeth. Thus, our new evaluation dataset reflects the presence of challenging circumstances in clinical practice. Moreover, our evaluation protocol was very strict in defining false positive findings, since one false positive (FP) voxel in the segmentation had already led to an FP prediction (see
Figure 3a), thus imposing a hard but realistic scenario for the algorithm.
The algorithm could be successfully applied to present and missing teeth from 195 subjects, i.e.,
of subjects in the total dataset. Our main result is a sensitivity of
and a specificity of
in detecting PALs at present teeth. The non-inferiority tests, which were designed upon sensitivity and specificity estimates derived from the proof of concept evaluation in [
32], provided enough evidence to reject the null hypothesis for specificity but did not do so for sensitivity. Despite this drop in sensitivity, we still consider our absolute performance on this challenging dataset as very promising (see also our qualitative results in
Figure 3a–c), since both the sensitivity and specificity are better than the threshold of 80%, which, according to the systematic review in [
24], can be interpreted as the threshold indicating excellent results. One of the reasons for false positives occurring might be that some lesions are located close to the incisive, the inferior alveolar, and the mental nerve, as illustrated in
Figure 3d. Furthermore, artifacts in general, caused by root canal fillings or dental implants potentially pose problems for the deep CNN (see
Figure 3e). We also studied the algorithm’s performance at missing teeth and found that the overall specificity for present and missing teeth combined increases from
to
(see confusion matrix in
Table 6).
Regarding the related work for automated detection of PALs in CBCT images, only a limited number of studies have been published. Zheng et al. [
28] proposed an anatomically constrained Dense U-Net model, which they evaluated on 20 CBCT images, obtaining a sensitivity of 84.0% and a precision of 90.0% in a root-based evaluation. In addition, Orhan et al. [
29] used a U-Net-based model to evaluate PAL detection in CBCT images and achieved a sensitivity of 92.8%. Setzer et al. [
27] evaluated a U-Net-based model on 2D slices from 20 CBCT images and achieved a sensitivity of 93.0% and a specificity of 88.0% in PAL detection. Recently, Calazans et al. [
34] proposed a classification model based on a 2D Siamese Network combined with a DenseNet-121 CNN [
40]. Their model was evaluated on 1000 coronal and sagittal slices extracted from CBCT images and achieved a sensitivity of 64.5% and a specificity of 75.8% in classifying PALs.
Comparing our study with those conducted by Zheng et al. [
28] and Orhan et al. [
29] is difficult due to the lack of reported specificities and details regarding negative class examples in their research. Relying on the precision metric for comparison may be misleading since our dataset is highly imbalanced, whereas their datasets have a well-balanced distribution that does not reflect real-world clinical scenarios. The precision metric is sensitive to class distribution, making it less suitable in this context. In terms of sensitivity and specificity, our study outperforms the results of Calazans et al. [
34], as they report a higher number of false negatives and false positives. While our sensitivity and specificity results are lower than those of the closely related and best-performing work by Setzer et al. [
27], it is important to note that their evaluation solely consisted of 20 CBCT test images with 61 roots. Therefore, we claim that our evaluation protocol is more strict than theirs, due to our extensive single-use testing dataset collected from clinical practice. Furthermore, many of these works use models trained on 2D slices, thus neglecting valuable 3D information.
In CBCT imaging, PALs are often not the primary clinical question, however, secondary PAL findings occur frequently, and, furthermore, they have to be documented by dentists who are often not radiological experts or may not have sufficient time to assess the CBCT images in great detail. In such cases, the help of an algorithm is invaluable to prevent findings from being overlooked, even at the cost of a larger number of false positives, which, however, can be ruled out comparatively straightforwardly, either visually or via additional clinical assessment of the respective tooth.
To study our evaluation results in more detail, we also analyze different stratifications of the testing dataset. While collecting the expert ground truth of the lesions, a lesion classification of lesion diameters into five different periapical index score categories [
20] was used. We see from
Figure 2 that for lesion classes 2 through 5 (lesions with diameters larger than 1 mm), the algorithm leads to few false negatives, i.e., high recall, while for lesion class 1 (lesions between 0.5 and 1 mm of diameter),
of the lesions in our dataset were missed. From a radiological point of view, such small lesions are generally challenging to detect, which was previously reported by Tsai et al. [
41] when studying simulated lesions in vitro on radiographs and CBCTs. If we use the lesion class stratification to compute the sensitivity solely for lesion classes 2 through 5, it reaches 90.4% (95% CI: 86.3–93.7%), which we consider to be a meaningful recall in clinical practice, such that the use of the algorithm can be suggested for lesions of sizes larger than 1 mm.
Another stratification that we investigated was from the anatomical point of view. Our results indicate that the algorithm provides a significantly higher specificity for teeth in the lower jaw, while the sensitivity difference between the lower and upper jaw was not statistically significant. We assume that this decrease in false positive findings for the lower jaw is due to its better radiological assessability compared with the upper jaw since the contrast between radiolucent lesions and alveolar bone or teeth is higher in the lower jaw (see
Figure 3a (at the second molar), c and d). Moreover, teeth in the upper jaw are located close to the maxillary sinus, such that the thin bony maxillary sinus floor or potential sinus membrane alterations might lead to confusion for the algorithm (see
Figure 3b,f).
When looking at different tooth categories in
Table 4, where teeth are assessed for both jaws combined, we notice that there are three tooth groups for which sensitivity is lower (below
), i.e., wisdom teeth (3rd molars), canines, and lateral incisors. Wisdom teeth are rarer in the population in general since many of them are removed when reaching adulthood or never show up. This is also reflected in our dataset, thus leading to a low number of lesions as well (see
Table 4). Moreover, different from lateral incisors and canines, molars are the teeth most affected by PALs, according to [
42]. Due to the lower number of lesions in the abovementioned three tooth groups (see
Table 4 for numbers of lesions), false negatives have a larger relative influence. Additionally, class 1 lesions of a smaller diameter are more strongly represented in two out of these three tooth categories in our dataset (third molars, lateral incisors, see
Table 3). We assume that the combination of these aspects leads to the lower sensitivity, while the average sensitivity of the remaining five tooth categories, where a larger number of lesions is present in each category, is
.
One limitation of our study is that the dataset collection for this evaluation was performed at the same hospital as in [
32]. While the focus was on the evaluation of generalizability via the inclusion of challenging data representative for clinical practice, we can therefore not draw any conclusions regarding generalizing to data from different sites. Moreover, the impact of anatomic variability due to differences in ethnicities, as, e.g., demonstrated in [
43], regarding the occurrence of radix entomolaris in an Asian population, has not been taken into account. Another limitation was that our testing dataset only contained a low number of lesions of periapical lesion index score 1. We conclude that in order to improve the algorithm and to draw stronger conclusions for small lesions as well, a re-training of the machine learning method on more data with the class 1 lesion diameter is required, which we see as potential future work.
In summary, we see our results as a very promising indication that machine learning can play an important role in assisting experts in dental practice, where high recall is needed to prevent overlooked findings.