An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs

Almutairi, Mona; Dardouri, Samia

doi:10.3390/info17050465

Open AccessArticle

An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs

by

Mona Almutairi

¹ and

Samia Dardouri

^1,2,*

¹

Department of Computer Science, College of Computing and Information Technology, Shaqra University, Shaqra 11911, Saudi Arabia

²

InnoV’COM Laboratory-Sup’Com, University of Carthage, Ariana 2083, Tunisia

^*

Author to whom correspondence should be addressed.

Information 2026, 17(5), 465; https://doi.org/10.3390/info17050465

Submission received: 22 March 2026 / Revised: 23 April 2026 / Accepted: 4 May 2026 / Published: 11 May 2026

(This article belongs to the Special Issue Deep Learning in Medical Image Analysis: Foundations, Techniques, and Applications)

Download

Browse Figures

Versions Notes

Abstract

Panoramic radiographs are widely used in dental practice due to their ability to provide a comprehensive view of the teeth, jaws, and surrounding anatomical structures in a single examination. However, automated interpretation remains challenging because multiple conditions may co-exist within a single image, class distributions are highly imbalanced, and several findings exhibit subtle radiographic characteristics. This study presents a deep learning framework for multi-label dental findings classification using panoramic radiographs from the publicly available VZRAD2 dataset. Following a label curation process, eleven clinically relevant classes were retained, including diseases, treatments, and anatomical structures. The proposed EfficientNet-B4-CBAM model integrates an EfficientNet-B4 backbone with a Convolutional Block Attention Module (CBAM) to enhance feature representation through channel and spatial attention. EfficientNet-B4 and ResNet50 were used as baseline models for comparison under a unified training protocol. The training pipeline incorporates data augmentation, weighted sampling to address class imbalance, AdamW optimization, and Binary Cross-Entropy with Logits loss for multi-label learning. On the validation set, the proposed model achieved the highest micro-F1 score of 0.8567, compared to 0.8424 for EfficientNet-B4 and 0.8469 for ResNet50. ROC analysis showed comparable separability across models, with micro-AUC values of 0.946 (EfficientNet-B4-CBAM), 0.947 (EfficientNet-B4), and 0.960 (ResNet50). Class-wise evaluation indicated strong performance for visually distinct findings such as impacted tooth, implant, filling, and root canal treatment, while anatomically diffuse or underrepresented classes remained more challenging. Grad-CAM visualizations suggest that the model focuses on clinically relevant regions, supporting interpretability. Overall, the results indicate that attention-enhanced convolutional models can provide effective and interpretable support for multi-label dental findings classification. However, the observed performance improvements are modest, and further validation on independent datasets, along with clinical evaluation, is required to confirm generalizability and real-world applicability.

Keywords:

panoramic radiographs; multi-label classification; dental findings classification; EfficientNet-B4; CBAM; attention mechanism; deep learning; class imbalance; Grad-CAM; explainable artificial intelligence; computer-aided diagnosis

1. Introduction

Dental radiography plays a fundamental role in screening, diagnosis, treatment planning, and longitudinal monitoring in modern oral healthcare. Among the available imaging modalities, panoramic radiography is particularly advantageous because it provides a comprehensive view of the entire dentition, both jaws, the maxillary sinus floor, and surrounding anatomical structures in a single, low-cost examination. This wide field of view also makes panoramic images highly suitable for artificial intelligence (AI) applications, as a single image can simultaneously contain multiple restorations, pathologies, and anatomical landmarks [1,2,3,4,5,6].

However, automated analysis of panoramic radiographs remains challenging. Compared to intraoral imaging, panoramic images exhibit lower sharpness, structural overlap, and significant variability in anatomical presentation. Furthermore, class distributions are often highly imbalanced, and subtle findings, such as periapical lesions or sinus-related conditions, may occupy only small regions of the image, making detection difficult [1,6,7,8,9,10,11,12].

Recent literature demonstrates rapid progress in AI-based analysis of panoramic radiographs. Previous studies have reported strong performance in tasks such as tooth identification, periodontal bone loss assessment, caries detection, sinus analysis, lesion localization, and osteoporosis screening [5,6,8,13,14]. Disease-specific applications have also been explored, including pediatric furcation lesion detection [15], periodontitis staging [16], general dental disease classification [1], apical and sinus floor analysis [3], multistage caries detection [9], periapical lesion detection [12], and implant planning assistance [17]. Despite these advances, most existing approaches focus on binary or narrowly defined tasks, limiting their applicability to real-world clinical scenarios where multiple findings frequently co-exist within a single radiograph. In addition, the lack of interpretability remains a critical concern in medical imaging, as clinicians require models that not only perform well but also provide meaningful explanations aligned with anatomical structures [18,19,20].

In this context, the present study develops a clinically oriented and interpretable deep learning framework for multi-label dental findings classification from panoramic radiographs using the VZRAD2 dataset. Rather than introducing a fundamentally new architecture, this work adopts an application-driven perspective, focusing on addressing key challenges that hinder real-world deployment, including multi-label prediction, class imbalance, and the need for model interpretability in medical imaging. The proposed model is based on an EfficientNet-B4 backbone enhanced with a Convolutional Block Attention Module (CBAM), while EfficientNet-B4 and ResNet50 are used as strong baselines under a unified experimental setting.

The main contributions of this work are fourfold. First, we construct a curated eleven-class multi-label classification task from a large-scale panoramic radiograph dataset, ensuring both clinical relevance and data consistency. Second, we develop an attention-enhanced framework that improves feature representation by incorporating channel and spatial attention mechanisms tailored to panoramic imaging. Third, we conduct a systematic and reproducible evaluation using multiple complementary metrics, including micro-F1, macro-F1, AUC, subset accuracy, and class-wise analysis, providing a comprehensive assessment of model performance under class imbalance. Fourth, we integrate explainability through Grad-CAM to verify that model predictions are grounded in clinically meaningful regions, thereby supporting interpretability and trust in computer-aided diagnosis.

The contribution of this work lies in providing a structured, reproducible, and clinically grounded framework for multi-label panoramic radiograph analysis, offering practical insights into model behavior under class imbalance and interpretability constraints, rather than proposing a fundamentally new architectural design.

Overall, this study provides a methodologically rigorous and clinically relevant framework for multi-label dental findings classification, contributing toward the development of reliable and interpretable AI-assisted screening tools for dental practice.

2. Related Work

Recent advances in artificial intelligence (AI) have significantly expanded the capabilities of dental image analysis, moving beyond simple classification tasks toward more clinically relevant pathology detection and interpretation. Early work by Cejudo et al. demonstrated the feasibility of deep learning for radiograph-type classification, highlighting its role in automated data organization for downstream diagnostic tasks [2]. Subsequent studies, such as Almalki et al., extended this paradigm to multi-class dental disease classification on panoramic radiographs, demonstrating the potential of deep learning for direct diagnostic support [1].

More advanced approaches have incorporated detection and segmentation techniques to better capture clinically meaningful structures. For instance, Shon et al. combined U-Net and YOLOv5 to classify periodontitis stages, illustrating the benefits of integrating localization and classification [16]. Similarly, Karamüftüoğlu et al. applied deep learning to pediatric dentistry by detecting furcation involvement in primary molars [15], while Wu et al. developed AI-based systems for sinus floor analysis and implant planning [17]. These studies reflect a growing trend toward task-specific and clinically oriented AI applications in dental radiography. In parallel, substantial progress has been made in tooth- and structure-level analysis. Yilmaz et al. compared Faster R-CNN and YOLOv4 for tooth detection, demonstrating the effectiveness of object detection frameworks in panoramic imaging [21]. Segmentation-based approaches have also been widely explored, including U-Net-based models for agenesis detection [22], grid-aware attention mechanisms for tooth segmentation and orientation [23], and Teeth U-Net for panoramic segmentation tasks [24]. A systematic review by Bonfanti-Gris et al. further confirmed the maturity of segmentation research, although it highlighted variability in performance and a lack of standardized evaluation protocols [13]. Additionally, data augmentation and synthesis techniques, such as Pano-GAN, have been proposed to address data scarcity and improve model robustness [25].

Disease-specific studies have also shown promising results. Multistage caries detection [9], ResNet-based caries classification, and detection of lesions under prostheses demonstrate the applicability of deep learning across various diagnostic scenarios. In particular, recent work on periapical lesion detection has achieved encouraging results, although challenges remain due to subtle radiographic features and variability in presentation [12]. Review studies have documented the rapid growth of AI in panoramic imaging, emphasizing both its potential and the variability in methodological rigor and validation practices [5,6,8]. Despite these advances, most existing studies focus on binary or narrowly scoped tasks, limiting their applicability to real-world clinical settings where multiple conditions frequently co-exist. Moreover, many models operate as “black boxes,” raising concerns about interpretability and clinical trust. To address this, explainable AI (XAI) techniques such as Grad-CAM have gained popularity, as they enable visualization of model attention on input images [18]. While such methods can enhance transparency and support clinical validation, recent reviews emphasize that their outputs must be interpreted cautiously and verified against domain expertise [19,20]. In light of these limitations, the present study focuses on a multi-label classification framework that integrates attention mechanisms and explainability. By combining EfficientNet-B4 with CBAM and incorporating Grad-CAM analysis, the proposed approach aims to address both performance and interpretability challenges in panoramic radiograph analysis.

3. Materials and Methods

3.1. Dataset Source and Label Curation

The experiments were conducted using the VZRAD2-v6 dataset [26], which was obtained from the Roboflow platform (Roboflow Universe) and exported on January 19, 2024. The dataset consists of 8429 panoramic dental radiographs annotated using the YOLOv8 format. During the export process, the dataset was automatically preprocessed by Roboflow, including auto-orientation, EXIF metadata removal, resizing to 640 × 640 pixels, random brightness adjustment in the range of −15% to +15%, and Gaussian blur with a kernel range between 0 and 0.7. These transformations aim to standardize image quality and improve model robustness. For the purpose of this study, the dataset annotations were converted into multi-label classification targets. The implementation utilized the CSV files generated by Roboflow to construct label vectors for the training and validation sets. A label curation process was applied to improve data quality and ensure stable model training. Labels were excluded if they met one of the following criteria: (i) low frequency, defined as fewer than 50 positive samples in the training set, or (ii) annotation inconsistencies, including typographical variations or ambiguous definitions. The threshold of 50 samples was selected to ensure sufficient representation for learning reliable feature patterns and to mitigate overfitting in the multi-label setting. After filtering, eleven classes were retained: Caries, Crown, Filling, Implant, Mandibular Canal, Missing Teeth, Periapical Lesion, Root Canal Treatment, Root Piece, Impacted Tooth, and Maxillary Sinus. Since multiple conditions may co-exist in a single radiograph, the task was formulated as a multi-label classification problem, and label vectors were used in conjunction with the BCEWithLogitsLoss function during training. While this filtering improves robustness and statistical reliability, it may limit the inclusion of rare conditions and reduce dataset diversity; this trade-off is acknowledged as a limitation. The resulting class distribution is summarized in Table 1.

Figure 1 presents six panoramic radiographs illustrating the diversity of anatomical structures and dental conditions encountered in the dataset. Figure 1a–f demonstrate variations in dentition, restorations, and image quality. Multiple dental findings, including fillings, crowns, missing teeth, and possible implants, can be observed across different regions of the jaw. The images also reflect variability in contrast, noise, and anatomical alignment, which increases the complexity of the multi-label classification task. These examples highlight the challenges associated with detecting both high-contrast structures (e.g., restorations) and subtle or diffuse conditions within full panoramic views.

Additionally, certain labels were removed due to annotation inconsistencies and potential labeling noise. Including such unreliable labels could negatively impact model performance by introducing ambiguity during training. Therefore, a minimum frequency threshold of 50 positive samples was enforced to retain only well-represented and consistent classes.

This filtering step ensures that the resulting classification task remains balanced, learnable, and clinically meaningful, while reducing the risk of model degradation due to noisy or insufficient data. Similar strategies have been widely adopted in medical imaging studies to improve robustness and reproducibility when dealing with heterogeneous and imperfect datasets.

The dataset was partitioned into training and validation sets following the original Roboflow splits (if applicable)/using a predefined random split (clarify accordingly). Due to the absence of patient-level identifiers in the dataset, the splitting was performed at the image level rather than the patient level. While this approach enables model training and evaluation, it may introduce a risk of data leakage if multiple images from the same patient are distributed across different splits, potentially leading to optimistic performance estimates. This limitation is acknowledged and should be considered when interpreting the results. Future work will aim to adopt patient-wise or cross-institutional splitting strategies to ensure more robust evaluation.

3.2. Preprocessing and Augmentation

The preprocessing and augmentation pipeline consists of two distinct stages. The first stage is performed during dataset export using the Roboflow platform (version 3.0), where all images are standardized to a resolution of 640 × 640 pixels. Additional fixed transformations include auto-orientation, EXIF metadata removal, brightness adjustment within the range of −15% to +15%, and Gaussian blur with a kernel range between 0 and 0.7. These operations ensure consistent image formatting and improve robustness to variations in image quality.

The second stage involves on-the-fly data augmentation applied only during training using torchvision transforms. For the proposed model, training images are resized to 380 × 380 pixels and augmented using horizontal flipping, random rotation (±25°), RandomAffine, ColorJitter, and RandomPerspective transformations. These augmentations aim to increase invariance to variations in patient positioning, anatomical orientation, and imaging conditions. Horizontal flipping is applied as a data augmentation technique to improve model robustness to variations in image orientation. While laterality can be clinically relevant in certain dental findings, the current task is formulated as image-level multi-label classification rather than side-specific diagnosis, and therefore flipping does not alter the presence of labels. Nevertheless, this transformation may obscure left–right contextual information, and is acknowledged as a limitation. Future work will explore side-aware modeling and constrained augmentation strategies to better preserve anatomical laterality.

In contrast, validation images are only resized to 380 × 380 pixels, normalized using standard ImageNet mean and standard deviation, and converted to tensors without any stochastic augmentation, ensuring a fair and unbiased evaluation. The choice of 380 × 380 resolution represents a trade-off between preserving radiographic detail and maintaining computational efficiency within available GPU memory constraints.

This clear separation between fixed preprocessing and training-time augmentation ensures a reproducible and well-defined experimental pipeline across all models.

3.3. Class Imbalance Handling

To address the significant class imbalance inherent in the multi-label dataset, a WeightedRandomSampler was employed during training. Class-wise positive sample counts were computed from the training data, and inverse frequency weights were assigned to each class based on the ratio of total samples to class-specific occurrences.

For each image, the sampling weight was defined as the maximum weight among its associated positive labels, ensuring that samples containing rare classes were more likely to be selected during training. This strategy is particularly suitable for multi-label classification, where each sample may belong to multiple classes with varying frequencies.

By increasing the sampling probability of underrepresented classes, the proposed approach improves the model’s ability to learn rare but clinically important conditions. This is reflected in the improved recall observed for less frequent classes such as implant and root piece. Figure 2 presents the class distribution before and after applying the WeightedRandomSampler strategy.

3.4. Baseline Models

To evaluate the effectiveness of the proposed approach, two widely used deep learning architectures were selected as baseline models.

The first baseline is EfficientNet-B4, initialized with pretrained weights from the ImageNet/Noisy Student dataset. The classification head was modified to produce outputs corresponding to the eleven target classes. EfficientNet-B4 is well-suited for medical imaging tasks due to its efficient compound scaling strategy, which balances network depth, width, and resolution.

The second baseline is ResNet50, a residual convolutional neural network known for its stable training behavior and strong generalization performance. Its residual connections facilitate gradient propagation, making it a reliable benchmark in medical image classification tasks. Both baseline models were trained under the same experimental conditions as the proposed EfficientNet-B4-CBAM model to ensure a fair and consistent comparison.

3.5. Proposed EfficientNet-B4-CBAM Model

The proposed architecture, referred to as EfficientNet-B4-CBAM, is based on the EfficientNet-B4 backbone enhanced with an attention mechanism. The EfficientNet-B4 network from the timm library is initialized with the parameters num_classes = 0 and global_pool = False in order to expose the convolutional feature maps prior to classification.

These extracted feature maps are then refined using a Convolutional Block Attention Module (CBAM), which sequentially applies channel attention and spatial attention to improve feature representation.

In the channel attention stage, global average pooling and global max pooling are applied to the feature maps to generate two channel descriptors. These descriptors are passed through a shared bottleneck consisting of 1 × 1 convolutional layers (equivalent to a multilayer perceptron) to produce channel attention weights. The resulting weights are applied to the feature maps via element-wise multiplication to emphasize informative channels. In the spatial attention stage, the channel-refined feature maps are further processed by computing average and max projections across the channel dimension. The resulting spatial descriptors are concatenated and passed through a 7 × 7 convolutional layer followed by a sigmoid activation to produce spatial attention weights. These weights highlight diagnostically relevant regions in the panoramic radiograph. After attention refinement, the feature maps are globally average-pooled and regularized using dropout with a probability of 0.4. The resulting feature vector is then passed to a fully connected layer that produces an 11-dimensional output, corresponding to the eleven dental disease classes. Because the task is multi-label classification, a sigmoid activation function is applied to obtain independent probabilities for each class.

It should be emphasized that, although earlier descriptions suggested a fusion between EfficientNet and ResNet architectures, the implemented model does not perform feature fusion. Instead, the proposed method consists of EfficientNet-B4 enhanced with CBAM attention, while ResNet50 is used solely as a baseline model for comparative evaluation. The architecture description and figures have been updated accordingly to accurately reflect the final implemented model architecture.

Multi-Label Classification and Loss Function

•: Feature Extraction:

Let the input panoramic radiograph be represented as:

X \in R^(H \times W \times C)

where (H), (W), and (C) denote the height, width, and number of channels of the input image, respectively.

The input image is processed by the EfficientNet-B4 backbone network to extract high-level feature representations:

F ₑ = f_E f f N e t (X)

where

f_E f f N e t (X)

represents the EfficientNet-B4 feature extraction function and

F ₑ

denotes the resulting convolutional feature map.

•: Attention Mechanism (CBAM):

To enhance the discriminative capability of the extracted features, a Convolutional Block Attention Module (CBAM) is applied. CBAM sequentially performs channel attention and spatial attention to highlight informative structures in the radiograph.

•: Channel Attention

Channel attention identifies important feature channels by aggregating spatial information using global pooling operations:

M_c (F) = σ (M L P (A v g P o o l (F)) + M L P (M a x P o o l (F)))

•: σ is the sigmoid activation function.
•: MLP is a multilayer perceptron.
•: AvgPool and MaxPool denote global pooling operations.

The channel-refined feature map is obtained as:

F^{'} = M_c (F) \otimes F

where ⊗ denotes element-wise multiplication.

•: Spatial Attention:

Spatial attention identifies important regions within the feature map and is computed as:

M_s (F^{'}) = σ (f^(7 \times 7) ([A v g P o o l (F^{'}); M a x P o o l (F^{'})]))

The final refined feature representation is:

F^{″} = M_s (F^{'}) \otimes F^{'}

This process enables the network to emphasize diagnostically relevant anatomical regions in panoramic radiographs.

•: Multi-Label Classification:

The classifier predicts the probability of each dental disease class using a fully connected layer:

z = W · F″ + b

where

•: W is the weight matrix.
•: b is the bias term.

Since multiple dental findings may occur in a single radiograph, the problem is formulated as a multi-label classification task. The predicted probability for each class (i) is computed using the sigmoid activation function:

{\hat{y}}_{i} = 1 / (1 + e^(- z_{i}))

where ŷᵢ represents the predicted probability for class i.

•: Loss Function

The model is trained using Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss), which operates directly on raw logits without requiring a separate sigmoid activation during training. Given a predicted logit

z_{i}

and ground-truth label

y_{i}

∈ {0, 1}, the loss for each class is defined as:

L = - \frac{1}{N} \sum_{i}^{N} [y_{i} \log (σ (z_{i})) + (1 - y_{i}) \log (1 - σ (z_{i}))]

where σ(⋅) denotes the sigmoid function. In practice, BCEWithLogitsLoss combines the sigmoid activation and cross-entropy computation in a numerically stable formulation. During inference, a sigmoid function is applied to the output logits to obtain class-wise probabilities.

ResNet50 is used solely as a baseline model for comparison, while the proposed architecture consists of an EfficientNet-B4 backbone enhanced with a Convolutional Block Attention Module (CBAM). Figure 3 illustrates the overall architecture of the proposed EfficientNet-B4-CBAM framework for multi-label dental findings classification from panoramic radiographs.

The pipeline begins with the input panoramic radiograph, which undergoes preprocessing and data augmentation, including resizing, horizontal flipping, and rotation, to improve model generalization. The processed image is then passed through the EfficientNet-B4 backbone, which extracts high-level convolutional feature representations. To enhance feature discrimination, a CBAM is applied to the final convolutional feature maps prior to global average pooling. The CBAM operates sequentially through two components: channel attention, which emphasizes the most informative feature channels, and spatial attention, which highlights diagnostically relevant regions within the radiograph. In the channel attention branch, a reduction ratio of 16 is used to balance representational capacity and computational efficiency. Following attention refinement, the feature maps are globally average-pooled to reduce spatial dimensions, and a dropout layer is applied to mitigate overfitting. The resulting feature vector is then passed through a fully connected layer that outputs eleven class-wise logits for multi-label classification. During inference, a sigmoid activation function is applied to obtain class probabilities. Finally, Grad-CAM is employed to generate visual explanations by highlighting the regions of the radiograph that contribute most to the model’s predictions, thereby supporting interpretability and potential clinical validation. All architectural components and hyperparameters are kept consistent across experiments to ensure fair comparison.

3.6. Optimization and Training Protocol

All models were implemented in Python (version 3.10) using the PyTorch (version 2.3.0) deep learning framework, with architectures obtained from the timm library. The proposed EfficientNet-B4-CBAM model is based on a pretrained EfficientNet-B4 backbone initialized with ImageNet weights and subsequently fine-tuned on the target dataset. The CBAM attention module is integrated to refine feature representations through channel and spatial attention mechanisms. The final classification head consists of global average pooling, followed by a dropout layer (rate = 0.4) and a fully connected layer producing eleven output logits.

All input images were resized to 380 × 380 pixels to match the EfficientNet-B4 input resolution. Data preprocessing and augmentation were performed using torchvision transforms. Training images were augmented using horizontal flipping, random rotation (±25°), affine transformations, color jitter, and random perspective distortion. In contrast, validation images were resized and normalized using standard ImageNet statistics without any stochastic augmentation, ensuring a fair evaluation protocol.

The models were trained using the AdamW optimizer with an initial learning rate of 3 × 10⁻⁴ and a batch size of 16. Binary Cross-Entropy with Logits Loss (BCEWithLogitsLoss) was used as the objective function for multi-label classification. To mitigate class imbalance, a WeightedRandomSampler was employed during training to increase the sampling probability of underrepresented classes.

Training was conducted for 20 epochs using a progressive fine-tuning strategy. Initially, only the classification head was trained while keeping the EfficientNet-B4 backbone frozen. Subsequently, the backbone layers were gradually unfrozen to enable full network fine-tuning. This strategy promotes stable convergence and reduces the risk of early overfitting.

Model selection was based on the validation micro-F1 score, which provides a balanced evaluation of precision and recall across all classes and is well-suited for imbalanced multi-label settings. During inference, a sigmoid activation function was applied to the output logits to obtain class-wise probabilities, and a fixed threshold of 0.5 was used to generate binary predictions. While this threshold offers a reasonable balance between precision and recall, it may not be optimal for all classes, and threshold calibration represents a potential direction for future improvement (Table 2).

Table 2. Training configuration used for model development.

Parameter	Value
Input size	380 × 380
Batch size	16
Optimizer	AdamW
Learning rate	3 × 10⁻⁴
Loss	BCEWithLogitsLoss
Epochs	20
Sampler	WeightedRandomSampler
Task type	Multi-label classification
Model selection criterion	Validation micro-F1

3.7. Evaluation Metrics

The proposed framework was evaluated using multiple performance metrics to provide a comprehensive assessment of model behavior in a multi-label classification setting. The micro-F1 score was used as the primary metric for model comparison, as it aggregates contributions from all classes and is particularly suitable for imbalanced datasets. In contrast, the macro-F1 score provides a class-balanced perspective by assigning equal importance to each class, regardless of its frequency. In addition, subset accuracy was used to measure the proportion of samples for which the predicted label set exactly matches the ground truth. As a strict metric, it reflects the model’s ability to correctly predict all labels simultaneously. The Hamming loss was also computed to quantify the fraction of incorrectly predicted labels across all classes. To further analyze model performance, classification reports were generated to provide precision, recall, and F1-score for each class. Moreover, multilabel confusion matrices and one-vs-rest Receiver Operating Characteristic (ROC) curves were used to evaluate class-wise behavior and decision thresholds. The F1-score, which balances precision and recall, is defined as:

F 1 - score = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

where

Precision = \frac{T P}{T P + F P}

Recall = \frac{T P}{T P + F N}

where

•: TP denotes true positives;
•: FP denotes false positives;
•: TN denotes true negatives;
•: FN denotes false negatives.

Additionally, the Receiver Operating Characteristic (ROC) curve is used to analyze the trade-off between true positive rate and false positive rate:

T P R = T P / (T P + F N)

F P R = F P / (F P + T N)

The ROC curve plots the TPR against the FPR at various threshold settings. The Area Under the ROC Curve (AUC) provides a scalar measure of the model’s overall discriminative ability, where a value closer to 1 indicates better class separability. In the context of multi-label classification, a one-vs-rest strategy is used to compute ROC curves for each class, and micro-averaging is applied to aggregate performance across all classes.

Although the proposed model demonstrates consistent improvements across several evaluation metrics, the observed performance gain (approximately 1% in micro-F1 score) remains modest. Statistical significance testing was not performed in this study due to the use of a single train–validation split without repeated runs, which limits the applicability of rigorous statistical comparisons such as paired t-tests or bootstrap-based confidence intervals. Reliable statistical validation would require multiple independent experiments or cross-validation to estimate variability in model performance. Therefore, the reported improvements should be interpreted with caution. Future work will incorporate repeated experiments and cross-validation to enable statistically robust and reliable model comparison.

3.8. Explainability with Grad-CAM

To enhance interpretability, the Gradient-weighted Class Activation Mapping (Grad-CAM) technique was employed to visualize the regions of the input image that contribute most to the model’s predictions. Grad-CAM was implemented by registering forward and backward hooks on the final convolutional layer of the network. During backpropagation, the gradients of the target class with respect to the feature maps were computed and globally average-pooled to obtain channel-wise importance weights. These weights were then combined with the corresponding feature maps to generate a class activation map. A ReLU activation was applied to retain only the positive contributions, and the resulting heatmap was upsampled to the original image resolution using bilinear interpolation.

Finally, the heatmap was overlaid on the original panoramic radiograph to visually assess whether the model focuses on clinically relevant regions, such as lesions, restorations, or anatomical structures. This provides an additional level of validation for the model’s predictions and supports its potential use in clinical decision-making.

4. Results

4.1. Aggregate Model Comparison

Table 3 summarizes the aggregate performance of the evaluated models across multiple evaluation metrics. The proposed EfficientNet-B4-CBAM model achieved the highest micro-F1 score (0.8567), indicating superior overall performance in aggregating predictions across all classes. This confirms its effectiveness in handling imbalanced multi-label classification tasks.

In terms of macro-F1 score, the ResNet50 model achieved the highest value (0.7025), suggesting better balanced performance across individual classes. This indicates that while the proposed model excels in overall performance, ResNet50 demonstrates slightly stronger class-wise consistency. The proposed model also achieved the highest subset accuracy (0.4722), reflecting improved capability in predicting the exact set of labels for each sample. Additionally, it achieved the lowest Hamming loss (0.0736), indicating fewer label-wise prediction errors compared to the baseline models.

Regarding micro-AUC, ResNet50 achieved the highest value (0.960), followed by EfficientNet-B4 (0.947) and EfficientNet-B4-CBAM (0.946). This suggests that while the proposed model improves threshold-dependent metrics such as F1-score and subset accuracy, ResNet50 maintains a slight advantage in ranking-based performance. Overall, Table 3 demonstrates that the proposed EfficientNet-B4-CBAM model provides the best trade-off between precision and recall, achieving consistent improvements in key metrics relevant to multi-label classification, particularly in terms of micro-F1, subset accuracy, and Hamming loss.

Although the proposed model shows consistent improvements in micro-F1 score compared to baseline models, the magnitude of this gain (approximately 1%) remains modest. In the absence of repeated runs, confidence intervals, or formal significance testing, it is not possible to determine whether the observed differences exceed typical run-to-run variability. Therefore, these improvements should be interpreted as indicative rather than statistically conclusive. A more rigorous assessment of robustness would require multiple independent experiments or cross-validation to estimate performance variability and compute statistical significance. This is identified as an important direction for future work.

Figure 4 presents a comparison of the micro-F1 scores achieved by the evaluated models, including EfficientNet-B4, ResNet50, and the proposed EfficientNet-B4-CBAM model. As shown, the proposed model attains the highest micro-F1 score (0.857), outperforming ResNet50 (0.848) and EfficientNet-B4 (0.842). Although the performance differences between the models are relatively small, the improvement achieved by the proposed model is consistent and indicates the effectiveness of incorporating the CBAM attention mechanism. The results suggest that attention-enhanced feature refinement enables better identification of relevant patterns in panoramic radiographs, leading to improved overall classification performance.

To assess the contribution of the attention mechanism, the comparison between EfficientNet-B4 and the proposed EfficientNet-B4-CBAM model can be interpreted as an ablation study isolating the effect of the CBAM. Both models share the same backbone architecture and are trained under identical experimental conditions, ensuring a fair comparison. The results indicate that incorporating CBAM yields a consistent, though modest, improvement in micro-F1 score and subset accuracy, suggesting that attention-based feature refinement enhances the model’s ability to capture relevant patterns in panoramic radiographs. In addition, Grad-CAM visualizations show that the attention-enhanced model produces more spatially coherent activation maps, supporting the qualitative benefit of the CBAM. However, the relatively small performance gain indicates that attention alone does not fully address the complexity of the task, highlighting the need for further methodological improvements. It should also be noted that the current evaluation is limited to two widely used convolutional baseline models. While these provide a strong and controlled reference, incorporating more recent architectures, such as transformer-based models, would enable a more comprehensive comparison with state-of-the-art approaches.

4.2. Class-Wise Performance

The class-wise results, summarized in Table 4, reveal substantial variability in performance across the eleven dental disease categories. This variation is expected in panoramic radiography, where classes differ in terms of prevalence, anatomical structure, and radiographic conspicuity. The highest F1-scores were achieved for impacted tooth (0.96), implant (0.95), filling (0.90), and root canal treatment (0.90). These classes typically exhibit well-defined shapes or high-contrast materials, making them easier to detect and classify. Additionally, crown (0.82) and missing teeth (0.79) also demonstrated strong performance, likely due to their relatively distinct structural patterns. In contrast, lower performance was observed for mandibular canal (0.44), periapical lesion (0.32), and maxillary sinus (0.18). These classes are more challenging due to their diffuse anatomical boundaries, low contrast, and relatively limited representation in the dataset. Such characteristics make them difficult to capture using a whole-image classification approach. Intermediate performance was observed for caries (0.63) and root piece (0.61). These classes often present with subtle or localized features that may not be fully captured at the global image level, contributing to moderate detection performance.

Overall, the results indicate that the proposed model performs well on structurally distinct and well-represented classes, while performance degrades for subtle, diffuse, or underrepresented conditions. This finding highlights the need for more localized or region-based modeling approaches to further improve classification performance for challenging classes.

Figure 5 illustrates the per-class F1-scores achieved by the proposed EfficientNet-B4-CBAM model across the eleven dental disease categories. The results show considerable variability in performance, reflecting differences in class prevalence, anatomical characteristics, and visual distinctiveness in panoramic radiographs. The highest F1-scores are observed for impacted tooth, implant, filling, and root canal treatment, all exceeding 0.90. These classes are characterized by well-defined shapes or high-contrast materials, making them easier for the model to detect. Crown and missing teeth also demonstrate strong performance, with F1-scores above 0.75. Moderate performance is observed for caries and root piece, with F1-scores around 0.60–0.65. These classes often exhibit subtle or localized features that may be less prominent in full panoramic images. In contrast, the lowest F1-scores are associated with mandibular canal, periapical lesion, and maxillary sinus, with values below 0.50. These classes are more challenging due to their diffuse anatomical structure, lower contrast, and relatively limited representation in the dataset.

4.3. ROC Analysis

The Receiver Operating Characteristic (ROC) analysis demonstrates strong class separability across all evaluated models, as illustrated in Figure 6. The EfficientNet-B4 model achieved a micro-average AUC of 0.947, while ResNet50 obtained the highest micro-average AUC of 0.960. The proposed EfficientNet-B4-CBAM model achieved a comparable micro-average AUC of 0.946. Although ResNet50 achieved the highest AUC, it did not yield the best micro-F1 score. This highlights an important distinction between ranking-based metrics and threshold-dependent metrics. Specifically, AUC evaluates the model’s ability to rank positive instances higher than negative ones across all thresholds, whereas F1-score reflects the balance between precision and recall at a specific decision threshold. The observed discrepancy indicates that ResNet50 produces well-separated probability distributions (i.e., strong ranking performance), but the proposed EfficientNet-B4-CBAM model achieves a more effective precision–recall trade-off when predictions are thresholded. This results in superior classification performance as measured by micro-F1. These findings emphasize the importance of evaluating multi-label medical classification models using complementary metrics, such as AUC and F1-score, rather than relying on a single performance indicator. In particular, threshold-dependent metrics are more aligned with real-world clinical decision-making, where binary predictions are required.

4.4. Confusion Matrices and Error Structure

Further insight into the model’s error distribution is provided by the class-wise confusion matrices in Figure 7. The results show strong diagonal dominance for classes such as impacted tooth, filling, implant, and root canal treatment, indicating a high number of correct predictions. This observation is consistent with the high F1-scores reported for these classes.

In contrast, classes such as mandibular canal, periapical lesion, and maxillary sinus exhibit a higher proportion of off-diagonal elements, particularly false negatives, reflecting the model’s difficulty in detecting these conditions. This behavior can be attributed to their lower representation in the dataset, as well as their diffuse anatomical structure and lower radiographic contrast.

Importantly, the confusion matrices reveal that the model does not tend to overestimate rare conditions (i.e., produce excessive false positives). Instead, the model exhibits a tendency toward underestimation, as evidenced by the higher number of false negatives for several classes. This suggests a conservative prediction behavior, where the model prioritizes precision over recall for difficult or underrepresented categories.

Overall, the confusion matrix analysis supports the quantitative results by demonstrating that model performance is strongly influenced by class characteristics, and it highlights the need for improved sensitivity for challenging classes in future work.

4.5. Grad-CAM Explainability

Grad-CAM visualizations were used to assess whether the model focuses on clinically meaningful regions within panoramic radiographs. As illustrated in Figure 8, the activation maps are not randomly distributed across the image background but are instead concentrated around relevant anatomical and pathological regions, such as restorative materials and affected teeth. This indicates that the model leverages localized radiographic evidence rather than spurious correlations. A comparison across models reveals notable differences in attention behavior. The EfficientNet-B4 and ResNet50 models exhibit more localized and, in some cases, fragmented activation patterns. In contrast, the proposed EfficientNet-B4-CBAM model demonstrates more coherent and spatially consistent attention, covering broader regions associated with the pathological area. This suggests that the integration of the CBAM attention mechanism enhances the model’s ability to capture both channel-wise and spatial dependencies, leading to improved feature representation. Importantly, the Grad-CAM results support the quantitative findings by showing that the proposed model focuses more effectively on diagnostically relevant regions, which contributes to its improved classification performance. However, it should be noted that Grad-CAM provides visual explanations of model attention rather than definitive evidence of correctness, and therefore should be interpreted as a supportive tool rather than a standalone validation method.

5. Discussion

The results of this study demonstrate that multi-label dental findings classification from panoramic radiographs can be effectively addressed using a carefully designed deep learning framework. From an application-oriented perspective, the proposed EfficientNet-B4-CBAM model achieves a balanced trade-off between predictive performance and interpretability, which is essential for clinical adoption. Rather than emphasizing architectural novelty, this work highlights the importance of aligning model design with the specific challenges of panoramic imaging, including multi-label prediction, class imbalance, and the need for transparent decision-making.

While the proposed model demonstrates competitive performance across multiple metrics, it does not consistently outperform all baseline models. In particular, ResNet50 achieves higher AUC values, indicating stronger ranking capability. This observation underscores the importance of considering both threshold-dependent metrics (e.g., micro-F1) and threshold-independent metrics (e.g., AUC) when evaluating multi-label classification performance. Furthermore, Grad-CAM visualizations provide qualitative insights into model attention but do not constitute quantitative validation of localization accuracy. Therefore, interpretations of attention maps should be made cautiously, and further validation using expert annotations or localization metrics is required to confirm their clinical relevance. The comparison between EfficientNet-B4 and the proposed EfficientNet-B4-CBAM model can be interpreted as an ablation analysis isolating the contribution of the CBAM. As both models share the same backbone and are trained under identical conditions, the observed improvement can be attributed to attention-based feature refinement. However, the magnitude of this improvement remains modest, suggesting that attention mechanisms alone are insufficient to fully address the complexity of panoramic radiograph analysis.

This behavior can be explained by the inherent characteristics of panoramic radiographs, where anatomical structures are distributed across the image and often overlap at multiple scales. In this context, the combined channel and spatial attention mechanisms of CBAM help emphasize diagnostically relevant regions while suppressing irrelevant background information. Nevertheless, consistent with recent literature, meaningful performance improvements are likely to require a combination of architectural refinement, data-centric strategies, and task-specific modeling.

A detailed analysis of evaluation metrics provides further insight into model behavior. While the proposed model achieves the highest micro-F1 score and lowest Hamming loss, ResNet50 attains a higher micro-AUC, reflecting superior ranking performance. This discrepancy highlights the distinction between threshold-dependent and threshold-independent evaluation. In clinical scenarios requiring binary decisions, threshold-dependent metrics may be more relevant, suggesting that the proposed model offers improved practical utility despite slightly lower ranking capability.

Class-wise results further emphasize the influence of anatomical and statistical factors on performance. High accuracy is observed for visually distinct classes such as impacted tooth, implant, filling, and root canal treatment. In contrast, lower performance is obtained for mandibular canal, periapical lesions, and maxillary sinus, which exhibit diffuse boundaries, lower contrast, and limited representation in the dataset. These findings indicate that global image-level classification is not well suited for detecting small or localized abnormalities. More advanced approaches such as detection-based models, patch-based learning, multi-instance learning, or segmentation-assisted pipelines, may be necessary to improve sensitivity for such conditions.

From a methodological perspective, the study follows a systematic and reproducible experimental protocol, in which all models are trained under identical conditions and evaluated using complementary metrics. The use of WeightedRandomSampler contributes to improved learning of underrepresented classes, although residual imbalance effects remain evident.

Interpretability remains a critical requirement in medical imaging. Grad-CAM visualizations indicate that the model focuses on clinically meaningful regions; however, these explanations are qualitative and should not be considered definitive evidence of correctness. Clinical validation by domain experts is necessary to confirm the reliability of these interpretations. Although Grad-CAM visualizations suggest more spatially coherent attention patterns in the proposed model, this observation remains qualitative and is not validated using expert annotations or quantitative localization metrics. Therefore, such interpretations should be considered indicative rather than conclusive.

Several limitations should be acknowledged. First, the study relies on a single dataset, and the absence of external validation limits generalizability across different clinical environments. Second, the performance gains introduced by CBAM are modest, indicating that further methodological improvements are required. Third, the use of a fixed decision threshold may not be optimal for all classes, particularly for underrepresented conditions. Finally, statistical significance testing was not conducted due to the use of a single train–validation split, and therefore the reported improvements should be interpreted with caution.

From a clinical perspective, the proposed framework has potential as a computer-aided screening tool to support dental practitioners in interpreting panoramic radiographs. By providing probabilistic predictions and visual explanations, the system may assist in prioritizing cases and improving diagnostic efficiency. However, reduced sensitivity for low-prevalence or diffuse conditions and the lack of external validation remain important challenges. The model should therefore be considered a decision-support tool rather than a replacement for clinical expertise. Future work should include multi-center validation, threshold optimization, and clinician-in-the-loop evaluation to ensure safe and effective real-world deployment.

In summary, this study demonstrates that attention-enhanced convolutional models provide a methodologically sound and clinically relevant approach for multi-label dental findings classification. While the improvements are incremental, the results highlight important directions for improving robustness, interpretability, and clinical applicability in AI-based dental imaging systems.

6. Conclusions and Future Works

This study presented an attention-enhanced deep learning framework, EfficientNet-B4-CBAM, for multi-label dental findings classification from panoramic radiographs. The proposed approach integrates an EfficientNet-B4 backbone with a Convolutional Block Attention Module (CBAM) to refine feature representation through channel and spatial attention. The model was evaluated on the VZRAD2 dataset using a unified and reproducible experimental protocol designed to address key challenges in panoramic imaging, including multi-label prediction, class imbalance, and interpretability. The experimental results demonstrate that the proposed framework achieves consistent but modest improvements over strong baseline models, particularly in terms of micro-F1 score, subset accuracy, and Hamming loss. These findings suggest that attention mechanisms can enhance the model’s ability to capture clinically relevant patterns when combined with appropriate training strategies such as data augmentation and weighted sampling. However, the observed performance gains remain incremental, indicating that further advances will likely require a combination of architectural, data-centric, and task-specific improvements. From an application perspective, this work highlights the importance of developing clinically oriented and interpretable AI systems, rather than focusing solely on architectural novelty. The integration of Grad-CAM provides qualitative insights into model behavior, suggesting that predictions are associated with relevant anatomical regions; however, such explanations should be interpreted with caution and validated through expert assessment.

Several limitations should be acknowledged. The study relies on a single dataset, and the absence of external validation limits generalizability across different clinical environments and imaging conditions. In addition, the use of a fixed decision threshold may not be optimal for all classes in a multi-label setting, particularly for underrepresented conditions. Furthermore, the observed variability across classes indicates that global image-level classification may not adequately capture subtle or spatially localized pathologies.

Future work will focus on improving robustness and clinical applicability. In particular, evaluation on independent and multi-center datasets will be conducted to assess generalization. Threshold optimization and probability calibration strategies will be explored to improve decision reliability. Incorporating detection- or segmentation-based approaches may enhance performance for challenging classes with localized features. In addition, comparisons with advanced architectures, including Vision Transformers, Swin Transformers, and hybrid CNN–Transformer models, will be conducted to provide a more comprehensive benchmarking against recent developments in medical image analysis. Finally, clinician-in-the-loop studies will be essential to evaluate usability, trust, and real-world impact.

In conclusion, the proposed EfficientNet-B4-CBAM framework demonstrates that attention-enhanced deep learning models can provide a methodologically sound and clinically relevant approach for multi-label dental findings classification. While further validation and refinement are required, this work contributes toward the development of reliable and interpretable AI-assisted tools for dental radiographic analysis.

Author Contributions

Conceptualization, M.A. and S.D.; Methodology, M.A. and S.D.; Validation, S.D.; Formal analysis, M.A.; Resources, M.A.; Data curation, S.D.; Writing – original draft, M.A.; Writing – review & editing, S.D.; Visualization, M.A. and S.D.; Supervision, S.D.; Project administration, S.D.; Funding acquisition, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are derived from the VZRAD2 dental panoramic radiograph dataset, available through the Roboflow platform. https://universe.roboflow.com/arshs-workspace-radio/vzrad2/dataset/6 (accessed on 2 February 2026).

Acknowledgments

We would like to thank the Deanship of Scientific Research at Shaqra University for supporting this work.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Almalki, Y.E.; Din, A.I.; Ramzan, M.; Irfan, M.; Aamir, K.M.; Almalki, A.; Alotaibi, S.; Alaglan, G.; Alshamrani, H.A.; Rahman, S. Deep learning models for classification of dental diseases using orthopantomography X-ray OPG images. Sensors 2022, 22, 7370. [Google Scholar] [CrossRef] [PubMed]
Cejudo, J.E.; Chaurasia, A.; Feldberg, B.; Krois, J.; Schwendicke, F. Classification of dental radiographs using deep learning. J. Clin. Med. 2021, 10, 1496. [Google Scholar] [CrossRef] [PubMed]
Wu, P.-Y.; Lin, Y.-J.; Chang, Y.-J.; Wei, S.-T.; Chen, C.-A.; Li, K.-C.; Tu, W.-C.; Abu, P.A.R. Deep learning-assisted diagnostic system: Apices and odontogenic sinus floor level analysis in dental panoramic radiographs. Bioengineering 2025, 12, 134. [Google Scholar] [CrossRef]
Mutawa, A.M. Deep learning applications for dental-disease classification using intraoral photographic images: Current status and future perspectives. AI 2026, 7, 85. [Google Scholar] [CrossRef]
Webster, S.; Fraser, J. Artificial intelligence and dental panoramic radiographs: Where are we now? Evid. Based Dent. 2024, 25, 43–44. [Google Scholar] [CrossRef]
Turosz, N.; Chęcińska, K.; Chęciński, M.; Brzozowska, A.; Nowak, Z.; Sikora, M. Applications of artificial intelligence in the analysis of dental panoramic radiographs: An overview of systematic reviews. Dentomaxillofacial Radiol. 2023, 52, 20230284. [Google Scholar] [CrossRef]
Pham, T.D.; Al-Hebshi, S. Classification of pediatric dental diseases from panoramic radiographs using natural language transformer and deep learning models. Front. Artif. Intell. 2026, 9, 1754498. [Google Scholar] [CrossRef]
Mohammad-Rahimi, H.; Motamedian, S.R.; Rohban, M.H.; Krois, J.; Uribe, S.E.; Mahmoudinia, E.; Rokhshad, R.; Nadimi, M.; Schwendicke, F. Deep learning for caries detection: A systematic review. J. Dent. 2022, 122, 104115. [Google Scholar] [CrossRef]
Pornprasertsuk-Damrongsri, S.; Vachmanus, S.; Papasratorn, D.; Kitisubkanchana, J.; Chaikantha, S.; Arayasantiparb, R.; Mongkolwat, P. Clinical application of deep learning for enhanced multistage caries detection in panoramic radiographs. Sci. Rep. 2025, 15, 33491. [Google Scholar] [CrossRef]
Biltekin, H.; Geduk, G.; Altan, A.; Karasu, S. Evaluation of deep learning systems in detection of dental caries on panoramic radiography. Am. J. Dent. 2025, 38, 163–168. [Google Scholar] [PubMed]
Ayhan, B.; Dönmez, N.; Korkmaz, Y.N. Detection of dental caries under fixed dental prostheses by analyzing digital panoramic radiographs with artificial intelligence algorithms based on deep learning methods. BMC Oral Health 2025, 25, 154. [Google Scholar] [CrossRef]
Szabó, V.; Orhan, K.; Dobó-Nagy, C.; Veres, D.S.; Manulis, D.; Ezhov, M.; Sanders, A.; Szabó, B.T. Deep learning-based periapical lesion detection on panoramic radiographs. Diagnostics 2025, 15, 510. [Google Scholar] [CrossRef]
Bonfanti-Gris, M.; Herrera, A.; Salido Rodríguez-Manzaneque, M.P.; Martínez-Rus, F.; Pradíes, G. Deep learning for tooth detection and segmentation in panoramic radiographs: A systematic review and meta-analysis. BMC Oral Health 2025, 25, 1280. [Google Scholar] [CrossRef] [PubMed]
Ghasemi, N.; Rokhshad, R.; Zare, Q.; Shobeiri, P.; Schwendicke, F. Artificial intelligence for osteoporosis detection on panoramic radiography: A systematic review and meta-analysis. J. Dent. 2025, 156, 105650. [Google Scholar] [CrossRef] [PubMed]
Karamüftüoğlu, N.; Bulut, A.; Akın, M.; Sağıroğlu, Ş. Panoramic radiograph-based deep learning models for diagnosis and clinical decision support of furcation lesions in primary molars. Children 2025, 12, 1517. [Google Scholar] [CrossRef]
Shon, H.S.; Kong, V.; Park, J.S.; Jang, W.; Cha, E.J.; Kim, S.-Y.; Lee, E.-Y.; Kang, T.-G.; Kim, K.A. Deep learning model for classifying periodontitis stages on dental panoramic radiography. Appl. Sci. 2022, 12, 8500. [Google Scholar] [CrossRef]
Wu, P.-Y.; Chen, S.-L.; Mao, Y.-C.; Lin, Y.-J.; Lu, P.-Y.; Yu, K.-H.; Li, K.-C.; Chi, T.-K.; Chen, T.-Y.; Abu, P.A.R. Automated implant placement pathway from dental panoramic radiographs using deep learning for preliminary clinical assistance. Diagnostics 2025, 15, 2598. [Google Scholar] [CrossRef] [PubMed]
Borys, K.; Schmitt, Y.A.; Nauta, M.; Seifert, C.; Krämer, N.; Friedrich, C.M. Explainable AI in medical imaging: An overview for clinical practitioners—Saliency-based XAI approaches. Eur. J. Radiol. 2023, 162, 110787. [Google Scholar] [CrossRef]
Champendal, M.; Prior, J.O.; Sadowski, S.M.; Reis, C.S. A scoping review of interpretability and explainability concerning artificial intelligence methods in medical imaging. Eur. J. Radiol. 2023, 169, 111159. [Google Scholar] [CrossRef]
Nazir, S.; Khan, S.; Khan, H.U.; Saba, T.; Javed, A.; Mohamed, A.W. Survey of explainable artificial intelligence techniques for biomedical imaging with deep neural networks. Comput. Biol. Med. 2023, 156, 106668. [Google Scholar] [CrossRef]
Yilmaz, S.; Tasyurek, M.; Amuk, M.; Celik, M.; Canger, E.M. Developing deep learning methods for classification of teeth in dental panoramic radiography. Oral Surg. Oral Med. Oral Pathol. Oral Radiol. 2024, 138, 118–127. [Google Scholar] [CrossRef] [PubMed]
Tunç, H.; Akkaya, N.; Aykanat, B.; Ünsal, G. U-Net-based deep learning for simultaneous segmentation and agenesis detection of primary and permanent teeth in panoramic radiographs. Diagnostics 2025, 15, 2577. [Google Scholar] [CrossRef]
Deb, M.; Deb, M.; Dhar, M.K. A deep learning approach to teeth segmentation and orientation from panoramic X-rays. Signals 2025, 6, 40. [Google Scholar] [CrossRef]
Hou, S.; Wang, Y.; Bian, Z.; Liu, J.; Wang, H. Teeth U-Net: A segmentation model of dental panoramic X-ray images. Int. J. Med. Inform. 2022, 168, 104884. [Google Scholar]
Pedersen, S.; Jain, S.; Chavez, M.; Ladehoff, V.; de Freitas, B.N.; Pauwels, R. Pano-GAN: A deep generative model for panoramic dental radiographs. J. Imaging 2025, 11, 41. [Google Scholar] [CrossRef]
VZRAD2 Dataset. Roboflow Universe. Available online: https://universe.roboflow.com/arshs-workspace-radio/vzrad2/dataset/6 (accessed on 10 February 2026).

Figure 1. Representative samples from the VZRAD2 dataset.

Figure 2. Class distribution before and after applying WeightedRandomSampler.

Figure 3. Architecture of the proposed EfficientNet-B4-CBAM framework for multi-label dental disease classification from panoramic radiographs.

Figure 4. Micro-F1 score comparison of the evaluated models.

Figure 5. Per-class F1 scores for the proposed EfficientNet-B4-CBAM model.

Figure 6. ROC curves for EfficientNet-B4, ResNet50, and EfficientNet-B4-CBAM.

Figure 7. Class-wise confusion matrices for the proposed EfficientNet-B4-CBAM model.

Figure 8. Grad-CAM visualizations for EfficientNet-B4, ResNet50, and the proposed EfficientNet-B4-CBAM model.

Table 1. Class-wise positive label counts in the dataset splits.

Class	Train-Positive Labels	Validation-Positive Labels	Test-Positive Labels
Caries	1276	444	269
Crown	1610	581	415
Filling	3238	1458	1127
Implant	280	77	58
Mandibular Canal	262	38	21
Missing Teeth	810	193	173
Periapical Lesion	1018	362	212
Root Canal Treatment	1910	768	606
Root Piece	482	133	84
Impacted Tooth	3594	1735	1340
Maxillary Sinus	194	27	12

Table 3. Aggregate performance comparison of the evaluated models on the validation set.

Model	Micro-F1	Macro-F1	Subset Accuracy	Hamming Loss	Micro-AUC
EfficientNet-B4	0.8424	0.6626	0.4259	0.0806	0.947
ResNet50	0.8469	0.7025	0.4447	0.0792	0.960
EfficientNet-B4-CBAM	0.8567	0.6822	0.4722	0.0736	0.946

Table 4. Class-wise performance metrics of the proposed EfficientNet-B4-CBAM model on the validation set.

Class	Precision	Recall	F1-Score	Support
Caries	0.80	0.52	0.63	444
Crown	0.76	0.87	0.82	581
Filling	0.90	0.90	0.90	1458
Implant	0.95	0.96	0.95	77
Mandibular Canal	0.62	0.34	0.44	38
Missing teeth	0.80	0.78	0.79	193
Periapical lesion	0.48	0.24	0.32	362
Root Canal Treatment	0.92	0.89	0.90	768
Root Piece	0.68	0.55	0.61	133
Impacted tooth	0.95	0.98	0.96	1735
Maxillary sinus	0.50	0.11	0.18	27

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Almutairi, M.; Dardouri, S. An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs. Information 2026, 17, 465. https://doi.org/10.3390/info17050465

AMA Style

Almutairi M, Dardouri S. An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs. Information. 2026; 17(5):465. https://doi.org/10.3390/info17050465

Chicago/Turabian Style

Almutairi, Mona, and Samia Dardouri. 2026. "An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs" Information 17, no. 5: 465. https://doi.org/10.3390/info17050465

APA Style

Almutairi, M., & Dardouri, S. (2026). An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs. Information, 17(5), 465. https://doi.org/10.3390/info17050465

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Attention-Enhanced Deep Learning Framework for Multi-Label Dental Findings Classification from Panoramic Radiographs

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Source and Label Curation

3.2. Preprocessing and Augmentation

3.3. Class Imbalance Handling

3.4. Baseline Models

3.5. Proposed EfficientNet-B4-CBAM Model

Multi-Label Classification and Loss Function

3.6. Optimization and Training Protocol

3.7. Evaluation Metrics

3.8. Explainability with Grad-CAM

4. Results

4.1. Aggregate Model Comparison

4.2. Class-Wise Performance

4.3. ROC Analysis

4.4. Confusion Matrices and Error Structure

4.5. Grad-CAM Explainability

5. Discussion

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI