1. Introduction
Accurate teeth segmentation is important in oral healthcare. It provides location data for orthodontic treatments, clinical diagnoses, and surgical procedures. It is also used to identify individuals, capture tooth morphology, and plan dental implants [
1]. Manually segmenting teeth is time-consuming and challenging, even for experienced professionals. Semi-automatic techniques can help, but they still require some human input. This is especially true for lower-resolution images, with which it can be difficult to accurately delineate certain tooth regions.
Teeth segmentation from panoramic X-ray images can be of two types: semantic segmentation and instance segmentation. Semantic segmentation is a simple approach to dental analysis that labels the entire teeth region with a single label. This simplicity makes it computationally efficient and easy to implement. Semantic segmentation is well suited for general dental analysis for which distinguishing individual teeth is not essential, such as identifying overall dental health or disease patterns. Semantic segmentation could be used to track the progression of dental diseases over time. This would allow dentists to monitor the effectiveness of treatment and make adjustments as needed. Teeth instance segmentation assigns separate labels to each individual tooth, enabling precise identification of each tooth’s boundaries and characteristics. With instance segmentation, dental professionals can accurately analyze the health of each tooth, aiding in personalized treatment planning for orthodontic procedures and other dental interventions. A dentist can use instance segmentation to identify a tooth that is decayed. This information can be used to plan the best course of treatment, such as a filling or a root canal. Instance segmentation allows for detailed quantitative analysis, such as measuring gaps between teeth, assessing wear patterns, and monitoring specific dental issues on a per-tooth basis. For tasks like designing dental prosthetics or implants, instance segmentation helps in creating accurate models that fit each tooth precisely.
While 3D instance segmentation is particularly valuable for complex cases that require a thorough understanding of tooth morphology, such as orthognathic surgery planning, dental implant placement, and in-depth orthodontic assessments, 2D instance segmentation is well suited for routine dental analyses and treatments that can be effectively assessed from a two-dimensional perspective, such as cavity detection, treatment planning, and basic orthodontic evaluations. Moreover, 2D instance segmentation is generally faster and requires less computational resources compared to its 3D counterpart, making it suitable for tasks that demand quick results.
In this paper, we use panoramic dental X-ray images for instance segmentation. Panoramic X-ray images are widely used for applications like dental caries, alveolar bone resorption, and impacted teeth [
2]. Research works on panoramic X-ray images are mostly limited to teeth detection, teeth segmentation, and teeth numbering. Orientations of teeth are not well explored. It can help dentists and oral surgeons diagnose dental problems and plan treatments, particularly for procedures related to restorative dentistry, such as dental implants, teeth alignment, and orthodontic interventions. Typically, positioning technology and segmentation are conducted in separate frameworks. Consequently, the development cycle takes longer, and the algorithm’s complexity is elevated [
3].
Our contribution to this paper can be summarized as follows:
We propose a two-stage framework that returns individual segmented teeth and their orientations. We adopt a deep learning-based approach to segment individual teeth, followed by applying Principal Component Analysis (PCA) to determine tooth orientation. Such outcomes can facilitate obtaining precise teeth positions in an image.
We modify the FUSegNet model [
4], originally developed for wound segmentation, for teeth instance segmentation by introducing grid-based attention gates in skip connections.
We extend our approach to find horizontal bounding boxes (HBB), oriented bounding boxes (OBB), and missing teeth detection.
We generate the OBB coordinates for each of the teeth in the DNS dataset [
5], and these coordinates are available in our GitHub repository, as referenced in the
Data Availability Statement.
We provide comprehensive experimental findings, including a comparison with the state of the art, and an in-depth examination of our technique with a comprehensive ablation study to show the efficacy of our approach.
To highlight the motivation, approach, and impact of this study, we present a structured summary in
Figure 1.
2. Literature Review
Deep learning can help dentists by automating the process of teeth segmentation. This saves time and reduces human error, allowing dentists to focus on more important tasks such as accurate diagnosis and treatment planning. Various methods have been developed for teeth segmentation. Koch et al. [
6] employed FCN in a U-Net setup. Zhao et al. [
7] introduced TSASNet with a two-stage approach involving contextual attention and segmentation based on attention maps. Chen et al. [
8] extended spatial pyramid pooling (SPP) to MSLPNet and introduced MS-SSIM loss. Salih and Kevin [
9] proposed LTPEDN, replacing LBC layers with LTP layers. Hou et al. [
10] proposed Teeth U-Net using a multi-scale feature aggregation attention block (MAB) and dilated hybrid self-attention block (DHAB) in the bottleneck layer for improved segmentation.
Jader et al. [
11] are credited for being the pioneers who attempted teeth instance segmentation from panoramic X-ray images. They employed a set of 193 images for training and subsequently assessed their approach on 1224 images, achieving an F1-score of 88%. Rubiu et al. [
12] also used Mask R-CNN on the Tufts Dental Database [
13], which consists of 1000 panoramic dental radiographs, including both deciduous and permanent teeth. Their classification accuracy and dice score were 98.4% and 87%, respectively, and they observed poor segmentation for the right mandibular third molar. Silva et al. [
5] explored Mask R-CNN, PANet, HTC, and ResNeSt for teeth segmentation and numbering. They observed the best result for PANet with 71.3% of mAP for segmentation. Helli and Mahamci [
14] employed a two-step methodology in which they employed a U-Net to create a binary prediction, followed by morphological operations to label connecting elements. So, the full segmentation process is not deep learning-based. Their study utilized a limited dataset comprising 116 patients, with 11 images reserved for testing purposes. El Bsat et al. [
15] tried MobileNet, AdapNet, DenseNet, and SegNet for maxillary teeth and palatal rugae segmentation. Their dataset consists of 797 occlusal views from teeth photographs. They achieved the best performance from SegNet with 86.66% average mIoU. Wathore and Gorthi introduced a bilateral symmetry-based augmentation technique for panoramic X-rays, achieving a Dice Similarity Coefficient (DSC) of 76.7% using TransUNet [
16]. Similarly, Brahmi and Jdey presented a new dataset comprising 107 panoramic X-ray images, reporting an F1-score of 63% using Mask-RCNN [
17].
The above-mentioned works primarily addressed teeth segmentation, with some including numbering, yet none tackled tooth orientation. However, the orientation of teeth in 2D panoramic X-ray images is significant because it can help dentists identify the position of the teeth and their roots. This paper presents a two-stage framework: initially, utilizing deep learning for precise teeth segmentation, enabling subsequent numbering based on individual labels; subsequently, employing principal component analysis (PCA) [
18] to establish tooth orientation.
3. Materials and Methods
3.1. Dataset
In this paper, we use the publicly available DNS dataset, which is accessible upon request to the authors of [
5]. The dataset comprises 543 panoramic X-ray images, split into five folds with a resolution of 1991 × 1127 pixels. The test fold has 111 images, while the other four have 108 each. Three folds are for training, and one is for validation. To fit in the GPU, we divide each image into patches of size 512 × 512 pixels with an overlap of 10 pixels along both height and width. This corresponds to a stride of 502 pixels, keeping redundancy low while enabling smooth transitions across patch boundaries. For border patches, the final starting index is adjusted by subtracting the patch size from the image dimension, ensuring that the patch fits entirely within the image boundaries. We choose the patch size 512 × 512 to balance contextual richness and computational efficiency. Larger patches preserve more semantic and structural information, which is beneficial for tasks such as teeth segmentation. All images are normalized between 0 and 1. In addition, for training the deep-learning models, a unique segmentation label is assigned to each tooth, resulting in a total of 32 segmentation labels.
Table 1 summarizes the key properties of the DNS dataset and the preprocessing steps adopted for model training. These include image resolution, fold distribution, patching strategy, and the use of PCA-based OBB generation. The distribution of segmentation labels is depicted in
Figure 2. Furthermore, we create oriented bounding box coordinates for each tooth using a PCA-based approach.
3.2. Model Architecture
Figure 3 demonstrates our deep learning model. Unlike the original FUSegNet designed for wound segmentation, we introduce grid-based attention gates into the skip connections to enhance the model’s focus on spatially relevant features, which is particularly beneficial for fine-grained dental structures. It has four major parts: encoder, decoder, grid-based attention gates (AGs), and parallel spatial and channel squeeze-and-excitation (P-scSE) module. Furthermore, to go beyond horizontal bounding boxes, which do not account for angular orientation, we implement a PCA-based oriented-bounding-box (OBB) technique, as detailed in
Section 3.4.
3.2.1. Encoder
To avoid manual scale adjustments, the study employs an EfficientNet architecture as the core encoder. Convolutional neural networks often require tweaking depth, width, and resolution, a traditionally cumbersome and random process. EfficientNet’s authors [
19] introduce a novel approach, using fixed coefficients (
α,
β, and
γ) and a compound coefficient
ϕ for uniform scaling. Depth, width, and resolution are scaled as
,
, and
, respectively. The authors used platform-aware neural architecture search via MnasNet [
20] to discover the baseline EfficientNet-B0 architecture. Based on this, they conducted a small grid search to determine the compound scaling coefficients α = 1.2, β = 1.1, and γ = 1.15, which were used to uniformly scale the model to EfficientNet-B1 through B7. While EfficientNet-B0 is 224 × 224, EfficientNet-B7 has
ϕ = 6, yielding 224 ×
= 224 ×
≈ 518 resolution. So, we use an EfficientNet-B7 model trained on ImageNet for patches of size 512 × 512.
3.2.2. Decoder
As depicted in
Figure 3, during each decoder stage, the upsampled output from the lower level is first concatenated with the encoder output of the corresponding level. However, in contrast to the FUSegNet architecture, the encoder output undergoes processing through a grid-based gated attention module [
21]. This is performed to regulate the flow along the skip connections and allow the attention coefficients to focus more specifically on local regions. The resulting concatenated output then passes through the P-scSE attention module, which aggregates spatial and channel-wise information. Finally, a 3 × 3 Convolution-ReLU-Batch normalization is applied to this output.
3.2.3. Parallel Spatial and Channel Squeeze-And-Excitation (P-scSE)
The squeeze-and-excitation module [
22] was designed to boost the network’s representational power by highlighting significant features and ignoring less relevant ones. It generates a channel descriptor using global average pooling, triggering channel-related dependencies. It is also referred to as cSE due to its channel-wise excitation. Roy et al. [
23] introduced the sSE module, which squeezes along the channel axis and excites along spatial dimensions. The scSE module combines the cSE and sSE components. This combination can be conduted in different ways. As shown in
Figure 4, P-scSE [
4] creates two parallel branches of the scSE module: one by adding cSE and sSE and the other by taking the maximum of them. While max-out offers competitiveness between the channel and spatial excitations, addition aggregates these two excitations. A switch is provided to skip max-out when the number of channels is small. When we have a small number of channels, the model’s capacity to learn intricate channel dependencies and patterns is already limited. So, if we selectively collect features, like max-out does, it will lose some important features and will not contribute significantly.
3.2.4. Grid-Based Attention Gate
In this paper, we use grid-based attention gates (AGs) [
21] to improve skip connections. The AG is a trainable module that is added to skip connections of encoder–decoder-based architecture to reduce the number of false positive predictions for small objects with significant shape variation. It learns to weigh the features from the skip connections, giving more importance to the features that are relevant to the target structure. As shown in
Figure 5, the attention gate calculates the attention coefficient
α, which is the result of additive attention computed from the input feature maps
x and the contextual information provided from the gating signal
g. The attention coefficient identifies important spatial regions by paying attention to task-relevant activations.
3.3. Post-Processing
The output of the deep-learning model is fine-tuned by implementing a post-processing stage. If two regions with the same label exist, the larger region is identified as the desired region, while the smaller region is categorized as unwanted. Initially, we identify border pixels using chain coding. Subsequently, we detect neighboring pixels of each border pixel using 8-connectivity. We only consider those neighboring pixels with intensities differing from those of the unwanted portion. As shown in
Figure 6, we observe three cases that need to be addressed.
Case-I: If all the neighbor pixels are 0 (background), then dissolve the unwanted portion into the background.
Case-II: If the neighbor pixels consist of background pixels and pixels of a specific label, then perform the following:
- -
First, ignore background pixels.
- -
Then, dissolve the unwanted portion into label pixels.
Case-III: If there exist two different labels in the neighboring pixels, then perform the following:
- -
Count the number of border pixels that lie in both labels.
- -
The most frequent label is the winner.
3.4. Oriented Bounding Box (OBB)
To implement the oriented bounding box (OBB), we utilize principal component analysis (PCA) [
18]. PCA is a statistical technique that transforms correlated observations into uncorrelated values called principal components. It ensures the first component captures the most variance, and subsequent components maximize variance while staying uncorrelated with preceding ones through orthogonal transformations.
We chose PCA due to its interpretability, computational efficiency, and effectiveness in capturing the primary axis of orientation from segmented tooth regions. While deep learning-based or regression-based OBB predictors are potential alternatives, they typically require supervised training with ground truth orientation angles, which are not available in the DNS dataset. We selected PCA over minimum-area bounding rectangles (e.g., OpenCV’s minAreaRect) because, unlike minAreaRect, which returns a rotated rectangle without clear directional cues, PCA explicitly identifies the dominant direction of the tooth contour, making it more suitable for estimating rotation in our context. The steps involved in generating OBBs in this paper are as follows:
Tooth separation: We approach the OBB generation for each tooth individually. This involves keeping one tooth in the image while rendering the others as background. The image is then binarized.
PCA: We then calculate the first two major principal components (PCs) and determine the angle between the first PC and the horizontal axis, which we refer to as PCAangle.
Rotation: We then generate a 2D rotation matrix to rotate the tooth and align it vertically. The rotation matrix for rotating a point (
x,
y) by an angle (
θ) around an arbitrary pivot (
xc,
yc) can be expressed as follows:
We calculate
θ in the following way:
HBB: After aligning the tooth vertically, we proceed to plot the horizontal bounding box (HBB). Generating an HBB from contour points is much easier than generating an OBB.
Inverse rotation: We then rotate the tooth along with the HBB back to its original position. To achieve this, we substitute into Equation (1).
3.5. Training and Evaluation
Experimental setup: All experiments are executed on Google Colab Pro+. For optimization, we use the Adam optimizer [
25] with an initial learning rate of 0.001 and weight decay of 1 × 10
−5. The learning rate is decreased by a factor of 0.1 when the metric shows no improvement for 10 epochs. Models are trained for 50 epochs, and the best model is stored. We employ a combined loss function, using both dice loss and focal loss with equal weights. Unlike cross-entropy loss, which overlooks contextual information of surrounding pixels in favor of per-pixel calculation, dice loss accounts for local and global information. Dice loss, represented as
DL = (1 −
DSC), encompasses the dice coefficient (
DSC). Focal loss (
FL) is valuable for addressing class imbalance (such as background >> foreground), in which it prioritizes challenging cases by down-weighting easy examples [
26]. So, the final loss function is expressed as follows:
Evaluation metric: For the segmentation task, we use intersection-over-union (IoU), precision, recall, and dice score (DSC). For encoder–decoder-based architecture, these evaluation metrics are widely used. Here are each definition’s details:
Here, TP, FP, and FN are true positive, false positive, and false negative, respectively.
For OBB, we calculate IoU between the ground truth OBB and predicted OBB. In later sections, it will be termed as rotated IoU (RIoU) to distinguish it from the IoU used in segmentation.
4. Results and Discussion
We can divide our experiments into two parts: segmentation analysis and OBB analysis. In segmentation analysis, we compare our proposed model with state-of-the-art segmentation models. Since our goal is to develop an encoder–decoder-based architecture, we primarily compare the results with popular encoder–decoder structures. We tabulate the results in
Table 2. As shown in the table, our proposed model achieves an IoU score of 82.43% and a DSC score of 90.37%, both of which are the highest scores compared to the other models. Incorporating grid-based attention gates led to a ~1.5% IoU improvement over the original FUSegNet. The post-processing described in
Section 3.3 improves the IoU from 82.0 to 82.43 and the DSC from 90.1 to 90.37. While the gain may appear modest, this step helps to remove artifacts and ensures cleaner segmentation masks, particularly around tooth boundaries.
We also explore two transformer-based approaches. The first one is the Mixed Vision Transformer (MiT)-b2-UNet, for which we incorporate the SegFormer (b2-sized) encoder [
27] pretrained on ImageNet with the UNet architecture. The second one is Swin-Unet [
28]. However, we do not observe satisfactory performance with transformer-based approaches. Possibly, it is because transformers often tend to overfit for small-scale data due to the lack of structural bias [
29] or because they require additional task-specific tuning [
30].
Figure 7 provides a graphical summary of the IoU, Precision, Recall, and DSC scores reported in
Table 2, illustrating the overall performance of each model, with our proposed method achieving the highest scores across all metrics.
Table 2.
Segmentation results achieved from deep-learning models.
Table 2.
Segmentation results achieved from deep-learning models.
Model | IoU | Precision | Recall | DSC |
---|
DeepLabV3+ [31] | 80.45 | 88.59 | 89.75 | 89.17 |
FPN [32] | 81.75 | 89.54 | 90.28 | 89.76 |
MANet [33] | 62.70 | 85.82 | 69.95 | 77.08 |
LinkNet [34] | 35.25 | 73.63 | 40.35 | 52.13 |
FUSegNet [4] | 80.79 | 88.51 | 90.26 | 89.38 |
PSPNet [35] | 28.94 | 59.60 | 36.00 | 44.89 |
UNet [36] | 67.93 | 86.00 | 76.37 | 80.90 |
MiT-b2-UNet [27] | 72.65 | 86.80 | 81.67 | 84.16 |
Swin-Unet [28] | 53.69 | 72.43 | 67.47 | 69.87 |
Ours | 82.43 | 90.48 | 90.26 | 90.37 |
Figure 8 illustrates the training and validation performance of the proposed model over 50 epochs. The model shows stable convergence with steadily decreasing loss and consistently high IoU and DSC scores on the validation set.
For OBB performance analysis, we first generate oriented bounding boxes using the method described earlier and calculate rotated IoU (RIoU).
We prefer IoU over mean average precision because, unlike other region proposal-based detection methods, oriented bounding boxes generated from encoder–decoder-based approaches do not have any confidence scores.
Figure 9 illustrates the
PCAangle for several example teeth by overlaying the first principal component on each tooth shape. This angle, measured relative to the horizontal axis, is then used to compute the tooth rotation angle,
θ, for generating OBBs. As shown in
Table 3, we achieve an RIoU score of 82.82%.
Figure 10 illustrates segmentation performance through the visualization of predictions compared to the ground truth, and it also presents the segmentation results using both horizontal bounding boxes and oriented bounding boxes.
So far, the evaluation is based on the entire dataset. We calculate the mean evaluation matrices for each individual tooth label in the entire dataset. However, this approach sometimes does not accurately reflect the performance of individual tooth labels. Therefore, we need to evaluate the 32 individual tooth labels separately. As shown in
Figure 11, we generate a radar chart to demonstrate the performance of each label. We observe that tooth labels 4, 8, 22, 24, and 32 show relatively poor performance compared to the other labels in terms of DSC and RIoU. Their DSC scores range from 80% to 85%, and their RIoUs range from 75% to 80%. In contrast, tooth labels 1–2, 9–10, 18–19, 20–21, and 27–29 demonstrate relatively high performance with DSC scores ranging from 90% to 95% and RIoU ranging from 85% to 90%.
Next, we perform categorical analysis. As shown in
Figure 2, 32 teeth are divided into four categories: incisors, canines, premolars, and molars. Each of these categories is further divided into two groups: upper and lower. So, in total, we have eight categories. Categorical analysis is presented in
Table 4. It is observed that incisors and canines, for both upper and lower jaws, demonstrate good performance. Lower premolars also exhibit good performance. These five categories achieve a DSC of >88% and an RIoU of >83%. However, upper premolars, along with both upper and lower molars, show relatively poor performance, with a DSC of ≤86% and an RIoU of ≤80%.
We identify several factors that influence the model’s performance. Lack of instance overlapping, presence of foreign bodies, fuzzy tooth roots, and poor annotations are some of the factors that affect the performance. As shown in
Figure 12a,b, there are overlaps between premolar teeth and between premolar and molar teeth in the panoramic X-ray images; however, there is no instance of overlap in the ground truth. Foreign bodies include dental restorations, broken dental instruments, metal fillings, braces, and other metal objects.
Figure 12c shows the presence of bracelets and fillings, while
Figure 12d shows the presence of fuzzy tooth roots. As shown in
Figure 12e, some images are poorly annotated and contain sharp edges and cones. However, when compared to the original data, our model predicts smoother edges, which is more reasonable. While some researchers have made modifications to improve the dataset’s quality [
37,
38], in this work, we have used the original dataset as it is publicly available. To better understand the impact of dataset imperfections, we manually reviewed the test set and identified several recurring issues. Specifically, at least 21 images include foreign objects (e.g., braces, caps, fillings, or screws), at least 19 images (excluding foreign objects) exhibit some form of tooth overlaps that are not reflected in the ground truth, and at least 10 images show ambiguous or fuzzy tooth roots, among which 5 also suffer from low overall image quality. Many of the overlaps are observed in category 2. These factors introduce noise and ambiguity during training and evaluation, potentially affecting the model’s ability to learn clear tooth boundaries. While a more rigorous assessment would benefit from clinical validation by dental professionals, such an analysis was beyond the current scope of this study.
Furthermore, we calculate the number of false positives and false negatives in OBBs. This is performed to detect missing teeth. There is a total of 3382 teeth in the 111 test images. We train and evaluate our model five times. As shown in
Table 3, we do not observe any false positives in any of the runs, and the number of false negatives ranges from three to six. This indicates that the model can be used effectively to detect missing teeth.
The average inference time of the proposed model is approximately 1.2 s per image on Google Colab Pro+, with ~66M model parameters, indicating its suitability for practical clinical use where immediate response is not critical.
5. Conclusions
In this paper, we address the critical tasks of teeth segmentation and orientation in the context of oral healthcare. Accurate teeth segmentation holds immense importance for various applications, including orthodontic treatments, clinical diagnoses, surgical procedures, and dental implant planning. We recognize the challenges posed by manual segmentation, especially in lower-resolution images, and the limitations of existing semi-automatic techniques that still require human intervention. Our model architecture includes an EfficientNet-based encoder and a decoder with grid-based attention gates and a parallel spatial and channel squeeze-and-excitation (P-scSE) module. We also introduce oriented bounding boxes (OBB) generation using principal component analysis (PCA). By addressing tooth orientation, a largely unexplored problem, within the same framework as segmentation, our work extends beyond current methodologies, advancing the field toward more comprehensive and clinically useful analysis tools. Our model’s accuracy and capabilities open new possibilities for improving dental diagnoses and treatment planning, even when working with clinically standard, moderate-resolution panoramic X-rays. This makes the approach practical for real-world deployment where ultra-high-resolution imaging is not always available. Future works include extending our approach to 3D instance segmentation of teeth, which is crucial for applications like orthognathic surgery planning and dental implant placement, would be a valuable direction. In addition, we plan to explore advanced data augmentation techniques, including geometric transformations, elastic deformations, and intensity perturbations, to improve the robustness and generalization of the model. We also aim to adapt the proposed framework to other dental imaging modalities, such as bitewing and periapical X-rays, with appropriate adjustments for their localized views and anatomical characteristics.