Semantic Segmentation of Maxillary Teeth and Palatal Rugae in Two-Dimensional Images

The superimposition of sequential radiographs of the head is commonly used to determine the amount and direction of orthodontic tooth movement. A harmless method includes the timely unlimited superimposition on the relatively stable palatal rugae, but the method is performed manually and, if automated, relies on the best fit of surfaces, not only rugal structures. In the first step, motion estimation requires segmenting and detecting the location of teeth and rugae at any time during the orthodontic intervention. Aim: to develop a process of tooth segmentation that eliminates all manual steps to achieve an autonomous system of assessment of the dentition. Methods: A dataset of 797 occlusal views from photographs of teeth was created. The photographs were manually semantically segmented and labeled. Machine learning methods were applied to identify a robust deep network architecture able to semantically segment teeth in unseen photographs. Using well-defined metrics such as accuracy, precision, and the average mean intersection over union (mIoU), four network architectures were tested: MobileUnet, AdapNet, DenseNet, and SegNet. The robustness of the trained network was additionally tested on a set of 47 image pairs of patients before and after orthodontic treatment. Results: SegNet was the most accurate network, producing 95.19% accuracy and an average mIoU value of 86.66% for the main sample and 86.2% for pre- and post-treatment images. Conclusions: Four architectural tests were developed for automated individual teeth segmentation and detection in two-dimensional photos that required no post-processing. Accuracy and robustness were best achieved with SegNet. Further research should focus on clinical applications and 3D system development.


Introduction
A crucial step in orthodontic treatment is the assessment of tooth movement, which has been studied using imaging modalities that include photographic 2D images, 2D X-rays, and intraoral 3D scanners. A precursor to measuring tooth movement is the detection, segmentation or separation of data that represents individual teeth [1]. In this paper, we focus on the detection and segmentation of individual teeth in 2D photographic images. Individual teeth segmentation has been achieved through a geometric approach and a machine learning approach.
With the geometric approach, teeth detection was addressed through the segmentation of maxillary individual teeth in 3D intraoral scans using minimum curvature to initiate the segmentation process [2]. However, user interaction was required at multiple stages to exclude the undesirable areas picked by the curvature-based algorithm. Li et al. [3] performed curvature calculations to identify teeth from a facial smile and intra-oral scan.
The process involved contour and curvature calculations followed by thresholding and boundary refinement to achieve segmentation of the front teeth, an insufficient approach to use for further calculations. Another approach involved tooth region separation in 2D X-rays by creating a dental arch represented as a four-degree polynomial [4]. This method validated separating the teeth on 2D X-rays by placing planes between the adjacent teeth; however, the teeth were not segmented.
Trials to segment teeth using machine learning are challenging because of the high variability of teeth shapes among humans and the necessity to develop a trained network and a machine learning model capable of segmenting teeth with high accuracy. Different schemes have been used, mainly on X-rays rather than oral scans. Oktay [5] applied object detection machine learning algorithms on panoramic 2D X-ray images, whereby each tooth was labeled by a bounding box belonging to a specific class of teeth: molars, premolars, and anterior teeth (canines and incisors). Preprocessing was performed to determine regions in images where teeth were expected to be found and to define symmetry [5]. An identified tooth in the X-ray image was highlighted with an encompassing rectangular shape, but this rendition is not the actual shape of the tooth.
Mikia et al. [6] applied deep learning to classify teeth, but the input to the network consisted of the X-ray images of manually pre-segmented teeth on a limited number of 52 images, considered insufficient for training purposes. The process only classified the tooth without generating a precise segmented boundary that can be used for further calculations.
Using an Artificial Neural Network (ANN), which is typically employed in tasks involving pattern recognition in the analysis of digital images, Raith et al. [7] classified dental features from a 3D scan. The features of interest, the cusps of the teeth, were input as feature vectors. Three cusp detection approaches were compared that only classified the cusps but did not segment the individual teeth. Lee et al. [8] performed instance segmentation of teeth, gingiva, and facial landmarks limited to the frontal smile-teeth to have an accuracy above 80%. Lower accuracy was achieved for the teeth farther from the center. The method was biased towards the front teeth.
Of greater relevance was the 3D model segmentation by Xu et al. [9]. The study classified mesh faces on a two-level segmentation, the first separating the teeth from the gingiva and the second segmenting individual teeth. A label optimization algorithm was introduced after each prediction to correct wrongly predicted labels. Nevertheless, "sticky teeth" (pairs of adjacent teeth similarly labeled after optimization) were sometimes falsely predicted. This problem was corrected with Principal Component Analysis. Finally, the predicted labels of the second segmentation were back-projected to the original model. This method required three-dimensional scan data and involved pre-and post-processing steps of the input to generate a precise prediction. Our method aims to generate automated predictions without any additional steps being applied to the input while generating the prediction.
One of the most significant goals of reproducing the dentition during growth or orthodontic tooth movement is the ability to determine tooth displacement relative to fairly stable oral structures, such as the palatal rugae, obviating the need to take sequential harmful radiographs. Present methods of rugae superimposition are performed manually and, if automated, rely on the best fit of surfaces, not only rugal structures. In a first step, motion estimation requires segmenting and detecting the location of teeth and rugae at any time during the orthodontic intervention.
Considering that the available methods of tooth prediction relied on manual manipulation or did not involve tooth segmentation, we aimed to generate automated segmentation consistent with all the teeth as well as precise predictions without any manual steps applied to the input while generating the prediction. In this paper, we present the development of tooth segmentation using readily available 2D occlusal photographs of the maxillary dental arch through a process that eliminates all manual steps to attain a completely autonomous system. Different architectures are tested in this process to find the most suitable network for the investigated dataset. A secondary aim was to compare four architectural styles that we planned on using.

Material and Methods
The study included the creation of a labeled dataset of photographic images of actual patients taken at the occlusal view of the maxillary arch. Deep learning was implemented through a Fully Convolutional Neural Network (F-CNN) architecture to semantically segment individual teeth and palatal rugae in color images. A benchmark using different network architectures was performed to assess the effect of data augmentation on the semantic segmentation of teeth. Accuracy and associated metrics were defined to identify the best network architecture for semantic segmentation of maxillary teeth and palatal rugae.

Dataset Collection
The dataset consisted of 797 photographic images compiled by the Division of Orthodontics and Dentofacial Orthopedics at the American University of Beirut Medical Center (AUBMC). In contrast to previous publications in which X-ray images were used, colored RGB two-dimensional (2D) images of the maxillary teeth and palate belonging to various malocclusions were taken according to standards at the occlusal plane with a single-lens reflex camera using an intraoral mirror. The images were taken at different distances to generate diversity in the dataset and make the training more robust. Images from the late mixed to permanent dentitions were included to illustrate the differentiation between primary canines and molars from permanent canines and molars. Also included to widen the scope of training were images with dental appliances, significant crowding, and prosthesis (including primary teeth with stainless steel crown) ( Figure 1). Excluded were images with multiple missing teeth, supernumerary, and transposed teeth. The images were saved in "PNG" format and with a 480 × 320 pixel resolution because this was the common resolution for the majority of the taken images and required less memory for training compared to high resolution images. maxillary dental arch through a process that eliminates all manual steps to attain a completely autonomous system. Different architectures are tested in this process to find the most suitable network for the investigated dataset. A secondary aim was to compare four architectural styles that we planned on using.

Material and Methods
The study included the creation of a labeled dataset of photographic images of actual patients taken at the occlusal view of the maxillary arch. Deep learning was implemented through a Fully Convolutional Neural Network (F-CNN) architecture to semantically segment individual teeth and palatal rugae in color images. A benchmark using different network architectures was performed to assess the effect of data augmentation on the semantic segmentation of teeth. Accuracy and associated metrics were defined to identify the best network architecture for semantic segmentation of maxillary teeth and palatal rugae.

Dataset Collection
The dataset consisted of 797 photographic images compiled by the Division of Orthodontics and Dentofacial Orthopedics at the American University of Beirut Medical Center (AUBMC). In contrast to previous publications in which X-ray images were used, colored RGB two-dimensional (2D) images of the maxillary teeth and palate belonging to various malocclusions were taken according to standards at the occlusal plane with a single-lens reflex camera using an intraoral mirror. The images were taken at different distances to generate diversity in the dataset and make the training more robust. Images from the late mixed to permanent dentitions were included to illustrate the differentiation between primary canines and molars from permanent canines and molars. Also included to widen the scope of training were images with dental appliances, significant crowding, and prosthesis (including primary teeth with stainless steel crown) ( Figure 1). Excluded were images with multiple missing teeth, supernumerary, and transposed teeth. The images were saved in "PNG" format and with a 480 × 320 pixel resolution because this was the common resolution for the majority of the taken images and required less memory for training compared to high resolution images.  A total of 719 images were segmented into 5 families of anatomical structures, including 4 for teeth (molars, premolars, incisors, canines) and one for the rugae. This set of data contributed to training a network to semantically segment these families (Figure 2A-C). To test individual structures, we augmented the sample with 78 randomly selected images.
A total of 797 images were segmented into individual structures, comprising 23 labels for individual primary (n = 10) and permanent (n = 12) teeth as well as the rugae (n = 1) ( Figure 2D,E). A total of 719 images were segmented into 5 families of anatomical structures, including 4 for teeth (molars, premolars, incisors, canines) and one for the rugae. This set of data contributed to training a network to semantically segment these families (Figure 2A-C). To test individual structures, we augmented the sample with 78 randomly selected images. A total of 797 images were segmented into individual structures, comprising 23 labels for individual primary (n = 10) and permanent (n = 12) teeth as well as the rugae (n = 1) ( Figure 2D,E). An additional dataset composed of 47 pairs of images was utilized solely to test the robustness of the trained network. Each pair of images was taken of the same patient before and after orthodontic treatment ( Figure 3). An additional dataset composed of 47 pairs of images was utilized solely to test the robustness of the trained network. Each pair of images was taken of the same patient before and after orthodontic treatment ( Figure 3).

Dataset Labeling Methods
Image labels serve as the ground truth for the training, validation, and testing of various neural network architectures. Labeling for semantic segmentation consists of assigning a class to every pixel in an image. In this work, the users who performed the labeling (orthodontic residents) identified the pixels in an image in the form of polygons drawn to

Dataset Labeling Methods
Image labels serve as the ground truth for the training, validation, and testing of various neural network architectures. Labeling for semantic segmentation consists of assigning a class to every pixel in an image. In this work, the users who performed the labeling (orthodontic residents) identified the pixels in an image in the form of polygons drawn to fit the shape of the object of interest. The labeling was applied to the entire dataset of (797 + 2 × 47 = 891) images.

Semantic Labeling
The pixel labeling was performed in a MATLAB application [10] by manually creating polygons following the contour of the regions of interest. For each image in the dataset (Figure 2A), an associated image of similar dimensions was created to identify the labels by assigning a different color to each label ( Figure 2B,E). The teeth were captured upon superimposing the label on the original image ( Figure 2C,F). All labels and contours of the teeth and the rugae area were verified by the labeling orthodontists.

Label Statistics
Semantic segmentation of teeth and rugae is challenging because of the variable sizes of the labels. The relatively larger size of the rugae label compared with the labels of individual teeth could skew the results of the network training. This issue was mitigated by utilizing an appropriate training accuracy metric (see Section 2.3.3. below). The number of pixels associated with each label was tallied for each labeling scheme and dataset combination.

Machine Learning and Semantic Segmentation
Deep learning is a model designed to analyze data similar to human analysis by using a layered structure of algorithms called an artificial neural network [11]. The design of such networks was inspired by the biological neural networks of the human brain. The algorithms are trained to find and identify patterns and features in massive amounts of data, enabling the network to generate predictions.
To apply semantic segmentation machine learning, we identified a labeled dataset comprised of images and their associated labels. The labeled images were fed into a chosen network with a specific architecture as input (see Section 2.3.2). The network in turn produced a prediction of the label as an output. To assess the accuracy of the network, the predicted label was compared to the input label, which is considered as ground truth. While in most instances, the training process starts with initial values referred to as a pretrained network, such as the semantic segmentation application used by Siam et al. [12,13], in our study, the main architecture was trained from scratch because of the absence of a pre-trained network relevant to teeth segmentation.

Dataset Split
In typical machine learning applications, the data are split into three categories: training (in which most of the data are used), validation (against which the training progress is validated), and testing. The validation and test sets are disjoint from the training set. Nearly 90% of our dataset with the family of teeth labeling scheme were dedicated as training and validation sets (Table 1), which were further split into 92% for training (586 images), and 8% for validation (53 images). Likewise, nearly 90% of the dataset with the individual teeth labeling scheme were earmarked for training and validation, which were also further split into 91% for training (641 images) and 9% for validation (64 images). The data from the 47 pairs of before and after images of patients were entirely used for testing to assess the robustness of the trained model.

Network Architectures
We applied four architectures (described in Appendix A) because they would cover various characteristics of our dataset that none of them would singularly: a The MobileUnet network [14], comprised of a small number of layers, is hence relatively fast to train and widely used in medical applications; b The AdapNet network [15] designed to adapt to environmental changes and focus less on the environment when predictions are made. Images taken in different lighting conditions would not affect the predictive ability of this network. This feature was appropriate for our dataset because the images were taken at varying proximity; c The DenseNet network [16], a model that uses features of various complexity levels to predict smooth boundaries, enables the network to deal with datasets comprising a relatively small number of images, as in our study, compared with datasets of up to hundreds of thousands of images; d The SegNet network [17], designed to be efficient while using a limited amount of memory and primarily designed to perceive spatial-relationships such as road scenes, was important for our dataset that included variable distances from which images of the teeth were captured.

Network Assessment
Semantic segmentation predictions are typically evaluated using an average mean intersection over union (average mIoU). For semantic segmentation, given two image labels representing ground truth and its associated prediction, the IoU for a given class (c) could be defined such that: where o i is the predictions pixels, y i is the ground truth labels pixels, ∩ is a logical "and" operation, and ∪ is a logical "or" operation. This computation, visually represented in Figure 4A, is similar to the originally defined equation [18]. For semantic segmentation, given two images representing the ground truth and its associated prediction, the pixel accuracy for a given class (c) could be defined such that: where oi is the predicted pixel, yi is the ground truth label pixel, ∩ is a logical "and" operation, and ∪ is a logical "or" operation ( Figure 4B). This accuracy measure can be evaluated for a specific class in an image, as an average for all classes in an image, or as an average for a single class for the entire dataset. The latter metric is referred to as the "per-class" pixel accuracy, which can provide more information on the ability of the network to precisely segment a specific label. This metric is especially useful for classes that occupy small regions in the image.
Pixel precision is defined by the ratio of the correctly detected pixels to all predicted pixels. This metric describes how many correct predictions there are compared to the total predictions generated by the model. Similarly, given two images representing ground truth and its associated prediction, one can define the precision for a given class (c) such that: where oi is the prediction pixel, yi is the ground truth label pixel, ∩ is a logical "and" operation, and ∪ is a logical "or" operation ( Figure 4C). Because the dataset exhibited a class imbalance with the dissimilar class sizes, the average mIoU is a better metric to assess the network prediction accuracy than the pixel accuracy or pixel precision. The class imbalance is exaggerated by the fact that the background (gingiva and non-teeth regions) and the rugae labels cover relatively larger areas than the rest of the classes (teeth); hence, as expected, the high pixel accuracy does not translate into a more accurate semantic segmentation [19]. One of the methodological approaches employed in the individual teeth labeling scheme involved data augmentation. We list this method and the corresponding figure under Results (Section 3.1.3) because of the sequence-dependent results.  In addition, other metrics were computed to assess the accuracy of the trained model, namely the pixel accuracy and pixel precision. Pixel accuracy is the percentage of pixels in an image that are correctly classified with respect to the input ground truth pixels.

Results
For semantic segmentation, given two images representing the ground truth and its associated prediction, the pixel accuracy for a given class (c) could be defined such that: where o i is the predicted pixel, y i is the ground truth label pixel, ∩ is a logical "and" operation, and ∪ is a logical "or" operation ( Figure 4B). This accuracy measure can be evaluated for a specific class in an image, as an average for all classes in an image, or as an average for a single class for the entire dataset. The latter metric is referred to as the "per-class" pixel accuracy, which can provide more information on the ability of the network to precisely segment a specific label. This metric is especially useful for classes that occupy small regions in the image.
Pixel precision is defined by the ratio of the correctly detected pixels to all predicted pixels. This metric describes how many correct predictions there are compared to the total predictions generated by the model. Similarly, given two images representing ground truth and its associated prediction, one can define the precision for a given class (c) such that: where o i is the prediction pixel, y i is the ground truth label pixel, ∩ is a logical "and" operation, and ∪ is a logical "or" operation ( Figure 4C). Because the dataset exhibited a class imbalance with the dissimilar class sizes, the average mIoU is a better metric to assess the network prediction accuracy than the pixel accuracy or pixel precision. The class imbalance is exaggerated by the fact that the background (gingiva and non-teeth regions) and the rugae labels cover relatively larger areas than the rest of the classes (teeth); hence, as expected, the high pixel accuracy does not translate into a more accurate semantic segmentation [19].
One of the methodological approaches employed in the individual teeth labeling scheme involved data augmentation. We list this method and the corresponding figure under Results (Section 3.1.3) because of the sequence-dependent results.

Labeling Statistics
The statistics of the family and individual structure labeling schemes demonstrate the dominance of the rugae labels. The number of pixels associated with primary teeth and third molars were insignificant in comparison with the family of teeth or individual teeth. Accordingly, the accuracy in segmenting the primary teeth and the third molars was expected to be low. In the 47 pairs of (pre-and post-treatment) images ( Figure 5C), the distribution of the teeth labels was similar to the training dataset ( Figure 5B).

Family of Teeth Labeling Scheme
Of the four network architectures, the SegNet and DenseNet exhibited the highest accuracy in terms of average mIoU (55.99% and 54.95%, respectively) ( Table 2). For both networks, the predicted labels exhibited spatial shifts (Figure 6), indicating that the network model memorized the spatial position of the teeth rather than segmenting them. To mitigate this issue, data augmentation was employed.

Individual Teeth Labeling Scheme
Considering the result of the family of teeth scheme, whereby the networks exhibit spatial memory, and the need for data augmentation to improve the network's accuracy, two data augmentation methods were employed. The first involved rotating the images (and their labels) and then adding them to the original set ( Figure 7). The second targets changing the perspective of the images (and their labels) by shearing them ( Figure 7E). labeling (C).

Family of Teeth Labeling Scheme
Of the four network architectures, the SegNet and DenseNet exhibited the highest accuracy in terms of average mIoU (55.99% and 54.95%, respectively) ( Table 2). For both networks, the predicted labels exhibited spatial shifts (Figure 6), indicating that the network model memorized the spatial position of the teeth rather than segmenting them. To mitigate this issue, data augmentation was employed.  Considering the result of the family of teeth scheme, whereby the networks exhibit spatial memory, and the need for data augmentation to improve the network's accuracy, two data augmentation methods were employed. The first involved rotating the images (and their labels) and then adding them to the original set ( Figure 7). The second targets changing the perspective of the images (and their labels) by shearing them ( Figure 7E). To assess the effect of the data augmentation, the top two performing architectures, SegNet and DenseNet, were re-trained on the full dataset ( Table 1). The two sets were tested individually and incrementally, culminating in six training combinations ( Table 3). The highest average mIoU was on the dataset that used the rotation data augmentation only. The perspective data augmentation did not improve training accuracy. Accordingly, only the rotation data augmentation was used in the dataset for the final training. Consequently, the four original architectures were retrained on the full dataset (including the rotation data augmentation) using the individual teeth labeling scheme. To assess the effect of the data augmentation, the top two performing architectures, SegNet and DenseNet, were re-trained on the full dataset ( Table 1). The two sets were tested individually and incrementally, culminating in six training combinations ( Table 3). The highest average mIoU was on the dataset that used the rotation data augmentation only. The perspective data augmentation did not improve training accuracy. Accordingly, only the rotation data augmentation was used in the dataset for the final training. Consequently, the four original architectures were retrained on the full dataset (including the rotation data augmentation) using the individual teeth labeling scheme. SegNet remained the best architecture, with an average mIoU of 86.66% and an accuracy of 95.19% (Table 4). A sample of the results from the SegNet architecture performed on the test dataset with the rotation augmented dataset is illustrated in Figure 8, showing the worst, average, and best predictions.  SegNet remained the best architecture, with an average mIoU of 86.66% and an accuracy of 95.19% (Table 4). A sample of the results from the SegNet architecture performed on the test dataset with the rotation augmented dataset is illustrated in Figure 8, showing the worst, average, and best predictions.

Machine Learning Robustness
The Signets network was used to test the robustness of the trained model on the independent third dataset of 47 pairs of pre-and post-orthodontic treatment images.

Network Accuracy
Using the same statistical analysis on this dataset as on the previous two datasets, the number of pixels per class was computed. A sample result for the Right Central Incisor class is shown in the appendix ( Figure A1A). The primary teeth and the third molar classes had the fewest pixels (close to 10% in all the images as shown in Figure A1O-T). Accordingly, two average mean IoU's were computed, one including all classes and the second ignoring the low-occurrence classes. To focus on the teeth segmentation accuracy of the network, the rugae class was not included in the second computation of the average mean IoU, which we refer to as the "Teeth Only IoU" (Table 5). The teeth only IoU value was 86.2 % (Table 5), but the average mean IoU of all teeth (including primary and third molars) and rugae was lower (82.9%), as expected, because the rugae boundaries were not consistently defined during the labeling process.

Network Robustness
To validate the robustness of the trained network, the accuracy of prediction was gauged for the pre-and post-treatment images separately, which were compared in the chi-square goodness of fit test. The prediction values in both sets were not statistically different (Table 5).

Discussion
The main contribution of this study was the segmentation of 2D clinical images through artificial intelligence and machine learning to quantify orthodontic tooth movement. To the best of the authors' knowledge, prior reporting on semantic segmentation for teeth was not available, prompting the investigation of multiple architectures to determine the best performing method.
The goal of the segmentation was to label each pixel of the intraoral 2D image with the corresponding tooth structure. While the trained network was able to depict the spatial location of a family of teeth in the image, the initial step of segmentation was reinforced through data augmentation, resulting in a more precise estimation of a matching set of teeth. The perspective data augmentation did not improve training accuracy, possibly because the images already had perspective variability since they were taken with actual cameras.
The augmentation method is commonly used in the learning phases of segmentation to increase matching accuracy [19]. In this study, this approach induced a substantial increase in the average mIoU accuracy with the best performing architecture, SegNet, which yielded mislabeling of only 1/20th of individual tooth structure.
The rugae label exhibited the lowest accuracy, probably because the rugae area was not well-defined like the teeth. This finding was accentuated by the facts that the variation of the rugae class may have also been associated with the performance of the ground truth labels by several orthodontists, and that the defining boundary can vary between individuals. A separate investigation of the source of variation is warranted to determine the contribution of each of these factors to the rugal delineation. Operator errors in segmenting the various categories of teeth could also be explored, although in this initial project, errors in the delineation of tooth contours did not impact machine learning because tooth labeling was not affected. In addition, outliers such as those in the first row of Figure 8 show how far the worst result is from the average, demonstrating that many more such images should be included in future research for the model to be sufficiently trained on them.
The sample image exhibiting missing teeth was mislabeled by the trained model. Specifically, if one of the premolars was missing, the existing premolar could be mislabeled. For instance, in Figure 9B, the left premolar (situated on the right side of the image) was labeled correctly as the first premolar and colored in blue, while the right premolar on the opposite side was labeled incorrectly as a second premolar colored in brown. Orthodontists can label these teeth correctly because the small space between the left canine and premolar indicates the prior existence of a premolar that was extracted. The prediction of both teeth is depicted in Figure 9C. The right premolar was predicted correctly owing to the spacing; however, the left premolar was predicted falsely and was classified as a second premolar.
12, x FOR PEER REVIEW 13 of 19 and premolar indicates the prior existence of a premolar that was extracted. The prediction of both teeth is depicted in Figure 9C. The right premolar was predicted correctly owing to the spacing; however, the left premolar was predicted falsely and was classified as a second premolar.
(A) (B) (C) The lack of statistical significance between pre-and post-treatment image predictions indicates that our model was robust regardless of whether the input images were taken from the pre-or the post-treatment set. This finding reflects a valuable attribute of the trained model because most of the pre-treatment images had malaligned and crowded teeth, yet the trained model was able to correctly segment them.
Considering the high scores in accuracy and robustness, the training of the system was proven to be successful in segmenting the teeth. This achievement sets the path to validate the superimposition of images aiming to quantify tooth movement during and after the completion of orthodontic treatment. Future research should help determine stable structures or planes to superimpose images taken at different timepoints and compare them to current radiological superimposition methods for cross-validation. Thus, radiological superimposition to evaluate tooth movement would be disregarded and radiation exposure reduced.
The present two-dimensional teeth segmentation has validated the usage of machine learning tools to identify and accurately segment teeth in 2D photographs. However, the The lack of statistical significance between pre-and post-treatment image predictions indicates that our model was robust regardless of whether the input images were taken from the pre-or the post-treatment set. This finding reflects a valuable attribute of the trained model because most of the pre-treatment images had malaligned and crowded teeth, yet the trained model was able to correctly segment them.
Considering the high scores in accuracy and robustness, the training of the system was proven to be successful in segmenting the teeth. This achievement sets the path to validate the superimposition of images aiming to quantify tooth movement during and after the completion of orthodontic treatment. Future research should help determine stable structures or planes to superimpose images taken at different timepoints and compare them to current radiological superimposition methods for cross-validation. Thus, radio-logical superimposition to evaluate tooth movement would be disregarded and radiation exposure reduced.
The present two-dimensional teeth segmentation has validated the usage of machine learning tools to identify and accurately segment teeth in 2D photographs. However, the 2D imaging modality restricts the motion estimation of teeth, which is limited to planar motion (x and y) and single rotation (with respect to z), both computed in the plane of the 2D image. Hence, the estimated motion would be a projection of the actual 3D motion of the teeth. To minimize the loss of information due to projection, the next step is to apply machine learning methods to the 3D domain through estimating the 2D motion on several independent image planes or using machine learning methods on a 3D imaging modality such as intraoral 3D scans. The applicability of the 2D machine learning methods developed in this work on 3D intraoral scans should be investigated.

Conclusions
A semantically labeled maxillary teeth dataset taken at the occlusal view was used to develop autonomous tooth segmentation through a process that eliminated manual manipulation. The dataset consisted of colored images, in contrast to previous research in which X-ray images were used. Machine learning methods were applied to identify the best network architecture for semantic segmentation of the images. The best network was SegNet, which yielded 95.19% accuracy and an average mIoU of 86.66%. The developed method required no post-processing nor pre-training. The model robustness, verified on an independent set of pre-and post-orthodontic treatment images, yielded an average mIoU value of 86.2% for the individually tested teeth. This model should help develop the superimposition schemes of sequential occlusal images on stable structures (e.g., palatal rugae) to determine tooth movement, obviating the need for harmful radiation exposure.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A.1 Architectures
In this work, four main architectures were included in our benchmark analysis. Our approach was similar to the network architecture study employed for urban scenes [20]. All of the network architectures share the same down-sampling factor of 32, which ensures a unified down-sampling factor that allows for the proper assessment of the decoding method. The networks include: (1) FC-DenseNet56 [16]: This network uses a downsampling-upsampling style encoderdecoder network. As the name suggests, it consists of 56 layers. In this architecture, denseblocks (DB) consisting of 4 layers are used. Each layer consists of a batch normalization, followed by ReLU, a 3 × 3 convolution, and dropout with probability p = 0:2. The DB has a growth rate of 12. In addition, on the downsampling side, each DB is followed by a Transition Down (TD) block that consists of batch normalization, followed by ReLU, a 1 × 1 convolution, a dropout with p = 0:2, and a non-overlapping max pooling of size 2 × 2. On the upsampling side, each DB is preceded by a Transition Up (TU) layer that has 3 × 3 transposed convolution with stride of 2 to compensate for the pooling operation. The network is terminated by a 1 × 1 convolution and a softmax layer; (2) MobileUNet-Skip [14]: This architecture is comprised of 28 layers such that two types of convolution layer blocks are obtained. The first type is a convolution layer consisting of a regular convolution followed by batch normalization and ReLU. The second type consists of a depth wise convolution followed by batch normalization and ReLU. Then, the layer is followed by 1 × 1 point-wise convolution, batch normalization, and ReLU. Finally, the architecture goes through a fully connected layer that feeds into a softmax layer for classification; (3) Encoder-Decoder-Skip based on SegNet [17]: This network uses a VGG-style encoderdecoder, where the upsampling in the decoder is done using transposed convolutions. The encoder network consists of 13 convolutional layers. For each encoder layer there exists a convolution with a filter bank to produce a set of feature maps. This is followed by batch normalization and an element-wise ReLU. Then, max-pooling with a 2 × 2 window and stride of two is performed. For every encoder layer there exists a decoder layer; hence, the decoder network consists of 13 layers similar to the encoder layers but differ in replacing the maxpooling by upsampling the input feature map followed by batch normalization.. In addition, the architecture employs additive skip connections from the encoder to the decoder; (4) Adapnet [15]: This architecture is a modified version of ResNet50 that uses bilinear upscaling instead of transposed convolutions. In addition, lower resolution processing is performed using a multi-scale strategy with atrous convolutions.