Do Radiographic Assessments of Periodontal Bone Loss Improve with Deep Learning Methods for Enhanced Image Resolution?

Resolution plays an essential role in oral imaging for periodontal disease assessment. Nevertheless, due to limitations in acquisition tools, a considerable number of oral examinations have low resolution, making the evaluation of this kind of lesion difficult. Recently, the use of deep-learning methods for image resolution improvement has seen an increase in the literature. In this work, we performed two studies to evaluate the effects of using different resolution improvement methods (nearest, bilinear, bicubic, Lanczos, SRCNN, and SRGAN). In the first one, specialized dentists visually analyzed the quality of images treated with these techniques. In the second study, we used those methods as different pre-processing steps for inputs of convolutional neural network (CNN) classifiers (Inception and ResNet) and evaluated whether this process leads to better results. The deep-learning methods lead to a substantial improvement in the visual quality of images but do not necessarily promote better classifier performance.


Introduction
The visual quality of imaging examinations is a major factor impacting the diagnosis process for several oral diseases. This quality is essential, since accurate identification of anatomical substructures, pathologies, and functional features depends on it. Diagnostic errors affect treatment planning and are a huge risk to the health of patients. Even though the advent of digital systems promoted better quality for examinations, many issues can result in low-quality images. Previous works assessed the visual quality characteristics observed by experts when they evaluated images [1,2]. Along with their overall appearance, experts considered features such as radio density, edge definition, image contrast, and resolution-this last one being mainly related to the sensor's capacity.
Regarding periodontal imaging, the use of periapical radiographs can be considered an important tool in this area, helping in the diagnosis, treatment, and prognosis of periodontal diseases [1][2][3][4][5][6]. Moreover, periapical radiographs are considered the standard for evaluating periodontal bone loss (PBL), since the facilitates the identification of bone defects [3][4][5][6][7][8]. This kind of examination usually covers not only the entire set of teeth, from their roots to their crowns, but also their adjacent bones. For this type of examination, low resolution is also an issue, since the corresponding acquisition tools have limitations.
In the scope of radiographic image acquisition, high spatial resolutions demand high ionizing radiation doses. Based on the popularity of radiographic examinations, many studies have emphasized the harmful effects of X-rays [9]. However, due to the presented in that study show that using resolution improvements leads to an increase in CNNs' classification accuracy compared with using low-resolution images.
This work's main objective was to perform a complete evaluation, considering both qualitative and quantitative analysis, of how different super-resolution algorithms (nearest, bilinear, bicubic, Lanczos, SRCNN, and SRGAN) used on periapical images can impact the assessment of periodontal bone loss. We observed the perceptual quality of the images provided by such methods (essential for the human assessment) and these methods' effects in pattern recognition automatic algorithms. This was innovative work, since no previous works evaluated the effects of deep-learning resolution improvement methods on assessing oral diseases. For this analysis, we performed two studies.
In the first one, a set of five periapical radiographs were treated with the six superresolution methods. The resultant images were visually evaluated considering their subjective qualities. Their scores were compared using the visual grading characteristics (VGC) curve [27].
In the second study, we extracted a set of regions of interest (ROI) in periapical images covering the interproximal area and treated them with the considered methods. Then, we trained convolutional neural networks with these treated images to classify whether the regions presented any interproximal bone loss or not. In that way, we found our whether using the images treated with deep-learning led to better performance in the classification task.

Radiographic Identification of Periodontal Bone Loss
Radiographs act as a complement to clinical examinations in the assessment of PBL and periodontal diseases in general [1,2,28]. In most cases, radiographs reveal features that are difficult to assess clinically, such as advanced periodontal lesions [28]. Nevertheless, radiographs also present some limitations. For example, bone destruction tends to appear less severe than it actually is, resulting in undetected mild destructive lesions, since they do not change the tissue density, and consequently the radiodensity, enough to be detectable in an exam [28]. Early bone changes can be identified in radiographs as subtle, mild erosions in the interproximal alveolar bone crest. These erosions tend to appear as very slights changes, but this does not mean that the disease process is recent, since the loss must occur for 6 to 8 months before radiographic evidence becomes visible [28]. When these early bone changes progress, they evolve into more severe bone loss, which can be identified as an increase in radiolucency due to a decrease in tissue density.
In this work, we focus on horizontal bone loss, vertical bone defects, and interdental craters. These patterns of bone loss may be visible radiographically. In general, interproximal bone loss can be radiographically and clinically observed as an increase in the distance from the enamel-cement junction to the alveolar crest. Horizontal bone loss consists of a horizontal loss in the alveolar bone's height; i.e., the tissue destruction is symmetrical. Radiographically, vertical bone loss can be identified as a deformity in the alveolus extending apically along the root of the affected tooth from the alveolar crest. When it happens in an interproximal region between two teeth, it can be seen as an uneven lesion, more accentuated on one side. The interproximal crater consists of a lesion that radiographically can be observed as a two-walled, trough-like depression. This loss has a band-like or irregular appearance in the interdental region between adjacent teeth [28]. Figure 1 shows examples of these three types of bone loss.

Convolutional Neural Networks for PBL Identification
The last few years have seen an intensification of machine learning methods being used to support diagnosis in several medical conditions. Moreover, previous works demonstrated the feasibility of using neural networks to identify or classify periodontal diseases in radiographs [29][30][31][32][33][34]. Convolutional neural networks (CNNs) were applied for alveolar bone loss identification and measurement [30], and identification and severity assessment of premolars and molars compromised [31]. Recent works also used CNNs for the classification of periapical lesions, considering their extent [32,33]. Additionally, CNNs were used to detect apical lesions on panoramic dental radiographs [34]. The sevenlayer network presented by Ekert et al. [34] achieved a sensitivity value of 0.65, a specificity value of 0.87, a negative predictive value of 0.93, and a positive predictive value of 0.49. CNNs also demonstrated good performance in the detection of PBL on panoramic dental radiographs. A network presented by Krois et al. [8], composed of seven layers, achieved accuracy, sensitivity, and specificity of 0.81 for this problem.
More recently, Moran et al. [29] evaluated two widely used CNN architectures (Inception and ResNet) to classify regions in periapical examinations according to the presence of periodontal bone destruction. The Inception model presented the best results, which were impressive, even considering the small and unbalanced dataset used. The final accuracy, precision, recall, specificity, and negative predictive values were 0.817, 0.762, 0.923, 0.711, and 0.902, respectively. Such results suggest the feasibility of using the CNN model as a clinical decision support tool to assess periodontal bone destruction in periapical exams.

Deep-Learning Resolution Improvement Methods
As previously mentioned, most commercial tools use interpolation methods for resolution improvement. Nevertheless, in the last few years, there was an expansion of deeplearning methods for this task. As pointed out by Yang et al. [35], super-resolution (spatial resolution improvement problem) solutions can be categorized based on the tasks they focus on, i.e., the specific classes of images they focus on. In this work, we focused on medical imaging, but in order to select the solution to be included in our evaluation, we considered the algorithms' performances in benchmarks established in the literature.

Convolutional Neural Networks for PBL Identification
The last few years have seen an intensification of machine learning methods being used to support diagnosis in several medical conditions. Moreover, previous works demonstrated the feasibility of using neural networks to identify or classify periodontal diseases in radiographs [29][30][31][32][33][34]. Convolutional neural networks (CNNs) were applied for alveolar bone loss identification and measurement [30], and identification and severity assessment of premolars and molars compromised [31]. Recent works also used CNNs for the classification of periapical lesions, considering their extent [32,33]. Additionally, CNNs were used to detect apical lesions on panoramic dental radiographs [34]. The seven-layer network presented by Ekert et al. [34] achieved a sensitivity value of 0.65, a specificity value of 0.87, a negative predictive value of 0.93, and a positive predictive value of 0.49. CNNs also demonstrated good performance in the detection of PBL on panoramic dental radiographs. A network presented by Krois et al. [8], composed of seven layers, achieved accuracy, sensitivity, and specificity of 0.81 for this problem.
More recently, Moran et al. [29] evaluated two widely used CNN architectures (Inception and ResNet) to classify regions in periapical examinations according to the presence of periodontal bone destruction. The Inception model presented the best results, which were impressive, even considering the small and unbalanced dataset used. The final accuracy, precision, recall, specificity, and negative predictive values were 0.817, 0.762, 0.923, 0.711, and 0.902, respectively. Such results suggest the feasibility of using the CNN model as a clinical decision support tool to assess periodontal bone destruction in periapical exams.

Deep-Learning Resolution Improvement Methods
As previously mentioned, most commercial tools use interpolation methods for resolution improvement. Nevertheless, in the last few years, there was an expansion of deep-learning methods for this task. As pointed out by Yang et al. [35], super-resolution (spatial resolution improvement problem) solutions can be categorized based on the tasks they focus on, i.e., the specific classes of images they focus on. In this work, we focused on medical imaging, but in order to select the solution to be included in our evaluation, we considered the algorithms' performances in benchmarks established in the literature.
Deep-learning based super-resolution methods have been widely used in medical imaging applications [21][22][23][24]26,36,37]. Zhang and An [23] proposed a deep-learning solution formed by two convolutional layers preceded by a prefixed bicubic interpolation. The transfer learning technique was also considered. In that work, the authors applied the proposed method to different types of medical images, such as knee magnetic resonance images (MRI), mammography, and angiography. Shi et al. [36] proposed a residual learning-based algorithm for MRI. The method proposed by Zeng et al. [22] also focuses on magnetic resonance images. It is a convolutional neural network that operates two types of super-resolution reconstructions (single and multi-contrast) at the same time. Park et al. [26] presented a super-resolution solution for computed tomography images. It consists of a deep-learning convolutional neural network, based mainly in the U-Net architecture. Zhao et al. [37] proposed SMORE, a deep-learning solution for the visualization improvement of brain lesions in fluid-attenuated inversion recovery images. Resolution improvement methods based on deep learning have also been applied to oral radiographs. Hatvani et al. [24] proposed a super-resolution method for enhancing dental cone-beam computerized tomography. That method is based on tensor-factorization and promotes a two-times magnification increase. Concerning deep-learning methods for resolution improvement, two algorithms have achieved impressive results in several applications, including in medical imaging: the super-resolution convolutional neural network (SR-CNN) [38] and the super-resolution generative adversarial network (SRGAN) [39]. These two solutions consist of state-of-the-art methods that achieved the best results for superresolution in the literature for benchmark datasets.
The SRCNN was initially proposed for Dong et al. [38] in 2016, and became the state-of-the-art method for super-resolution (considering a 2× magnification factor) for the BSD200 benchmark datasets [40], and due to this outstanding performance, it was included in our evaluation. The SRCNN ( Figure 2) requires pre-processing of the inputs before the network handles them, which involves the application of bicubic interpolation to obtain an initial image of the desired resolution. Then, the deep neural network processes the pre-processed images. The network operation is divided into three main steps: 1-patch extraction and representation; 2-nonlinear mapping; 3-reconstruction. SRCNNs have been used in several resolution improvement tasks. Umehara, Ota, and Ishida [25] proposed a scheme for resolution improvement in chest CT images based on the SRCNN. The method proposed by Qiu et al. [41] is an SRCNN-based reconstruction solution for knee MRIs. It is formed by three SRCNN hidden layers and a sub-pixel convolution layer.
Another popular deep-learning method for resolution improvement is the SRGAN [39]. This solution was proposed in 2017 by Ledig et al. [39] and became the state-of-the-art for super-resolution using the BSD100 and PIRM datasets [40,42], overcoming all other super-resolutions previously mentioned. The current state-of-the-art in this problem is a variation of SRGAN. SRGAN also demonstrated high performance in a wide range of applications [21,[43][44][45]. Therefore, this solution was also selected for this analysis. The SRGAN follows the main structure of general generative adversarial networks: it is composed of a generative model G and a differentiable discriminator D ( Figure 3). The generator network G is trained as a feed-forward CNN parametrized by θ G , where θ G corresponds to the weights and biases of an L-layer, obtained by optimizing a superresolution-specific loss [39]. In the training process, G is trained to create images that simulate real images and in that way mislead D, which is trained to distinguish between real images and the images generated by G [39]. SRGANs have also been applied to medical imaging. Liu et al. [45] presented an SRGAN to obtain high-resolution brain MRI data. Recently, Moran et al. [21] evaluated the use of an SRGAN for obtaining high-resolution periapical radiographs, considering the transfer learning technique.
Although SRCNN and SRGAN (including their variations) are currently state-ofthe-art methods for super-resolution, other algorithms have also shown impressive results. The KK method [46] has also presented good performance when working on the super-resolution problem. The following steps define it: initial rescaling using bicubic interpolation, and high-frequency detail recovery using local patch-based regression. For this last step, the band frequency components are extracted by the Laplacian operation.
The sparse coding (SC) method proposed by Yang et al. [47] consists of defining a sparse representation for each image patch of the low-resolution input to generate the high-resolution output. For that, two dictionaries D h and D l are trained, for low and high-resolution image patches, respectively. The main idea is to obtain the same sparse representations for low and high-resolution versions of the same patch using D l and D h . Then, this translation method, composed of the two dictionaries, can be used to obtain highresolution images from the sparse representations of low-resolution image patches. Using this approach, with sparse representations of patches instead of the actual patches, reduces the method's computational cost significantly. The anchored neighborhood regression (ANR) method [48] is also based on the sparse representations, generalizing them by allowing the approximation of low-resolution input patches using a linear combination of their nearest neighbors. For that, a neighboring embedding should be defined, considering that the patches' representations lie on low-dimensional nonlinear manifolds with locally similar geometry.

Materials and Methods
In order to assess how using deep-learning resolution improvement methods impact the visual quality of periapical images, and consequently, the identification of PBL, we performed two different studies. The Research Ethics Committee approved the studies presented here (CAAE, registered at the Brazilian Ministry of Health, 20703019.8.3001.5259). The periapical radiographs used were acquired in the Policlíınica Piquet Carneiro of Rio de Janeiro State University, using the long cone paralleling technique for minimal distortion. For image acquisition, the Sirona Heliodent Plus device (70 kVp, 7 mA, Kavo Brasil Focus) was used. The exposure time was 0.25 to 0.64 s. In addition to that, the image acquisition used the EXPRESS™ Origo imaging plate system (Intraoral imaging plate system. https://www.kavo.com/dental-xray-machines-diagnostics/intraoral-x-ray, archived on 19 February 2021) by KaVo Dental (Biberach an der Riss, Germany). For both acquisition and storage, the Express digital system was used. The digital image format was the grayscale JPEG.

Study 1-Qualitative Analysis of Image Quality
In the first study, we aimed to evaluate the perceptual quality of the periapical radiographs treated with different methods. The main idea of Study 1 was to assess the order of the approaches used to increase the spatial resolution of dental images according to their quality, promoting an easier assessment of PBL. For that, five periapical radiographs were treated with each of the considered approaches (nearest, bilinear, bicubic, Lanczos, SRCNN, and SRGAN-these last two being deep-learning-based methods obtained by Moran et al. [21]). In total, 30 treated images were considered. Then, we asked observers to evaluate the quality of the treated images considering aspects that impacted their visual analysis of PBL, such as edge definition artifacts, blur, and aliasing. The observers qualified each of the treated images by assigning scores based on the mean opinion score (MOS) metric [49], a perceptual quality metric that considers a scale ranging from 1 to 4. On that scale, 1-denotes poor quality, 2-reasonable quality, 3-good quality, and 4-very high quality. This evaluation was performed asynchronously using an online form.
Concerning the observers included in this study, they can be separated into two groups: experts and lay observers. The expert group was formed by experienced dentists, two of whom were dentists specialized in oral radiography (experts 1 and 2), and two were dentists specialized in endodontics (experts 4 and 5). The lay group was formed by 17 participants who were not dentists or radiologists and can be considered lay in PBL or oral radiography assessments. Eight of them presented previous contact with concepts related to medical images (radiographs and/or other medical images). Additionally, twelve of them had contact with concepts related to image processing techniques. The main idea for including laypeople in the study was to observe whether the quality trends denoted by lay observers would be in agreement with the ones denoted by experts.

Study 2-Evaluation of the Impacts of Pre-Processing on Deep-Learning Based Classification
In the last few years, the use of computational algorithms as assistive tools in the diagnosis of several oral diseases has increased substantially. The applications in this scope cover, among other tasks, the classification of oral images according the presence or absence of a certain lesion. In that way, Study 2 aimed to compare the considered super resolution methods as pre-processing steps for deep neural networks, considering the task of classification of regions of interest in periapical radiographs according to the presence of PBL. The main idea of this study was to evaluate the pre-processing's impacts on the classification performance of such networks.
The process to obtain these regions of interest, which were classified by the CNNs, was defined by Moran et al. [29], and includes the following steps: pre-processing of the periapical examinations using histogram equalization, manual extraction of regions of interest (interproximal areas between two teeth, limited at the top by the enamelcement junction and at the bottom by the alveolar crests), and labeling of the regions of interest by experts (experienced dentists-one of them a specialist in oral radiology; no differences existed between their annotations) considering the presence or absence of interproximal PBL.
After obtaining the images of the regions of interest, they were split into three sets: training, validation, and test sets. For the test set, we obtained 52 images of each class (with and without PBL), resulting in 104 regions. The remaining images were subjected to two different data augmentation processes in order to increase the dataset's size and at the same time reduce the differences in the numbers of samples for the classes. For the PBL class, the data augmentation consisted of horizontal flips. For the healthy class, it consisted of horizontal and vertical flips. Consequently, for the training and validation sets, we obtained 1278 images of regions with PBL and 1344 images of healthy regions. The training-validation ratio was 80:20.
The classification networks considered in this study were ResNet [50] and Inception [18]. The input images for ResNet models must be 224 × 224, as defined in [50]. The input resolution for Inception is 299 × 299 [18]. Nevertheless, the images of the regions of interest, obtained by the previous steps, presented spatial resolution lower than that, so rescaling was demanded in order to allow these images to be used as inputs to the classifiers. We resized all the images to the same resolutions (224 × 224 for ResNet inputs and 299 × 299 for Inception inputs) in order to prepare the data to be processed by the CNN classifiers. For that, the images of the regions of interest (of all training, validation, and test sets) were treated with each one the considered resolution improvement methods.
The images obtained for each method were used to train different models, resulting in twelve different models. Six of them were ResNet models, ResNet Nearest , ResNet Bilinear , ResNet Bicubic , ResNet Lanczos , ResNet SRCNN , and ResNet SRGAN , which correspond to the ResNet models trained exclusively with the images treated with the nearest, bilinear, bicubic, Lanczos, SRCNN, and SRGAN methods, respectively. Similarly, the other six models were the Inception models trained exclusively with these same data: Inception Nearest , Inception Bilinear , Inception Bicubic , Inception Lanczos , Inception SRCNN , and Inception SRGAN . Figure 4 shows the whole process for the ResNet models, from the original periapical images to the final trained models to be compared. By analyzing the performances of such classification networks, one can infer which super-resolution method is the best pre-processing rescaling step for these types of classifiers in the defined task.
The classifiers' training processes were performed using the backpropagation algorithm [51] and included 180 epochs. Transfer learning has been demonstrated to improve the performance of deep-learning methods in classification tasks. In that way, we applied it in the training of our classifiers. All models were initialized using weights obtained by a fine-tuning process considering the Imagenet dataset [52] to achieve better initial weight values, and eliminate the impact of random weight initialization, which could possibly interfere with a classifier's performance. The training and testing processes were executed in a desktop machine with the following configuration: Intel ® Xeon ® CPU 2.30 GHz processor by Intel (Mountain View, USA), Tesla P100-PCIE-16GB GPU processor by Nvidia (Santa Clara, USA), 13 GB RAM.   Table 1 shows the general and observers' MOS based on the answers for: "Evaluate the quality of the radiograph, considering the general quality of the image, visibility of the anatomical structures, definition of the limits between the structures and the presence of artifacts as blur and aliasing." Table 1. Observers' ratings and respective mean opinion scores (MOS) for the visual grading characteristics (VGC) analysis.

Expert 1 (specialist in oral radiology)
Visual quality perception (MOS score) Number of cases To compare the resolution improvement methods, we used the visual grading characteristics (VGC) curve [27], based on the MOS values. As described by Bath [27], the VGC is an evaluation of an image's subjective characteristics, in which the observer assigns scores to them using a multi-step rating scale (in this case, MOS) in order to state opinions about defined quality criteria. Moreover, it compares the cumulative distributions of the scores of two different methods, providing a general comparison of them concerning a quality criterion. Given a set of images treated with different methods and their respective scores assigned by the observers, it is possible to define a probability distribution of the images from each method. For instance, considering that expert 1 assigned the score 1 to three images treated with the bicubic method (see Table 1, line 4, column 4), the probability of observer 1 assigning score 1 to an image treated with the bicubic method is 3/5 = 0.6 (i.e., 60%), since he classified three images as "poor quality" among the five bilinear images. Similarly, the probability of this same observer (expert 1) assigning "reasonable quality" (score 2) to an image treated with this same method (bicubic) is 2/5 = 0.4, i.e., 40%. In this way, the probability distribution of the bicubic method for observer 1 can be defined as P O 1 (x), where x ∈ {4, 3, 2, 1} corresponding to the MOS scores.
Consequently, the cumulative distribution referent to the bicubic method for observer 1 is defined by function C O 1 (x), where x corresponds to the MOS scores: Using this same idea, we can define the probability distribution of any method M A method for observer n as Each color in these plots represents one observer or group: general VGC curves in maroon, VGC curves for all experts in light blue, curves for lay participants in dark blue, expert 1 in medium blue, expert 2 in red, expert 3 in yellow, expert 4 in green, and expert 5 in orange.    Table 2 shows the AUC values of the general VGC curves (obtained considering the scores given by all expert observers and lay participants), the experts VGC curves (obtained considering the scores given by all expert observers), and the lay participants' VGC curves (obtained considering the scores given by all lay participants). These results denote the superiority of SRGAN. To evaluate the differences between the M A and M B methods, we applied the nonparametric Wilcoxon paired test [53]. This test is mainly applied to analyze data that present unknown distributions. We considered a 99% confidence interval, and the alternative hypothesis H A was "MOS scores of M B are higher than the values of M A ." For that, we considered all scores given by all observers. The Wilcoxon p-value achieved for the nearestbilinear pair was 0.001-i.e., the p-value considering the H A "bilinear MOS values are higher than nearest MOS values" was < 0.01, so it proved H A . The p-values for nearestbicubic, nearest-Lanczos, nearest-SRCNN, nearest-SRGAN, bilinear-bicubic, bilinear-Lanczos, bilinear-SRCNN, bilinear-SRGAN, bicubic-Lanczos, bicubic-SRCNN, bicubic-SRGAN, Lanczos-SRCNN, Lanczos-SRGAN, and SRCNN-SRGAN were 8.191 × 10 −9 , 1.751 × 10 −11 , 5.175 × 10 −14 , 7.996 × 10 −15 , 2.695 × 10 −5 , 2.406 × 10 −9 , 9.398 × 10 −13 , 2.485 × 10 −12 , 3.795 × 10 −5 , 1.209 × 10 −7 , 2.634 × 10 −9 , 0.007, 1.825 × 10 −6 , and 0.002, respectively, which also prove the hypotheses.
In addition to the subjective evaluation provided by the MOS scores, we also included the following measures: peak signal-to-noise ratio (PSNR), mean square error (MSE), and structural similarity index (SSIM). For that, we downscaled high-resolution images from a 720 × 720 spatial resolution to a 128 × 128 spatial resolution and then applied each one of the considered methods to obtain new images with a 720 × 720 resolution. Then, the similarity of the original high-resolution images and the images obtained by each method was assessed using such measures. Table 3 shows the MSE, PSNR, and SSIM values for the images treated with each super-resolution method, where it is possible to observe that SRGAN provides high-resolution images very close to the original ones, because it presents the lower MSE, the greater PSNR, and the SSIM closest to 1. Table 3. Average and standard deviation of mean square error (MSE), peak signal-to-noise ratio (PSNR), and structural similarity index (SSIM) computed for each method.

Study 2
The confusion matrices for all the ResNet and Inception models are presented in Tables 4 and 5.     Tables 6 and 7 show the values achieved for the other metrics considered, which were sensitivity (recall), specificity, precision (positive predictive value, PPV), and negative predictive value (NPV) [54]. In this example, such measures are based on:   Table 8 shows the proportion of correctly classified examples (test accuracy) for each method. Figure 6 shows the receiver operating characteristic (ROC) and precision-recall (PR) curves for each model. Figure 6a shows the ROC curves for ResNetNearest, ResNetBilinear, ResNetBicubic, ResNetLanczos, ResNetSRCNN, and ResNetSRGAN; and Figure 6b shows their PR curves. Similarly, Figure 6c shows the ROC curves for InceptionNearest, InceptionBilinear, InceptionBicubic, InceptionLanczos, InceptionSRCNN, and InceptionSR-GAN; and Figure 6d shows their PR curves.

General Discussion
Relative to Study 1, for expert 3, Lanczos's quality was very high and very close to SRGAN, which is also denoted by the corresponding VGC curve and its AUC. For expert 4, the nearest method had better results, despite the intense aliasing effect. For expert 5, all interpolation methods and SRCNN had very similar and reasonable performance, except the SRGAN method, which was superior. Except for expert 2, the experts considered the nearest method equal to or better than the bilinear method. This suggests that the nearest aliasing effect tends to be more supported than the bilinear blur effect. For experts 1 and 2, the results' quality progressively improved for the bilinear, bicubic, Lanczos, SRCNN, and SRGAN methods, as expected. This superiority was also proved by the p-values achieved for the Wilcoxon tests performed. This trend was also observed in a general way, considering the scores for all observers. On the one hand, experts considered SRGAN as the best method and found the effects of bilinear, bicubic, and nearest methods relatively similar. Some of the experts even gave the same scores for those three methods, which can also be seen in the VGC curves that are repeated and overlap for some pairs. Note that the results for experts 4 and 5 differ substantially from the results of other experts, especially for bilinear and bicubic methods. This difference can be mostly related to the fact that they are specialized in endodontics, not in oral radiology, so their perception of the visual quality regarding PBL assessments can differ. Experts 4 and 5 may have had some tolerance for image flaws that were not tolerated by the dentists specialized in oral radiology. Additionally, due to their extensive experience in endodontics and PBL assessment, they may have had some ability to detect PBL even in blurred images.
On the other hand, the two deep-learning methods had very similar performances for laypeople, with SRGAN being slightly better. For the interpolation methods, the improvements had the progression expected in terms of quality, following the order: nearest, bilinear, bicubic, and Lanczos.
These visual quality differences are exemplified in Figure 7, in which it is possible to see how the deep-learning methods increase the edges' definition in the interproximal area.
It is important to emphasize that MOS consists of a qualitative evaluation, so it is a subjective metric that is highly observer dependent. Even while considering that the number of experts in this study can be considered low, observing the general experts' curves, based on all experts' scores, minimizes these observer-dependent factors and provides an overview of the perceptual quality. Additionally, the additional analysis performed, including the analysis of lay observers, involved a large number of participants and provided more robust metrics, evidencing more general trends about the quality of the different methods.
Regarding Study 2, at first sight, the classifiers' performances tend to appear similar in a general way, as denoted by the ROC curves ( Figure 6) and the overall accuracy, except for Inception Bilinear and Inception Bicubic , which had higher accuracy compared with the other methods. However, in contrast with the results of Study 1, the results of Study 2 for the ResNet models suggest that the use of SRGAN may actually have a bad influence on the classification. The best overall accuracy was obtained by the SRCNN (Table 8). For the ResNet models, Lanczos had a worse accuracy than the bilinear and bicubic interpolation methods. Nevertheless, the bicubic interpolation led to more balanced results, in the way that its performance for both classes was similar (Table 4). For the SRGAN, the accuracy for the healthy class was substantially higher than with other methods, which is reflected by the high precision and specificity values (Table 6). Nevertheless, its low accuracy for the PBL class resulted in many false negatives, which is also denoted by the recall and NPV values. On the other hand, the SRCNN presented the best accuracy for the PBL class, which led to the high recall and NPV values, but its high number of false positives led to low precision and specificity values. Similar phenomena happened to the nearest, bilinear, bicubic, and Lanczos methods. In that way, the ResNet Nearest , ResNet Bilinear , ResNet Lanczos , ResNet SRCNN , and ResNet SRGAN models seemed to present a biased trend to the PBL or healthy classes. This might have been caused by the trend that these methods add certain artifacts to this kind of image-blur for bilinear, Lanczos, and SRCNN, and aliasing for nearest and SRGAN [9]. Even considering that in Study 2 we focused on a classification task, using a preprocessing step that improves the spatial resolution of input images is interesting for a For the Inception models, the use of the deep-learning methods led to worse performance compared with the bicubic method, considering all evaluated metrics (Table 7), which suggests that the mentioned artifacts badly impact the patterns used by models of this architecture. Concerning the interpolation methods, the performance varied largely according to the class considered, denoting a high bias for such Inception models (Tables 5 and 7).
The results of Study 2 also suggest that advanced CNNs can handle blurred images during the training process in such a way that the pattern recognition is not so drastically affected by this kind of artifact. Moreover, the methods used as pre-processing steps for CNNs should consider factors that impact the pattern recognition algorithms instead of factors that impact human visual perception, since deep learning algorithms perform classification in a different way to how human experts do. Additionally, the development of deep learning algorithms that directly handle low-resolution inputs can be beneficial.
Even considering that in Study 2 we focused on a classification task, using a preprocessing step that improves the spatial resolution of input images is interesting for a wide range of automatic applications. For instance, object detection and segmentation are tasks that could employ such pre-processing. Specifically for segmentation, edges and details definition are essential. One example of an application is segmentation band detections of anatomical structures in CT (or CTA) scans from aortic dissections [55]. In such an application, the enhancement possibly provided by the super-resolution methods may provide more details about the aortic wall and layers, especially in the primary entry tear area, which is demanded in that context.
Concerning the results of both studies, there is evidence that using the deep-learning methods (especially SRGAN) improves the perceptible visual quality of the images in aspects related to the PBL identification, such as contrast and edge definition. On the other hand, their application as a pre-processing step for CNN classifiers did not substantially improve the overall performance. However, it may help to identify more precisely each of the classes, depending on the method used and the classifier considered.

Conclusions
In this work, we evaluated how using resolution improvement methods influences the assessment of periodontal bone loss. For that, we proposed two different studies, focusing on human and computer-based analysis of PBL, respectively. The results of Study 1 (MOS scores and VGC curves) demonstrated that both deep-learning methods, especially SRGAN, generate high-resolution images with high visual quality in aspects that influence PBL assessment, promoting easier diagnosis. The interpolation methods' performances varied hugely, but the expected trend was observed in the general evaluation (considering all participants). On the other hand, the deep-learning methods did not substantially improve CNN classifiers' performances, suggesting that they may add some sort of artifacts that influence the texture patterns that discriminate sample groups along with the CNNs' operation. We highlight that one of the main limitations of this work was the low number of dentists participating in Study 1.
In future works, we aim to extend this analysis to evaluating the impacts of the deep-learning resolution improvement methods on other computer-based tasks, such as segmentation and object detection. Informed Consent Statement: Patient consent was waived due to the retrospective nature of the analysis based on existing data.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy and ethical restrictions.