Using a Deep Learning Model to Address Interobserver Variability in the Evaluation of Ulcerative Colitis (UC) Severity

The use of endoscopic images for the accurate assessment of ulcerative colitis (UC) severity is crucial to determining appropriate treatment. However, experts may interpret these images differently, leading to inconsistent diagnoses. This study aims to address the issue by introducing a standardization method based on deep learning. We collected 254 rectal endoscopic images from 115 patients with UC, and five experts in endoscopic image interpretation assigned classification labels based on the Ulcerative Colitis Endoscopic Index of Severity (UCEIS) scoring system. Interobserver variance analysis of the five experts yielded an intraclass correlation coefficient of 0.8431 for UCEIS scores and a kappa coefficient of 0.4916 when the UCEIS scores were transformed into UC severity measures. To establish a consensus, we created a model that considered only the images and labels on which more than half of the experts agreed. This consensus model achieved an accuracy of 0.94 when tested with 50 images. Compared with models trained from individual expert labels, the consensus model demonstrated the most reliable prediction results.


Introduction
Ulcerative colitis (UC) is an idiopathic, chronic inflammatory disease of the colon mucosa, usually beginning in the rectum and extending proximally through all or part of the colon [1].With alternating periods of exacerbation and remission, the clinical course is unpredictable.Diagnosis of UC is based on clinical symptoms and confirmed by objective findings from endoscopic and histological examinations.Endoscopic evidence of UC includes persistent colonic inflammation, with confirmatory biopsy specimens indicating chronic colitis [2].The endoscopy plays an important role in managing patients with UC, allowing us to visualize and assess disease severity.Consequently, the objective assessment provided by endoscopy is important to the optimal management of patients with UC.
The most widely used scoring systems for assessing endoscopic disease activity in UC are the Mayo endoscopic subscore and the Ulcerative Colitis Endoscopic Index of Severity (UCEIS) [3].The UCEIS, which was developed using a linear mixed regression model, assesses the extent of endoscopic severity based on three variables: vascular pattern (normal, 1; patchy obliteration, 2; or obliterated, 3), bleeding (none, 1; mucosal, 2; luminal mild, 3; or luminal moderate or severe, 4) and erosions and ulcers (none, 1; erosions, 2; superficial ulcer, 3; or deep ulcer, 4) [4].It is challenging to correctly grade colonoscopies using the UCEIS, with even experienced and sufficiently trained experts showing interobserver variability.As a result, researchers have been working to develop a deep learning system for the consistent and objective analysis of UC endoscopic images based on artificial intelligence (AI) [5][6][7][8][9][10].
Our study used AI for the endoscopic evaluation and diagnosis of patients with UC.AI was first used in 2003 to assess endoscopic severity in UC patients.Sasaki et al. defined the Matts score for grading endoscopic severity using pictorial parameters of mucosal redness from 133 digital colonoscopy fixed images of 55 patients with UC.The degree of mucosal redness was measured as a hemoglobin index through a Bayesian-driven computer-aided detection algorithm.This algorithm could differentiate the Matts grades based on the kurtosis of hemoglobin index with high sensitivity and specificity [5].More recently, Ozawa et al. attempted to detect mucosal remission or activity in UC patients using AI and a computer-aided detection system based on convolutional neural networks (CNN) and trained on large datasets of endoscopic still images.The system showed a high level of performance, with areas under the receiver operating characteristic curve of 0.86 and 0.98 to identify Mayo 0 and 0-1, respectively [6].
In addition, many other authors have conducted UC-related research using AI.Sutton et al. utilized AI to differentiate UC from other intestinal diseases and to assess the severity of UC endoscopic ulcers, achieving an accuracy of 87.50% and an area under the curve of 0.90 with 851 images from UC patients [7].Takenaka et al. created a deep neural network system that analyzed endoscopic images from 2012 UC patients (totaling 40,758 images) and 6885 biopsy outcomes, achieving 90.1% accuracy in identifying remission in endoscopy [8].Yao et al. trialed a fully automated video system for analyzing and grading endoscopic disease in UC.The system, working with videos of the clinical trial set comprising 51 high resolution images and 264 tests, correctly differentiated between remission and active disease in 83.7% of cases [9].Gottlieb et al. verified a deep learning algorithm's capability to predict levels of UC severity from full-length endoscopy videos from 249 patients, displaying an area under the curve ranging from 0.787 to 0.901 for the endoscopic Mayo Score and 0.855 for the UCEIS [10].Finally, Bossuyt et al. developed an operator-independent computer-based tool to determine UC activity based on endoscopic images from 29 consecutive UC patients and 6 healthy controls.This tool's readings correlated significantly with the Robarts histological index, Mayo Endoscopic Score, and UCEIS.These studies collectively indicate the growing potential of AI in improving the diagnosis, assessment, and management of UC [11].
Therefore, AI implementation in UC is promising for improving the assessment of disease activity and reducing interobserver variability in grading such activity.In most studies, the primary focus has been on the binary classification of Ulcerative Colitis (UC) states, differentiating between the inactive and active phases of the disease.
In this study, we developed a model that predicts three stages of UC severity in the diagnosis of endoscopic images from patients with UC.Furthermore, to enhance the objectivity and precision of UC diagnosis, we constructed a robust deep learning model that effectively reduces discrepancies between different expert evaluations.

Materials and Methods
In this study, we aimed to analyze statistically experts' scoring differences for patients with UC and assess quantitatively the impact of this interobserver variability on diagnostic outcomes [12][13][14].To achieve this, we trained a deep learning network called consensus data using only images for which expert scoring was consistent.We compared the performance of the deep learning model, which was trained using each expert's scoring, with the performance of different models trained to diagnose test images.Figure 1 shows the flowchart of this study.
J. Pers.Med.2023, 13, x FOR PEER REVIEW 3 of 14 consensus data using only images for which expert scoring was consistent.We compared the performance of the deep learning model, which was trained using each expert's scoring, with the performance of different models trained to diagnose test images.Figure 1 shows the flowchart of this study.

Patients and Images
A total of 254 rectal endoscopic images obtained from 115 patients with ulcerative colitis who underwent endoscopy at Ewha Womans University Seoul Hospital between 10 June 2019, and 29 February 2021, were targeted (Table 1).Patients with Crohn's disease, resection before the colonoscopy date, or other bowel resection were excluded from the study.The study protocol was approved by the Ethics Committee of Ewha Womans University Seoul Hospital (IRB no.2023-03-028).The severity of ulcerative colitis was classified into three categories: remission/mild, moderate, and severe.These labels were used as input data for the deep learning network models.Figure 2 shows the collected endoscopic images and labels in this study.

Patients and Images
A total of 254 rectal endoscopic images obtained from 115 patients with ulcerative colitis who underwent endoscopy at Ewha Womans University Seoul Hospital between 10 June 2019, and 29 February 2021, were targeted (Table 1).Patients with Crohn's disease, resection before the colonoscopy date, or other bowel resection were excluded from the study.The study protocol was approved by the Ethics Committee of Ewha Womans University Seoul Hospital (IRB no.2023-03-028).The severity of ulcerative colitis was classified into three categories: remission/mild, moderate, and severe.These labels were used as input data for the deep learning network models.Figure 2 shows the collected endoscopic images and labels in this study.
The endoscopic images were captured using a CV-290 (Olympus, Tokyo, Japan).They are RGB images with resolutions of 543 × 475 and 1242 × 1079, and an 8-bit color depth.The endoscopic images were captured using a CV-290 (Olympus, Tokyo, Japan).They are RGB images with resolutions of 543 × 475 and 1242 × 1079, and an 8-bit color depth.

Scoring System
We introduced the consensus approach in this study.Under this approach, five experts independently assign scores, but the final score is only adopted if at least three experts assign the same score, termed "consensus data".This method minimizes bias in evaluations from individual experts' judgments, contributing to increasing reliability.
These models were chosen based on their exceptional performance in this field and their extensive application in research.Despite their diverse architectures, all models utilize convolutional and pooling layers for feature extraction and dimensionality reduction, which makes them highly efficient for high-level image classification tasks, aligning perfectly with the requirements of our study.
All of these models are TensorFlow implementations initialized using weights from the ImageNet dataset [15].This method, known as transfer learning, is common practice in many fields of medical imaging and has proven to be exceptionally successful [16].The principal advantage of this approach is that it uses the pre-learned weights from the lower layers of these models.These lower layers often detect more generalized features, such as edges or textures, which are universal to many image classification problems.Our study employed end-to-end training, achieving satisfying results.However, for cases where this approach is not as effective, 'fine-tuning'-freezing the lower layers-could be a viable alternative.

Scoring System
We introduced the consensus approach in this study.Under this approach, five experts independently assign scores, but the final score is only adopted if at least three experts assign the same score, termed "consensus data".This method minimizes bias in evaluations from individual experts' judgments, contributing to increasing reliability.

Deep Learning Network
For image classification, we selected 13 deep learning network models based on CNN (Figure 3), which are known for their exceptional performance in this field: DenseNet-121, MobileNetV2, DenseNet201, InceptionV3, EfficientNetB0, EfficientNetB7, MobileNetV3Large, ResNet152V2, ResNet50, ResNet50V2, VGG19, VGG16, and Xception.Our initial dataset consisted of endoscopic images split into training and test sets at an approximate ratio of 8:2 (Table 2).To further enhance and generalize our deep learning model, we supplemented our dataset with images from the HyperKvasir open dataset, obtained from Baerum Hospital between 2008 and 2016, which boasts a collection of 110,079 images.From this, 10,662 images were labeled across 23 classes of finding [17], with a significant portion related to pathological findings like Barrett's esophagus, esophagitis, polyps, ulcerative colitis, and hemorrhoids.From the 851 UC images, we carefully selected those with quality endoscopic features.Five experts then assessed these images for severity, resulting in a consensus on 267 images.These were combined with our initial 254 images, culminating in a total dataset of 521 images for deep learning training.These models were chosen based on their exceptional performance in this field and their extensive application in research.Despite their diverse architectures, all models utilize convolutional and pooling layers for feature extraction and dimensionality reduction, which makes them highly efficient for high-level image classification tasks, aligning perfectly with the requirements of our study.
All of these models are TensorFlow implementations initialized using weights from the ImageNet dataset [15].This method, known as transfer learning, is common practice in many fields of medical imaging and has proven to be exceptionally successful [16].The principal advantage of this approach is that it uses the pre-learned weights from the lower layers of these models.These lower layers often detect more generalized features, such as edges or textures, which are universal to many image classification problems.Our study employed end-to-end training, achieving satisfying results.However, for cases where this approach is not as effective, 'fine-tuning'-freezing the lower layers-could be a viable alternative.
Our initial dataset consisted of endoscopic images split into training and test sets at an approximate ratio of 8:2 (Table 2).To further enhance and generalize our deep learning model, we supplemented our dataset with images from the HyperKvasir open dataset, obtained from Baerum Hospital between 2008 and 2016, which boasts a collection of 110,079 images.From this, 10,662 images were labeled across 23 classes of finding [17], with a significant portion related to pathological findings like Barrett's esophagus, esophagitis, polyps, ulcerative colitis, and hemorrhoids.From the 851 UC images, we carefully selected those with quality endoscopic features.Five experts then assessed these images for severity, resulting in a consensus on 267 images.These were combined with our initial 254 images, culminating in a total dataset of 521 images for deep learning training.The model training was conducted for 30 epochs, with a batch size of 30.The categorical cross-entropy loss function, commonly used for multiclass classification, was selected, and the Adam optimization algorithm was used.The learning rate was set to 1 × 10 −4 , and all images were downscaled to a size of 543 × 475.
Accuracy, recall, precision, and F1 score were utilized to assess the performance of our deep learning model comprehensively.These metrics are used to evaluate the model's classification performance quantitatively, also reflecting the approach's sensitivity and the harmonic mean of the metrics.

Data Preprocessing
Endoscopic images inherently possess characteristics such as reflections caused by the light source and dark regions where the light does not reach.These phenomena exist as artifacts in deep learning training, and it is crucial to control them to ensure the model's accuracy and reliability [2,3].In order to eliminate such artifacts, we attempted to remove the areas corresponding to light reflection in the RGB channels.However, we faced challenges in accurately isolating only the areas of light reflection, as parts of ulcer or erosion regions were also eliminated alongside the reflective regions in the RGB channels.This complexity posed a difficulty in precisely detecting the areas of light reflection (Figure 4).
In the data preprocessing stage, the color space of the images was converted from RGB to HSV to eliminate reflections and dark areas (Figure 5) [18,19].The ranges to detect reflections and dark areas were (0, 360) for H, (90, 255) for S, and (65, 236) for V.
With this method, the identified regions were converted into binary mask images, which were then multiplied with the original RGB endoscopic images.Subsequently, the empty spaces were filled using an inpainting technique [20].In the data preprocessing stage, the color space of the images was converted from RGB to HSV to eliminate reflections and dark areas (Figure 5) [18,19].The ranges to detect reflections and dark areas were (0, 360) for H, (90, 255) for S, and (65, 236) for V.
With this method, the identified regions were converted into binary mask images, which were then multiplied with the original RGB endoscopic images.Subsequently, the empty spaces were filled using an inpainting technique [20].To enhance the generalization performance of the model through data augmentation, the following techniques were applied: rotation range (360 degrees), zoom range (15%), width shift range (20%), height shift range (20%), shear range (15%), horizontal flipping, and filling mode ("reflect").

Interobserver Variation
We employed the intraclass correlation coefficient (ICC) to assess the agreement among UCEIS scores (ranging from 0 to 8) assigned by the experts (Table 3).The ICC is a statistical methodology that measures the level of agreement among observations by calculating the ratio of between-observer variance to the total variance among observations.Through this approach, we evaluated the consistency of scores assigned by multiple experts to the same images.
With this method, the identified regions were converted into binary mask images, which were then multiplied with the original RGB endoscopic images.Subsequently, the empty spaces were filled using an inpainting technique [20].To enhance the generalization performance of the model through data augmentation, the following techniques were applied: rotation range (360 degrees), zoom range (15%), width shift range (20%), height shift range (20%), shear range (15%), horizontal flipping, and filling mode ("reflect").

Interobserver Variation
We employed the intraclass correlation coefficient (ICC) to assess the agreement among UCEIS scores (ranging from 0 to 8) assigned by the experts (Table 3).The ICC is a statistical methodology that measures the level of agreement among observations by calculating the ratio of between-observer variance to the total variance among observations.Through this approach, we evaluated the consistency of scores assigned by multiple experts to the same images.

Interobserver Variation
We employed the intraclass correlation coefficient (ICC) to assess the agreement among UCEIS scores (ranging from 0 to 8) assigned by the experts (Table 3).The ICC is a statistical methodology that measures the level of agreement among observations by calculating the ratio of between-observer variance to the total variance among observations.Through this approach, we evaluated the consistency of scores assigned by multiple experts to the same images.

ICC
Level of Agreement 0.9-1.0Excellent 0.75-0.9Good 0.5-0.75Moderate <0. 5 Poor We used Fleiss' kappa index to evaluate the agreement in severity classification based on the UCEIS assigned by the experts (Table 4).In this context, the labels were classified into four categories: remission, mild, moderate, or severe.Fleiss' kappa is a suitable statistical method for measuring agreement among evaluators assessing categorical data.

UCEIS Score Estimation
To measure the interobserver variance, we closely examined the UCEIS scores and severities provided by each expert.The UCEIS scores and severities evaluated by five experts and the consensus data are shown in Figure 6a,b.When we examined the UCEIS scores given by each expert, we observed a maximum difference of 63 images.Of the total 254 images, we constructed a consensus data set using 220 images that were scored identically by more than half of the experts.Similarly, when we considered the severity of the condition, we found a maximum difference of 79 images.In the consensus data criteria, out of 254 images, all the images met the consensus conditions.
We used Fleiss' kappa index to evaluate the agreement in severity classification based on the UCEIS assigned by the experts (Table 4).In this context, the labels were classified into four categories: remission, mild, moderate, or severe.Fleiss' kappa is a suitable statistical method for measuring agreement among evaluators assessing categorical data.Equivalent to chance

UCEIS Score Estimation
To measure the interobserver variance, we closely examined the UCEIS scores and severities provided by each expert.The UCEIS scores and severities evaluated by five experts and the consensus data are shown in Figure 6a,b.When we examined the UCEIS scores given by each expert, we observed a maximum difference of 63 images.Of the total 254 images, we constructed a consensus data set using 220 images that were scored identically by more than half of the experts.Similarly, when we considered the severity of the condition, we found a maximum difference of 79 images.In the consensus data criteria, out of 254 images, all the images met the consensus conditions.

Statistical Analysis of Interobserver Variance
The interobserver variance among five experts was assessed based on UCEIS scores.The ICC, a metric of interobserver consistency, was 0.8431, indicating good agreement among the observers' assessments [21].
Subsequently, UCEIS scores were transformed into measures of UC severity, and interobserver variance was recalculated among the same set of experts.In this context, the calculated kappa coefficient was 0.4916.While this value does not imply perfect agreement among the observers, it does denote moderate agreement, demonstrating a certain degree of uniformity in the classification of UC severity [21].

Statistical Analysis of Interobserver Variance
The interobserver variance among five experts was assessed based on UCEIS scores.The ICC, a metric of interobserver consistency, was 0.8431, indicating good agreement among the observers' assessments [21].
Subsequently, UCEIS scores were transformed into measures of UC severity, and interobserver variance was recalculated among the same set of experts.In this context, the calculated kappa coefficient was 0.4916.While this value does not imply perfect agreement among the observers, it does denote moderate agreement, demonstrating a certain degree of uniformity in the classification of UC severity [21].

Outcome of the Deep Learning Network Model
We evaluated the performance of 13 deep learning network models of consensus data, which is crucial for ensuring reliability in endoscopic image classification.Table 5 presents the results, including accuracy, F1 score, recall, and precision.The models are listed in descending order based on accuracy.EfficientNetB0 exhibited the highest overall performance.With an impressive accuracy of 79.20%, coupled with a balanced F1 score of 81.25%, this model demonstrated robust capabilities in the classification tasks.The confusion matrix for the EfficientNetB0 model shows that it correctly identified remission

Differences in Accuracy for Each Model
We developed six deep learning network models, trained with labels from five experts as well as consensus data, without using the HyperKvasir dataset.Table 6 shows the number of pass/fail images when labels were predicted for 50 test images.Expert A and the consensus networks predicted the most accurately.

Discussion
This study aimed to measure and quantify the differences in interpretation among experts analyzing endoscopic images patients with UC.A deep learning model using images and labels that were agreed upon by more than half of the participating experts was developed.The proposed consensus approach can effectively reduce variations in expert opinions while preserving the diagnostic patterns specific to each institution.
Interobserver variance in both UCEIS scores and UC severity offers important insights into the complexities of medical diagnosis and evaluation.While the ICC for UCEIS scores stands at 0.8431, indicating a good level of agreement among experts, the kappa coefficient for UC severity shows a value of 0.4916, signifying only a moderate consensus.It's important to recognize that clinical decisions, especially therapeutic choices, are primarily grounded on the assessment of UC severity rather than just UCEIS scores.This underscores the profound significance of the observed interobserver variation and points to the existence of such variances in real-world clinical settings.The difference in these evaluations emphasizes the need for a deep learning approach to assessing severity, ensuring consistency in patient care and treatment decisions.
The implementation of our deep learning model offers a transformative approach to diagnostic procedures in gastroenterology, particularly for those related to UC.By providing a standardized diagnostic guide, it serves as an invaluable asset not only for fellows or beginners entering the clinical settings but also for seasoned practitioners handling many cases on a day-to-day basis.This standardized approach is especially significant considering the inherent subjectivity and variability in interpreting endoscopic images.As individual reading tendencies and biases might evolve with experience, the model acts as a consistent anchor, mitigating the risk of divergent interpretations.Furthermore, the model's adaptability ensures it remains relevant and updated, reflecting advances in understanding and shifts in diagnostic criteria.
The comparison with previous studies on the evaluation of UC severity using deep learning is presented in Table 7.One of the notable features of our study is the advanced technology for handling artifacts.Using this method, we effectively eliminated light reflections.Further adjustments to the HSV range made it possible to fully detect even the dark areas that aren't directly illuminated.However, when we tried applying inpainting techniques afterward, we experienced a significant loss in crucial endoscopic data information.As a result, we decided to optimize the HSV range to appropriately capture the dark areas without compromising on data quality.This thoughtful calibration and approach substantially boosted the performance of our deep learning model.Moreover, the number of images used in our study might be relatively small when compared with other deep learning studies involving endoscopic images; however, while the volume of data is important in deep learning, we recognized that the use of high-quality data is even more critical [22][23][24].Therefore, we selectively utilized images of the rectum from patients.This is because UC symptoms first appear in the rectum, manifesting most prominently there before gradually spreading to other areas.
Additionally, in the labeling process, we used only data that met certain criteria to minimize interobserver variance.This approach ensured that we maintained a high standard of data quality throughout the study.When labels were consistent across experts, it indicated that the chosen images were both representative and clear, eliminating potential ambiguities in interpretation.Still, in situations where data was scarce, we combined our data with the HyperKvasir dataset for deep learning models.Incorporating external datasets is a common practice to enhance the robustness of models, especially when native datasets might not provide sufficient variability.From the HyperKvasir dataset, we selectively curated images that were suitable for endoscopic judgment and of good quality to form a consensus data set.This curation process was rigorous, ensuring that only the most pertinent and clear images were added to our data pool.Owing to our meticulous data management, combined with a comprehensive selection process, we achieved excellent accuracy even with a relatively small dataset [25,26].
Nevertheless, we acknowledge the lack of data for severe cases in our study.This limited number of severe affected the overall performance of our model.This discrepancy arose primarily because there were significantly fewer images of severe cases compared to those of remission/mild or moderate severity.Such imbalances in the dataset can introduce biases into deep learning models, affecting their generalizability in diverse clinical settings.Understanding this limitation, we emphasize the urgent need to gather a more extensive collection of image data, particularly those depicting severe conditions, to strengthen the predictive capabilities of our model.To address this gap, we are laying the groundwork for a multicenter study in the imminent future.Collaborating with various centers will not only grant access to a larger dataset but also ensure its diversity, covering a broader range of clinical scenarios.Through these efforts, we aim to fine-tune our model, striving to enhance its reliability across the spectrum of UC severity.
The decision to combine the remission and mild categories in the dataset used to train the deep learning model is grounded in clinical rationale and informed by treatment objectives.From a treatment perspective, patients in remission or with mild symptoms are often not the primary targets for aggressive intervention.The combination of these categories into a single group reflects a clinically meaningful distinction in the condition's management.This approach ensures that the model's predictions align more closely with the clinical considerations that guide treatment decisions, potentially improving the model's utility in the real-world clinical setting.
We also collected pathologic readings of endoscopic images for our study and sought the input of two pathologists to ensure reliability.In cases where their opinions diverged, the pathologists engaged in discussions to reach a consensus on the pathological interpretation [27,28].Figure 8 illustrates the distribution of severity based on both pathology and clinical findings.The differences between the pathological and clinical findings are thought to arise because the pathological findings are evaluated only in the biopsy tissue, while the clinical findings are evaluated in the entire endoscopic image.In clinical practice, biopsy results are important, but decisions are also influenced by the size and appearance of lesions visible on endoscopic images [27].As a result, we chose not to utilize the pathology findings for data labeling in this study, focusing instead on other significant factors.Institutional Review Board Statement: The studies involving human participants were reviewed and approved by The Institutional Review Board of Womans University Seoul Hospital, Korea (IRB no.2023-03-028), in accordance with ethical guidelines and the Declaration of Helsinki.The patients/participants provided their written informed consent to participate in this study.Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Figure 2 .
Figure 2. Endoscopic images and ulcerative colitis labels determined using the Ulcerative Colitis Endoscopic Index of Severity (UCEIS).

Figure 2 .
Figure 2. Endoscopic images and ulcerative colitis labels determined using the Ulcerative Colitis Endoscopic Index of Severity (UCEIS).

Figure 5 .
Figure 5.The removal of reflections and dark areas via HSV conversion and inpainting: White: light reflection.

Figure 5 .
Figure 5.The removal of reflections and dark areas via HSV conversion and inpainting: White: light reflection.

Figure 5 .
Figure 5.The removal of reflections and dark areas via HSV conversion and inpainting: White: light reflection.

Figure 6 .
Figure 6.The distribution of (a) scores and (b) severity from five experts, and consensus data.

Figure 6 .
Figure 6.The distribution of (a) scores and (b) severity from five experts, and consensus data.

Figure 7 .
Figure 7. Confusion matrix of the EfficientNetB0 model for consensus data.

Figure 7 .
Figure 7. Confusion matrix of the EfficientNetB0 model for consensus data.

Figure 8 .
Figure 8.The distribution of severity in consensus and pathology data.

Figure 8 .
Figure 8.The distribution of severity in consensus and pathology data.

Table 1 .
Demographics and images.

Table 1 .
Demographics and images.

Table 2 .
Distribution of Images for Deep Learning: Training vs. Test Set.

Table 4 .
Interpretation of kappa index.

Table 4 .
Interpretation of kappa index.

Table 6 .
The number of pass/fail images among 50 test images with six network models.

Table 7 .
Comparison of previous studies and our study.

Table 7 .
Comparison of previous studies and our study.