1. Introduction
Ulcerative colitis (UC) is a chronic, idiopathic inflammatory bowel disease (IBD) characterized by persistent inflammation of the colonic mucosa. This inflammation primarily affects the rectum and may extend proximally in a continuous manner to involve the entire colon [
1]. The underlying etiology of UC remains unclear, but it is thought to result from a complex interplay of genetic, environmental, immune, and microbial factors. The prevalence of UC varies globally, with higher incidence rates observed in Western countries. However, emerging epidemiological data indicate a rising trend, suggesting a possible influence of lifestyle and environmental changes [
2]. UC significantly impacts patients’ quality of life, leading to symptoms such as diarrhea, rectal bleeding, abdominal pain, urgency, and fatigue [
3]. Additionally, the chronic nature of the disease can contribute to emotional distress, anxiety, and reduced social functioning [
4,
5]. It is also, a remitting and relapsing disorder, characterized by periods of symptomatic exacerbations followed by phases of clinical remission [
6]. Due to this unpredictable disease course, ongoing monitoring is essential for effective disease management. Regular assessment through clinical evaluation, mood-targeted interventions, endoscopic surveillance, and biomarker monitoring helps optimize treatment strategies, prevent complications, and improve long-term outcomes for patients [
7,
8].
Specifically, endoscopic assessment plays a critical role in the ongoing monitoring of ulcerative colitis (UC), providing direct visualization of mucosal inflammation and aiding in treatment decisions [
9]. Mucosal healing, as it is determined by endoscopy, is a key therapeutic target associated with reduced risk of disease progression, complications, and colectomy. The Mayo Endoscopic Score (MES) is the gold standard for evaluating disease severity in UC. This scoring system assesses mucosal inflammation through key visual indicators: erythema (redness of the mucosa), vascular pattern (visibility of blood vessels beneath the mucosa), friability (tendency of tissue to bleed when touched), and the presence of erosions or ulcers [
10]. Despite its widespread use, current endoscopic assessment methods present several challenges, particularly related to inter- and intra-observer variability with the level of experience [
11]. The subjective nature of endoscopic scoring can lead to discrepancies in disease severity grading between different endoscopists (inter-observer variability) and even between repeated evaluations by the same endoscopist (intra-observer variability). Such inconsistencies can impact clinical decision-making, influencing treatment choices and disease monitoring strategies [
12]. Additionally, endoscopic procedures are invasive, costly, and may not always be feasible for frequent monitoring. The MES is a four-tiered grading system (0 to 3) based on mucosal appearance during endoscopy, as follows:
0: Normal or inactive disease.
1: Mild disease (erythema, decreased vascular pattern, mild friability).
2: Moderate disease (marked erythema, absent vascular pattern, friability, erosions).
3: Severe disease (spontaneous bleeding, deep ulcers).
Given these challenges, efforts are being made to develop more objective and reproducible assessment tools, including artificial intelligence-assisted endoscopic evaluation and biomarker-based disease monitoring, to complement conventional endoscopic scoring systems [
13,
14,
15,
16,
17].
To address these challenges, computer-aided diagnosis (CAD) using advanced artificial intelligence (AI) algorithms has emerged as a promising solution [
17]. AI-based tools, particularly those utilizing deep learning models, have the potential to enhance the accuracy and consistency of endoscopic assessments. These algorithms can automatically analyze endoscopic images and videos, offering real-time, objective, and reproducible grading of mucosal inflammation. Such systems could help mitigate the issues of inter- and intra-observer variability, reduce human error, and provide consistent disease monitoring.
Recent successes in deep learning for medical imaging have demonstrated its capability to outperform traditional methods in various domains [
18,
19]. For example, deep learning has shown remarkable performance in detecting and classifying pathologies in radiology and dermatology, where algorithms can identify patterns that are often difficult for human observers to detect [
20,
21]. Furthermore, in the realm of gastrointestinal (GI) conditions, AI-driven techniques have already been successfully applied to gastroscopy and colonoscopy [
22]. Studies have shown that deep learning models can accurately detect colorectal cancer and polyps [
23], as well as assist in evaluating conditions such as Crohn’s disease and esophageal cancer [
24]. These advancements suggest that AI may serve an important role in enhancing the diagnostic and monitoring capabilities for UC, ensuring more accurate, efficient, and consistent clinical care.
In the present study, several state-of-the-art deep learning models were trained for automated ulcerative colitis (UC) severity classification using endoscopic images. The key focus involves comparing different convolutional neural network (CNN) architectures to determine their effectiveness in accurately assessing disease severity while reducing the subjectivity of conventional scoring systems like the Mayo Endoscopic Score (MES).
2. Materials and Methods
2.1. Dataset
The Labeled Images for Ulcerative Colitis (LIMUC) dataset, which is publicly available, was utilized for this study [
25]. For this dataset, a total of 19,537 endoscopic images were collected from 1043 colonoscopy procedures performed on 564 UC patients at Marmara University Institute of Gastroenterology between December 2011 and July 2019. All images were acquired using a Pentax EPK-i video processor and Pentax EC-380LKp video colonoscope (Pentax, Tokyo, Japan) and standardized to a resolution of 352 × 288 pixels during database storage. The images were captured at different time points during the colonoscopy procedures, ensuring no spatial relationship among images from the same patient and thus increasing dataset heterogeneity.
Images unsuitable for evaluation due to debris, inadequate bowel preparation, artifacts, retroflexion, or poor image quality were excluded from the study. All patient information, software outputs, and temporal data were masked to prevent bias. Two experienced gastroenterologists independently reviewed and classified all images according to the Mayo Endoscopic Score (MES). The inter-reader reliability for MES labeling was measured with quadratic weighted kappa and achieved a score of 0.781. The initial diagnoses from the gastroenterologist who performed the colonoscopy were not used in the labeling of this dataset.
For images with inconsistent labels between the two reviewers (7652 images), a third independent reviewer, blinded to previous classifications, provided an additional assessment. Final scores for these cases were determined using majority voting. Images that received different labels from all three reviewers were excluded from the study.
After applying all exclusion criteria, including 8060 images deemed unsuitable for MES evaluation and 201 images with complete reviewer disagreement, the final dataset consisted of 11,276 images with the following distribution:
MES 0: 6105 (54.14%).
MES 1: 3052 (27.07%).
MES 2: 1254 (11.12%).
MES 3: 865 (7.67%).
Figure 1 depicts a sample image for each MES evaluation.
2.2. Experimental Setup
In this study, the performance of several state-of-the-art deep learning models was evaluated on the LIMUC dataset. The selected models were DenseNet [
26], EfficientNet [
27], MobileNetV2 [
28], ResNet [
29], VGG [
30], and Vision Transformer (ViT) [
31], all of which have demonstrated state-of-the-art results on the ImageNet classification challenge [
32]. Transfer learning [
33] was utilized by initializing all models with weights pre-trained on ImageNet. To accommodate the LIMUC classification task, the final classification layer of each network was modified by replacing the original 1000-class ImageNet output layer with a layer containing four neurons, one for each MES score.
Neural network scaling affects both capacity and generalization performance [
27]. Models with more parameters are usually better at handling complex tasks, but they may also overfit their data, particularly if the datasets are small. For this reason, network architectures come in several variants, which utilize the same base architecture but increase the number of parameters. However, it is not always clear which variant will be the most performant. For this reason, this study evaluates several variants of each model family. The detailed list of the network variants and their total parameters are shown in
Table 1.
The dataset was split into training and test sets using an 80/20 ratio. To ensure a fair comparison, all models were trained and evaluated on the same dataset partitions. For training, the Adam optimizer [
34] was used with a learning rate of
. The training process was conducted on an NVIDIA RTX 4090 (NVIDIA Corporation, Santa Clara, CA, USA) graphics card with 24 GB of VRAM for 300 epochs, utilizing a fixed input image size of (224, 224) pixels and a batch size of 64. However, for EfficientNetB6 and EfficientNetB7, the batch size was reduced to 32 because of insufficient memory. Generally, batch size has been shown to influence the generalization performance of neural networks. Smaller batch sizes lead to noisier gradient updates, which can help escape local minima during training, but may also slow down convergence. Conversely, larger batch sizes typically lead to faster convergence and exhibit a stabler loss trajectory, but can also result in poorer generalization [
35]. In this study, all models reached convergence regardless of batch size, suggesting that in this particular setup, this small batch size variation did not substantially affect the final results.
Due to class imbalance in the dataset, cross-entropy loss [
36] with class weighting was employed to ensure that underrepresented classes received higher importance during training. Cross-entropy loss, commonly used for classification tasks, measures the divergence between the predicted probability distribution and the true class labels. It is defined as follows:
where
is the true label (1 for the correct class, 0 otherwise), and
is the predicted probability for that class. This loss function penalizes incorrect predictions more severely when the confidence in the wrong class is high, encouraging the model to output probabilities that align closely with the actual distribution of labels.
However, in imbalanced datasets, standard cross-entropy loss can lead to biased learning, where the model favors majority classes and struggles to recognize minority classes effectively [
37]. To counteract this, class weighting is commonly applied, where the weight assigned to each class is computed as follows:
where
N is the total number of samples,
K is the total number of classes, and
is the number of samples in class
c. This ensures that classes with fewer samples receive higher weights, thereby balancing the contribution of each class to the loss function.
Further to image resizing, each RGB channel of the inputs was independently normalized by subtracting the mean and dividing by the standard deviation, as computed on the entire training set. The mean was determined as [0.4143, 0.2877, 0.2184] and the standard deviation as [0.2986, 0.2210, 0.1784], for the red, green, and blue channels, respectively. In addition, data augmentation techniques were applied during training to increase the diversity of the training set. The augmentations used were as follows:
Horizontal flipping: The image was flipped horizontally with a 50% probability of introducing left–right variations.
Vertical flipping: The image was flipped vertically with a 50% probability of adding top–bottom variations.
Color jittering: The following image color properties were randomly adjusted as follows:
- –
Brightness adjusted between 60% and 160% of the original.
- –
Contrast adjusted by a factor of 0.2.
- –
Saturation adjusted by a factor of 0.1.
- –
Hue slightly altered within a range of ±0.01.
Elastic transformations: The image structure was deformed in a non-linear fashion to mimic realistic distortions while preserving crucial features, using Lanczos4 interpolation for smooth transitions.
All experiments were conducted using Python 3.12 and PyTorch 2.1.2 with CUDA 11.8 for acceleration.
2.3. Network Evaluation
To evaluate model performance, four standard classification metrics were used: F1 score, recall, precision, and accuracy. These metrics were computed individually for each class following a one-versus-rest approach, where each class is evaluated independently by transforming the multi-class problem into a binary classification one. In this approach, instances of the target class are considered positive samples, while instances from all remaining classes are collectively treated as negative samples.
The accuracy of a class is defined as follows:
The precision of a class is defined as follows:
The sensitivity (recall) of a class is given by the following:
where
(true positive) denotes the correctly predicted positive case,
(false positive) denotes the incorrectly predicted positive case, and
(false negative) denotes the actual positive case that was incorrectly classified as negative.
The F1 score is the harmonic mean of precision and recall, formulated as follows:
This metric balances precision and recall, making it particularly useful for imbalanced datasets where one class might dominate over others.
Furthermore, to facilitate comparison with the results of other studies, the weighted kappa score [
38] was calculated for the best-performing model, as determined by the mean value of the F1 score across all classes. The weighted kappa score is a statistical measure used to assess the agreement between two raters while accounting for the degree of disagreement. Unlike simple accuracy, which only considers correct versus incorrect predictions, the weighted kappa score applies a weighting scheme that penalizes larger discrepancies more heavily. Given the ordinal nature of the MES scores, the quadratic weighted kappa (QWK) was employed, which assigns quadratic penalties to differences between predicted and actual ordinal categories. This makes it particularly useful in tasks such as medical diagnosis, grading systems, and other ordered classification problems.
To estimate the uncertainty associated with the weighted kappa score, bootstrapping was employed. Bootstrapping is a resampling technique that involves repeatedly drawing random samples with replacements from the original dataset to create multiple resampled datasets. By computing the weighted kappa score for each resampled dataset, a distribution of scores is obtained, allowing the estimation of confidence intervals. This approach provides insights into the stability and reliability of the model’s performance, ensuring that reported kappa scores are robust to variations in the data.
Statistical Comparison
To assess the statistical significance of performance differences between models, McNemar’s tests were conducted [
39]. McNemar’s test is a statistical technique specifically designed for analyzing paired nominal data. The test analyzes the disagreements between two classifiers, focusing on cases where one classifier is correct while the other is incorrect. By analyzing these mismatched predictions, McNemar’s test determines whether the performance gap between two classifiers is statistically significant or just random. The test creates a contingency table,
M, of agreements and disagreements between the classifiers and then calculates a chi-square statistic to assess whether the observed differences in error patterns are meaningful. The test statistic (chi-square
) and the contingency table
M are defined as follows:
where
b is the number of cases where classifier 1 is correct and classifier 2 is incorrect,
c is the number of cases where classifier 1 is incorrect and classifier 2 is correct,
a is the number of cases where both classifiers are correct, and
d is the number of cases where both classifiers are incorrect.
3. Results
3.1. Network Performance
The performance metrics of all evaluated models are presented in
Table 2,
Table 3,
Table 4 and
Table 5. To determine the optimal model, the macro-averaged F1 score was utilized as the primary evaluation metric. This choice was motivated by the class imbalance in the dataset, as the F1 score provides a more balanced representation of model performance by considering both precision and recall.
Based on the macro-average F1 scores across all MES score classes, the top five performing models were VGG19 (0.7528), EfficientNetB1 (0.7521), EfficientNetB6 (0.7493), MobileNetV2 (0.7472), and DenseNet169 (0.7465). VGG19 demonstrated superior performance particularly for MES 1 and 2, although it was slightly outperformed in MES 0 and 3.
For MES 0, DenseNet201, EfficientNetB3, and EfficientNetB6 achieved the highest F1 scores (0.887), while VGG19 demonstrated the best performance for MES 1 (0.702) and MES 2 (0.668). For MES 3, DenseNet169 attained the highest F1 score (0.792). Notably, VGG19 did not achieve the highest F1 score for MES 0 and 3 but performed consistently well across all classes, resulting in the highest overall macro-average F1 score.
Accuracy metrics followed similar patterns, with DenseNet201 and EfficientNetB3 achieving the highest accuracy for MES 0 (0.877 and 0.880 respectively), VGG19 for MES 1 (0.839), VGG19 for MES 2 (0.929), and DenseNet169 for MES 3 (0.968). These results indicate that different architectures exhibited specific strengths in identifying particular MES score categories.
3.2. Statistical Analysis
The results of the McNemar statistical analysis, which compare all models against VGG19 (the selected benchmark model based on the macro-averaged F1 score), are shown in
Table 6. Statistically significant differences (
) were observed for several models. In particular, MobileNetV2, VGG16, ResNet34, EfficientNetB0, and EfficientNetB1 showed statistically significant differences from VGG19 in MES 0 performance. For MES 1, significant differences were observed with ViT and ResNet34. In MES 3, DenseNet121 and EfficientNetB4 demonstrated statistically significant performance differences compared to VGG19.
These findings reveal that despite the overall small numerical differences in overall performance metrics, certain architectures may be particularly well-suited for detecting specific MES score categories. This could provide important context for model selection, particularly where accurate identification of specific disease severity levels is critical.
4. Discussion
4.1. Model Architecture and Performance Analysis
This study adopts a comparative approach, evaluating state-of-the-art deep neural network classifiers for automated UC severity classification using endoscopic images. While the VGG19 network was found to perform the best based on the macro-averaged F1 score, the performance differences among the top-performing models were very small, with less than 0.01 separation between the five highest-scoring architectures. This suggests that multiple models would be viable candidates for deployment depending on specific requirements such as inference speed, memory constraints, or deployment environment. For example, smaller and faster models are suitable for deployment in settings with limited computing power, and where real-time analysis requirements are necessary.
The EfficientNet family demonstrated consistently strong results across its various configurations. Interestingly, EfficientNetB1 outperformed all its larger variants, EfficientNetB2-7, in terms of the macro-averaged F1 score, despite having significantly fewer parameters (6.5 M versus 63.8 M for EfficientNetB7). Similarly, MobileNetV2 exhibited competitive performance with only 2.2 million parameters, making it substantially more efficient than most other tested architectures. These findings challenge the conventional assumption that model size and complexity necessarily correlate with improved classification performance and are consistent with previous work on different domains [
40].
The class imbalance inherent in the dataset used, despite being addressed through class weighting during training, noticeably affected model performance across different metrics. All models exhibited misleadingly high accuracy scores for the underrepresented classes (MES 2 and 3), primarily due to the dominance of true negatives in the evaluation. This imbalance paradoxically resulted in higher accuracy metrics for classes with fewer samples compared to the more abundant classes (MES 0 and 1). This observation reinforces the decision to prioritize the F1 score as the most reliable performance indicator, as it more effectively represents a model’s actual classification capabilities for minority classes by balancing precision and recall.
4.2. Comparison with Existing Methodologies
To directly compare the best-performing model in this study, VGG19, against other methodologies utilizing the LIMUC dataset, the bootstrapping resampling technique was employed and the network was re-evaluated several times to compute the weighted kappa score.
Table 7 presents the quadratic weighted kappa (QWK) of VGG19, as well as the QWK scores reported in other studies.
Polat et al. [
25] introduced a regression-based methodology. They treated the MES categories as independent classes rather than recognizing their ordinal relationship, by producing a single continuous value representing disease severity. The effectiveness of this methodology was demonstrated through testing across multiple CNN architectures. The most performant model was DenseNet121, achieving a QWK score of 0.854 (95% CI: 0.842–0.867) for the Mayo sub-score classification. This study’s VGG19 achieved 0.876 (95% CI 0.861–0.892), demonstrating slightly higher performance. The VGG19 model also showed stronger overall performance in class-specific metrics, achieving a mean F1 score of 0.753 with individual scores of 0.885, 0.702, 0.668, and 0.756 for MES 0–3 respectively, compared to DenseNet121’s reported macro F1 of 0.697. However, a detailed class-specific comparison was not possible as DenseNet121’s individual class F1 scores were not reported.
Pyatha et al. [
41] proposed a self-supervised learning (SSL) methodology. The models ResNet50, ViT, and SwinB were combined with the MoCo-v3 (momentum contrast) self-supervised learning framework. The authors first pre-trained their model using self-supervised learning with the MoCo-v3 framework and one of the models as the backbone. After this pre-training phase, they fine-tuned the model for the specific task of UC grading using all available labeled training data. The best-performing model was MoCo-v3-SB with SwinB as the backbone with a mean F1 score of 0.711 and QWK score of 0.844 when fine-tuned using 100% of the samples. This research’s VGG19 model achieved higher mean F1 and QWK scores. Both approaches demonstrated particular strength in identifying both inactive and active disease states, with VGG19 having higher accuracy for MESs 1, 2, and 3. Both models showed similar patterns of class imbalance, with the strongest performance in detecting MES 0 (inactive disease), which is clinically valuable for monitoring disease remission. Individual class F1 scores for the MoCo-v3-SB model with SwinB as its backbone were not reported.
In another study by Polat et al. [
42], the class distance weighted cross-entropy (CDW-CE) loss function was introduced. It was designed specifically for ordinal classification tasks. The function penalizes predictions more severely when they deviate further from the true class, with the
a parameter determining how harshly such deviations are penalized. Their experiments revealed that the Inception-v3 architecture achieved the highest performance, with a quadratic weighted kappa (QWK) score of 0.872 when using CDW-CE with a margin, compared to 0.868 without a margin. These results were further validated across other architectures, with ResNet18 showing improvement from 0.857 to 0.860 and MobileNet-v3-L improving from 0.859 to 0.862. While the VGG19 architecture achieved a lower QWK score using standard cross-entropy loss compared to both variants of CDW-CE, this presents an opportunity for future research to investigate potential performance improvements by implementing CDW-CE loss with the VGG19 architecture.
4.3. Limitations and Future Perspectives
While this research’s implementation of the deep learning models demonstrated strong performance in UC severity classification, several limitations should be acknowledged.
First, the approach of using traditional cross-entropy loss may not optimally leverage the ordinal nature of Mayo scores. Future work should explore the incorporation of ordinal-aware loss functions such as CDW-CE, which has shown promise in recent studies. This could potentially improve the model’s understanding of the progressive nature of disease severity. Furthermore, this study utilized default hyperparameters across all model training. Future work should explore automated hyperparameter optimization techniques, to identify optimal configurations that could improve model accuracy and generalization capabilities.
Second, despite implementing weighted cross-entropy to mitigate class disparity, the models exhibited lower performance for underrepresented classes. This suggests that while class-weighting helps, it does not fully address the challenges posed by data imbalance. This imbalance could be problematic in clinical settings where accurate differentiation between moderate cases is crucial. Future research should investigate techniques to improve balanced performance across all Mayo scores. Furthermore, expanding the dataset with additional images, particularly for underrepresented categories, would likely enhance classification performance and lead to more robust, generalizable models. Data augmentation strategies, synthetic data generation, or other techniques should be explored to create a more balanced dataset. A critical next step for validating and assessing the generalizability and robustness of the current findings is to evaluate the best-performing models, particularly VGG19, on external, unseen ulcerative colitis (UC) image datasets from different medical centers and patient populations.
Third, unlike some comparative studies, k-fold cross-validation was not employed, which could provide more robust performance estimates. Although the validation strategy was chosen to facilitate direct comparisons, k-fold cross-validation was avoided for statistical comparison purposes, as this approach can lead to elevated Type I error rates when comparing machine learning algorithms [
43]. Future work should include more in-depth validation approaches to better assess model generalizability and achieve the best possible results for clinical implementation.
Lastly, while the current implementation achieves strong performance, investigation into model compression techniques could make it more suitable for deployment in live resource-constrained clinical settings, as results of this research indicate that larger models do not necessarily outperform smaller ones. This could include exploring quantization, pruning, or knowledge distillation approaches while maintaining classification accuracy.
5. Conclusions
This research demonstrates the potential of deep learning models for ulcerative colitis (UC) severity classification using endoscopic imaging. Notably, multiple deep learning architectures showed robust performance in automatically assessing UC severity, with several models achieving consistently high classification accuracy. While VGG19 achieved the highest macro-averaged F1 score of 75.3% (with class-specific scores of 0.885, 0.702, 0.668, and 0.756 for MES 0-3) the strong performance across diverse model architectures with minimal statistical differences suggests that model selection should be guided primarily by deployment requirements rather than marginal performance gains. Notably, smaller models achieved competitive results despite having significantly fewer parameters, challenging conventional assumptions about model complexity and classification performance.
The consistency of results across different neural network models underscores the viability of AI-assisted UC severity assessment. To translate these promising research findings into clinical practice, future studies should focus on clinical validation. This would involve testing these automated classification systems in real-world clinical settings to support clinical decision-making.
By providing more consistent, and objective Mayo Endoscope Score assessments, these automated approaches are a significant step toward reducing the inter-observer variability that has traditionally complicated the determination of UC severity in clinical practice.