Modification of U-Net with Pre-Trained ResNet-50 and Atrous Block for Polyp Segmentation: Model TASPP-UNet †

: Colorectal cancer is the third most prevalent type of cancer globally, and it typically progresses unnoticed, making early detection via effective screening methods crucial. This study presents the TASPP-UNet, an advanced deep learning model that integrates Atrous Spatial Pyramid Pooling (ASPP) blocks and a ResNet-50 encoder to enhance polyp boundary delineation accuracy in colonoscopy images. We utilized augmented datasets from Kvasir-SEG and CVC Clinic-DB, which included up to 2000 images, to enrich the training examples’ variability. The TASPP-UNet achieved a superior IOU of 0.9276, compared to 0.9128 by the ResNet50-UNet and 0.8607 by the standard U-Net, demonstrating its efficacy in precise segmentation tasks. Notably, this model exhibited impressive computational efficiency with a processing speed of 151.1 frames per second (FPS), underscoring its potential for real-time clinical applications aimed at early and accurate colorectal cancer detection. This performance highlights the model’s capability not only to improve diagnostic accuracy but also to enhance clinical workflows, potentially leading to better patient outcomes.


Introduction
Colorectal cancer (CRC) ranks as the third prevalent type of cancer and claims the lives of millions of people worldwide, according to GLOBOCAN [1].The stomach, being a pouch-shaped organ, typically does not present symptoms from luminal tumor growth leading to obstructive complications in its initial stages [2].Recently, major efforts have been focused on early detection of colorectal polyps because they may represent a precancerous condition [3].Adenomatous colorectal polyps are considered dangerous because they are more likely to develop into cancer; a small adenoma can slowly progress over 10 years before becoming invasive cancer through the pathway of chromosomal instability [4,5].Currently, colonoscopy is widely used to examine the large intestine.However, wireless capsule endoscopy (WCE) offers a new approach by allowing the patient to swallow a small WCE capsule to collect video images of the colon [6].However, earlier research indicates 26% of colorectal polyps go undetected during colonoscopy, and when WCE is administered, the endoscopist requires a long time to analyze them, which also increases the likelihood of being missed [7].In response to the problems of missing polyps during colonoscopy and the complexity of analyzing WCE data, systems based on artificial intelligence approaches are being developed to automatically segment colon polyps.This study compared computer systems and traditional colonoscopy methods in polyps.The results showed that the computer system group revealed more polyps and adenomas than the group with the traditional white light method of examination [8].Along the same lines, another research study found that the use of a computer system in colonoscopy reduced the miss rate of polyps to 13.89%, compared to 40.00% with traditional colonoscopy [9].
The application of neural network algorithms facilitates the automation of medical image analysis.However, the field of segmentation and analysis of colon polyps using neural network algorithms faces several challenges, namely blurred boundaries of polyps, their similarity to surrounding tissue size, and also a lack of quality annotated data, which can lead to issues of model generalization [10].In the field of computer vision for medical image segmentation, convolutional neural networks, namely U-Net architecture with skip connection for better feature transfer between networks, have been emphasized [11].Convolutional neural networks, namely U-Net [12] architecture with symmetric encoder and decoder and skip connection for better feature transfer between networks, have been emphasized.However, there are several disadvantages of mixing features of different abstraction levels, low generalizability, and limited context perception.Currently, many additions and modifications have been developed to improve the basic U-Net model in the polyp segmentation task, but outstanding challenges still exist [13].
In this paper, a TASPP-UNet model is proposed that is based on the U-Net model architecture, which has been improved by integrating the pre-trained [14] and adding the Atrous Spatial Pyramid Pooling (ASPP) block [15] to improve context analysis.An experimental comparison with the baseline version of U-Net and the version with the pre-trained encoder was performed.Improving neural network architectures and applying transfer learning techniques aim to overcome existing problems in polyp segmentation, which will hopefully help in the diagnosis of CRC.

Materials and Methods
The following section describes the basic blocks of the encoder and decoder of the TASPP-UNet method, and the parameters of ResNet50-UNet and the classical U-Net will be compared.

Pre-Trained Encoder
The TASPP-UNet model utilizes a combination of the deep convolutional network ResNet-50 [16] as an encoder for advanced contextual features.The pre-trained ResNet-50 model trained on the multi-million ImageNet [17] dataset consists of fifty trained layers as well as residual blocks designed to avoid vanishing gradients.The problem is that the error gradient propagating backward from the output to the input layers becomes increasingly smaller, making it difficult to update the weights in the initial layers during training [18].The residual block formula is shown below (Formula (1)).
where x is the input and F(x) is the output of the layers.Figure 1 depicts the residual block.In the context of colon polyp segmentation, employing ResNet-50 as a feature extractor should enhance the model's accuracy.

ASPP Module
The ASPP module is located in the model architecture at the encoder output of the  In the context of colon polyp segmentation, employing ResNet-50 as a feature extractor should enhance the model's accuracy.

ASPP Module
The ASPP module is located in the model architecture at the encoder output of the pre-trained ResNet-50 encoder, which is important for processing the deepest and most abstract features before they are reconstructed in the decoder [19].The ASPP block uses atrotic atrous convolution to process images at different scales, increasing the receptive field of filters without the need to increase the number of parameters or the amount of computation.The ASPP module is shown in Figure 2. In the context of colon polyp segmentation, employing ResNet-50 as a feature extractor should enhance the model's accuracy.

ASPP Module
The ASPP module is located in the model architecture at the encoder output of the pre-trained ResNet-50 encoder, which is important for processing the deepest and most abstract features before they are reconstructed in the decoder [19].The ASPP block uses atrotic atrous convolution to process images at different scales, increasing the receptive field of filters without the need to increase the number of parameters or the amount of computation.The ASPP module is shown in Figure 2. Figure 2 illustrates a schematic of the ASPP block showing parallel convolutional pathways with dilatational velocities as well as a global pooling layer.These paths are combined using 1 × 1 convolution to create an enriched set of output features.

Model Decoder Path
The decoder in the TASPP-UNet model sequentially increases the resolution of the feature map using an up-sampling process to recover spatial details lost in the encoding stage.Subsequent convolutional layers further refine the features, while batch normalization ensures the stability of the learning process.Regularization mechanisms such as Dropout are applied to reduce overtraining [20].Inverse convolution layers further extend the feature map by refining the segmentation.A final 1 × 1 sigmoid activation convolution transforms the feature map into a probabilistic representation of each pixel's membership to the object of interest, thus creating a segmentation map. Figure 3 shows the overall architecture of the TASPP-UNet model.Figure 2 illustrates a schematic of the ASPP block showing parallel convolutional pathways with dilatational velocities as well as a global pooling layer.These paths are combined using 1 × 1 convolution to create an enriched set of output features.

Model Decoder Path
The decoder in the TASPP-UNet model sequentially increases the resolution of the feature map using an up-sampling process to recover spatial details lost in the encoding stage.Subsequent convolutional layers further refine the features, while batch normalization ensures the stability of the learning process.Regularization mechanisms such as Dropout are applied to reduce overtraining [20].Inverse convolution layers further extend the feature map by refining the segmentation.A final 1 × 1 sigmoid activation convolution transforms the feature map into a probabilistic representation of each pixel's membership to the object of interest, thus creating a segmentation map. Figure 3    Figure 3 illustrates the overall structure of the TASPP-UNet model, which includes an encoder with a pre-trained ResNet-50, an ASPP block for multilevel context analysis, and a decoder with up-sampling and skip connections to recover a detailed segmentation map [21].The model outputs binary polyp segmentation and segmentation overlay on the original image.

Comparison of Segmentation Models
In our study, we conducted a comparative analysis of the TASPP-UNet model with two other architectures, i.e., the advanced UNet with ResNet-50 encoder and the basic U-Net.Table 1 provides a quantitative comparison of these architectures.The TASPP-UNet model integrating ASPP and ResNet-50 has the highest complexity with more than 19.5 million parameters and a model size of about 74.7 MB.The ResNet-50-UNet method, although utilizing the powerful ResNet-50 encoder but without the ASPP module, occupies an intermediate position with approximately 14.9 million parameters and a model size of 63.42 MB.

Evaluation Metrics
Several key indicators, each with a specific formula, were used to assess colon polyp segmentation performance in the research experiment.
Binary Cross-Entropy Loss: This function calculates the difference between probability assignments-the actual label and the predicted probability; the following will be indicated as Loss.It is calculated using Formula (2).
where N is the number of samples, y i is the actual label, and Ý i is the predicted probability [22].Dice Score: It measures the overlap between the prediction and ground truth (Formula (3)).
where X is the set of predicted positives and Y is the set of actual positives [23].
Accuracy: This reflects the proportion of true results out of the total number of cases examined (Formula (4)).
where TP, TN, FP, and FN represent true positive, true negative, false positive, and false negative, respectively [24].
Mean Intersection Over Union (Mean IOU): This is a common evaluation metric for segmentation tasks that measures the mean overlap between the predicted segmentation and the reference standard for each class.It is given by Formula (5).where X i and Y i are the prediction and the reference standard, and N is the number of classes [25].
Each metric provides insight into different aspects of model performance.Loss provides a direct measure of prediction accuracy in terms of probability, Dice score, and Mean IOU, which assess the spatial accuracy of segmentation, and precision provides a quick overall measure of correctness.

Data Preprocessing
In this study, two public databases, Kvasir-SEG [26], including 1000 annotated images, and the CVC Clinic-DB [27], set containing 612 annotated images, were used to solve the colon polyp segmentation problem.These sets were combined for data diversity.During data preparation for the colon polyp segmentation study, a data augmentation method was applied to the data to expand the original image set to 2000 image instances.The augmentation process utilized operations such as random variation of brightness and contrast, rotation to a limited angle, and shift and scaling while preserving the aspect ratio [28].These operations were chosen to simulate possible variations in the shooting conditions, thus enriching the training dataset with a variety of conditions.After reaching the target amount of data, pixel values of images and masks were normalized, bringing them to a range between 0 and 1 to improve the convergence of model training.

Implementation Details
The development of neural network models for this study was conducted using an NVIDIA A100 PCIe 80 GB GPU, (NVIDIA Corporate, Santa Clara, CA, USA).This graphics card is built on a 7 nm process, features a GA100 GPU, and contains 6912 shader blocks, 432 tensor cores, and 80 GB of HBM2e memory via a 5120-bit interface.The software used was Python 3.10 and TensorFlow version 2.15.0, providing the necessary compatibility and performance for medical data processing.

Results and Discussion
In this experiment for colon polyp segmentation, the datasets used were Kvasir-SEG, which includes 1000 annotated images, and CVC Clinic-DB, which comprises 612 images, totaling 1612 images.These were augmented to 2000 images to enhance dataset variability.The model training was executed using the Adam optimizer [29] with a starting learning rate of 0.0001, incorporating early stopping based on the validation sample's performance if no improvement was noted over 25 epochs with training capped at 140 and results of model segmentation by metric indicators were rounded to four decimal places.The performance results for the main metrics on the training data are derived in Table 2, and the plotted graph with curves is shown in Figure 4.The TASPP-UNet demonstrated the highest Mean IOU at 92.76%, reflecting its superior capability in precise boundary delineation, despite a Dice score of 0.9147 and Accuracy score of 76.55%.The ResNet-50 U-Net, while achieving a marginally better Dice score of 0.9176, lagged slightly in Accuracy score and Mean IOU at 74.17% and 91.28%, respectively.The standard U-Net showed the lowest performance across all metrics with a Dice score of 0.8731, Accuracy score of 72.56%, and Mean IOU of 86.07%, underscoring the enhancements provided by ASPP and the ResNet-50 encoder in the advanced models.Figure 4 shows the plots of model learning curves for the main metrics on the training set.score of 0.9176, lagged slightly in Accuracy score and Mean IOU at 74.17% and 91.28 respectively.The standard U-Net showed the lowest performance across all metrics w a Dice score of 0.8731, Accuracy score of 72.56%, and Mean IOU of 86.07%, underscori the enhancements provided by ASPP and the ResNet-50 encoder in the advanced mode Figure 4 shows the plots of model learning curves for the main metrics on the training s The outcome of the effectiveness of the main metrics on the validation data is p sented in Table 3.The validation data show that TASPP-UNet has attained the overall top scores in metrics, with a Mean IOU of 0.7141, indicating a strong capacity for accurate segmen tion.It also led to a Dice score and Accuracy score, reinforcing its robustness.The R Net-50 U-Net, despite a slightly higher Loss, maintained competitive scores, notably Dice score of 0.7430.The standard U-Net, while trailing with a Mean IOU of 0.50 The outcome of the effectiveness of the main metrics on the validation data is presented in Table 3.The validation data show that TASPP-UNet has attained the overall top scores in all metrics, with a Mean IOU of 0.7141, indicating a strong capacity for accurate segmentation.It also led to a Dice score and Accuracy score, reinforcing its robustness.The ResNet-50 U-Net, despite a slightly higher Loss, maintained competitive scores, notably a Dice score of 0.7430.The standard U-Net, while trailing with a Mean IOU of 0.5076, demonstrates the potential areas for improvement in segmentation tasks, particularly in terms of precision and generalizability, reflected in its Dice score of 0.6386 and Accuracy score of 72.56%.Figure 5 presents the plots of model learning curves for the main metrics on the validation data.
The outcomes of the methods on the test set of 200 images are summarized in Table 4.       Referring to the results in Table 5, the TASPP-UNet exhibits the shortest training time and the fastest speed (FPS), which makes it preferable when computing resources are limited.Resnet50-UNet and standard U-Net require longer training times.These metrics are important for understanding the trade-offs between accuracy and efficiency of model training in research.Figures 6 and 7 show the original images of the model prediction results of colon polyp segments and the overlaid segment on the images.
Referring to the results in Table 5, the TASPP-UNet exhibits the shortest training time and the fastest speed (FPS), which makes it preferable when computing resources are limited.Resnet50-UNet and standard U-Net require longer training times.These metrics are important for understanding the trade-offs between accuracy and efficiency of model training in research.Figures 6 and 7 show the original images of the model prediction results of colon polyp segments and the overlaid segment on the images.Figure 6 shows that TASPP-UNet's model predictions align closely with expert annotations, indicating its precision.In contrast, the Resnet-50-UNet and standard U-Net show some inaccuracies, marked by blue boxes where polyps were either missed or incorrectly identified, illustrating the challenges in polyp boundary delineation by these models.prediction results of colon polyp segments and the overlaid segment on the images.Figure 6 shows that TASPP-UNet's model predictions align closely with expert annotations, indicating its precision.In contrast, the Resnet-50-UNet and standard U-Net show some inaccuracies, marked by blue boxes where polyps were either missed or incorrectly identified, illustrating the challenges in polyp boundary delineation by these models.Figure 6 shows that TASPP-UNet's model predictions align closely with expert annotations, indicating its precision.In contrast, the Resnet-50-UNet and standard U-Net show some inaccuracies, marked by blue boxes where polyps were either missed or incorrectly identified, illustrating the challenges in polyp boundary delineation by these models.
Presented in Figure 7, polyp segmentation prediction results illustrate that the TASPP-UNet model shows segmentation that agrees well with the ground truth provided by experts, which emphasizes its performance.

Conclusions
The TASPP-UNet model proposed in this study and its experimental comparison with classical U-Net and Resnet-50 U-Net on a test set demonstrate its performance in the task of colon polyp segmentation.Achieving an average IOU of 0.9276, TASPP-UNet demonstrates robust boundary detection, which exceeded the performance of Resnet-50 U-Net by 0.9128 and standard U-Net by 0.8607.The performance in boundary detection can be attributed to ASPP blocks, which enhance contextual information.In terms of operational efficiency, TASPP-UNet maintained a decent frame rate of 151.1 FPS during training, suggesting its potential for rapid deployment in clinical settings.This high frame rate emphasizes the model's ability to process large datasets quickly, facilitating its application in realtime medical imaging scenarios.The balance between high segmentation accuracy and computational efficiency emphasizes the promise of the TASPP-UNet model as a practical tool to assist endoscopists in accurate and timely identification of colorectal polyps.
In addition, extending the applicability of the model to other types of medical images and conditions could greatly expand its utility.The inclusion of large and diverse datasets will be critical to improving the generalizability and robustness of the model.Efforts should also be directed towards optimizing the model for resource-constrained environments to ensure that it can be used in a variety of clinical settings, including those with limited computing infrastructure.
Thus, the improvement of the TASPP-UNet model represents a leap forward in medical image segmentation and hopefully has promising implications for improving the efficiency and accuracy of colorectal polyp detection and segmentation in clinical practice.
Future research should focus on improving these computational paradigms, expanding the applicability of the model, and incorporating large datasets to assist endoscopists in segmenting and recognizing colon polyps.

Figure 3
Figure 3 illustrates the overall structure of the TASPP-UNet model, which includ an encoder with a pre-trained ResNet-50, an ASPP block for multilevel context analys and a decoder with up-sampling and skip connections to recover a detailed segmentati map [21].The model outputs binary polyp segmentation and segmentation overlay o the original image.

Figure 4 .
Figure 4. Comparative performance metrics on a training set of colon polyp segmentation metho where (a) Loss, (b) Dice score, (c) Accuracy score estimation, and (d) Mean IOU estimation epoch.

Figure 4 .
Figure 4. Comparative performance metrics on a training set of colon polyp segmentation methods, where (a) Loss, (b) Dice score, (c) Accuracy score estimation, and (d) Mean IOU estimation by epoch.

0
.0963, a Dice score of 0.8967, an Accuracy score of 0.5624, and a Mean IOU of 0.87 Resnet50-UNet, although it showed a higher Loss, achieved a moderate Dice score a Mean IOU.Standard U-Net showed average results, with a Loss of 0.1461, a Dice score 0.8234, an Accuracy score of 0.5529, and a Mean IOU of 0.7667, indicating the potent for improvement in segmentation accuracy on unfamiliar data.

Figure 5 .
Figure 5. Comparative analysis on the validation data of colon polyp segmentation methods, wh (a) Loss, (b) Dice score, (c) Accuracy score estimation, and (d) Mean IOU estimation by epoch.

Figure 5 .
Figure 5. Comparative analysis on the validation data of colon polyp segmentation methods, where (a) Loss, (b) Dice score, (c) Accuracy score estimation, and (d) Mean IOU estimation by epoch.On the test dataset, the TASPP-UNet method performed the best with a Loss of 0.0963, a Dice score of 0.8967, an Accuracy score of 0.5624, and a Mean IOU of 0.8789.Resnet50-UNet, although it showed a higher Loss, achieved a moderate Dice score and Mean IOU.Standard U-Net showed average results, with a Loss of 0.1461, a Dice score of 0.8234, an Accuracy score of 0.5529, and a Mean IOU of 0.7667, indicating the potential for improvement in segmentation accuracy on unfamiliar data.Table 5 compares the training time and speed of different methods in the polyp segmentation task.

Figure 6 .
Figure 6.Comparison of polyp segmentation methods (overlaying predicted results are presented on images and areas with segmentation errors are highlighted in blue).

Figure 6 .
Figure 6.Comparison of polyp segmentation methods (overlaying predicted results are presented on images and areas with segmentation errors are highlighted in blue).

Figure 6 .
Figure 6.Comparison of polyp segmentation methods (overlaying predicted results are presented on images and areas with segmentation errors are highlighted in blue).

Figure 7 .
Figure 7. Three samples from the test set to compare methods for predicting polyp segmentation (color segmentation is presented on a dark blue background).

Table 1 .
Comparison of characteristics and parameters of segmentation methods.

Table 2 .
Evaluation of segmentation methods on a training dataset.

Table 3 .
Evaluation of segmentation methods on a validation dataset.

Table 3 .
Evaluation of segmentation methods on a validation dataset.

Table 4 .
Evaluation of segmentation methods on a test dataset.

Table 4 .
Evaluation of segmentation methods on a test dataset.

Table 5 .
Comparison of training time and speed for segmentation methods.

Table 5 .
Comparison of training time and speed for segmentation methods.