Gastrointestinal Tract Polyp Anomaly Segmentation on Colonoscopy Images Using Graft-U-Net

Computer-aided polyp segmentation is a crucial task that supports gastroenterologists in examining and resecting anomalous tissue in the gastrointestinal tract. The disease polyps grow mainly in the colorectal area of the gastrointestinal tract and in the mucous membrane, which has protrusions of micro-abnormal tissue that increase the risk of incurable diseases such as cancer. So, the early examination of polyps can decrease the chance of the polyps growing into cancer, such as adenomas, which can change into cancer. Deep learning-based diagnostic systems play a vital role in diagnosing diseases in the early stages. A deep learning method, Graft-U-Net, is proposed to segment polyps using colonoscopy frames. Graft-U-Net is a modified version of UNet, which comprises three stages, including the preprocessing, encoder, and decoder stages. The preprocessing technique is used to improve the contrast of the colonoscopy frames. Graft-U-Net comprises encoder and decoder blocks where the encoder analyzes features, while the decoder performs the features’ synthesizing processes. The Graft-U-Net model offers better segmentation results than existing deep learning models. The experiments were conducted using two open-access datasets, Kvasir-SEG and CVC-ClinicDB. The datasets were prepared from the large bowel of the gastrointestinal tract by performing a colonoscopy procedure. The anticipated model outperforms in terms of its mean Dice of 96.61% and mean Intersection over Union (mIoU) of 82.45% with the Kvasir-SEG dataset. Similarly, with the CVC-ClinicDB dataset, the method achieved a mean Dice of 89.95% and an mIoU of 81.38%.


Introduction
The stomach, small intestine, and large intestine (which include the colon, rectum, and anus) are the parts of the gastrointestinal tract (GI tract) [1,2]. The GI tract is the core part of the digestive system of the human body where mucosal findings vary from mild to extremely lethal diseases [3,4]. The mucous membrane has protrusions of abnormal tissue referred to as polyps. Polyps can grow in the GI tract in any place, but most are found in the colorectal area. Non-neoplastic and neoplastic are the two categories of colorectal polyps [5]. Non-neoplastic polyps can be divided into subcategories-hyper-plastic, hamartomata's polyps, and inflammatory-which are recognized as non-cancerous diseases. On the other hand, neoplastic polyps can become cancerous depending upon the size of the polyps. The growth of polyps mostly takes place in the colorectal area (inner tissue lining); they are non-cancerous but indorse colorectal cancer (CRC), which is a very dangerous and lethal disease. The scope of CRC across the world accounts for nearly 10%, of all cancer-related deaths [6]. The colorectal polyps are analyzed and removed after examining the colon using a standardized colonoscopy procedure. There are different endoscopy methods to examine the GI tract, but Confocal Laser Endomicroscopy (CLE) is a cutting-edge and microscopiclevel endoscopic technique that allows for subcellular imaging and optical biopsies to be

•
The CLAHE technique is applied at the preprocessing stage over the Kvasir-SEG dataset for improving the contrast of the frames, which has an impact on the overall execution of the deep learning model. • A CNN-based 74-layer Graft-U-Net architecture is proposed, which is composed of an encoder (analyzing) and decoder (synthesizing) block. In the encoder and decoder blocks, different depth sizes of the filters are employed: 8,16,32,48, and 64. The encoder is modified by the inclusion of the grafting layers parallel to the conventional UNet layers in the encoder block. The derivations of the features of parallel networks are added and forwarded to the next layers. The results of the model are improved by including a graft network layer in the encoder block.
The organization of this document is as follows: The associated work is stated after the introduction in Section 2. In Section 3, the materials and methods for the proposed Graft-U-Net structure and polyp detection and segmentation are addressed. In Section 4, the results of the performed experiments and discussion are presented. In Section 5, the final remarks, consisting of a conclusion and discussion of future work, are summarized

Related Works
Automatic disease detection and segmentation have become active research areas in the past decade [28][29][30][31][32][33][34][35]. Several algorithms and efficient methods have been developed for polyp detection. With the development of methods and algorithms, the texture and color of the polyps were focused on in one research paper by applying handcrafted descriptors for learning features [36]. An existing study reveals that CNN has become a very famous method in the research industry for the accomplishment of public challenges in the computer vision field [37]. By using CNN, software modules and algorithms have been designed for edge and polyp detection in the frames [38]. Colonoscopy images and videos have been used for polyp detection via region-based CNN methods, including transfer learning (Inception and ResNet) and post-processing techniques [39]. The framework has been performed for disease detection and segmentation problems using the Generative Adversarial Network (GAN) model [40]. Real-time performance and high-sensitivity algorithms, including the YOLO algorithm, have been developed for polyp segmentation [41]. Transfer learning for polyp segmentation has been evaluated in terms of specificity and sensitivity [42]. The computer vision approaches have been improved due to the inclusion of data-driven methods for polyp segmentation [43]. Object segmentation has been performed using the down-and up-sampling techniques for the pixel-wise classification of polyps [44]. The fully convolutional network (FCN) has been suggested by Long et al. for polyp dissection [45].
UNet is the modified and extended architecture of the FCN [46]. Unet comprises an analysis path and a synthesis path that are recognized as an encoder and a decoder, respectively. The analysis part provides the detail of the deep features, while the synthesis part offers segmentation based on learned features. The encoder-decoder network is a very core component in terms of semantic segmentation in UNet and the FCN [30]. Multiple variants of UNet for biomedical segmentation are found in the literature. The encoderdecoder in UNet applies convolution layers whereby the encoder extracts essential semantic features ranging from down-to up-level. Table 1 depicts a summary of the existing models that are used for polyp segmentation using the Kvasir-SEG dataset. The decoder generates the required segmentation mask by using extracted features from the encoder. The up-sampled (decoder) features are concatenated with the downsampled (encoder) features using a skip connection. The final output binary masks are produced by the convolutional layers. The pre-trained network, including VGG16 and VGG19 [53], is replaced by the encoder stage of the UNet model for polyp segmentation tasks. The residual networks are very successful in transfer learning, such as ResNet50 for disease detection and localization [54]. Identity mapping and 3 × 3 convolutional layers are used by the residual network [55]. Vanishing gradients and exploding gradients are eliminated in a deeper neural network using identity mapping [56]. Several clinical endoscopy and colonoscopy image datasets are publicly available, and researchers can use them. In the proposed work, two datasets, CVC-ClinicDB and Kvasir-SEG, are employed for model evaluations.

Materials and Methods
A model, Graft-U-Net for polyp detection, is proposed, which comprises three main phases, including preprocessing, the encoder (analysis path), and the decoder (synthesis path). The CLAHE technique is used in the preprocessing stage, which enables the features to be more clearly visualized in the frames. The frames are given as an input to the encoder block, which explores the context of the frame without determining the location of the disease. The decoder follows the encoder for synthesizing the frames. The location is determined by using the skip connection initiated from the encoder block. The segmented mask and ground truth mask are outlined over the original frame with blue and red colors, respectively, for the analysis of the model. The block diagram of the method for polyp segmentation is demonstrated in Figure 1. A detailed description of each block is provided in the upcoming section. [51] 2021 SANet 90.40% [52] 2021 UACANet 90.50% The decoder generates the required segmentation mask by using extracted features from the encoder. The up-sampled (decoder) features are concatenated with the downsampled (encoder) features using a skip connection. The final output binary masks are produced by the convolutional layers. The pre-trained network, including VGG16 and VGG19 [53], is replaced by the encoder stage of the UNet model for polyp segmentation tasks. The residual networks are very successful in transfer learning, such as ResNet50 for disease detection and localization [54]. Identity mapping and 3 × 3 convolutional layers are used by the residual network [55]. Vanishing gradients and exploding gradients are eliminated in a deeper neural network using identity mapping [56]. Several clinical endoscopy and colonoscopy image datasets are publicly available, and researchers can use them. In the proposed work, two datasets, CVC-ClinicDB and Kvasir-SEG, are employed for model evaluations.

Materials and Methods
A model, Graft-U-Net for polyp detection, is proposed, which comprises three main phases, including preprocessing, the encoder (analysis path), and the decoder (synthesis path). The CLAHE technique is used in the preprocessing stage, which enables the features to be more clearly visualized in the frames. The frames are given as an input to the encoder block, which explores the context of the frame without determining the location of the disease. The decoder follows the encoder for synthesizing the frames. The location is determined by using the skip connection initiated from the encoder block. The segmented mask and ground truth mask are outlined over the original frame with blue and red colors, respectively, for the analysis of the model. The block diagram of the method for polyp segmentation is demonstrated in Figure 1. A detailed description of each block is provided in the upcoming section.

Preprocessing
Preprocessing is the first and most important stage of the presented approach for enhancing the intensity of pixels in the images. The CLAHE method is applied over the  Figure 1. Block diagram of the proposed method for polyp segmentation.

Preprocessing
Preprocessing is the first and most important stage of the presented approach for enhancing the intensity of pixels in the images. The CLAHE method is applied over the complete Kvasir-SEG dataset. The controlled intensity level of the pixels provides the local details in the image. The image is separated into corner areas, border, and inner regions, with the non-overlapping regions of equal size. The noise in the frames is clipped by setting the threshold of the clipper, which is not an easy task where the maximum redistribution level of the clipping and histogram levels are kept equal. The clip limit is defined by Reza [35], and the form of the equation is represented as below.
where in each region of the image, M and N are the gray levels and resolution of the frame, respectively. α is a clipping factor with a range of [0 − 100] and S max shows the limited slope of the transformation function; thus, [1 − S max ] represents the slop range in each mapping. Figure 2 illustrates the preprocessed frames using the CLAHE method.
details in the image. The image is separated into corner areas, border, and inner regions, with the non-overlapping regions of equal size. The noise in the frames is clipped by setting the threshold of the clipper, which is not an easy task where the maximum redistribution level of the clipping and histogram levels are kept equal. The clip limit is defined by Reza [35], and the form of the equation is represented as below.
( ) where in each region of the image, M and N are the gray levels and resolution of the frame, respectively.  is a clipping factor with a range of [0 − 100] and shows the limited slope of the transformation function; thus, [1 − ] represents the slop range in each mapping. Figure 2 illustrates the preprocessed frames using the CLAHE method.

Proposed Graft-U-Net Model
Graft-U-Net is composed of encoder and decoder blocks whereby each encoder block includes the down-sample blocks (DSB). The five DSBs are created in the encoder block, passing feature maps one after another up to the fifth DSB. In every DSB, grafting blocks are proposed, parallel to the conventional layers in the encoder of UNet. Thus, the name Graft-U-Net is given to the network, which is a modified form of UNet. The decoder consists of five up-sampling blocks (USB) that are used for synthesizing the information using a skip connection. The architecture of Graft-U-Net's composed encoder-decoder is depicted in Figure 3.

Proposed Graft-U-Net Model
Graft-U-Net is composed of encoder and decoder blocks whereby each encoder block includes the down-sample blocks (DSB). The five DSBs are created in the encoder block, passing feature maps one after another up to the fifth DSB. In every DSB, grafting blocks are proposed, parallel to the conventional layers in the encoder of UNet. Thus, the name Graft-U-Net is given to the network, which is a modified form of UNet. The decoder consists of five up-sampling blocks (USB) that are used for synthesizing the information using a skip connection. The architecture of Graft-U-Net's composed encoder-decoder is depicted in Figure 3.  The USB receives the explored information from the DSB block and synthesizes the information to localize the disease location information by using a skip connection. The early information is determined by skip connections from the encoder to the decoder block. The whole set of USBs provides the disease location and also improves the model performance through advanced feature construction. A detailed explanation of each encoder (analysis) and decoder (synthesis) block is addressed in Sections 3.2.1 and 3.2.2, respectively.  The USB receives the explored information from the DSB block and synthesizes the information to localize the disease location information by using a skip connection. The early information is determined by skip connections from the encoder to the decoder block. The whole set of USBs provides the disease location and also improves the model performance through advanced feature construction. A detailed explanation of each encoder (analysis) and decoder (synthesis) block is addressed in Sections 3.2.1 and 3.2.2, respectively.

Encoder DSB Blocks (Analysis Blocks)
The encoder block of the proposed Graft-U-Net consists of five DSBs. Each phase of the encoder block is distributed with two parallel networks, including the grafting layers network and a conventional network. Each network is created with different layers (convolution, batch normalization, and activation). The convolution layers provide a set of feature maps. Feature maps, after the activation of the layers of each network in every phase of the encoder block, are added and forwarded to the max-pooling layer. The sequence of the operation in the encoder block, with mathematical derivation, is defined as follows.
The size of the input frames is kept at 512 × 512 and provided to the network; then, the convolutional operation is performed with two input variables: three-channel color images, with the dimensions of the n and c channels being (n × n × c), and a 3D volume filter (Kernel) with a size of (f × f × c). The relationship between the input (images) and output (feature maps) is described below: After the convolution operation, the batch normalization (BN) technique is implemented. After the feature normalization technique, BN is used to measure the variance and average in chunks for every feature. Additionally, channels of neurons are rationalized by setting the feature value of the small batches. The standard deviation is determined for splitting and extrapolating the average of the characteristics [57]. The average of the batch is represented mathematically as: where . . , f i }b, f is a feature of the batch set, and the variance of the small-batch is represented as: Then, the features are normalized as: where constant σ represents the steadiness of the features. The features are scaled between 0 and 1 using the activation function. The mathematical equation of ReLU is given as: where x is the feature set of the frames. The complete set of features undergoes the application of convolution operation, BN, and the ReLU activation function, and is and passed to the next convolution layer network, which is represented by the equation below: The size of the input frames is kept at 512 × 512 and provided to the network; then, the convolutional operation is performed with two input variables: three-channel color images, with the dimensions of the n and c channels being (n × n × c), and a 3D volume filter (Kernel) with a size of (f × f × c). The relationship between the input (images) and output (feature maps) is described below: After the convolution operation, the batch normalization (BN) technique is implemented. After the feature normalization technique, BN is used to measure the variance and average in chunks for every feature. Additionally, channels of neurons are rationalized by setting the feature value of the small batches. The standard deviation is determined for splitting and extrapolating the average of the characteristics [57]. The average of the batch is represented mathematically as: where ℎ = { 1 , 2 … , } , f is a feature of the batch set, and the variance of the smallbatch is represented as: Then, the features are normalized as: where constant ℴ represents the steadiness of the features. The features are scaled between 0 and 1 using the activation function. The mathematical equation of ReLU is given as: where x is the feature set of the frames. The complete set of features undergoes the application of convolution operation, BN, and the ReLU activation function, and is and passed to the next convolution layer network, which is represented by the equation below: where ƛ is the output feature set that is obtained across the graft layer network,  represents batch normalization, f is the activation function, and w and x represent the weight and the input feature maps to the convolutional layers, respectively. Bias is represented by b. The feature set of the first convolution layer network is forwarded to the graft layer network. The graft layer network is composed of the convolution layer, BN, and activation layer. The obtained information from the graft layer network is presented below: where PEER REVIEW 7 of 20 The size of the input frames is kept at 512 × 512 and provided to the network; then, the convolutional operation is performed with two input variables: three-channel color images, with the dimensions of the n and c channels being (n × n × c), and a 3D volume filter (Kernel) with a size of (f × f × c). The relationship between the input (images) and output (feature maps) is described below: After the convolution operation, the batch normalization (BN) technique is implemented. After the feature normalization technique, BN is used to measure the variance and average in chunks for every feature. Additionally, channels of neurons are rationalized by setting the feature value of the small batches. The standard deviation is determined for splitting and extrapolating the average of the characteristics [57]. The average of the batch is represented mathematically as: where ℎ = { 1 , 2 … , } , f is a feature of the batch set, and the variance of the smallbatch is represented as: Then, the features are normalized as: where constant ℴ represents the steadiness of the features. The features are scaled between 0 and 1 using the activation function. The mathematical equation of ReLU is given as: where x is the feature set of the frames. The complete set of features undergoes the application of convolution operation, BN, and the ReLU activation function, and is and passed to the next convolution layer network, which is represented by the equation below: where ƛ is the output feature set that is obtained across the graft layer network,  represents batch normalization, f is the activation function, and w and x represent the weight and the input feature maps to the convolutional layers, respectively. Bias is represented by b. The feature set of the first convolution layer network is forwarded to the graft layer network. The graft layer network is composed of the convolution layer, BN, and activation layer. The obtained information from the graft layer network is presented below: is the output feature set that is obtained across the graft layer network, α represents batch normalization, f is the activation function, and w and x represent the weight and the input feature maps to the convolutional layers, respectively. Bias is represented by b. The feature set of the first convolution layer network is forwarded to the graft layer network. The graft layer network is composed of the convolution layer, BN, and activation layer. The obtained information from the graft layer network is presented below: where ϑ is the output of the grafted convolution layer, β represents the batch normalization layer of the graft network, f is the activation function, w and x represent the weight and the input feature maps to the convolutional layers, respectively, and b determines the bias of the neuron. The collected information from the graft layer network is added to the parallel convolution layer network and is presented in the equation below: where ℎ = { 1 , 2 … , } , f is a feature of the batch set, and the variance of the smallbatch is represented as: Then, the features are normalized as: where constant ℴ represents the steadiness of the features. The features are scaled between 0 and 1 using the activation function. The mathematical equation of ReLU is given as: where x is the feature set of the frames. The complete set of features undergoes the application of convolution operation, BN, and the ReLU activation function, and is and passed to the next convolution layer network, which is represented by the equation below: where ƛ is the output feature set that is obtained across the graft layer network,  represents batch normalization, f is the activation function, and w and x represent the weight and the input feature maps to the convolutional layers, respectively. Bias is represented by b. The feature set of the first convolution layer network is forwarded to the graft layer network. The graft layer network is composed of the convolution layer, BN, and activation layer. The obtained information from the graft layer network is presented below: where  is the output of the grafted convolution layer,  represents the batch normalization layer of the graft network, f is the activation function, w and x represent the weight and the input feature maps to the convolutional layers, respectively, and b determines the bias of the neuron. The collected information from the graft layer network is added to the parallel convolution layer network and is presented in the equation below: ,, , where H(x) is the obtained feature map after the addition of the feature maps of two networks, including the graft network and the parallel convolution layer network. After the inclusion of the layer, the feature set is passed to the max-pooling layer. The convolution operation decreases the resolution of the frames but increases the receptive field (context) information, which is covered by the filter at any given time. Channel-wise attention is given via the squeeze and excitation layers. It is formed of a two-step approach: a maxpooling operation squeezes the n number of feature vectors, and n shows the feature map count. In the upcoming step, the feed-forward network obtains the global feature vector from the squeeze net onward. After that, the features are reduced, and then, expanded to the original size n.
In the whole encoder block, the convolution, BN, activation, and max-pooling operations are performed, and the depth of the frame is increased. In the convolution operation, different numbers of filters are employed (8,16,32,48, and 64) with a filter size of 3 × 3. The number of filters is increased gradually from the upper to lower blocks, which helps to explore more detailed features in the frames. The information on polyp disease is analyzed, with in-depth analysis of the features. The encoder block provides overall context information, but not actual information on the location of the disease. For obtaining information on the location, a decoder block is required that uses the skip connection for collecting disease location information and increasing the resolution of the frames. Figure 4 depicts the visual information obtained from the different convolution layers (C1, C4, C7, C10, C13 in color, and C13 in grayscale) of the encoder block of Graft-U-Net using the kvasir-SEG dataset. where ( ) is the obtained feature map after the addition of the feature maps of two networks, including the graft network and the parallel convolution layer network. After the inclusion of the layer, the feature set is passed to the max-pooling layer. The convolution operation decreases the resolution of the frames but increases the receptive field (context) information, which is covered by the filter at any given time. Channel-wise attention is given via the squeeze and excitation layers. It is formed of a two-step approach: a maxpooling operation squeezes the n number of feature vectors, and n shows the feature map count. In the upcoming step, the feed-forward network obtains the global feature vector from the squeeze net onward. After that, the features are reduced, and then, expanded to the original size n.
In the whole encoder block, the convolution, BN, activation, and max-pooling operations are performed, and the depth of the frame is increased. In the convolution operation, different numbers of filters are employed (8,16,32,48, and 64) with a filter size of 3 × 3. The number of filters is increased gradually from the upper to lower blocks, which helps to explore more detailed features in the frames. The information on polyp disease is analyzed, with in-depth analysis of the features. The encoder block provides overall context information, but not actual information on the location of the disease. For obtaining information on the location, a decoder block is required that uses the skip connection for collecting disease location information and increasing the resolution of the frames. Figure  4 depicts the visual information obtained from the different convolution layers (C1, C4, C7, C10, C13 in color, and C13 in grayscale) of the encoder block of Graft-U-Net using the kvasir-SEG dataset.

Decoder USB Blocks (Synthesis blocks)
The decoder obtains the feature maps from the encoder and reconstructs the statistics of the polyp disease. In the decoder, the five up-sampling blocks containing 64, 48, 32, 16, and 8 filters are created. Each block contains many layers, including the convolution layer, batch normalization (BN), activation layer, up-sampling layer, and concatenation layer (CNC), which are used for synthesizing the information using a skip connection. A layer detail summary of the complete model is shown in Table 2. The notation used in Table 2 is defined as A-activation, C-convolution layer, BN-batch normalization, UP-upsampling, MP-max pooling, CNC-concatenation layer.

Decoder USB Blocks (Synthesis blocks)
The decoder obtains the feature maps from the encoder and reconstructs the statistics of the polyp disease. In the decoder, the five up-sampling blocks containing 64, 48, 32, 16, and 8 filters are created. Each block contains many layers, including the convolution layer, batch normalization (BN), activation layer, up-sampling layer, and concatenation layer (CNC), which are used for synthesizing the information using a skip connection. A layer detail summary of the complete model is shown in Table 2. The notation used in Table 2 is defined as A-activation, C-convolution layer, BN-batch normalization, UP-up-sampling, MP-max pooling, CNC-concatenation layer.  In each phase of the USB, the convolution operation is performed and features from the feature maps are normalized by the BN layer. The activation function is applied on the normalized feature maps. Similarly, the information of each frame is forwarded to the next USB up to the last convolution layer, as shown in Figure 3. The dimensionality of the feature maps is kept the same across layers in the decoder block for the addition of the feature map at each stage in terms of skip connection. The skip connection provides the hidden information, which is misplaced due to the deepness of the encoder block network. It assists in better reconstruction of the semantic feature maps to the encoder, where the following residual block helps to learn the necessary features using backpropagation by repeating it many times. After the last convolution operation, the sigmoid activation function is performed, which provides the segmented frame as the final output of the Graft-U-Net model.

Results and Discussion
In this section, an explanation of the two datasets is given and performance evaluation protocols are addressed. A detailed description of the experiments is provided with the two datasets. The training and testing of the model were performed using the NVIDIA GTX 1070 GPU. Windows 10 was used with a core i5 machine with 8 GB of inbuilt RAM. Python Spyder IDE was used for the compiling of results and model evaluation. This section addresses the model's performance evaluation protocols, visualizations of features, and experimental setup.

Datasets
Medical-image analysis is a highly demanding task whereby pixel-wise image segmentation is performed using medical-imaging datasets. An open-access dataset such as Kvasir-SEG is an annotated medical-image dataset with a corresponding segmentation mask. The size of the file containing polyp frames was 46.2 MB. The original frames and their corresponding ground truth frames were verified by a qualified gastroenterologist. The resolution of the frames varied from 332 × 487 to 1920 × 1072 pixels in the whole dataset, which was stored in two folders-the actual images folder and the ground truth images folder-where the name of each frame was kept the same as the name of the original images in folder 1. The Kvasir-SEG dataset [58] was used for the evaluation of Graft-U-Net, which consisted of 1000 polyp images. The dataset was prepared by an expert endoscopist from Oslo University Hospital Norway (OUHN). An open-access CVC-ClinicDB dataset was employed as the state-of-the-art, and had 612 images with a 384 × 288 resolution from 31 colonoscopy sequences [59]. CVC-ClinicDB was also composed of two folders-one folder for original images and the other for ground truth images (containing a mask) corresponding to the polyp area covered in the original frame. A comprehensive summary of both datasets that are used in the proposed model is given in Table 3. Figure 5 illustrates the sample of the original images with corresponding ground truth images. ClinicDB dataset was employed as the state-of-the-art, and had 612 images with a 384 × 288 resolution from 31 colonoscopy sequences [59]. CVC-ClinicDB was also composed of two folders-one folder for original images and the other for ground truth images (containing a mask) corresponding to the polyp area covered in the original frame. A comprehensive summary of both datasets that are used in the proposed model is given in Table  3. Figure 5 illustrates the sample of the original images with corresponding ground truth images.

Performance Evaluation Measures
The standard computer vision methods for semantic segmentation were used for the evaluation of model performance using the Kvasir-SEG dataset in terms of precision, mean Dice coefficient (mDice), recall, accuracy, mean intersection of union (IoU), and F2score. Each evaluation protocol provides specific information relevant to the experiment. A false-positive (fp) determines the information about a predicted class as positive when

Performance Evaluation Measures
The standard computer vision methods for semantic segmentation were used for the evaluation of model performance using the Kvasir-SEG dataset in terms of precision, mean Dice coefficient (mDice), recall, accuracy, mean intersection of union (IoU), and F2-score.
Each evaluation protocol provides specific information relevant to the experiment. A false-positive (fp) determines the information about a predicted class as positive when it is actually found to be negative, a true-positive (tp) provides a correct prediction, a false-negative (fn) considers the predicted class as negative while it is actually positive, and for a true-negative (tn), the actual class and predicted class are found to be negative. Test score accuracy is measured by the F2 scores that are used in binary classification problems.

Experiment 1: Results of Kvasir-SEG Dataset Using Graft-U-Net
The experiment was performed using the Kvasir-SEG dataset where the ratio of the sample was set as 70% training and 30% testing. In the training, the number of epochs was set as 40 on 1000 frames of the dataset. The results were collected from the model in terms of performance evaluation metrics as the mIoU (82.45%), mDice (96.61%), F2 score (95.25%), Precision (99.11%), Recall (94.33%), and Accuracy (85.11%). The results of the Graft-U-Net model were obtained using the Kvasir-SEG dataset and are displayed in Figure 6, which makes the results of the model more noticeable.
The experiment was performed using the Kvasir-SEG dataset where the ratio of the sample was set as 70% training and 30% testing. In the training, the number of epochs was set as 40 on 1000 frames of the dataset. The results were collected from the model in terms of performance evaluation metrics as the mIoU (82.45%), mDice (96.61%), F2 score (95.25%), Precision (99.11%), Recall (94.33%), and Accuracy (85.11%). The results of the Graft-U-Net model were obtained using the Kvasir-SEG dataset and are displayed in Figure 6, which makes the results of the model more noticeable.   The results of Graft-U-Net (0.8245) in terms of mIoU are compared with the pre-study models of UNet (0.4334), ResUNet (0.4364), and ResUNet++ (0.7927). The comparison result of mIoU is depicted in Figure 8. The results of Graft-U-Net (0.9433) in terms of Recall are compared with the prestudy models of UNet (0.6306), ResUNet (0.5041), and ResUNet++ (0.7064). The outcome shows that recall is enhanced when using the Kvasir-SEG dataset. The outcome of the recall is illustrated in the form of a table and a graph in Figure 9. The results of Graft-U-Net (0.8245) in terms of mIoU are compared with the pre-study models of UNet (0.4334), ResUNet (0.4364), and ResUNet++ (0.7927). The comparison result of mIoU is depicted in Figure 8. The results of Graft-U-Net (0.9433) in terms of Recall are compared with the prestudy models of UNet (0.6306), ResUNet (0.5041), and ResUNet++ (0.7064). The outcome shows that recall is enhanced when using the Kvasir-SEG dataset. The outcome of the recall is illustrated in the form of a table and a graph in Figure 9. The results of Graft-U-Net (0.9433) in terms of Recall are compared with the pre-study models of UNet (0.6306), ResUNet (0.5041), and ResUNet++ (0.7064). The outcome shows that recall is enhanced when using the Kvasir-SEG dataset. The outcome of the recall is illustrated in the form of a table and a graph in Figure 9. The results of Graft-U-Net (0.9433) in terms of Recall are compared with the prestudy models of UNet (0.6306), ResUNet (0.5041), and ResUNet++ (0.7064). The outcome shows that recall is enhanced when using the Kvasir-SEG dataset. The outcome of the recall is illustrated in the form of a table and a graph in Figure 9. The illustrated results are compiled on the Kvasir-SEG dataset and input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth masks and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 11. The illustrated results are compiled on the Kvasir-SEG dataset and input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth masks and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 11. The illustrated results are compiled on the Kvasir-SEG dataset and input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth masks and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 11.

Experiment 2: Results of the CVC-ClinicDB Dataset Using Graft-U-Net
An in-depth performance analysis and additional experiments were performed for automatic polyp segmentation. The CVC-ClinicDB dataset is considered the one that can make the model clinically acceptable. The results of Graft-U-Net (0.8995) in terms of mDice are compared with the pre-study models of UNet (0.6419), ResUNet (0.4511), and ResUNet++ (0.7955). The analysis declares that the proposed model provides improved results in terms of mDice when using the CVC-ClinicDB dataset. Figure 12 represents the result of the mDice. The results of Graft-U-Net (0.8138) in terms of mIoU are compared with the pre-study models of UNet (0.4711), ResUNet (0.4571), and ResUNet++ (0.7962). The comparison that is shown in Figure 13 declares the model to be better in terms of mIoU when using the CVC-ClinicDB dataset.  The results of Graft-U-Net (0.8138) in terms of mIoU are compared with the pre-study models of UNet (0.4711), ResUNet (0.4571), and ResUNet++ (0.7962). The comparison that is shown in Figure 13 declares the model to be better in terms of mIoU when using the CVC-ClinicDB dataset. The results of Graft-U-Net (0.8138) in terms of mIoU are compared with the pre-study models of UNet (0.4711), ResUNet (0.4571), and ResUNet++ (0.7962). The comparison that is shown in Figure 13 declares the model to be better in terms of mIoU when using the CVC-ClinicDB dataset.  The results of Graft-U-Net (0.8138) in terms of mIoU are compared with the pre-study models of UNet (0.4711), ResUNet (0.4571), and ResUNet++ (0.7962). The comparison tha is shown in Figure 13 declares the model to be better in terms of mIoU when using the CVC-ClinicDB dataset.  The illustrated results are compiled on the CVC-ClinicDB dataset and the input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth mask and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 16.

Discussion
Semantic segmentation is a crucial segmentation technique that is employed for polyp detection from the frames of the GI tract. Deep learning plays a decisive role in the computer vision field for feature learning using CNN techniques. The challenges mostly occur in data acquisition, such as the appearance of the polyps fluctuating under the same lighting conditions and variable texture, and varying angular views under different lighting conditions. Graft-U-Net is the proposed method for polyp segmentation in our manuscript, and overcomes the addressed challenges attractively. The proposed model comprises main two blocks (encoder and decoder) where a graft network is proposed in the encoder block, as shown in Figure 3. The encoder analyzes the information in the frames while the decoder synthesizes the visual information using skip connections. Graft-U-Net outperforms in terms of mDice (96.61%), the mIoU (82.45%), precision (94.33%), and recall The illustrated results are compiled on the CVC-ClinicDB dataset and the input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth mask and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 16. The illustrated results are compiled on the CVC-ClinicDB dataset and the input images (original images) with the corresponding ground truth masks, model-predicted output masks, and a combined form of the ground truth mask and predicted masks, which are compared with a blue outline and a red outline, respectively, in Figure 16.

Discussion
Semantic segmentation is a crucial segmentation technique that is employed for polyp detection from the frames of the GI tract. Deep learning plays a decisive role in the computer vision field for feature learning using CNN techniques. The challenges mostly occur in data acquisition, such as the appearance of the polyps fluctuating under the same lighting conditions and variable texture, and varying angular views under different lighting conditions. Graft-U-Net is the proposed method for polyp segmentation in our manuscript, and overcomes the addressed challenges attractively. The proposed model comprises main two blocks (encoder and decoder) where a graft network is proposed in the encoder block, as shown in Figure 3. The encoder analyzes the information in the frames while the decoder synthesizes the visual information using skip connections. Graft-U-Net outperforms in terms of mDice (96.61%), the mIoU (82.45%), precision (94.33%), and recall (99.11%) using the Kvasir-SEG dataset. Similarly, model performance was analyzed using the CVC-ClinicDB dataset, which provides better results in terms of mDice (89.95%), the mIoU (81.38%), precision (87.85%), and recall (92.11). Consequently, the algorithm can be

Discussion
Semantic segmentation is a crucial segmentation technique that is employed for polyp detection from the frames of the GI tract. Deep learning plays a decisive role in the computer vision field for feature learning using CNN techniques. The challenges mostly occur in data acquisition, such as the appearance of the polyps fluctuating under the same lighting conditions and variable texture, and varying angular views under different lighting conditions. Graft-U-Net is the proposed method for polyp segmentation in our manuscript, and overcomes the addressed challenges attractively. The proposed model comprises main two blocks (encoder and decoder) where a graft network is proposed in the encoder block, as shown in Figure 3. The encoder analyzes the information in the frames while the decoder synthesizes the visual information using skip connections. Graft-U-Net outperforms in terms of mDice (96.61%), the mIoU (82.45%), precision (94.33%), and recall (99.11%) using the Kvasir-SEG dataset. Similarly, model performance was analyzed using the CVC-ClinicDB dataset, which provides better results in terms of mDice (89.95%), the mIoU (81.38%), precision (87.85%), and recall (92.11). Consequently, the algorithm can be made more generalized by using small-sized polyps for semantic segmentation. In this regard, Graft-U-Net is proposed to handle small polyps and shape information, and to incorporate artifacts separately to improve the model's overall efficiency.

Conclusions and Future Work
The proposed Graft-U-Net model performs semantic segmentation better than existing models. The model achieves accurate segmentation of colorectal polyps using the two polyp datasets described in the manuscript. During the preprocessing phase, the CLAHE technique is used to enhance the intensity level of the frames of the Kvasir-SEG dataset. The proposed Graft-U-Net model is composed of encoder and decoder blocks where five DSBs and five USBs are made. The graft network is proposed in each DSB block in the encoder to obtain better feature maps. The decoder block constructs the feature maps for finding the location of the mask, which is the area covered in the original frame. So, the proposed model outperforms with respect to mDice (96.61%), the mIoU (82.45%), precision (94.33%), and recall (99.11%) using the Kvasir-SEG dataset; similarly, on the CVC-clinicDB dataset, the model achieves better results with mDice (89.95%), the mIoU (81.38%), precision (87.85%), and recall (92.11%). The performance evaluations are compared with the existing state-of-the-art models UNet, ResUNet, and ResUNet++.
The encoder block can be replaced by models including resNet, VGG, InceptionNet, AlexNet, etc. The proposed model can serve as a strong baseline for additional exploration to establish a useful technique, which will help to achieve the generalizability goal.