Localization and Classiﬁcation of Gastrointestinal Tract Disorders Using Explainable AI from Endoscopic Images

: Globally, gastrointestinal (GI) tract diseases are on the rise. If left untreated, people may die from these diseases. Early discovery and categorization of these diseases can reduce the severity of the disease and save lives. Automated procedures are necessary, since manual detection and categorization are laborious, time-consuming, and prone to mistakes. In this work, we present an automated system for the localization and classiﬁcation of GI diseases from endoscopic images with the help of an encoder–decoder-based model, XceptionNet, and explainable artiﬁcial intelligence (AI). Data augmentation is performed at the preprocessing stage, followed by segmentation using an encoder–decoder-based model. Later, contours are drawn around the diseased area based on segmented regions. Finally, classiﬁcation is performed on segmented images by well-known classiﬁers, and results are generated for various train-to-test ratios for performance analysis. For segmentation, the proposed model achieved 82.08% dice, 90.30% mIOU, 94.35% precision, and 85.97% recall rate. The best performing classiﬁer achieved 98.32% accuracy, 96.13% recall, and 99.68% precision using the softmax classiﬁer. Comparison with the state-of-the-art techniques shows that the proposed model performed well on all the reported performance metrics. We explain this improvement in performance by utilizing heat maps with and without the proposed technique.


Introduction
GI tract diseases are disorders related to the digestive system.Diagnoses of these diseases are highly dependent on medical imaging.The processing of large visual data is difficult for medical professionals and radiologists; this renders it subject to incorrect medical evaluation [1].The most common diseases that occur in the digestive system are ulcerative colitis, ulcers, esophagitis, and polyps, which can transform into colorectal cancer.These diseases are the key causes of mortality around the globe [2].
As per the survey conducted on colorectal cancer for the year 2019, 26% of men, as well as 11% of women, around the globe are diagnosed with this cancer [3].In 2021, more than 0.3 million cases of colorectal cancer were diagnosed in the US, and the death toll rose to 44% [4].Roughly 0.7 million new instances of diseases are reported each year worldwide [5].Alongside GI malignant growth [6,7], ulcer advancement in the GI tract is additionally a significant illness.The authors of [8] announced that the most noteworthy yearly predominance of ulcers was 141 per 1000 people in Spain, and the least was around 57 in Sweden.
During a routine endoscopic checkup, many lesions are missed due to the factors like the presence of stool and because of the organ's multifaceted topology.Although the bowel is cleansed for improvement in the detection of cancer or its predecessor lesions, still, the ratio of missed polyps is immoderate, from 21.4-26.8%[9].Moreover, the interclass

•
Development of an encoder-decoder-based model for segmentation and localization of diseases.

•
Development of an explainable AI-based model that is utilized for the classification of endoscopic images with contours into four main diseases.

•
Development of an efficient and robust framework having better accuracy, precision, and recall rate.
The remainder of this paper is structured as follows: related works are presented in Section 2. The methodological specifics are then provided in Section 3. The experimental data are included in Section 4. In Section 5 discussion and analysis of experimental results are presented and Finally, Section 6 concludes the paper.

Literature Review
In the past few years, the detection of diseases using medical imaging has been a hot area of research, especially in the domain of the gastrointestinal tract.The segmentation of polyps, in particular, has been the major focus because of the availability of ground truths.Furthermore, the classification of gastrointestinal diseases has also been an active area of research.The performance of machine learning algorithms reported in the literature has been quite impressive [17,18], but deep learning algorithms surpass the ML approaches and achieve better results [19].
For the detection of GI tract diseases, numerous studies are available in the literature that use the ML method.For example, in [17], the authors developed a ML model based on the longitudinal training cohort of over 20 thousand patients undergoing treatment for peptic ulcers between the years 2007 and 2016.Their greatest accuracies were 82.6% and 83.3% using logistic regression and ridge regression, respectively.Sen Wang et al. [18] established ML architecture for ulcer diagnosis and performed experimentation on a private developed dataset of WCE videos, 1504 to be exact.The effectiveness of this technique was evaluated using the ROC curve and the AUC, and achieved a 0.9235 peak value.In a different work [13], Jinn-Yi Yeh et al. used color characteristics and a WCE image collection to identify bleeding and ulcers.They used texture information in addition to combining all the picture attributes into a single matrix.Several classifiers, including SVM, neural networks, as well as decision trees, were presented in this matrix of characteristics.Various performance metrics were included for examination, and the accuracy ranged from 92.86% to 93.64%.
It has been observed that deep learning (DL) models generally performed better in detecting GI tract diseases.The authors of [20] developed the VGGNet model based on CNN to detect GI ulcers, with a dataset of 854 images, and achieved 86.6% accuracy.However, these tests took place using conventional endoscopy images.In [21], the authors developed a CNN-based DL model; the dataset consisted of 5360 images containing ulcers and erosions, and contained merely 450 normal class images.The method achieved 90.8% detection accuracy.Sekuboyina and co-authors, in [22], proposed models based on CNN to detect dissimilar forms of diseases in WCE images, like ulcers, and more.They developed multiple subsections of images and applied the DL model.This experiment attained 71% sensitivity and 72% specificity.
Apart from the classification techniques, researchers also proposed segmentation techniques for the detection of the predecessor disease of colorectal cancer.A fully convolutional network (FCN) was proposed in [23], which is trained from start to finish as well as pixel by pixel, and yields the segmentation of polyps.There are no extra postprocessing procedures needed for the suggested model, which is the major contribution of this research.In another paper [24], the authors discussed and enhanced the FCN network and named it the U-Net architecture.The U-Net model achieved good results for localization.Furthermore, many researchers have tried to modify and enhance the U-Net architecture to achieve better segmentation and localization results [25][26][27], but in medical images, these are not evaluated or do not provide better results.By maximizing the characteristics gleaned from two pre-trained models, the authors of [28] established a framework for gastrointestinal illness categorization and achieved 96.43% accuracy.In another framework [29], MobileNet-V2 is used for the multiclass classification of gastrointestinal illnesses, and a contrast enhancement approach was suggested.
Based on the literature, it can be said that sufficient related work has been performed in the field of GI tract disease detection and classification.The presented results show reasonable performance in terms of accuracy.However, performance can be improved.Accuracy is an important performance metric; however, for multiclass classification problems, accuracy is less significant as compared to other performance metrics, especially when there is an imbalance in the dataset.For instance, we would like to emphasize that precision and recall rate are important performance measures for life-critical applications.Most of the presented works have reasonable accuracy, but they suffer from lower precision and recall rate, and require improvement.
Review of the existing work also highlights that most of the work on GI tract diseases has been conducted on datasets that are not publicly available.This makes it hard to generalize the results and compare the performance.Furthermore, researchers mostly focused on single disease detection and binary classification [14][15][16].The focus of our work is to conduct experiments on publicly available datasets and target the multiclass classification of GI tract diseases like polyps, ulcers, ulcerative colitis, and esophagitis.Also, the suggested strategy has significantly improved the performance across practically all indicators.

Methodology
Various diseases can attack the human GI tract, like colorectal cancer, and their predecessor diseases, like polyps, as well as other diseases, such as ulcers, esophagitis, and ulcerative colitis, to name a few.To diagnose such diseases, traditional endoscopic images or WCE images are needed and play a vital role.Artificial intelligence-based methods like DL have proved to be helpful for the diagnosis of such diseases.Therefore, in this paper, we have developed a DL-based model for segmentation as well as a multiclass classification of GI tract diseases.The core aim of our research is to put forward a DL model based on the segmented images.This approach is used for the detection of multiple GI tract diseases, and hence, is used for reducing the doctors' time to manually diagnose or use multiple applications separately for each malady.
In our proposed methodology, we undertake the five steps shown in Figure 1.As a first step, we acquired the publicly available datasets, namely, Kvasir-Seg, Kvasir V-2, and Hyper-Kvasir datasets.After that, the dataset was increased by applying data augmentation using multiple transformations.Subsequently, segmentation was performed using U-Net, an encoder-decoder-based model, with Resnet-34 as a backbone, and then, contours were drawn around the diseased area.In the second-last step, heat maps were generated to compare and analyze the model's performance on segmented and non-segmented images.In the last step, images with contours around the diseased area were used as an input of the Xception model for feature extraction, and multiple classifiers are applied for classification.
Various diseases can attack the human GI tract, like colorectal cancer, and their predecessor diseases, like polyps, as well as other diseases, such as ulcers, esophagitis, and ulcerative colitis, to name a few.To diagnose such diseases, traditional endoscopic images or WCE images are needed and play a vital role.Artificial intelligence-based methods like DL have proved to be helpful for the diagnosis of such diseases.Therefore, in this paper, we have developed a DL-based model for segmentation as well as a multiclass classification of GI tract diseases.The core aim of our research is to put forward a DL model based on the segmented images.This approach is used for the detection of multiple GI tract diseases, and hence, is used for reducing the doctors' time to manually diagnose or use multiple applications separately for each malady.
In our proposed methodology, we undertake the five steps shown in Figure 1.As a first step, we acquired the publicly available datasets, namely, Kvasir-Seg, Kvasir V-2, and Hyper-Kvasir datasets.After that, the dataset was increased by applying data augmentation using multiple transformations.Subsequently, segmentation was performed using U-Net, an encoder-decoder-based model, with Resnet-34 as a backbone, and then, contours were drawn around the diseased area.In the second-last step, heat maps were generated to compare and analyze the model's performance on segmented and non-segmented images.In the last step, images with contours around the diseased area were used as an input of the Xception model for feature extraction, and multiple classifiers are applied for classification.

Dataset Collection and Preparation
The Kvasir-Seg [30] dataset was utilized for segmentation, and the Kvasir-V2 [31] and Hyper-Kvasir [32] datasets were utilized for classification.Our dataset contains four diseases, i.e., ulcers, polyps, esophagitis, and ulcerative colitis, as well as a normal class, with 1000 instances for each malady, other than the ulcer malady, which has only 854 instances.As a result, our dataset consists of 4854 images divided into five classes: ulcerative colitis, polyps, ulcers, esophagitis, and normal.For segmentation, the Kvasir-Seg dataset is used, which contains the 1000 images of the polyp class with their ground truths.
Initially, the segmentation results were collected based on Kvasir-Seg dataset, and then this method was applied to all other diseased images, and classification was performed.

Preprocessing
DL models require more data to train on as compared with ML models, otherwise they start overfitting the data and lacking generalization.Hence, augmentation was performed after the dataset was initially collected to enhance its size.Moreover, data augmentation is a very powerful technique used to reduce the validation error along with the training error [33].The main transformations that are applied during the data augmentation are rotation, width shifting, height shifting, horizontal/vertical flip, and zoom-in/out.The total dataset size after applying data augmentation increased to 30,000 images, with 6000 images for each class.The images generated after applying data augmentation are shown in Figure 2.
As a result, our dataset consists of 4854 images divided into five classes: ulcerative colitis, polyps, ulcers, esophagitis, and normal.For segmentation, the Kvasir-Seg dataset is used, which contains the 1000 images of the polyp class with their ground truths.
Initially, the segmentation results were collected based on Kvasir-Seg dataset, and then this method was applied to all other diseased images, and classification was performed.

Preprocessing
DL models require more data to train on as compared with ML models, otherwise they start overfitting the data and lacking generalization.Hence, augmentation was performed after the dataset was initially collected to enhance its size.Moreover, data augmentation is a very powerful technique used to reduce the validation error along with the training error [33].The main transformations that are applied during the data augmentation are rotation, width shifting, height shifting, horizontal/vertical flip, and zoom-in/out.The total dataset size after applying data augmentation increased to 30,000 images, with 6000 images for each class.The images generated after applying data augmentation are shown in Figure 2.

Segmentation
Segmentation of the diseased region was performed using the U-Net model.U-Net is a CNN-based segmentation model that was proposed in 2015 for biomedical images [23].It has one encoder module and another decoder module.Figure 3 depicts the U-Net model's architecture.In the encoder module, two convolutional (3 × 3) layers are applied repeatedly with one stride.The Relu layer and a 2 × 2 Maxpooling layer with two and four strides follow each convolutional layer.A dropout layer is applied following the first convolutional layer.The bottom layers consist of 3 × 3 convolutional layers.The decoding part up-samples the dimensions of the image to its original by applying two convolutional (3 × 3) layers.The first layer is stacked by Relu; the dropout layer and the next convolutional layer are stacked by the Relu layer only.The top layer, which is also the last layer, is a

Segmentation
Segmentation of the diseased region was performed using the U-Net model.U-Net is a CNN-based segmentation model that was proposed in 2015 for biomedical images [23].It has one encoder module and another decoder module.Figure 3 depicts the U-Net model's architecture.In the encoder module, two convolutional (3 × 3) layers are applied repeatedly with one stride.The Relu layer and a 2 × 2 Maxpooling layer with two and four strides follow each convolutional layer.A dropout layer is applied following the first convolutional layer.The bottom layers consist of 3 × 3 convolutional layers.The decoding part up-samples the dimensions of the image to its original by applying two convolutional (3 × 3) layers.The first layer is stacked by Relu; the dropout layer and the next convolutional layer are stacked by the Relu layer only.The top layer, which is also the last layer, is a convolutional (1 × 1) layer.The first encoder part is used for the extraction of features, and is similar to the VGG-16 model [34].The up-sampling operation combines both lowresolution as well as high-resolution information, which is the provision of object-based recognition, as well as accurate positioning and segmentation, which is useful for medical image segmentation [34].As the foundational model for U-Net, we utilized the ResNet-34 model, which was observed to outperform other segmentation models [35].The U-Net model outputs the black-and-white image mask, which was then used to draw contours around the diseased area of an image.
is similar to the VGG-16 model [34].The up-sampling operation combines both low-resolution as well as high-resolution information, which is the provision of object-based recognition, as well as accurate positioning and segmentation, which is useful for medical image segmentation [34].As the foundational model for U-Net, we utilized the ResNet-34 model, which was observed to outperform other segmentation models [35].The U-Net model outputs the black-and-white image mask, which was then used to draw contours around the diseased area of an image.During model training, the Adam optimizer was applied.Due to its outstanding outcomes and adaptable learning gain, the Adam optimizer is frequently used by researchers for CNNs [37], and root mean squared error (RMSE) is used as a loss function.The model was trained for a total of 250 epochs with a batch size of 50.
The Adam optimizer is used to control the gradient descent rate in such a way that there is minimum fluctuation near to global optima, and it takes large steps near to local optima to avoid it and reach global minima efficiently.Adam combines the features of two gradient descent techniques, namely, momentum and root mean squared propagation (RMSP).Mathematical equations of momentum and RMSP are expressed as follows: During model training, the Adam optimizer was applied.Due to its outstanding outcomes and adaptable learning gain, the Adam optimizer is frequently used by researchers for CNNs [37], and root mean squared error (RMSE) is used as a loss function.The model was trained for a total of 250 epochs with a batch size of 50.
The Adam optimizer is used to control the gradient descent rate in such a way that there is minimum fluctuation near to global optima, and it takes large steps near to local optima to avoid it and reach global minima efficiently.Adam combines the features of two gradient descent techniques, namely, momentum and root mean squared propagation (RMSP).Mathematical equations of momentum and RMSP are expressed as follows: where R t is the gradient aggregate at t, δS is a derivative of a loss function, δ∀ t is a derivate of weights at t, σ is an average parameter that is moving, and ϕ t is the sum of the square of past gradients.Initially, both R t and ϕ t are set to zero, and it is observed that both tend to be biased towards zero as σ 1 and σ 2 are set to one.The Adam optimizer solved this problem by calculating bias-corrected R t as well as ϕ t .Mathematical equations of these biased corrected values are expressed as follows: After each iteration, new positions of weights by substituting the updated values are given as follows: where ∀ t is a weight at time t, ∃ is a learning rate, and µ is a constant.Mean squared error (MSE) is called an average of squares of errors.It is the square of the difference between the actual attribute and estimator.Mathematically, the equation of mean squared error is expressed as follows: solved this problem by calculating bias-corrected   � as well as   �.Mathematical equations of these biased corrected values are expressed as follows: After each iteration, new positions of weights by substituting the updated values are given as follows: where ∀  is a weight at time t, ∃ is a learning rate, and  is a constant.Mean squared error (MSE) is called an average of squares of errors.It is the square of the difference between the actual attribute and estimator.Mathematically, the equation of mean squared error is expressed as follows: where ը  is the original valuation, and ը  � is the anticipated valuation of the model.

Heat Maps
Explainable artificial intelligence (XAI) in medical imaging is a set of techniques and approaches to enable medical experts to understand the diseased judgment process of artificial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a tool created in 2017 that produces an explanation for each type of CNN model [38,39].The heat map of the anticipated labels is the Grad-CAM result.
Heat maps of images were generated before segmentation and after segmentation for the analysis of the diseased area in an image.The magnitude with which the model highlights the area is called activation, and we exhibit this on the Jet color map.Violet color highlights the lowest-magnitude area, and red represents the high-magnitude area.The process of heat map generation is shown in Figure 4. Grad-CAM works by checking the last convolutional layer before and after the examination of gradient information that is flowing to that layer.In our case, we applied the solved this problem by calculating bias-corrected   � as well as   �.Mathematical equations of these biased corrected values are expressed as follows: After each iteration, new positions of weights by substituting the updated values are given as follows: where ∀  is a weight at time t, ∃ is a learning rate, and  is a constant.Mean squared error (MSE) is called an average of squares of errors.It is the square of the difference between the actual attribute and estimator.Mathematically, the equation of mean squared error is expressed as follows: where ը  is the original valuation, and ը  � is the anticipated valuation of the model.

Heat Maps
Explainable artificial intelligence (XAI) in medical imaging is a set of techniques and approaches to enable medical experts to understand the diseased judgment process of artificial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a tool created in 2017 that produces an explanation for each type of CNN model [38,39].The heat map of the anticipated labels is the Grad-CAM result.Heat maps of images were generated before segmentation and after segmentation for the analysis of the diseased area in an image.The magnitude with which the model highlights the area is called activation, and we exhibit this on the Jet color map.Violet color highlights the lowest-magnitude area, and red represents the high-magnitude area.The process of heat map generation is shown in Figure 4. Grad-CAM works by checking the last convolutional layer before and after the examination of gradient information that is flowing to that layer.In our case, we applied the i ) where fter each iteration, new positions of weights by substituting the updated values are as follows: e ∀  is a weight at time t, ∃ is a learning rate, and  is a constant.ean squared error (MSE) is called an average of squares of errors.It is the square of ifference between the actual attribute and estimator.Mathematically, the equation of squared error is expressed as follows: e ը  is the original valuation, and ը  � is the anticipated valuation of the model.
eat Maps xplainable artificial intelligence (XAI) in medical imaging is a set of techniques and aches to enable medical experts to understand the diseased judgment process of ial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a reated in 2017 that produces an explanation for each type of CNN model [38,39].The ap of the anticipated labels is the Grad-CAM result.eat maps of images were generated before segmentation and after segmentation for alysis of the diseased area in an image.The magnitude with which the model highthe area is called activation, and we exhibit this on the Jet color map.Violet color ights the lowest-magnitude area, and red represents the high-magnitude area.The ss of heat map generation is shown in Figure 4. rad-CAM works by checking the last convolutional layer before and after the exation of gradient information that is flowing to that layer.In our case, we applied the i is the original valuation, and vate of weights at t,  is an average parameter that is moving, and   is the sum of the square of past gradients.Initially, both   and   are set to zero, and it is observed that both tend to be biased towards zero as  1 and  2 are set to one.The Adam optimizer solved this problem by calculating bias-corrected   � as well as   �.Mathematical equations of these biased corrected values are expressed as follows: After each iteration, new positions of weights by substituting the updated values are given as follows: where ∀  is a weight at time t, ∃ is a learning rate, and  is a constant.Mean squared error (MSE) is called an average of squares of errors.It is the square of the difference between the actual attribute and estimator.Mathematically, the equation of mean squared error is expressed as follows: where ը  is the original valuation, and ը  � is the anticipated valuation of the model.

Heat Maps
Explainable artificial intelligence (XAI) in medical imaging is a set of techniques and approaches to enable medical experts to understand the diseased judgment process of artificial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a tool created in 2017 that produces an explanation for each type of CNN model [38,39].The heat map of the anticipated labels is the Grad-CAM result.Heat maps of images were generated before segmentation and after segmentation for the analysis of the diseased area in an image.The magnitude with which the model highlights the area is called activation, and we exhibit this on the Jet color map.Violet color highlights the lowest-magnitude area, and red represents the high-magnitude area.The process of heat map generation is shown in Figure 4. Grad-CAM works by checking the last convolutional layer before and after the examination of gradient information that is flowing to that layer.In our case, we applied the i is the anticipated valuation of the model.

Heat Maps
Explainable artificial intelligence (XAI) in medical imaging is a set of techniques and approaches to enable medical experts to understand the diseased judgment process of artificial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a tool created in 2017 that produces an explanation for each type of CNN model [38,39].The heat map of the anticipated labels is the Grad-CAM result.
Heat maps of images were generated before segmentation and after segmentation for the analysis of the diseased area in an image.The magnitude with which the model highlights the area is called activation, and we exhibit this on the Jet color map.Violet color highlights the lowest-magnitude area, and red represents the high-magnitude area.The process of heat map generation is shown in Figure 4.
tions of these biased corrected values are expressed as follows: After each iteration, new positions of weights by substituting the updated values are given as follows: where ∀ is a weight at time t, ∃ is a learning rate, and  is a constant.Mean squared error (MSE) is called an average of squares of errors.It is the square of the difference between the actual attribute and estimator.Mathematically, the equation of mean squared error is expressed as follows: where ը is the original valuation, and ը is the anticipated valuation of the model.

Heat Maps
Explainable artificial intelligence (XAI) in medical imaging is a set of techniques and approaches to enable medical experts to understand the diseased judgment process of artificial intelligent models.The gradient-weighted class activation map (Grad-CAM) is a tool created in 2017 that produces an explanation for each type of CNN model [38,39].The heat map of the anticipated labels is the Grad-CAM result.
Heat maps of images were generated before segmentation and after segmentation for the analysis of the diseased area in an image.The magnitude with which the model highlights the area is called activation, and we exhibit this on the Jet color map.Violet color highlights the lowest-magnitude area, and red represents the high-magnitude area.The process of heat map generation is shown in Figure 4. Grad-CAM works by checking the last convolutional layer before and after the examination of gradient information that is flowing to that layer.In our case, we applied the Grad-CAM works by checking the last convolutional layer before and after the examination of gradient information that is flowing to that layer.In our case, we applied the transfer learning concept and used the pre-trained Xception model, as it provides the best heat maps, and is therefore used for classification as well.The results of the image and its heat map before and after segmentation are shown in Figure 5.It is apparent from Figure 5 that after segmentation, the model is more focused, and looks exactly at the diseased region as the high-magnitude area; therefore, we used images with contours drawn around the diseased area for classification.
transfer learning concept and used the pre-trained Xception model, as it provides the best heat maps, and is therefore used for classification as well.The results of the image and its heat map before and after segmentation are shown in Figure 5.It is apparent from Figure 5 that after segmentation, the model is more focused, and looks exactly at the diseased region as the high-magnitude area; therefore, we used images with contours drawn around the diseased area for classification.

Features Extraction and Classification
As a final step, the Xception model was fine-tuned, and multiple classifiers were applied to predict the true labels.In our proposed model, the transfer learning approach is utilized as it performs better than training completely from beginning [40,41].The Xception model, which was pre-trained on the ImageNet dataset, was used and fine-tuned on our dataset by applying a dropout layer with 0.4 probability.The input of the Xception model is the images with contours, and the output is the features.These features are used for classification by applying multiple classifiers, like softmax, linear SVM, quadratic SVM, and Bayesian.
The Xception model is based on CNN with depth-wise separable convolutional layers.This model has 36 convolutional layers that are arranged into 14 modules.In simple terms, the Xception model is a depth-wise separable CNN with a residual connection.The architecture of Xception is shown in Figure 6.The authors of [42] proved through experimentation that Xception outperforms other CNN models like VGG-16, ResNet-152, and Inception V3 on the ImageNet dataset.
For experimentation, Python was used, and other settings are shown here in order to reproduce the results.During model training, the Adam optimizer was applied.Because of its outstanding outcomes and adaptable learning gain, the Adam optimizer is frequently used by researchers for CNNs [37].Categorical cross-entropy (CCE) was also employed as a loss function.During the training of the DL model, the loss function determines the difference between the original class and the anticipated class.It also adjusts the weights of the CNN to produce a better-fitting model [43].The set batch size was 50 and the model was trained on 250 epochs.

Features Extraction and Classification
As a final step, the Xception model was fine-tuned, and multiple classifiers were applied to predict the true labels.In our proposed model, the transfer learning approach is utilized as it performs better than training completely from beginning [40,41].The Xception model, which was pre-trained on the ImageNet dataset, was used and fine-tuned on our dataset by applying a dropout layer with 0.4 probability.The input of the Xception model is the images with contours, and the output is the features.These features are used for classification by applying multiple classifiers, like softmax, linear SVM, quadratic SVM, and Bayesian.
The Xception model is based on CNN with depth-wise separable convolutional layers.This model has 36 convolutional layers that are arranged into 14 modules.In simple terms, the Xception model is a depth-wise separable CNN with a residual connection.The architecture of Xception is shown in Figure 6.The authors of [42] proved through experimentation that Xception outperforms other CNN models like VGG-16, ResNet-152, and Inception V3 on the ImageNet dataset.CCE loss is an excellent measure for calculating loss by computing how distinguished two discrete probabilities are from each other.The mathematical equation of this loss is as follows: where  is the original valuation, and  is the anticipated valuation of the model.

Results
The results of our proposed model are compiled separately for both segmentation as well as classification in Sections 4.1 and 4.2, respectively.Evaluation matrices used for evaluating the results are dice, mIOU, precision, recall, and accuracy.Python 3.10, Matplotlib 3.6.2,PyTorch 1.12.0, and Keras 2.11.0 are the primary tools and libraries used for experimentation.The Adam optimizer and CCE are used, and the entire framework is developed on a GPU with a 4 GB NVIDIA Tesla graphics card and 32 GB of RAM.The model was trained on 250 epochs with a fixed batch size of 50.For experimentation, Python was used, and other settings are shown here in order to reproduce the results.During model training, the Adam optimizer was applied.Because of its outstanding outcomes and adaptable learning gain, the Adam optimizer is frequently used by researchers for CNNs [37].Categorical cross-entropy (CCE) was also employed as a loss function.During the training of the DL model, the loss function determines the difference between the original class and the anticipated class.It also adjusts the weights of the CNN to produce a better-fitting model [43].The set batch size was 50 and the model was trained on 250 epochs.
CCE loss is an excellent measure for calculating loss by computing how distinguished two discrete probabilities are from each other.The mathematical equation of this loss is as follows: where S i is the original valuation, and S i is the anticipated valuation of the model.

Results
The results of our proposed model are compiled separately for both segmentation as well as classification in Sections 4.1 and 4.2, respectively.Evaluation matrices used for evaluating the results are dice, mIOU, precision, recall, and accuracy.Python 3.10, Matplotlib 3.6.2,PyTorch 1.12.0, and Keras 2.11.0 are the primary tools and libraries used for experimentation.The Adam optimizer and CCE are used, and the entire framework is developed on a GPU with a 4 GB NVIDIA Tesla graphics card and 32 GB of RAM.The model was trained on 250 epochs with a fixed batch size of 50.

Segmentation Results
Colorectal cancer and its predecessor disease segmentation results can be evaluated using different measures.It is highly dependent on the rate of detection as well as on the fraction between complete pixels and diseased pixels.To check the effectiveness of segmentation using U-Net with ResNet-34 as a backbone model, we performed a set of experiments on the Kvasir-Seg dataset.Performance measures used to check the efficiency of segmentation are dice, mIOU, precision, and recall.
Dice, which is also known as the overlap measure, is the most frequently used measure for evaluating and testing the effectiveness of medical image segmentation [44].This overlap region between the predicted segmented image and the ground truth is doubled, and the result is divided by the total number of pixels in both images.mIOU, known as the mean intersection over union, is usually used to check for medical segmentation.IOU is calculated as the anticipated segmentation overlap over the ground truth divided by the total number of pixels.Mean IOU is calculated by taking the IOU of each label and averaging them.A precision measure is defined as the quality of being accurate.It measures the quality of our predictions.Recall is a measure used to calculate the positive points in the ground truth that are predicted positively by a model.Mathematical equations of these performance measures are provided as Equations ( 8)- (11).
where M is defined as true-positive, N is defined as false-positive, O is defined as falsenegative, and P is defined as true-negative.
For the segmentation results, the K-fold cross-validation technique was applied with K fixed to 10, as it is evident from research that when K is 10, the model performs better [40].After we applied U-Net with ResNet-34 as a backbone model on our dataset, the model achieved a 0.9030 mIOU score, 0.8208 dice score, 0.9435 precision, and 0.8597 recall score.Table 1 compares the quantitative findings based on the Kvasir-Seg dataset using several segmentation methods.Qualitative results of segmentation and localization of polyps based on the Kvasir-Seg dataset are shown in Figure 7.By looking at the ground truth, it can be noticed that the segmentation results generated by UNet with the ResNet-34 model as a background are up to the mark.Furthermore, the results show that the model detected the large diseased area and produced high-quality masks at similar locality but with a slightly different shape.This same segmentation model is applied to all other classes, like ulcer, polyp, ulcerative colitis, and esophagitis, for drawing contours around the diseased area and passing these images for classification.

Classification Results
We evaluated the performance of classification on a fine-tuned Xception model using different performance measures, namely, precision, recall, and accuracy.In medical applications, we are more concerned that recall should be high so that no disease case should

Classification Results
We evaluated the performance of classification on a fine-tuned Xception model using different performance measures, namely, precision, recall, and accuracy.In medical applications, we are more concerned that recall should be high so that no disease case should be treated as normal.Precision and recall are already discussed in the segmentation, and their equations are also shown in (10) and (11); therefore, only accuracy is discussed in this section.Accuracy points out the number of true predictions from total predictions.The equation of accuracy is shown below: where M is defined as true-positive, N is defined as false-positive, O is defined as falsenegative, and P is defined as true-negative.Classification results were collected by distributing the dataset into various train-to-test ratios, namely, 80/20, 70/30, and 60/40, and using 10-fold cross-validation.Initially, results are collected based on input images with no contours using 10-fold cross-validation to compare the performance.It is evident from Table 2 that the softmax classifier outperforms other classifiers, with 89.62% precision, 78.25% recall, and 81.06% accuracy.However, quadratic SVM performance cannot be overlooked, as it is near to that of softmax.On an 80/20 ratio, the best achieved results using the softmax classifier are 87.67%precision, 80.13% recall, and 85.27% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 3.It is evident from Table 3 that quadratic SVM performance is also satisfactory and near to that of softmax.Moreover, the testing accuracy graph based on the model trained for the 80/20 train-to-test ratio is also shown in Figure 8.The confusion matrix generated based on the 80/20 ratio using the softmax classifier is shown in Figure 9. Looking at the results, it is clear that in the ulcer class, four of the cases are shown as normal, and in the esophagitis class, three of the cases are treated as normal, which is an attentive sign.Moreover, six of the normal cases are treated as disease cases.The confusion matrix generated based on the 80/20 ratio using the softmax classifier is shown in Figure 9. Looking at the results, it is clear that in the ulcer class, four of the cases are shown as normal, and in the esophagitis class, three of the cases are treated as normal, which is an attentive sign.Moreover, six of the normal cases are treated as disease cases.On a 70/30 ratio, the best results achieved using the softmax classifier are 96.94%precision, 93.22% recall, and 94.68% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 4.It is evident from Table 4 that after softmax, Bayesian performance is better than the other classifiers.Moreover, the testing accuracy graph on the model trained with the 70/30 train-to-test ratio is also shown in Figure 10.On a 70/30 ratio, the best results achieved using the softmax classifier are 96.94%precision, 93.22% recall, and 94.68% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 4.It is evident from Table 4 that after softmax, Bayesian performance is better than the other classifiers.Moreover, the testing accuracy graph on the model trained with the 70/30 train-to-test ratio is also shown in Figure 10.The confusion matrix generated based on the 70/30 ratio using the softmax classifier is shown in Figure 11.Looking at the results, it is clear that while using this ratio, the model performs much better, and only one disease case is treated as normal, which is in the esophagitis class.Moreover, only four of the normal cases are treated as disease cases.The confusion matrix generated based on the 70/30 ratio using the softmax classifier is shown in Figure 11.Looking at the results, it is clear that while using this ratio, the model performs much better, and only one disease case is treated as normal, which is in the esophagitis class.Moreover, only four of the normal cases are treated as disease cases.On a 60/40 ratio, the best results obtained using the softmax classifier are 82.56%precision, 73.69% recall, and 78.06% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 5.It is evident from Table 5 that quadratic SVM performance is also satisfactory and near to that of softmax.Moreover, the testing accuracy graph on the model trained with the 60/40 train-to-test ratio is shown in   On a 60/40 ratio, the best results obtained using the softmax classifier are 82.56%precision, 73.69% recall, and 78.06% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 5.It is evident from Table 5 that quadratic SVM performance is also satisfactory and near to that of softmax.Moreover, the testing accuracy graph on the model trained with the 60/40 train-to-test ratio is shown in Figure 12.The confusion matrix generated based on the 60/40 ratio using the softmax classifier is shown in Figure 13.Looking at the results, it is clear that while using this ratio, model performance worsens, as three polyp cases, six ulcer cases, six ulcerative colitis cases, and ten cases of esophagitis were predicted as non-diseased.Moreover, it is also a great concern that 20 of the normal cases were treated as disease cases.We believe that the behavior of the model worsened as training data were reduced; hence, the model was not properly tuned.For 10-fold cross-validation, the best results achieved using the softmax classifier are 99.68%precision, 96.13% recall, and 98.32% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 6.It is evident from Table 6 that the quadratic SVM as well as Bayesian performances are satisfactory, and cannot be ignored.Moreover, the testing accuracy graph based on the model trained with 10-fold cross-validation is depicted in Figure 14.The confusion matrix generated based on the 60/40 ratio using the softmax classifier is shown in Figure 13.Looking at the results, it is clear that while using this ratio, model performance worsens, as three polyp cases, six ulcer cases, six ulcerative colitis cases, and ten cases of esophagitis were predicted as non-diseased.Moreover, it is also a great concern that 20 of the normal cases were treated as disease cases.We believe that the behavior of the model worsened as training data were reduced; hence, the model was not properly tuned.The confusion matrix generated based on the 60/40 ratio using the softmax classifier is shown in Figure 13.Looking at the results, it is clear that while using this ratio, model performance worsens, as three polyp cases, six ulcer cases, six ulcerative colitis cases, and ten cases of esophagitis were predicted as non-diseased.Moreover, it is also a great concern that 20 of the normal cases were treated as disease cases.We believe that the behavior of the model worsened as training data were reduced; hence, the model was not properly tuned.For 10-fold cross-validation, the best results achieved using the softmax classifier are 99.68%precision, 96.13% recall, and 98.32% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 6.It is evident from Table 6 that the quadratic SVM as well as Bayesian performances are satisfactory, and cannot be ignored.Moreover, the testing accuracy graph based on the model trained with 10-fold cross-validation is depicted in Figure 14.For 10-fold cross-validation, the best results achieved using the softmax classifier are 99.68%precision, 96.13% recall, and 98.32% accuracy.Results achieved by applying multiple classifiers using our proposed model are shown in Table 6.It is evident from Table 6 that the quadratic SVM as well as Bayesian performances are satisfactory, and cannot be ignored.Moreover, the testing accuracy graph based on the model trained with 10-fold cross-validation is depicted in Figure 14.The confusion matrix produced by 10-fold cross-validation with the softmax classifier is shown in Figure 15.Looking at the results, it is clear that the model performs significantly better, and no disease case is treated as normal, which is the prime focus in medical applications.Hence, we achieved our desired performance using the proposed methodology.Moreover, only one of the normal cases is treated as a disease case.After analyzing the results, it is understandable that better results are achieved with 10-fold cross-validation, which shows the reliability of our model; hence, we select it as our proposed model.The confusion matrix produced by 10-fold cross-validation with the softmax classifier is shown in Figure 15.Looking at the results, it is clear that the model performs significantly better, and no disease case is treated as normal, which is the prime focus in medical applications.Hence, we achieved our desired performance using the proposed methodology.Moreover, only one of the normal cases is treated as a disease case.After analyzing the results, it is understandable that better results are achieved with 10-fold cross-validation, which shows the reliability of our model; hence, we select it as our proposed model.The confusion matrix produced by 10-fold cross-validation with the softmax classifier is shown in Figure 15.Looking at the results, it is clear that the model performs significantly better, and no disease case is treated as normal, which is the prime focus in medical applications.Hence, we achieved our desired performance using the proposed methodology.Moreover, only one of the normal cases is treated as a disease case.After analyzing the results, it is understandable that better results are achieved with 10-fold cross-validation, which shows the reliability of our model; hence, we select it as our proposed model.

Discussion
This section focuses on the analysis of the proposed methodology and its effectiveness, along with its limitations.Better segmentation and heat maps contribute towards improved classification accuracy, precision, and recall.The dataset is split into various train-to-test ratios in order to ensure that no bias exists, and that samples are actual representatives of the dataset.If we analyze the results in Tables 3-6, it is clear that when the training data are reduced to 60%, the accuracy is reduced drastically, and the model treated 25 disease cases as normal, which means that the model is not generalized well when the training data are reduced.Moreover, better results on 10-fold cross-validation indicate balance between bias and variance of model.It is also evident from the results that there is a significant improvement in precision and recall rate, along with accuracy, which is also an indication of robustness.Upon analysis of the confusion matrix presented in Figure 15, a clear indication of better performance in the case of diseased data is observed, as no disease case is treated as normal.The proposed framework includes numerous significant steps, and major classification results were improved by using the images with contours, which indicates the significance of this step.The performance improvement between original and contour images can be observed by looking at the results presented in Tables 2 and 6.There is drastic improvement in accuracy of up to 17.26% for images with contours.This step highlights the boundary region of disease in an image, which in turn improves the classification outcomes.Moreover, heat maps also reveal that when the segmentation is performed, the model is more focused on a diseased area.Overall, the performance of the model in terms of false-positive rate (no diseased instance is classified as normal) with the 10-fold cross-validation technique demonstrates the robustness of our proposed methodology.
The proposed methodology outperformed the cutting-edge methods, thus having major contributions; however, there are certain limitations that need to be addressed for future study.For instance, this study does not take into account the contrast and brightness issues of the endoscopic images.Moreover, neither the influence of training several models nor the optimization of features was taken into account in this study, which might lead to better results.
Finally, we also present a comparison with cutting-edge methods.Table 7 shows that the suggested model outperforms the state-of-the-art methods in terms of accuracy.

Conclusions
The manual detection and classification of GI diseases is a challenging task; therefore, an automated system is needed for improved results.In this work, we proposed a DLbased architecture to accurately segment and classify GI diseases.The main idea is to perform localization using an encoder-decoder-based segmentation technique and draw contours around the diseased area of an image.Furthermore, heat maps are generated using Grad-CAM for unsegmented and segmented images to visualize the high-magnitude region within an image.The images with contours are then used for classification using a deep learning-based model.Segmentation performance is evaluated using various performance metrics like dice, mIOU, accuracy, precision, and recall.For segmentation, our proposed model achieved 82.08% dice, 90.30% mIOU, 94.35% precision, and 85.97% recall.For classification, we reported effectiveness in terms of accuracy, precision, and recall rate.The proposed model achieved 98.32% accuracy, 96.13% recall, and 99.68% precision using the softmax classifier.Our findings show that the presented model did not treat any disease case as normal, which is crucial when human life is involved.Although the proposed model achieved better results as compared to the existing state-of-the-art techniques, several interesting questions need to be researched in the future.For instance, the effects of contrast enhancement and illumination variation were not considered in this research.These preprocessing steps will be the focus of future work, as these highlight the region of interest, and may result in improved performance.Furthermore, we plan to assess the performance of the proposed method in diverse domains, such as those mentioned in references [49][50][51][52][53].

Figure 1 .
Figure 1.Proposed methodology of localization and classification of GI tract disorders.

Figure 1 .
Figure 1.Proposed methodology of localization and classification of GI tract disorders.

Figure 4 .
Figure 4. Heat map generation process for an endoscopic image.

Figure 4 .
Figure 4. Heat map generation process for an endoscopic image.

e 4 .
Heat map generation process for an endoscopic image.

Figure 4 .
Figure 4. Heat map generation process for an endoscopic image.

Figure 4 .
Figure 4. Heat map generation process for an endoscopic image.

Figure 4 .
Figure 4. Heat map generation process for an endoscopic image.

Figure 5 .
Figure 5. Heat map visualization of input image before and after segmentation using the Xception-Net model.

Figure 5 .
Figure 5. Heat map visualization of input image before and after segmentation using the Xception-Net model.

Figure 7 .
Figure 7. Qualitative findings based on the Kvasir-SEG dataset after applying U-Net with ResNet-34 as the backbone.

Figure 8 .
Figure 8. Testing accuracy graph based on the 80/20 train-to-test ratio.

Figure 10 .
Figure 10.Testing accuracy graph based on the 70/30 train-to-test ratio.

Figure 10 .
Figure 10.Testing accuracy graph based on the 70/30 train-to-test ratio.

Figure 12 .
Figure 12.Testing accuracy graph based on the 60/40 train-to-test ratio.

Figure 12 .
Figure 12.Testing accuracy graph based on the 60/40 train-to-test ratio.
weights at t,  is an average parameter that is moving, and   is the sum of the e of past gradients.Initially, both   and   are set to zero, and it is observed that tend to be biased towards zero as  1 and  2 are set to one.The Adam optimizer d this problem by calculating bias-corrected   � as well as   �.Mathematical equaof these biased corrected values are expressed as follows:

Table 1 .
Quantitative findings based on the Kvasir-Seg dataset.

Table 2 .
Performance matrices based on images without contours using 10-fold cross-validation.

Table 3 .
Performance matrices based on the 80/20 train-to-test ratio.

Table 4 .
Performance matrices based on the 70/30 train-to-test ratio.

Table 4 .
Performance matrices based on the 70/30 train-to-test ratio.

Table 5 .
Performance matrices based on the 60/40 train-to-test ratio.

Table 5 .
Performance matrices based on the 60/40 train-to-test ratio.

Table 7 .
Proposed model comparison with other approaches.