Artificial Intelligence-Based Classification of Multiple Gastrointestinal Diseases Using Endoscopy Videos for Clinical Diagnosis

Various techniques using artificial intelligence (AI) have resulted in a significant contribution to field of medical image and video-based diagnoses, such as radiology, pathology, and endoscopy, including the classification of gastrointestinal (GI) diseases. Most previous studies on the classification of GI diseases use only spatial features, which demonstrate low performance in the classification of multiple GI diseases. Although there are a few previous studies using temporal features based on a three-dimensional convolutional neural network, only a specific part of the GI tract was involved with the limited number of classes. To overcome these problems, we propose a comprehensive AI-based framework for the classification of multiple GI diseases by using endoscopic videos, which can simultaneously extract both spatial and temporal features to achieve better classification performance. Two different residual networks and a long short-term memory model are integrated in a cascaded mode to extract spatial and temporal features, respectively. Experiments were conducted on a combined dataset consisting of one of the largest endoscopic videos with 52,471 frames. The results demonstrate the effectiveness of the proposed classification framework for multi-GI diseases. The experimental results of the proposed model (97.057% area under the curve) demonstrate superior performance over the state-of-the-art methods and indicate its potential for clinical applications.


Introduction
Different types of gastrointestinal (GI) diseases, such as colorectal cancer and tumor, are the leading cause of death in the USA [1]. According to the American Cancer Society, approximately 76,940 people lost their lives in 2016 owing to different types of cancers in the GI tract [1]. The effective diagnosis of such GI diseases is a tedious and time-consuming task. Most of the small GI lesions remain imperceptible during the early stages, which ultimately evolves into a fatal ailment. Therefore, it is essential to develop computerized approaches that can assist the physicians in effective diagnosis and treatment. Therefore, substantial efforts were focused over the past few decades to develop artificial intelligence (AI)-based computer-aided diagnosis (CAD) tools and applications in various medical fields [2][3][4]. These fields include the detection of brain tumor [5], classification of different types of skin cancers, diagnosis in radiation oncology, diabetic retinopathy, histologic classification of gastric biopsy, and endoscopy [6][7][8][9][10][11][12][13][14][15].
In the field of endoscopy, the recent AI-based CAD tools utilize the strength of deep learning (a set of advanced machine learning algorithms) for the analysis of various types of endoscopic scans.

Related Works
In recent years, the strength of deep learning-based algorithms has been utilized in the field of endoscopy, including capsule endoscopy (CE), esophagogastroduodenoscopy (EGD), and colonoscopy [6][7][8][9][10][11][12][13][14][15]. To facilitate the physicians with the effective diagnosis of different GI lesions, several CNN-based CAD tools have been proposed in the literature. These CAD tools are capable of detecting and classifying even small lesions in the GI tract, which often remain imperceptible to the human visual system. Before the advent of deep learning methods, many previous studies have focused on the handcrafted feature-based methods, which mainly consider texture and color information.
Most of the previous studies have been carried out to perform the detection and classification of different type of GI polyps in the field of CE. Generally, these methods followed a common approach of the feature extracting and then classification to detect and classify the GI polyps. In [19], Karargyris et al. proposed a geometric and texture features based method for the detection of small bowel polyps and ulcers in CE. Log Gabor filters and the SUSAN edge detector was used to preprocess the images and, finally, the geometric features were extracted to detect the polyp and ulcer region. Li et al. [20] utilized the advantages of a discrete wavelet transform and uniform local binary pattern (LBP) with a support vector machine (SVM) to classify the normal and abnormal tissues. In this feature extraction approach, wavelet transform combines the capability of multiresolution analysis and uniform LBP to provide robustness to illumination changes, which results in better performance.
Similarly, another texture features-based automatic tumor recognition framework was proposed in [6] for wireless CE images. In this framework, a similar integrated approach was adopted based on LBP and discrete wavelet transform to extract the texture features of the scale and rotation invariants. Finally, the selected features were classified by using an SVM. Yuan et al. [21] proposed an integrated polyps detection algorithm by combing the Bag of Features (BoF) method with the saliency map. In the first step, the BoF method characterizes the local features by using a scale-invariant feature transform (SIFT) feature vectors with k-means clustering. Then saliency features were obtained by generating saliency map histogram. Finally, both BoF and saliency features were fed into the SVM to perform classification. Later, Yuan et al. [22] extended this approach with the addition of LBP, uniform LBP (ULBP), complete LBP (CLBP), and histogram of oriented gradients (HoG) features along with SIFT features for capturing more discriminative texture information. Finally, these features were classified by using SVM and Fisher's linear discriminant analysis (FLDA) classifiers by considering different combinations of local features. The combination of SIFT and CLBP features with SVM classifier resulted in top classification accuracy.
Seguí et al. presented a deep CNN system for small intestine motility characterization [7]. This CNN-based method exploited the general representation of six different intestinal motility events by extracting deep features, which resulted in superior classification performance when compared to the other handcrafted features-based methods. Another CNN-based CAD tool was presented in [15] to quantitatively analyze the celiac disease in a fully automated approach by using CE videos. This proposed method utilized the advantages of a well-known CNN model (i.e., GoogLeNet) to distinguish between the normal and abnormal (i.e., diagnosed with celiac disease) patients. Thus, the effective characterization of the celiac disease resulted in better diagnosis and treatment when compared to the manual analysis of CE videos. In [12], a multistage deep CNN-based framework for hookworm (i.e., intestinal parasite) detection was proposed using CE images. Two different CNN networks, named as edge extraction network and hookworm classification network, were unified, which simultaneously characterized the visual and tubular patterns of hookworms.
In the field of EGD, a deep learning-based CAD tool was proposed for the diagnosis of Helicobacter pylori (H. pylori) infection [9]. In this proposed framework, two-stage CNN models were used. In the first stage, a 22-layers deep CNN was fine-tuned for the classification (i.e., positive or negative) of H. pylori infection. Then, in the second stage, another CNN was used to further classify the dataset (EGD images) according to eight different anatomical locations. Similarly, Takiyama et al. proposed another CNN-based classification model to categorize the anatomical location of the human GI tract [8]. This technique could categorize the EGD images into four major anatomical locations (i.e., larynx, esophagus, stomach, and duodenum) and three subcategories for the stomach images (upper, middle, and lower regions). A pretrained CNN architecture, named as GoogLeNet, was used for this classification problem, which demonstrated high classification performance. In a recent study by Hirasawa et al. [13], a fully automated diagnostic tool for gastric cancer was proposed by utilizing the detecting capability of deep CNN-based architectures. A single-shot multibox detector (SSD) architecture was used to detect early and advanced stages of gastric cancer from EGD images. The proposed method demonstrated substantial detection capability even for small lesions when compared to the conventional methods. The results of this study illustrated its practical usability in clinical practice for better diagnosis and treatment. However, it demonstrated certain limitations as only high-quality EGD images could be used from the same type of endoscope and endoscopic video system.
Generally, the various deep learning-based methods demonstrate either the problem of over-fitting or under-fitting owing to the utilization of a large number of network parameters and the limited amount of data available in the training dataset. This problem degrades the system performance in a real-world scenario. A similar problem also occurs in the domain of medical image analysis owing to the unavailability of a sufficiently large training dataset. To address this issue, a transfer learning mechanism is often adopted in this domain. In the field of colonoscopy, Zhang et al. [10] used this approach for automatic detection and classification of colorectal polyps. A novel transfer learning approach was applied to train the two different CNN models for the source domain (i.e., nonmedical dataset) and then fine-tuning was performed for the target domain (i.e., medical dataset). Their method performed the polyp detection and classification tasks in two different stages. In the first stage, an image of interest (i.e., polyp image) was selected by using the CNN-based polyp detection model. In the second stage, another CNN model was further used to categorize the detected polyp image into either a hyperplastic polyp or an adenomatous colorectal polyp. The results of this study demonstrated that the CNN-based diagnoses achieved a higher accuracy and recall rate than endoscopist diagnoses. However, their method is not applicable for real-time colonoscopy image analysis owing to the use of multistage CNN models. Another study by Byrne et al. [14], presented a single deep CNN-based real-time colorectal polyp classification framework using the colonoscopy video images. In this study, a simple CNN model was trained to classify each input frame into one of four different categories, i.e., hyperplastic polyp, adenomatous polyp, no polyp, or unsuitable. The end-to-end processing time of this CNN model was 50 ms per frame, resulting in its applicability for the real-time classification of polyps. In another study [11], an offline and online three-dimensional (3D) deep CNN framework was proposed for automatic polyp detection. Two different 3D-CNNs, named as offline 3D-CNN and online 3D-CNN, were simultaneously used to exploit the more general representation of features for the task of effective polyp detection. In this complete framework, the offline 3D-CNN effectively reduced the number of false positives, whereas the online 3D-CNN was used to further improve the polyp detection. The experimental results showed that the 3D fully convolutional network was capable of learning more representative spatiotemporal features from colonoscopy videos in comparison with the handcrafted or two-dimensional (2D) CNN features-based methods.
Endoscopy is a direct imaging modality, which captures the internal structure of the human GI tract in the form of videos rather than a still image. Therefore, it is possible to extract both spatial and temporal information from endoscopic data to enhance the diagnostic capability of different deep CNN-based CAD tools. Most of the previous studies considered only the spatial information for classification and detection of different GI diseases without considering the temporal information. The loss of temporal information affects the overall performance of the CAD tools. In addition, the maximum number of classes to be managed in the previous studies were also limited to eight [9], which only considered limited GI diseases, such as a tumor or cancer.
To address these issues from previous researches, we considered 37 different categories in our proposed work, which included both normal and diseased cases related to different parts of the human GI tract. We proposed a novel two-stage deep learning-based framework to enhance the classification performance of different GI diseases by considering both spatial and temporal information. Two different models named as ResNet and LSTM were trained separately to extract the spatial and temporal features, respectively. In Table 1, the strengths and weaknesses of previous studies and our proposed method are summarized.

Contribution
This is the first approach towards the classification of multiple GI diseases that includes 37 different categories related to normal and diseased cases while considering different parts of the human GI tract. The major contributions of this study can be summarized in the following five ways when compared to the previous methods.
( 1) To the best of our knowledge, this is the first approach to develop a comprehensive deep learning-based framework for the classification of multiple GI diseases by considering deep spatiotemporal features. In contrast, most of the previous studies [6][7][8][9][10][11][12][13][14][15] considered the limited number of classes that are related to a specific type of GI portion. (2) We proposed a novel cascaded ResNet and LSTM-based framework in the medical domain to learn both spatial and temporal features for the different type of GI diseases. When compared to the previous methods based on handcrafted features and simple 2D-CNNs, our method can manage the large intraclass and low interclass variations among multiple classes more effectively.
We deeply analyzed the performance of our proposed method by selecting the multilevel spatial features for LSTM from the different layers of the ResNet network. Furthermore, the performance of multilevel spatial features was also analyzed by applying principal component analysis (PCA).
We compared the performance of the various state-of-the-art CNN models and different handcrafted feature-based approaches. Our analysis was more detailed, in contrast to previous studies [8,9], which provided only a limited performance analysis for a small number of classes related to a specific GI part. (5) Finally, we have ensure that our trained model and video indices of experimental endoscopic videos are publicly available through [18]; therefore, other researchers can evaluate and compare its performance.

Proposed Method
This section presents our proposed method for the classification of multiple GI diseases, including the CNN architecture for the extraction of spatial features, LSTM-based network for the extraction of temporal features, and finally, the classification portion comprises of fully connected (FC) layers.

Overview of the Proposed Approach
The conventional image or video classification framework is comprised of two main stages, known as the feature extraction stage and the classification stage. There are also certain other preprocessing steps such as image resizing or batch normalization (BN) to adjust the dataset according to the network compatibility. A brief flowchart of our method for the classification of multiple GI diseases based on deep spatiotemporal features is illustrated in Figure 1. In the first preprocessing step, the size of each endoscopic video frame was adjusted to 224 × 224 × 3 (according to the input layer size of the CNN model). In the next steps, we used a cascaded CNN and LSTM-based deep network to extract the spatial and temporal features, respectively, by using the resized sequence of frames. Using the CNN model, a sequence of spatial feature vectors was extracted, which was subsequently inputted to the LSTM for the extraction of temporal features. The final output of the LSTM comprises of a single feature vector that contains both the spatial and temporal information for each given sequence of frames. In the last step, the classification of the extracted spatiotemporal feature vector was performed by categorizing the given video sequence into one of 37 different categories (i.e., 37 different categories presenting the normal and diseased cases related to the human GI tract). sequence of frames. Using the CNN model, a sequence of spatial feature vectors was extracted, which was subsequently inputted to the LSTM for the extraction of temporal features. The final output of the LSTM comprises of a single feature vector that contains both the spatial and temporal information for each given sequence of frames. In the last step, the classification of the extracted spatiotemporal feature vector was performed by categorizing the given video sequence into one of 37 different categories (i.e., 37 different categories presenting the normal and diseased cases related to the human GI tract).

Structure of Our Proposed Model
Our proposed classification framework consists of a cascaded CNN and LSTM-based deep networks with the capability to classify the video data based on spatiotemporal features. The primary advantage of our network is its capability to categorize a variable length sequence of n successive images (i.e., I 1 , I 2 , I 3 , . . . , I n ) with significant performance gain. For example, the use of more successive images results in better classification performance. In addition, our cascaded deep learning model demonstrated high performance in comparison with only CNN-based models. That is because the CNN models only extract the spatial information by processing each input image independently rather than considering both spatial and temporal features in the case of a video dataset. Owing to the loss of temporal information in a CNN model, the overall classification performance is deteriorated. To overcome the limitation of previous spatial features-based methods in the medical domain, our study included a spatial variant of a recurrent neural network (RNN) named as LSTM along with the conventional CNN model to enhance the classification performance. The overall structure of our proposed classification framework is shown in Figure 2. The complete framework is comprised of three different stages, i.e., spatial features extraction, temporal features extraction, and finally, the classification stage. In each stage, a specific set of deep learning procedures was applied to the given input sequence of endoscopic frames. Thus, the final class label was predicted for the input sequence using 37 different categories of different GI diseases. The detailed explanation of each stage is presented in the subsequent sections.

Structure of Our Proposed Model
Our proposed classification framework consists of a cascaded CNN and LSTM-based deep networks with the capability to classify the video data based on spatiotemporal features. The primary advantage of our network is its capability to categorize a variable length sequence of successive images (i.e., 1 , 2 , 3 , … , ) with significant performance gain. For example, the use of more successive images results in better classification performance. In addition, our cascaded deep learning model demonstrated high performance in comparison with only CNN-based models. That is because the CNN models only extract the spatial information by processing each input image independently rather than considering both spatial and temporal features in the case of a video dataset. Owing to the loss of temporal information in a CNN model, the overall classification performance is deteriorated. To overcome the limitation of previous spatial features-based methods in the medical domain, our study included a spatial variant of a recurrent neural network (RNN) named as LSTM along with the conventional CNN model to enhance the classification performance. The overall structure of our proposed classification framework is shown in Figure 2. The complete framework is comprised of three different stages, i.e., spatial features extraction, temporal features extraction, and finally, the classification stage. In each stage, a specific set of deep learning procedures was applied to the given input sequence of endoscopic frames. Thus, the final class label was predicted for the input sequence using 37 different categories of different GI diseases. The detailed explanation of each stage is presented in the subsequent sections. The first stage of our proposed classification framework included a deep CNN model named ResNet18 [23], which was used for spatial features extraction from each input frame. The primary

Spatial Features Extraction using a Convolutional Neural Network
The first stage of our proposed classification framework included a deep CNN model named ResNet18 [23], which was used for spatial features extraction from each input frame. The primary reasons for selecting ResNet18 [23] was the high classification accuracy and the optimal number of learnable parameters when compared to the other state-of-the-art deep CNN models [16,[24][25][26][27]. In a later section the experimental results quantitatively illustrate the significance of our selected ResNet18 model when compared to the other models.
The complete structure of the extraction model for spatial features is illustrated in Figure 2. The entire network consists of multiple residual units, which can be considered as the basic building block. These residual units are categorized into two different types based on the type of shortcut connectivity (i.e., 1 × 1 convolutional-mapping-based shortcut connectivity and identity-mapping-based shortcut connectivity) [23]. The shortcut connectivity in an identity-mapping-based residual unit maintains the depth of previous feature map without any modification whereas the shortcut connectivity in the 1 × 1 convolutional-mapping-based residual unit increases the depth of the previous feature map by applying the 1 × 1 convolution. Moreover, in each residual unit, there are two convolutional layers with a filter size of 3 × 3 in sequential order. These filters contain the learnable parameters, which are optimized during the training procedure. ResNet18 consists of a total of eight residual units, including five identity mapping-based residual units and three 1 × 1 convolutional mapping-based residual units, as shown in Figure 3. The use of more identity mapping-based residual units results in better performance in terms of computational complexity and training time. In addition, both types of residual units result in smoother information propagation in both forward and backward directions [28]. unit maintains the depth of previous feature map without any modification whereas the shortcut connectivity in the 1 × 1 convolutional-mapping-based residual unit increases the depth of the previous feature map by applying the 1 × 1 convolution. Moreover, in each residual unit, there are two convolutional layers with a filter size of 3 × 3 in sequential order. These filters contain the learnable parameters, which are optimized during the training procedure. ResNet18 consists of a total of eight residual units, including five identity mapping-based residual units and three 1 × 1 convolutional mapping-based residual units, as shown in Figure 3. The use of more identity mapping-based residual units results in better performance in terms of computational complexity and training time. In addition, both types of residual units result in smoother information propagation in both forward and backward directions [28]. The layer-wise structural details are further explained in Table 2, which demonstrates the flow of information processing by the different layers of ResNet18 in a sequential order. In general, the convolutional and FC layers are the main components of a conventional CNN model, which are used for features extraction and classification, respectively. There are also certain other layers without including the learnable parameters, such as a rectified linear unit (ReLU) layer, softmax, max pooling, average pooling, and a classification layer. Our selected ResNet18 model primarily contains a total of eighteen layers in which there are seventeen convolutional layers and one FC layer. These layers encompass the learnable parameters (i.e., filter coefficients and biases), which are optimized through the training procedure. Each convolutional layer is followed by the BN layer (it normalizes the feature map of each channel) and then a ReLU layer (it performs a threshold operation).
The first convolutional layer (i.e., Conv1) of our selected model generates an output feature map 1 of size 112 × 112 × 64 by applying 64 different filters of size 7 × 7 × 3 over the given input image . After Conv1, the next max pooling layer further processes the output feature map 1 by applying a filter of 3 × 3 pixels and generates a down-sampled feature map 2 of size 56 × 56 × 64. This output feature map 2 is passed through the first identity mapping-based residual unit that The layer-wise structural details are further explained in Table 2, which demonstrates the flow of information processing by the different layers of ResNet18 in a sequential order. In general, the convolutional and FC layers are the main components of a conventional CNN model, which are used for features extraction and classification, respectively. There are also certain other layers without including the learnable parameters, such as a rectified linear unit (ReLU) layer, softmax, max pooling, average pooling, and a classification layer. Our selected ResNet18 model primarily contains a total of eighteen layers in which there are seventeen convolutional layers and one FC layer. These layers encompass the learnable parameters (i.e., filter coefficients and biases), which are optimized through the training procedure. Each convolutional layer is followed by the BN layer (it normalizes the feature map of each channel) and then a ReLU layer (it performs a threshold operation). Image input layer 224 × 224 × 3 n/a n/a n/a n/a n/a  The first convolutional layer (i.e., Conv1) of our selected model generates an output feature map X 1 of size 112 × 112 × 64 by applying 64 different filters of size 7 × 7 × 3 over the given input image X. After Conv1, the next max pooling layer further processes the output feature map X 1 by applying a filter of 3 × 3 pixels and generates a down-sampled feature map X 2 of size 56 × 56 × 64. This output feature map X 2 is passed through the first identity mapping-based residual unit that applies the two convolution filters (Conv2-1 and Conv2-2) in sequential order and generates an intermediate feature map as f (X 2 , W 2 ). Finally, the output feature map X 3 of size 56 × 56 × 64 is generated by adding X 2 and f (X 2 , W 2 ). The second identity mapping-based residual unit also performs a similar operation and converts the feature map X 3 to a new feature map X 4 . The next 1 × 1 convolutional mapping-based residual unit further processes the feature map X 4 by applying the two convolution filters (Conv4-1 and Conv4-2) in sequential order and generates the first intermediate output feature map as f (X 4 , W 4 ). Meanwhile, a 1 × 1 convolution filter (Conv4-3) converts the feature map X 4 to the second intermediate output feature map as h(X 4 , W 4 ). Finally, the output feature map X 5 is obtained by adding both intermediate feature maps Similarly, all the successive residual units process the output feature map of the previous residual unit in the same way by using a different number of filters with different sizes and stride values as listed in Table 2. Finally, the optimal feature vector x of size 1 × 1 × 512 is obtained after applying the average pooling layer with filter size 7 × 7 pixels over the last output feature map X 10 (i.e., the output of the last convolutional layer). In this way, a set of n feature vectors {x 1 , x 2 , x 3 . . . , x n } are obtained by processing all the successive images (I 1 , I 2 , I 3 . . . , I n ). These extracted feature vectors are further used as the input to the LSTM network for temporal feature extraction. The remaining three layers (i.e., FC, softmax, and classification layer) only participate in the training procedure. Therefore, after completing the training process, the output feature vector is selected after the average pooling layer for further temporal feature extraction and classification rather than the final classification layer.

Temporal Features Extraction by Long Short-term Memory Model
In the second stage, LSTM, a variant of the RNN model [29], was used to exploit the temporal information from the set of n features vectors that were extracted in the first stage by using ResNet18. The structure of LSTM consists of n LSTM cells [30]. Figure 2 (Stage 2) illustrates the flow of n features vectors (x 1 , x 2 , x 3 . . . , x n ) through the multiple LSTM cells. In the figure, h n and c n denote the output (also known as the hidden state) and cell state at time step n, respectively. The hidden state, h n , contains the output of the LSTM cell for the time step n and the cell state c n holds the information learned from all the previous time steps (i.e., 1 to n − 1). The first LSTM cell (at time step n = 1) uses the initial state of the network (h 0 , c 0 ) and the input feature vector x 1 to compute the first output h 1 and the updated cell state c 1 . At time step n (where n 1), the LSTM cell uses the current state of the network (h n−1 , c n−1 ) and the input feature vector x n to calculate the output h n and the updated cell state c n . Thus, the temporal information is exploited in the LSTM stage by using all the spatial feature vectors.
The basic structure of a standard LSTM cell is shown in Figure 4, which illustrates the flow of data at time step n. In general, four components, named as input gate (i n ), forget gate ( f n ), cell candidate (g n ), and output gate (o n ), are responsible for controlling the state information at time step n. The i n controls the level of the cell state update, whereas the f n controls the level of the cell state reset. The g n adds the information to the cell state and finally, the o n controls the level of the cell state added to the hidden state. Based on these components, the complete structure of the cell is divided into three gates, named as forget, input, and output gates, as highlighted in Figure 4.

Temporal Features Extraction by Long Short-term Memory Model
In the second stage, LSTM, a variant of the RNN model [29], was used to exploit the temporal information from the set of features vectors that were extracted in the first stage by using ResNet18. The structure of LSTM consists of LSTM cells [30]. Figure 2 (Stage 2) illustrates the flow of features vectors ( 1 , 2 , 3 … , ) through the multiple LSTM cells. In the figure, ℎ and denote the output (also known as the hidden state) and cell state at time step , respectively. The hidden state, ℎ , contains the output of the LSTM cell for the time step and the cell state holds the information learned from all the previous time steps (i.e., 1 to − 1). The first LSTM cell (at time step = 1) uses the initial state of the network (ℎ 0 , 0 ) and the input feature vector 1 to compute the first output ℎ 1 and the updated cell state 1 . At time step (where ≠ 1), the LSTM cell uses the current state of the network (ℎ −1 , −1 ) and the input feature vector to calculate the output ℎ and the updated cell state . Thus, the temporal information is exploited in the LSTM stage by using all the spatial feature vectors.
The basic structure of a standard LSTM cell is shown in Figure 4, which illustrates the flow of data at time step . In general, four components, named as input gate ( ), forget gate ( ), cell candidate ( ), and output gate ( ), are responsible for controlling the state information at time step . The controls the level of the cell state update, whereas the controls the level of the cell state reset. The adds the information to the cell state and finally, the controls the level of the cell state added to the hidden state. Based on these components, the complete structure of the cell is divided into three gates, named as forget, input, and output gates, as highlighted in Figure 4. , are included in the LSTM cell, which are responsible for learning the temporal information after performing sufficient training. These learnable parameters ( , , ) and cell components ( , , , ) are used to calculate the cell state and output ℎ at time step . The following mathematical computations are performed to determine the state information and cell components: Furthermore, the three different type of learnable parameters, termed as input weights, , are included in the LSTM cell, which are responsible for learning the temporal information after performing sufficient training. These learnable parameters (W, R, b) and cell components (i n , f n , g n , o n ) are used to calculate the cell state c n and output h n at time step n. The following mathematical computations are performed to determine the state information and cell components: g n = tanh W g n x n + R g n h n−1 + b g n where tanh is the hyperbolic tangent function, which is calculated as tanh(x) = (e x − e −x )/(e x + e −x ). It is used as a state activation function. The function σ is the sigmoid function, which is calculated as σ(x) = (1 + e −x ) −1 to compute the gate activation function.
In the first stage, ResNet18 processed the sequence of n successive images (i.e., I 1 , I 2 , I 3 . . . , I n ) in a sequential order to extract the spatial features. Then, the LSTM model processed all the spatial feature vectors (a set of n feature vectors {x 1 , x 2 , x 3 . . . , x n }) in a parallel fashion in the second stage. Therefore, the feature accumulation block, as shown in Figure 2, is used to accumulate all the spatial feature vectors (obtained from ResNet18 in the first stage) before inputting it to the LSTM model in the second stage. The layer-wise structural details of our proposed LSTM model are listed in Table 3. The final output of the LSTM model contains both the spatial and temporal information, which is followed by the stack of FC layers to perform the final classification. In the final classification stage, the output h n of the LSTM cell at the last time step n is selected as the final output feature vector rather than using all the outputs (i.e., h 1 , h 2 , h 3 , . . . , h n ). Then, a stack consisting of FC, softmax, and classification layers is used to perform the final classification as shown in Figure 2. The output of the last LSTM cell is followed by a FC layer where the number of nodes is equal to the number of classes. The primary purpose of the FC layer is to determine the larger patterns by combining all the spatiotemporal features learned by the previous layers across the images. It multiplies the input feature vector obtained from the last LSTM cell by a weight matrix W and then adds a bias vector b. The final output obtained after this FC layer is presented as y = W·h n + b. The next softmax layer converts the output y in terms of probability by applying the softmax function [31]. Finally, the classification layer considers the output from the softmax layer and assigns each input to one of the 37 different categories by using the cross-entropy loss function [31]. In conclusion, the final class label is assigned to the given sequence of n successive images by exploiting both the spatial and temporal information.

Experimental Setup and Performance Analysis
In this section, we analyze the performance of our proposed ResNet18 and LSTM-based classification framework. We provide the details of the selected endoscopy dataset, experimental configurations, various performance analysis metrics used to evaluate the quantitative performance, observations, and analysis of the results as well as the comparison with other methods.

Dataset
To evaluate the performance of the proposed multiple GI diseases classification framework, we selected an open access endoscopic videos dataset from Gastrolab [32] and the KVASIR dataset [33].
The datasets contain various endoscopic videos related to different parts of the human GI tract, including both normal and diseased cases. The details of each individual video (including the information about normal and diseased cases as well as the anatomical district) are included as the video name. Based on the available information, the complete dataset was categorized into 37 different classes including both normal and diseased cases related to different parts of the human GI tract. These different classes include the multiple anatomical locations (i.e., esophagus, stomach, small intestine, large intestine, and rectum) of the human GI tract as shown in Figure 5. video name. Based on the available information, the complete dataset was categorized into 37 different classes including both normal and diseased cases related to different parts of the human GI tract. These different classes include the multiple anatomical locations (i.e., esophagus, stomach, small intestine, large intestine, and rectum) of the human GI tract as shown in Figure 5. Furthermore, the details of multiple subcategories of each anatomical district and their corresponding number of classes with types of diseases and the number of training and testing sequences are listed in Table 4. The entire dataset contains a total of 77 video files including 52,471 frames. In the preprocessing part, all these frames were resized into fixed dimensions with the spatial size of 224 × 224 ; subsequently, they were converted into a standard bitmap file format. We performed the two-fold cross-validation by randomly dividing the entire dataset as 50% for training and the remaining 50% for testing. That is, in all the performance comparisons, the numbers of training data are the same as those of the testing data as shown in Table 4.
In the first stage, an online data augmentation [34] process (including random translation and in-plain rotation) was used to solve the class imbalance problem [35] caused by the different number of training samples in each class. The data augmentation process was performed only for the training dataset in the first stage (i.e., spatial features extraction using ResNet18), and was not performed for the testing dataset.   Table 4. The entire dataset contains a total of 77 video files including 52,471 frames. In the preprocessing part, all these frames were resized into fixed dimensions with the spatial size of 224 × 224; subsequently, they were converted into a standard bitmap file format. We performed the two-fold cross-validation by randomly dividing the entire dataset as 50% for training and the remaining 50% for testing. That is, in all the performance comparisons, the numbers of training data are the same as those of the testing data as shown in Table 4.  In the first stage, an online data augmentation [34] process (including random translation and in-plain rotation) was used to solve the class imbalance problem [35] caused by the different number of training samples in each class. The data augmentation process was performed only for the training dataset in the first stage (i.e., spatial features extraction using ResNet18), and was not performed for the testing dataset.  The visual representation of our selected dataset for each class is shown in Figure 6. In this diagram, each individual image presents a specific class from the total of 37 different classes (i.e., C1, C2, C3, . . . , C37). The primary challenge in our selected dataset was the high intra-class variance caused by the different types of lesion structures and texture properties within the same class as depicted in Figure 7. Furthermore, different viewing conditions and dynamic structural changes during the endoscopy procedure may also increase the intra-class variance. To solve this problem, a high level of abstraction was required to present the common characteristics of such types of datasets with high intra-class variance. In addition, a sufficient amount of training dataset related to a particular domain can also enhance the overall performance of the CAD systems. This type of dataset aids in analyzing the performance of our proposed framework in a challenging scenario.

Experimental Setup and Training
The proposed framework was implemented with MATLAB R2018b (MathWorks, Inc., Natick, MA, USA) [36] on a Windows 10 operating system. The deep learning library named as deep learning toolbox was included in MATLAB for the implementation of various CNN models [37]. Any people who purchase MATLAB R2018b [36] can use this library with the licenses based on the credits to the authors of the CNN models. All the experiments were performed on a desktop computer with a 3.50 GHz Intel®(Santa Clara, CA, USA) Core-i7-3770K central processing unit (CPU) [38], 16 GB random access memory (RAM), and an NVIDIA (Santa Clara, CA, USA) GeForce GTX 1070 graphics card [39]. The use of the graphics card provides the parallel processing capability for both the training and the testing phase.
As explained in Section 4, our proposed method combined two types of image features for classification of multiple GI diseases, i.e., the spatial features extracted by a deep CNN model in the first stage, and then the temporal features that were extracted by using the LSTM model in the second stage. Both the networks were trained separately by using the stochastic gradient descent [40] optimizer method, which is generally used for optimal training of CNNs. It is a more efficient back propagation algorithm for learning the discriminative linear classifiers by using a convex loss function. Its primary goal is to optimize the learnable parameters of the model (i.e., filter weights and biases) by considering the derivative of the loss function. In addition, we initialized the parameters of the first stage CNN model by using a pretrained ResNet18 model, which was successfully trained on the ImageNet dataset [41]. This scheme was widely used in previous studies to initialize the network parameters to make the network training process easier and time effective. In the case of the LSTM model, the initial weights were randomly initialized by using a Gaussian distribution with zero mean and 0.001 standard deviation, and the biases were initialized to zero. In Table 5, the parameters of the training procedure used in our experiments are listed. The performance of our proposed method was evaluated by performing the cascaded training of our ResNet18 and LSTM-based classification framework. In the first stage, we performed the training of ResNet18 by using the training dataset (as listed in Table 4). Figure 8 shows the progress of training loss and accuracy according to the different number of epochs for both folds of cross-validations. The training loss approaches zero after a certain number of epochs, and the training accuracy approaches 100%, which illustrate that our selected model is sufficiently trained. In addition, after performing several training experiments for different CNN models, we determined that the fine-tuning of a pretrained model results in faster convergence rather than training from scratch. In other words, we used the ResNet18 model which was pretrained with the ImageNet dataset [41]. Then, we performed the fine-tuning of this model with our training dataset of Table 4. Therefore, we selected a pretrained model of ResNet18 for spatial feature extraction in the first stage. Moreover, the average accuracy of our selected ResNet18 based on the spatial features was higher than other deep CNN models. Thus, both the ResNet18 and LSTM models were interconnected in a cascaded fashion, and separate trainings were performed for both networks. The second stage training process was started after completing the training for the ResNet18 model. In the second stage, the output feature vectors (extracted from the trained ResNet18 model in the first stage using the training dataset) were used to train our proposed LSTM model. In this stage, each training sample comprised of a set of feature vectors (extracted from successive frames in the first stage) instead of a single feature vector. Thus, an intermediate features-based dataset was generated from the extracted feature vectors, which was further used for temporal feature extraction. In our experiment, a total of fifteen (i.e., = 15) successive frames were used to generate a set of fifteen feature vectors for each training sample. Figure 9 shows the progress of training loss and In the second stage, the output feature vectors (extracted from the trained ResNet18 model in the first stage using the training dataset) were used to train our proposed LSTM model. In this stage, each training sample comprised of a set of n feature vectors (extracted from n successive frames in the first stage) instead of a single feature vector. Thus, an intermediate features-based dataset was generated from the extracted feature vectors, which was further used for temporal feature extraction. In our experiment, a total of fifteen (i.e., n = 15) successive frames were used to generate a set of fifteen feature vectors for each training sample. Figure 9 shows the progress of training loss and accuracy for both folds of cross-validations. The training loss approaches to zero after a certain number of iterations in the first epoch and the training accuracy approaches 100%, which shows the optimal convergence of the second stage (LSTM) of our model. In Figure 9, it can also be observed that the convergence of LSTM is faster and smoother when compared to ResNet18 (in the first stage). The primary reason for this result is the use of an intermediate dataset (i.e., a set of discriminative spatial feature vectors) for temporal feature extraction rather than using the successive frames. In the second stage, the output feature vectors (extracted from the trained ResNet18 model in the first stage using the training dataset) were used to train our proposed LSTM model. In this stage, each training sample comprised of a set of feature vectors (extracted from successive frames in the first stage) instead of a single feature vector. Thus, an intermediate features-based dataset was generated from the extracted feature vectors, which was further used for temporal feature extraction. In our experiment, a total of fifteen (i.e., = 15) successive frames were used to generate a set of fifteen feature vectors for each training sample. Figure 9 shows the progress of training loss and accuracy for both folds of cross-validations. The training loss approaches to zero after a certain number of iterations in the first epoch and the training accuracy approaches 100%, which shows the optimal convergence of the second stage (LSTM) of our model. In Figure 9, it can also be observed that the convergence of LSTM is faster and smoother when compared to ResNet18 (in the first stage). The primary reason for this result is the use of an intermediate dataset (i.e., a set of discriminative spatial feature vectors) for temporal feature extraction rather than using the successive frames.

Performance Analysis Metric
We employed average accuracy, F1 score, mean average prevision (mAP), and mean average recall (mAR) [42] to quantitatively evaluate the performance of our proposed ResNet18 and LSTM-based classification model. Based on these four parameters, we evaluated the overall performance of the model by calculating the average value for all the classes. These four metrics are defined as: where TP k , FP k , TN k , and FN k denote the number of true positives, false positives, true negatives, and false negatives, respectively, for each class k. The value of TP k presents the number of correctly classified images from class k, FP k shows the number of images that are misclassified as belonging to class k. TN k indicates the number of images correctly classified that do not belong to class k and FN k denotes the number of misclassified images that actually belong to class k. Here K denotes the total number of classes, which is equal to 37 in our research.

Testing of the Proposed Method
The length of successive frames performs an important role in the system performance. The small number of successive frames results in low temporal information, whereas the long sequence length increases the processing time and the effects of noise. Therefore, we performed the training of our LSTM model for thirty different number of frames (i.e., n = 1, 2, 3, . . . , 30). Then, the testing performance was evaluated for each step size. Figure 10 shows the average performance results according to different number of frames. In Figure 10, the green square box indicates the maximum average performance whereas the red square box illustrates the maximum performance with respect to different performance metrics (i.e., accuracy, F1 score, mAP, and mAR). Finally, based on the overall maximum average performance, we determined that the best accuracy could be obtained when the numbers of frame was 15 (n = 15). where , , , and denote the number of true positives, false positives, true negatives, and false negatives, respectively, for each class . The value of presents the number of correctly classified images from class , shows the number of images that are misclassified as belonging to class .
indicates the number of images correctly classified that do not belong to class and denotes the number of misclassified images that actually belong to class . Here denotes the total number of classes, which is equal to 37 in our research.

Testing of the Proposed Method
The length of successive frames performs an important role in the system performance. The small number of successive frames results in low temporal information, whereas the long sequence length increases the processing time and the effects of noise. Therefore, we performed the training of our LSTM model for thirty different number of frames (i.e., = 1,2,3, … ,30). Then, the testing performance was evaluated for each step size. Figure 10 shows the average performance results according to different number of frames. In Figure 10, the green square box indicates the maximum average performance whereas the red square box illustrates the maximum performance with respect to different performance metrics (i.e., accuracy, F1 score, mAP, and mAR). Finally, based on the overall maximum average performance, we determined that the best accuracy could be obtained when the numbers of frame was 15 ( = 15). As the next experiment, we performed a layer-wise performance comparison between our method (ResNet18 + LSTM) and only a ResNet18 model by selecting the features from the different parts of the network. Moreover, this additional experiment was also used to investigate the more discriminative features at certain intermediate layers that could result in better performance. For this experiment, the output feature vectors were extracted from five different layers (i.e., Conv6-2, Conv7-2, Conv8-2, Conv9-2, Avg. pooling, as listed in Table 2) of ResNet18 with the feature map size of 14 × 14 × 256 (50,176), 14 × 14 × 256 (50,176), 7 × 7 × 512 (25,088), 7 × 7 × 512 (25,088), and 1 × 1 × 512 (512), respectively. In the case of our method, the classification performance for each layer was obtained by further extracting the temporal information from the LSTM model using these features. The layer-wise features from ResNet18 model were classified using a k-nearest neighbor (KNN) classifier, which is widely used for pattern classification [43]. The complete layer-wise performances of our method and ResNet18 are listed in Table 6. Based on the overall performance, we concluded that the deeper features result in better classification performance in the case of our method and the ResNet18 model. However, the layer-wise performance of our method was still higher than the conventional ResNet18. Moreover, our method also showed a high average accuracy of 90.48% and mAP of 91.29% when a still image (i.e., n = 1) was used, which are higher values when compared to other CNN-based methods (an accuracy of 89.95% and mAP of 90.72% in the case of conventional ResNet18).
The extracted features from the last average pooling layer of ResNet18 were further analyzed by applying PCA [44] technique as a post processing step. The main objective of this analysis was to explore the discriminative nature of the features (i.e., to check if our selected features were distinctive or redundant). For this purpose, all the extracted features of dimension 1 × 512 from the last average pooling layer were projected to the eigenspace by applying the PCA. This eigenspace presented all the input feature vectors in a new coordinate system in a more distinctive way. The dimensions of these newly obtained features are selected based on the maximum variance (i.e., greater than 99%) of the projected data on all the possible axes. The eigenvalue corresponding to each feature vector was used to select a feature vector. In the case of our dataset, a new set of feature vectors (with the feature dimension 1 × 136) was obtained by selecting a total of 136 eigenvectors with the highest eigenvalues. In our proposed model, this new set of feature vectors were further used as inputs to the LSTM model to explore the temporal information and then the final classification performance was obtained as listed in Table 7. In addition, the PCA feature-based performance was evaluated for ResNet18 by using the KNN classifier, which is also presented in Table 7. According to these final classification results, we concluded that the PCA-based features reduced the performance in both cases (i.e., our proposed model and ResNet18), whereas the original high dimension features resulted in better performance. Finally, it can be concluded that our extracted features (from the last average pooling layer) were already diverse, and the performance of our method was still high in comparison with conventional ResNet18 after applying the PCA. Figure 11 illustrates the more comprehensive classification performance of our model in terms of the confusion matrix. It can be observed from these results that only a few classes (i.e., C16, C31, C33, C34) showed a low classification performance owing to the high inter class similarities in terms of lesion textures or GI organ structures. However, the overall performance of our proposed method was significantly high for a dataset with several classes.   Figure 11 illustrates the more comprehensive classification performance of our model in terms of the confusion matrix. It can be observed from these results that only a few classes (i.e., C16, C31, C33, C34) showed a low classification performance owing to the high inter class similarities in terms of lesion textures or GI organ structures. However, the overall performance of our proposed method was significantly high for a dataset with several classes. Figure 11. Confusion matrix of the proposed method. The entry in the ℎ row and ℎ column corresponds to the percentage of samples from class that were classified as class . Precision and recall are calculated as "TPk/ (TPk + FPk) "and "TPk/ (TPk + FNk)" [45], respectively.

Comparisons with Previous Methods
The performance of our proposed ResNet18 and LSTM-based methods were compared with the various state-of-the-art deep CNN-based CAD tools that are used in the endoscopy domain [8,12,14,15]. To ensure a fair comparison, the performances of all the existing baseline methods were Figure 11. Confusion matrix of the proposed method. The entry in the ith row and jth column corresponds to the percentage of samples from class i that were classified as class j. Precision and recall are calculated as "TP k / (TP k + FP k ) "and "TP k / (TP k + FN k )" [45], respectively.

Comparisons with Previous Methods
The performance of our proposed ResNet18 and LSTM-based methods were compared with the various state-of-the-art deep CNN-based CAD tools that are used in the endoscopy domain [8,12,14,15]. To ensure a fair comparison, the performances of all the existing baseline methods were evaluated with our selected dataset using the same training and testing data of two-fold cross-validation. In a recent study related to endoscopy, two different CNN models-GoogLeNet [8,12,15] and InceptionV3 [14]-were primarily used in the diagnosis of various type of GI diseases. Therefore, the performance of these two models were evaluated in comparison with our proposed method. The experimental results showed that our method outperformed these two deep CNN models [8,12,14,15] with significant performance gain as listed in Table 8.
Further, we also compared the performance of our method with the recent CNN models [16,[23][24][25] used in image classification domains other than endoscopy. The main objective of these comparisons was to estimate the performance of the existing state-of-the-art CNN models in the endoscopy image analysis domain. The complete experimental results for all the selected baseline methods are listed in Table 8. These results confirm that our proposed ResNet18 and LSTM-based method shows the highest performance in the endoscopy image analysis domain for the classification of multiple GI diseases.
The discriminative ability of our proposed method, in contrast with other baseline methods, can also be observed through the receiver operating characteristics (ROC) curve (an effective measure used to evaluate the diagnostic ability of a model). It is created by plotting the true positive rate (known as the probability of detection) against the false positive rate (known as the probability of false alarm) at various threshold settings. From Figure 12, it can be observed that our proposed method also shows the highest value for the area under the curve (AUC) with a value of 97.057% in comparison with all the other selected baseline methods (i.e., SqueezeNet: 82.131%, AlexNet: 87.328%, GoogLeNet: 91.097%, VGG19: 92.039%, VGG16: 93.060%, InceptionV3: 95.000%, ResNet50: 95.924%, and ResNet18: 95.705%). All these ROC curves are presented by the average values obtained from two-fold cross-validations. The figure on the left side provides an enlarged view to illustrate the performance difference more clearly. The complete parametric and structural details of our proposed model and the other selected models are listed in Table 9. The AUC performance of ResNet18 is comparable with the second-best model named as ResNet50, as shown in Figure 12; however, the training parameters of ResNet18 are significantly less than half of that of ResNet50, as listed in Table 9. Therefore, we adopted the ResNet18 architecture as the backbone model to extract the spatial features, which are further used as inputs to the LSTM model to exploit the temporal information. In our proposed framework, the total learnable parameters were approximately 13.17M (including both ResNet18 and LSTM), which  The complete parametric and structural details of our proposed model and the other selected models are listed in Table 9. The AUC performance of ResNet18 is comparable with the second-best model named as ResNet50, as shown in Figure 12; however, the training parameters of ResNet18 are significantly less than half of that of ResNet50, as listed in Table 9. Therefore, we adopted the ResNet18 architecture as the backbone model to extract the spatial features, which are further used as inputs to the LSTM model to exploit the temporal information. In our proposed framework, the total learnable parameters were approximately 13.17M (including both ResNet18 and LSTM), which were still significantly lower than the second-best model (i.e., ResNet50) as shown in Table 9. Furthermore, a sensitivity analysis was performed to evaluate the robustness of our method and other CNN models. A Monte Carlo simulation step [27] was performed to analyze this sensitivity performance. In this simulation setup, the performance of each individual CNN model was evaluated in an iterative way by randomly selecting 20% of the complete testing dataset as a subset of the testing dataset. A total of 200 iterations were performed for both folds of cross-validations. Finally, the average performance (i.e., average accuracy, F1 score, mAP, and mAR) as well as standard deviation were obtained for each model. The overall sensitivity performance of our method and all the selected models are illustrated in Figure 13. It can be observed in Figure 13a-d that the overall sensitivity performance of our proposed method is higher while considering average accuracy, F1 score, mAP, and mAR when compared to all the existing baseline models.  A t-test performance analysis [46] was further performed to illustrate the significance of the performance difference between our method and ResNet18. The reason why the t-test performance analysis was performed only against ResNet18 is because ResNet18 shows the second-best accuracy as shown in Table 8. In general, this performance analysis is often used to illustrate the performance difference between two systems or algorithms in a more discriminative way. It is based on a null hypothesis ( ), which assumes that there is no performance difference (i.e., = 0) between two models. Then, a rejection score (p-value) is calculated to check the validity of the null hypothesis based on the performance of the two models (in this case, our method and the second-best model). Figure 14 illustrates the t-test performance (for the values of mean ( ), standard deviation ( ), and pvalue) for our method and the second-best model. These results were calculated for all the performance measures. The obtained rejection scores (p-values) in case of the average accuracy, F1 score, mAP, and mAR were 1.51 × 10 −43 , 6.87 × 10 −20 , 4.67 × 10 −10 , and 1.03 × 10 −33 , respectively. All these p-values are less than 0.01, which indicate that the null hypothesis is rejected (i.e., ≠ 0) at a 99% confidence score for all the performance metrics. Based on these results, it can be concluded that there is a significant performance difference between our method and the second- A t-test performance analysis [46] was further performed to illustrate the significance of the performance difference between our method and ResNet18. The reason why the t-test performance analysis was performed only against ResNet18 is because ResNet18 shows the second-best accuracy as shown in Table 8. In general, this performance analysis is often used to illustrate the performance difference between two systems or algorithms in a more discriminative way. It is based on a null hypothesis (H), which assumes that there is no performance difference (i.e., H = 0) between two models. Then, a rejection score (p-value) is calculated to check the validity of the null hypothesis based on the performance of the two models (in this case, our method and the second-best model). Figure 14 illustrates the t-test performance (for the values of mean (µ), standard deviation (ρ), and p-value) for our method and the second-best model. These results were calculated for all the performance measures. The obtained rejection scores (p-values) in case of the average accuracy, F1 score, mAP, and mAR were 1.51 × 10 −43 , 6.87 × 10 −20 , 4.67 × 10 −10 , and 1.03 × 10 −33 , respectively. All these p-values are less than 0.01, which indicate that the null hypothesis is rejected (i.e., H 0) at a 99% confidence score for all the performance metrics. Based on these results, it can be concluded that there is a significant performance difference between our method and the second-best method. Furthermore, the higher mean (µ) performance of our method indicates its superiority over the second-best baseline model. best method. Furthermore, the higher mean ( ) performance of our method indicates its superiority over the second-best baseline model. We also performed Cohen's d [47] analysis, by which the size of the difference between the two groups were demonstrated using the effect size [48]. Cohen's d analysis is widely used for analyzing the difference between two measured values. Generally, Cohen's d is classified as small at approximately 0.2-0.3, as medium at approximately 0.5, and as large at greater than or equal to 0.8. For example, if the calculated Cohen's d is closer to 0.2-0.3 than 0.5 and 0.8, we can say that the difference between measured values has a small effect size. If the calculated Cohen's d is closer to 0.8 than 0.2-0.3 and 0.5, we can say that the difference between measured values has a large effect size. The calculated Cohen's d values for the performance of the two models (our method and the secondbest model) were approximately 1.57 (closer to 0.8), 0.96 (closer to 0.8), 0.64 (closer to 0.8), and 1.33 (closer to 0.8) for average accuracy, F1 score, mAP, and mAR, respectively. Consequently, we concluded that the difference in the performances between our method and the second-best model has a large effect while considering the average accuracy, F1 score, mAP, and mAR.
In this section, we present the performances of various handcrafted feature-based methods that were also compared with our proposed CNN and LSTM-based classification framework for further comparison. In this comparison, three known handcrafted feature extraction methods, named as LBP [49], histogram of oriented gradients (HoG) [50], and multilevel LBP (MLBP) [51], were considered. Then, the extracted features from each method were classified by using four different classifiers: We also performed Cohen's d [47] analysis, by which the size of the difference between the two groups were demonstrated using the effect size [48]. Cohen's d analysis is widely used for analyzing the difference between two measured values. Generally, Cohen's d is classified as small at approximately 0.2-0.3, as medium at approximately 0.5, and as large at greater than or equal to 0.8. For example, if the calculated Cohen's d is closer to 0.2-0.3 than 0.5 and 0.8, we can say that the difference between measured values has a small effect size. If the calculated Cohen's d is closer to 0.8 than 0.2-0.3 and 0.5, we can say that the difference between measured values has a large effect size. The calculated Cohen's d values for the performance of the two models (our method and the second-best model) were approximately 1.57 (closer to 0.8), 0.96 (closer to 0.8), 0.64 (closer to 0.8), and 1.33 (closer to 0.8) for average accuracy, F1 score, mAP, and mAR, respectively. Consequently, we concluded that the difference in the performances between our method and the second-best model has a large effect while considering the average accuracy, F1 score, mAP, and mAR.
In this section, we present the performances of various handcrafted feature-based methods that were also compared with our proposed CNN and LSTM-based classification framework for further comparison. In this comparison, three known handcrafted feature extraction methods, named as LBP [49], histogram of oriented gradients (HoG) [50], and multilevel LBP (MLBP) [51], were considered. Then, the extracted features from each method were classified by using four different classifiers: adaptive boosting (AdaBoostM2) [52], multiclass SVM (multi-SVM) [53], random forest (RF) [54], and KNN. All these handcrafted feature-based methods exploit the low-level features (i.e., edge or corner information). We evaluated the performance of 12 different handcrafted feature-based classification methods for our selected dataset to obtain a fair comparison. The detailed results for all these classification methods are listed in Table 10.
Among all these handcrafted feature extraction and classification methods, HoG + RF (i.e., HoG feature extraction method followed by the RF classifier) demonstrated superior performance. Hence, the HoG feature extraction method exploited the more discriminative low-level features in comparison with the other two methods. Furthermore, the RF classifier considers a tree structure to determine the classification decision, which resulted in a better performance and controlled the over-fitting problem. However, there is a significant performance difference between our method and the best handcrafted feature-based method (HoG + RF). Our proposed method outperformed all the handcrafted feature-based methods.

Discussion
Our proposed deep CNN and LSTM-based classification framework shows the best performance with a high AUC of 97.057%. This remarkable performance of our proposed system increases its usability in the diagnosis of several GI diseases by automatically detecting different types of GI lesions or abnormalities, such as polyps, ulcers, or cancers from endoscopic videos. Our AI-based CAD system can assist the physicians in an effective diagnosis and treatment of many complex GI diseases. Furthermore, the classification of the endoscopic videos can, itself, be beneficial in retrieving the previously stored videos related to the current situation of a patient. Thus, the past cases can provide a path toward correct diagnostic decision. Therefore, we can also utilize our proposed classification framework for efficient endoscopic video frame retrieval by using the predicted class labels. The overall block diagram for our class prediction-based retrieval system is shown in Figure 15. In this retrieval section, the first step is to predict the actual class for the given query (i.e., successive endoscopic video frames). To predict the actual class label, a probability score corresponding to each class label is obtained for the given query by using our proposed classification framework. Based on the highest probability score, the corresponding class label is chosen as the actual class label. In the second step, the relevant cases related to input query frames are explored only within the predicted class based on feature matching. In this feature matching stage, the extracted spatiotemporal feature vector from the input query frames is matched one by one with the feature database of that predicted class by calculating the Euclidean distance. Based on the minimum distance, the frame index (i.e., name or ID information) is selected. Finally, the relevant frame is retrieved from the database by using the frame index information obtained in previous stage. distance, the frame index (i.e., name or ID information) is selected. Finally, the relevant frame is retrieved from the database by using the frame index information obtained in previous stage. Figure 15. Class prediction-based retrieval system by using our proposed classification framework.
A few correctly retrieved examples are illustrated in Figure 16 by using our class predictionbased retrieval system. It can be observed that the retrieved endoscopic frames have high intra-class variance with varying illumination and contrast. However, our proposed system still outperforms with 100% retrieval performance for all the selected cases. Moreover, the classification performance for these selected example cases is also 100%, which can be observed in Figure 11 (confusion matrix performance for each class). Further, Figure 17 shows the probability score corresponding to each input query. It can be observed that the highest probability score is obtained for the actual predicted class, which shows that the proposed classification model is capable of extracting the discriminative features for the given query. In conclusion, this significant performance gain (in both classification and retrieval sections) shows that our method can be robust to the high intra-class variance of a dataset. Figure 15. Class prediction-based retrieval system by using our proposed classification framework.
A few correctly retrieved examples are illustrated in Figure 16 by using our class prediction-based retrieval system. It can be observed that the retrieved endoscopic frames have high intra-class variance with varying illumination and contrast. However, our proposed system still outperforms with 100% retrieval performance for all the selected cases. Moreover, the classification performance for these selected example cases is also 100%, which can be observed in Figure 11 (confusion matrix performance for each class). Further, Figure 17 shows the probability score corresponding to each input query. It can be observed that the highest probability score is obtained for the actual predicted class, which shows that the proposed classification model is capable of extracting the discriminative features for the given query. In conclusion, this significant performance gain (in both classification and retrieval sections) shows that our method can be robust to the high intra-class variance of a dataset.
performance for each class). Further, Figure 17 shows the probability score corresponding to each input query. It can be observed that the highest probability score is obtained for the actual predicted class, which shows that the proposed classification model is capable of extracting the discriminative features for the given query. In conclusion, this significant performance gain (in both classification and retrieval sections) shows that our method can be robust to the high intra-class variance of a dataset. There are a few classes in our selected dataset that show the low retrieval performance, as shown in Figure 18. The primary reason for this performance degradation is the anatomical structural overlapping and identical shape of different GI lesions among different classes. Figure 18a shows a few incorrectly retrieved results as C30 (i.e., tuber adenoma in sigmoid colon) and C32 (i.e., ulcerative colitis in rectosigmoid part of large intestine) are retrieved for an input query of C16 (i.e., severe Crohn's disease in terminal ileum of small intestine). It can be observed from Figure 18a that the lesion characteristics among these three classes (i.e., C16, C30, and C32) show a resemblance that may cause the incorrect retrieval. Similarly, certain other incorrect retrieval cases were obtained for an input query of C31 (i.e., polypoid cancer in sigmoid colon), C33 (i.e., severe Crohn's disease in the rectum), and C34 (i.e., adenocarcinoma in the rectum) owing to identical lesion characteristics, as shown in Figures 18b-d. Moreover, Figure 19 shows the probability score corresponding to each input query in which significantly higher probability scores can be observed corresponding to multiple predicted class labels. These multiple higher scores show the structural or lesion similarities among multiple classes, which can result in classification errors. However, the retrieval performance in these cases can be enhanced by exploring the input query in multiple classes, which can be selected based on a multiple probability scores that is greater than a certain threshold. There are a few classes in our selected dataset that show the low retrieval performance, as shown in Figure 18. The primary reason for this performance degradation is the anatomical structural overlapping and identical shape of different GI lesions among different classes. Figure 18a shows a few incorrectly retrieved results as C30 (i.e., tuber adenoma in sigmoid colon) and C32 (i.e., ulcerative colitis in rectosigmoid part of large intestine) are retrieved for an input query of C16 (i.e., severe Crohn's disease in terminal ileum of small intestine). It can be observed from Figure 18a that the lesion characteristics among these three classes (i.e., C16, C30, and C32) show a resemblance that may cause the incorrect retrieval. Similarly, certain other incorrect retrieval cases were obtained for an input query of C31 (i.e., polypoid cancer in sigmoid colon), C33 (i.e., severe Crohn's disease in the rectum), and C34 (i.e., adenocarcinoma in the rectum) owing to identical lesion characteristics, as shown in Figure 18b-d. Moreover, Figure 19 shows the probability score corresponding to each input query in which significantly higher probability scores can be observed corresponding to multiple predicted class labels. These multiple higher scores show the structural or lesion similarities among multiple classes, which can result in classification errors. However, the retrieval performance in these cases can be enhanced by exploring the input query in multiple classes, which can be selected based on a multiple probability scores that is greater than a certain threshold.
shown in Figures 18b-d. Moreover, Figure 19 shows the probability score corresponding to each input query in which significantly higher probability scores can be observed corresponding to multiple predicted class labels. These multiple higher scores show the structural or lesion similarities among multiple classes, which can result in classification errors. However, the retrieval performance in these cases can be enhanced by exploring the input query in multiple classes, which can be selected based on a multiple probability scores that is greater than a certain threshold.

Conclusion
In this paper, a novel CNN and LSTM-based classification framework was proposed for the classification of multiple GI diseases using endoscopic videos. Moreover, our proposed classification framework is further utilized to design a class prediction-based endoscopic video retrieval system. The proposed spatiotemporal features-based method is capable of encoding more discriminative representations of multiple endoscopy scans when compared to the features learned only from spatial information. Therefore, both spatial and temporal information results in better classification and retrieval performance. The performance of the proposed method was evaluated thoroughly using a publicly available dataset from GastroLab as well as the KVASIR database. Moreover, the same dataset and experimental protocol was adopted for the various state-of-the-art methods to make a fair comparison. The proposed method achieved 97.057% area under the curve as the best results, together with an average accuracy of 92.57%, F1 score of 93.41%, mAP of 94.58%, and mAR of 92.28. In addition, the obtained t-test rejection scores (p-values) of our proposed and second-best method are less than 0.01 (1.51 × 10 −43 , 6.87 × 10 −20 , 4.67 × 10 −10 , and 1.03 × 10 −33 in the case of the average accuracy, F1 score, mAP, and mAR, respectively), which indicate that the null hypothesis is rejected (i.e., ≠ 0) at a 99% confidence score for all the performance metrics. After performing a detailed analysis, we observed that our method consistently achieved high classification performance in comparison with various state-of-the-art deep CNN and handcrafted features-based methods of LBP, HoG, and MLBP. The classification and retrieval performance of the proposed system reveals its applicability to clinical diagnosis, treatment, education, and research. We also ensured that our trained model is publicly available to aid other researchers in performance comparisons.
As a future work, we are planning to increase the dataset by considering more than 37 classes. In addition, we are planning to perform the real-time detection of small lesions using an endoscopic

Conclusions
In this paper, a novel CNN and LSTM-based classification framework was proposed for the classification of multiple GI diseases using endoscopic videos. Moreover, our proposed classification framework is further utilized to design a class prediction-based endoscopic video retrieval system. The proposed spatiotemporal features-based method is capable of encoding more discriminative representations of multiple endoscopy scans when compared to the features learned only from spatial information. Therefore, both spatial and temporal information results in better classification and retrieval performance. The performance of the proposed method was evaluated thoroughly using a publicly available dataset from GastroLab as well as the KVASIR database. Moreover, the same dataset and experimental protocol was adopted for the various state-of-the-art methods to make a fair comparison. The proposed method achieved 97.057% area under the curve as the best results, together with an average accuracy of 92.57%, F1 score of 93.41%, mAP of 94.58%, and mAR of 92.28. In addition, the obtained t-test rejection scores (p-values) of our proposed and second-best method are less than 0.01 (1.51 × 10 −43 , 6.87 × 10 −20 , 4.67 × 10 −10 , and in the case of the average accuracy, F1 score, mAP, and mAR, respectively), which indicate that the null hypothesis is rejected (i.e., H 0) at a 99% confidence score for all the performance metrics. After performing a detailed analysis, we observed that our method consistently achieved high classification performance in comparison with various state-of-the-art deep CNN and handcrafted features-based methods of LBP, HoG, and MLBP. The classification and retrieval performance of the proposed system reveals its applicability to clinical diagnosis, treatment, education, and research. We also ensured that our trained model is publicly available to aid other researchers in performance comparisons.
As a future work, we are planning to increase the dataset by considering more than 37 classes. In addition, we are planning to perform the real-time detection of small lesions using an endoscopic video. We also plan to improve the overall classification performance by combing multiple deep CNN models.
Author Contributions: M.O. and K.R.P. designed the overall system. In addition, they wrote and revised the paper. M.A., J.C., and T.M. helped to design the comparative analysis and experiments.