Colon Tissues Classification and Localization in Whole Slide Images Using Deep Learning

Colorectal cancer is one of the leading causes of cancer-related death worldwide. The early diagnosis of colon cancer not only reduces mortality but also reduces the burden related to the treatment strategies such as chemotherapy and/or radiotherapy. However, when the microscopic examination of the suspected colon tissue sample is carried out, it becomes a tedious and time-consuming job for the pathologists to find the abnormality in the tissue. In addition, there may be interobserver variability that might lead to conflict in the final diagnosis. As a result, there is a crucial need of developing an intelligent automated method that can learn from the patterns themselves and assist the pathologist in making a faster, accurate, and consistent decision for determining the normal and abnormal region in the colorectal tissues. Moreover, the intelligent method should be able to localize the abnormal region in the whole slide image (WSI), which will make it easier for the pathologists to focus on only the region of interest making the task of tissue examination faster and lesser time-consuming. As a result, artificial intelligence (AI)-based classification and localization models are proposed for determining and localizing the abnormal regions in WSI. The proposed models achieved F-score of 0.97, area under curve (AUC) 0.97 with pretrained Inception-v3 model, and F-score of 0.99 and AUC 0.99 with customized Inception-ResNet-v2 Type 5 (IR-v2 Type 5) model.


Introduction
Colorectal cancer (CRC) is one of the leading causes of death worldwide. Among the incidence of cancer, Asia accounted for 49.3% of the cancer cases; moreover, more than half of the total mortality caused by cancer is 58.3%. In terms of the type of cancer, colorectal cancer accounted for 10% of newly diagnosed cases, remaining as the third leading cause of cancer death in Asia [1]. Cancer remained the top cause of death in Taiwan for the considering a model with the different domain of the target dataset. When the diabetic foot ulcer dataset was considered, the proposed model achieved an F-score of 89.4% with TL from the different domain of the target dataset, and an F-score of 97.6% TL from the same domain of the target dataset. A similar investigation was carried out in [14], but considering the breast cancer dataset. The authors proposed a hybrid model of parallel convolutional layers and residual links, which was trained with the same domain transfer learning. The proposed model achieved classification accuracies of 97.4% and 96.1% on the validation set and the testing set, respectively. Addressing the image annotation issue, a novel TL approach was proposed in [15], wherein the need for data labeling was reduced by training the DL model with a large number of unlabeled images of a specific task followed by using a small number of labeled images for fine-tuning the model. The proposed model achieved F-scores of 98.53% and 97.51% for skin cancer and breast cancer, respectively. Nonetheless, on the basis of the above-discussed literature survey, it remains unclear whether the use of TL will provide better performance if the source domain is completely different from the target domain or if the source domain is same as the target domain. Moreover, it is also unclear as to whether the performance of the model derived by employing the same domain TL provides better performance over the model that is trained from the scratch [12][13][14][15].
Consequently, in this paper, DL methods were employed for the patch level classification of normal and abnormal CRC tissues in WSI. The aim of this work included multiphase analyses for colorectal tissue classification and localization. In phase one, pretrained convolutional neural network (CNN) architectures trained with different source domain were compared. In the second phase, different customized models were designed using Keras [16], following similar structures of popular models such as Visual Geometry Group (VGG) [17], Residual Networks (ResNet) [18], Inception [19,20], and Inception-ResNet-v2 [21]. The designed customized models were trained from the scratch considering the target dataset (own CRC dataset) instead of using the pretrained weights of ImageNet, and the performances of the different models were compared. In phase three, checking of the scope for further improvement of best performing customized model was conducted, improving the performance of the customized IR-v2. Ultimately, the abnormal regions were localized in the WSI in the final phase.
In this paper, we aimed to perform the classification and localization of the colorectal tissue, considering both the same domain and different domain TL, and training the models from scratch. Therefore, the contributions made in this work can be summarized as follows: • A new dataset consisting of 297 WSI for the colon was collected and manually annotated by a well-experienced pathologist. • Transfer learning was investigated considering the training of different CNN architectures using weights obtained from different domain datasets, and performances were recorded after hyperparameter tuning. • Different customized CNNs models were built and trained from scratch using target dataset, and performances were recorded and investigated after hyperparameter tuning. • Among the customized models, the top-performing model was studied further to check if the model can be further tuned to obtain the best customized model.

•
The best-customized model IR-v2 Type 5 achieved an F-score of 0.99 and AUC 0.99.

•
The patches classified as abnormal were localized in the WSI, which could be beneficial for pathologists to examine less area compared to the whole slide.

•
On the basis of our study, we empirically proved that the customized IR-v2 Type 5 model provides better results for the CRC dataset if trained from scratch.

•
The IR-v2 Type 5 model developed through this study may be deployed in different hospitals for automatic classification and localization of abnormal tissues in WSI, which can assist pathologists in making accurate decisions in a faster mode and can ultimately help to expedite the treatment and therapy procedure for CRC patients.
Therefore, in this paper, artificial intelligence-based methods are proposed for automatic classification and automatic localization of abnormal regions in WSI. The proposed Diagnostics 2021, 11, 1398 4 of 22 models can assist the pathologists and surgeons in faster decision making. The remaining sections of this paper are organized as follows. In Section 2, the dataset and methods used in this paper are discussed, followed by the experiments performed in Section 3. Section 4 is used for discussing the results of both pretrained and customized models along with the localized results. In Section 5, discussions and comparisons of the current works in this field are presented, and finally, the work is concluded in Section 6. Figure 1 shows the overall outline of the study, wherein the steps involved from image data collection to scanning and obtaining WSI were briefed. This was followed by preprocessing and the division of data for classification model evaluation. The customized models were evaluated using performance metrics and, finally, abnormal tissues were localized in WSI. All the steps were elucidated in the following subsections. which can assist pathologists in making accurate decisions in a faster mode and can ultimately help to expedite the treatment and therapy procedure for CRC patients.

Study Outline
Therefore, in this paper, artificial intelligence-based methods are proposed for automatic classification and automatic localization of abnormal regions in WSI. The proposed models can assist the pathologists and surgeons in faster decision making. The remaining sections of this paper are organized as follows. In Section 2, the dataset and methods used in this paper are discussed, followed by the experiments performed in Section 3. Section 4 is used for discussing the results of both pretrained and customized models along with the localized results. In Section 5, discussions and comparisons of the current works in this field are presented, and finally, the work is concluded in Section 6. Figure 1 shows the overall outline of the study, wherein the steps involved from image data collection to scanning and obtaining WSI were briefed. This was followed by preprocessing and the division of data for classification model evaluation. The customized models were evaluated using performance metrics and, finally, abnormal tissues were localized in WSI. All the steps were elucidated in the following subsections.

Data Acquisition
In this work, data were collected from 77 subjects containing 297 WSI for the colon, including both normal and abnormal subjects. All the collected tissue samples were taken between 2009 and 2011 at the Chang Gung Memorial Hospital, Linkou, Taiwan. The samples collected for the study were anonymized and approved by the Institutional Review Board, Chang Gung Memorial Hospital, Linkou, Taiwan, under the license number 201702073B0. The size of the specimen varied among subjects and thus the size of scanned images. These specimens were stained with H&E and scanned using Hamamatsu Nano-Zoomer at 40× magnification, resulting in a vendor-dependent format named Nano-Zoomer Digital Pathology (NDP) images. In this study, the largest source size of the specimen obtained was 43.4 mm × 25.6 mm, which led to an NDP image size of 3.18 GB with a resolution of 229 nm/pixel (110,917 dpi) and dimensions of 188,416 × 112,384 pixels.

Data Acquisition
In this work, data were collected from 77 subjects containing 297 WSI for the colon, including both normal and abnormal subjects. All the collected tissue samples were taken between 2009 and 2011 at the Chang Gung Memorial Hospital, Linkou, Taiwan. The samples collected for the study were anonymized and approved by the Institutional Review Board, Chang Gung Memorial Hospital, Linkou, Taiwan, under the license number 201702073B0. The size of the specimen varied among subjects and thus the size of scanned images. These specimens were stained with H&E and scanned using Hamamatsu NanoZoomer at 40× magnification, resulting in a vendor-dependent format named NanoZoomer Digital Pathology (NDP) images. In this study, the largest source size of the specimen obtained was 43.4 mm × 25.6 mm, which led to an NDP image size of 3.18 GB with a resolution of 229 nm/pixel (110,917 dpi) and dimensions of 188,416 × 112,384 pixels.

Image Annotation
After obtaining the digitalized WSI comprising 269 slides for abnormal subjects and 28 slides for normal subjects, we needed to annotate the normal and abnormal tissue regions for using the supervised learning method to train the deep learning models. As a result, in the WSI, the normal and abnormal regions were annotated by a pathologist who has more than 10 years of experience. While analyzing the WSI, we observed that some of the slides were not optimally scanned, which resulted in blurred images when the images were magnified. In addition, the WSI with overlapped and/or folded tissue samples were discarded. As a result, after discarding non-optimal slides, the study consisted of 22 subjects containing 28 normal WSIs and 55 subjects containing 187 abnormal WSIs. The entire image annotations were carried out using the NDP.view2 software, which is a free edition, provided by Hamamatsu Photonics K.K. for viewing the NDP images.

Image Preprocessing
The WSIs obtained for the subjects ranged from 1 to 3 GB, including the images from both normal and abnormal subjects. However, the WSI of such a huge size could not be directly used as input to the deep learning algorithms due to the technical challenge of not being able to fully fit into the computer's memory. Consequently, NDPITools [22], open-source software distributed under GNU General Public License 3.0, was used for splitting the WSI into smaller splits with an overlap of 25 pixels, resulting in splits of JPEG format. There were 32, 64, or 128 splits formed per WSI, depending on the size of input WSI. When scanned, the WSI also contained white background of the slide, which was not required in the analysis. In addition, some splits consisted of 75-100% overlapping tissues, which were the results of poor fixation during slide preparation. As a result, the splits containing white background, artefactual staining, and tissue wrinkling were discarded manually under the pathologist's supervision. The remaining splits were used as input to obtain the patch (tile). TileMage Image Splitter 2.11 was used for obtaining patches varying from 200 to 300 pixels by 200 to 300 pixels. After the patch formation, a similar approach was adopted for the removal of patches containing white background and tissue wrinkling, and then, finally, the patches were ready for training the deep learning models. Figure 2 shows the step-by-step procedure adopted from image data collection to preprocessing.

Image Annotation
After obtaining the digitalized WSI comprising 269 slides for abnormal subjects and 28 slides for normal subjects, we needed to annotate the normal and abnormal tissue regions for using the supervised learning method to train the deep learning models. As a result, in the WSI, the normal and abnormal regions were annotated by a pathologist who has more than 10 years of experience. While analyzing the WSI, we observed that some of the slides were not optimally scanned, which resulted in blurred images when the images were magnified. In addition, the WSI with overlapped and/or folded tissue samples were discarded. As a result, after discarding non-optimal slides, the study consisted of 22 subjects containing 28 normal WSIs and 55 subjects containing 187 abnormal WSIs. The entire image annotations were carried out using the NDP.view2 software, which is a free edition, provided by Hamamatsu Photonics K.K. for viewing the NDP images.

Image Preprocessing
The WSIs obtained for the subjects ranged from 1 to 3 GB, including the images from both normal and abnormal subjects. However, the WSI of such a huge size could not be directly used as input to the deep learning algorithms due to the technical challenge of not being able to fully fit into the computer's memory. Consequently, NDPITools [22], open-source software distributed under GNU General Public License 3.0, was used for splitting the WSI into smaller splits with an overlap of 25 pixels, resulting in splits of JPEG format. There were 32, 64, or 128 splits formed per WSI, depending on the size of input WSI. When scanned, the WSI also contained white background of the slide, which was not required in the analysis. In addition, some splits consisted of 75-100% overlapping tissues, which were the results of poor fixation during slide preparation. As a result, the splits containing white background, artefactual staining, and tissue wrinkling were discarded manually under the pathologist's supervision. The remaining splits were used as input to obtain the patch (tile). TileMage Image Splitter 2.11 was used for obtaining patches varying from 200 to 300 pixels by 200 to 300 pixels. After the patch formation, a similar approach was adopted for the removal of patches containing white background and tissue wrinkling, and then, finally, the patches were ready for training the deep learning models. Figure 2 shows the step-by-step procedure adopted from image data collection to preprocessing. Overall data collection, annotation, and preprocessing method for CRC WSI.

Artificial Intelligence-Based Analysis
In artificial intelligence, among the artificial neural networks, the CNN, which uses convolution operation, weight sharing, and local connectivity principle, is considered best suited for image analysis. Two types of training were considered for the CNNs: the first

Artificial Intelligence-Based Analysis
In artificial intelligence, among the artificial neural networks, the CNN, which uses convolution operation, weight sharing, and local connectivity principle, is considered best suited for image analysis. Two types of training were considered for the CNNs: the first set for training consisted of pretrained networks and the second set for training consisted of the customized CNNs, which were built taking the different CNN architectures as the base. The CNN is well known for being used in medical image analysis, which learns the important features of an image efficiently, omitting the feature engineering step that is used in a typical machine learning approach. The most important layer in a CNN is the convolutional layer, which applies convolution over the input. Let the input image be denoted by q and the kernel be denoted as r. The output indexes of rows (m) and columns (n) of the resultant feature map will be as given in Equation (1).
When applying convolution, the information from pixels located on the outskirts is lost. As a result, padding is used for solving such issues. Depending on whether padding is used or not, the types of padding are valid, meaning that no border is used around the input, and the same, wherein the border is used around the input. When the same padding is applied with filter dimension r, the padding p must satisfy Equation (2).
After the padding, the dimension of the output feature map (d out ) or output image can be calculated using Equation (3) as follows: where s is the stride, and d in is the dimension of the input feature map. Let us consider that there are n r filters and the number of channels for the image is n c . The dimension of the whole output can be calculated using Equation (4).
When applying the convolution over an input during forward pass, we calculated an intermediate value z using Equation (5) as follows: where w i represents the weight associated with input feature x i , and b is the bias. If z, the output produced by one layer, was directly forwarded to the next layer, it led to linear and weak learning. As a result, for introducing non-linearity, an activation function was applied after convolution operation in the convolutional layer itself using Equation (6).
The most preferred activation function, rectified linear unit (ReLU), is estimated as given in Equation (7): When performing convolution over an input, not all the outputs produced were important. In addition, the size of the output increased with the increase in filters. Therefore, it was required to downsample the output produced by the convolutional layer. The downsampling was carried out using either max-pooling or average-pooling operation. When using CNN, the convolutional layers with activations followed by pooling layers were applied multiple times as per the requirement. All the layers followed local connectivity. However, for the output of the network to be obtained, all the features must be aggregated, which created the requirement of global connectivity. Therefore, there were fully connected layers attached towards the end of CNN. The fully connected layer works on the multilayer perceptron principle obeying global connectivity among the nodes. Finally, in the output layer, a softmax activation function was used in the last layer to determine the probabilities of each category, using Equation (8): Here, the standard exponential function was applied to each element z i of the input vector z, and normalized by dividing by the sum of all the exponentials.

Data Divisibility
To carry out the image analysis for normal vs. abnormal classification, we obtained 303,012 normal patches and approximately 1,000,000 abnormal patches after all preprocessing steps were completed. In favor of having unbiased performances during the analyses, we performed a five-fold cross-validation study [23] for each pretrained model and customized model, and standard deviation (SD) was also considered. During each round of cross-validation, the dataset was randomly divided into two sets containing training set and testing set in the ratio of 80:20, wherein the training was used for model derivation, and the testing set was used for model evaluation.

Transfer Learning Using Pretrained CNN Architectures
In order to compare the transfer learning with different domain dataset (ImageNet) and to train the models from the scratch using own dataset, we considered different popular pretrained CNNs such as VGG, ResNet, Inception, and IR-v2 for the analysis. In case of pretrained CNN architectures, the weights of ImageNet were used. Therefore, the last layer was only fine-tuned with the considered CRC dataset. While using deep CNNs, we needed to find the most suitable model that could be used for the analysis of histopathology images of the colon. Therefore, different performance metrics [24] such as recall (sensitivity), specificity, precision, accuracy, F-score, and Matthew correlation coefficient (MCC) [25] were considered to evaluate different CNNs. Moreover, the receiver operating characteristic (ROC) curve [26] was plotted showing the AUC, and the average precision (AP) [27] was also calculated.

Deep Learning Using Customized CNN Architectures
By considering the training from the scratch instead of using weights of ImageNet, we built models from the scratch using Keras. The built and compiled model learned only the patterns of histopathology images, and weights were updated in all layers during the learning procedure. The structure of models such as five blocks in VGG16; 1× 1 and 3 × 3 convolutions in Inception family architectures; skip connections in ResNet50 architecture; and 1 × 1, 3 × 3, and 7 × 7 factorizations with skip connections in IR-v2 were used as the base, and the models were trained from scratch.

Deep Learning Using Variants of Customized Inception-ResNet-v2
On the basis of the modifications made in the IR-v2 model, we discuss five types of customized IR-v2 models in this study. In Type 1 of IR-v2, the default configuration of the network was used. However, the originally used numbers of linear filters in Inception-A (384), Inception-B (1154), and Inception-C (2048) blocks were reduced to 128 in every block in the case of Type 2 of IR-v2. Moreover, the numbers of filters were also reduced to 128 in every convolutional layer defined in the reduction modules. In addition to the configuration of the network in Type 2, the number of modules for Inception-B was reduced from 10 to 5 in the network used in Type 3. However, in the network used in Type 4, the activation function was changed from softmax to sigmoid in the output layer, keeping the configurations same as the network used in Type 3. On the contrary, the originally used numbers of linear filters in Inception-A (384), Inception-B (1154), and Inception-C (2048) blocks were reduced to 128, 512, and 512, respectively in the case of Type 5. The detailed structure of the best performing model IR-v2 Type 5 is shown in Figure 3, wherein the considered numbers of filters, stride, and pooling are mentioned for the considered structures of Stem in Figure 3a, Inception-A block in Figure 3b, Inception-B block in Figure 3c, 35 × 35 to 17 × 17 reduction module A in Figure 3d, Inception-C block in Figure 3e, and finally 17 × 17 to 8 × 8 reduction module B in Figure 3f. In addition to changing the numbers of filters and layers, modifications were made in the considered number of modules, such as the number of modules in Inception-B being reduced from 10 to 7. blocks were reduced to 128, 512, and 512, respectively in the case of Type 5. The detailed structure of the best performing model IR-v2 Type 5 is shown in Figure 3, wherein the considered numbers of filters, stride, and pooling are mentioned for the considered structures of Stem in Figure 3a, Inception-A block in Figure 3b, Inception-B block in Figure 3c, 35×35 to 17×17 reduction module A in Figure 3d, Inception-C block in Figure 3e, and finally 17×17 to 8×8 reduction module B in Figure 3f. In addition to changing the numbers of filters and layers, modifications were made in the considered number of modules, such as the number of modules in Inception-B being reduced from 10 to 7.

Localization
When considering the automatization of WSI analysis, we found that the patch level classification was not enough to assist the pathologists. To reduce the burden of pathologists, the abnormal tissues must be exactly localized in the WSI. Therefore, the template matching algorithm was used to exactly localize the abnormal split in the WSI. Template matching is a popular digital image processing technique used for matching a small part of an image referred to as a template (T) to the source image (I). As shown in Figure 4, in this work, the template matching consisted of two phases, wherein in the first phase, only the abnormal patches formed from the splits were localized in the respective splits of varying dimensions. In the second phase, this was followed by the localization of the splits in the WSI, showing the abnormal regions in the WSI of colon tissue. The localization method used in this work is the normalized correlation method [28], calculated using Equation (9).

Localization
When considering the automatization of WSI analysis, we found that the patch level classification was not enough to assist the pathologists. To reduce the burden of pathologists, the abnormal tissues must be exactly localized in the WSI. Therefore, the template matching algorithm was used to exactly localize the abnormal split in the WSI. Template matching is a popular digital image processing technique used for matching a small part of an image referred to as a template ( ) to the source image ( ). As shown in Figure 4, in this work, the template matching consisted of two phases, wherein in the first phase, only the abnormal patches formed from the splits were localized in the respective splits of varying dimensions. In the second phase, this was followed by the localization of the splits in the WSI, showing the abnormal regions in the WSI of colon tissue. The localization method used in this work is the normalized correlation method [28], calculated using Equation (9).   Figure 5 illustrates the proposed work-the patches generated after preprocessing were used for differentiating the abnormal tissues, wherein both pretrained models and customized models were used individually. Finally, the abnormal tissues differentiated by the model were ultimately localized in the WSI using the template matching algorithm.   Figure 5 illustrates the proposed work-the patches generated after preprocessing were used for differentiating the abnormal tissues, wherein both pretrained models and customized models were used individually. Finally, the abnormal tissues differentiated by the model were ultimately localized in the WSI using the template matching algorithm.

Localization
When considering the automatization of WSI analysis, we found that the pa classification was not enough to assist the pathologists. To reduce the bu pathologists, the abnormal tissues must be exactly localized in the WSI. Theref template matching algorithm was used to exactly localize the abnormal split in t Template matching is a popular digital image processing technique used for ma small part of an image referred to as a template ( ) to the source image ( ). As s Figure 4, in this work, the template matching consisted of two phases, wherein in phase, only the abnormal patches formed from the splits were localized in the re splits of varying dimensions. In the second phase, this was followed by the localiz the splits in the WSI, showing the abnormal regions in the WSI of colon tissue. Th ization method used in this work is the normalized correlation method [28], ca using Equation (9).  Figure 5 illustrates the proposed work-the patches generated after prepro were used for differentiating the abnormal tissues, wherein both pretrained mod customized models were used individually. Finally, the abnormal tissues differ by the model were ultimately localized in the WSI using the template matching alg

Implementation Environment
All the classification and localization models were implemented using Tensorflow [29] 1.14, computed on a Linux OS GPU with the specification TITAN RTX 24 GB×4, Intel ® Xeon ® Scalable Processors, 3 UPI up to 10

Transfer Learning Using Pretrained CNN Architectures
When using the weights of ImageNet and fine-tuning the last layer in the case of considered pretrained CNN architectures, we trained the models with learning rate (0.0001), batch size (256), and the number of iterations (20,000). After every 400 iterations, validation was performed. The training times for the models were similar, approximately 2 h ± 15 min for 400 iterations, resulting in approximately 20 s per iteration. Similarly, the validation time was approximately 3 min. The results for different metrics with SD, obtained after performing a fivefold cross-validation study, are shown in Table 1. On the basis of the values obtained by CNN architectures, we observed that VGG16 had the highest sensitivity (0.99 ± 0.012). However, the specificity was lower, which resulted in a lower AUC (0.96), as justified by Figure 6a. Similar performance in terms of AUC can be seen in the case of the model IR-v2, as shown in Figure 6f. On the other hand, ResNet50 and Inception family performed with AUC (0.97), as shown in Figure 6b-e. In general, as per Table 1 and Figure 6, all the models showed similar performances. As a result, further studies were conducted for verifying if the performances of models could be improved when the models were trained from scratch with the same domain dataset.

Transfer Learning Using Pretrained CNN Architectures
When using the weights of ImageNet and fine-tuning the last layer in the case of considered pretrained CNN architectures, we trained the models with learning rate (0.0001), batch size (256), and the number of iterations (20,000). After every 400 iterations, validation was performed. The training times for the models were similar, approximately 2 h ± 15 min for 400 iterations, resulting in approximately 20 s per iteration. Similarly, the validation time was approximately 3 min. The results for different metrics with SD, obtained after performing a fivefold cross-validation study, are shown in Table 1. On the basis of the values obtained by CNN architectures, we observed that VGG16 had the highest sensitivity (0.99 ± 0.012). However, the specificity was lower, which resulted in a lower AUC (0.96), as justified by Figure 6a. Similar performance in terms of AUC can be seen in the case of the model IR-v2, as shown in Figure 6f. On the other hand, ResNet50 and Inception family performed with AUC (0.97), as shown in Figure 6b-e. In general, as per Table 1 and Figure 6, all the models showed similar performances. As a result, further studies were conducted for verifying if the performances of models could be improved when the models were trained from scratch with the same domain dataset.

Deep Learning Using Customized CNN Architectures
When considering the creation, compilation, and training of models from scratch, we trained all the customized models with initial learning rate (0.0008), batch size (128), number of epochs (50), and the optimizer used was Adam. Moreover, the input size was 224×224 for VGG16, ResNet50, and GoogLeNet, and 299 × 299 for Inception-v3, Inception-v4, and IR-v2. All the customized models took approximately 5 days when trained from scratch. The performances of the different customized models are presented in Table 2. On the basis of the values of performance metrics such as F-score, VGG16 achieved 0.89 ± 0.002, which was reduced by 0.06 when compared to F-score achieved in Table 1 (0.95

Deep Learning Using Customized CNN Architectures
When considering the creation, compilation, and training of models from scratch, we trained all the customized models with initial learning rate (0.0008), batch size (128), number of epochs (50), and the optimizer used was Adam. Moreover, the input size was 224 × 224 for VGG16, ResNet50, and GoogLeNet, and 299 × 299 for Inception-v3, Inception-v4, and IR-v2. All the customized models took approximately 5 days when trained from scratch. The performances of the different customized models are presented in Table 2. On the basis of the values of performance metrics such as F-score, VGG16 achieved 0.89 ± 0.002, which was reduced by 0.06 when compared to F-score achieved in Table 1 (0.95 ± 0.00). Similar observations were made in the case of all other models (except IR-v2), wherein the F-score was reduced by 0.17 (ResNet50), 0.10 (GoogLeNet), 0.03 (Inception-v3), and 0.12 (Inception-v4) when the models were trained from scratch. The declines in performances were further justified by ROC curves for the different customized models, as shown in Figure 7. On the basis of the observations, we found that the performance of VGG16 reduced in the customized model, which is reflected in AUC achieved in Figure 7a, and AP also reduced from 0.91 ± 0.003 (Table 1) to 0.89 ± 0.013 (Table 2). Similar observations were made in the case of another model where AUC and AP were reduced to 0.78 (Figure 7b) and 0.71 ± 0.017 (Table 2), respectively, for ResNet50. Moreover, AUC and AP were respectively reduced to 0.73 (Figure 7c) and 0.74 ± 0.013 (Table 2) for GoogLeNet, 0.94 (Figure 7d) and 0.89 ± 0.019 (Table 2) for Inception-v3, and 0.82 (Figure 7e) and 0.74 ± 0.018 (Table 2) for Inception-v4.

Deep LEARNING Using Variants of Customized Inception-ResNet-v2
Among all the customized CNNs, as observed from Table 2 and Figure 7, the customized IR-v2 performed better than all other models in terms of accuracy, sensitivity, Fscore, etc. Therefore, for further analysis, IR-v2 was considered as the base model, and

Deep LEARNING Using Variants of Customized Inception-ResNet-v2
Among all the customized CNNs, as observed from Table 2 and Figure 7, the customized IR-v2 performed better than all other models in terms of accuracy, sensitivity, F-score, etc. Therefore, for further analysis, IR-v2 was considered as the base model, and several modifications were made such as changing the input image size, the number of hidden layers, and the numbers of filters in hidden layers, as already elucidated in Section 3.4. The training parameters for the models are given in Table 3. Among all the variants of customized IR-v2, IR-v2 Type 5 performed better in comparison to other types, as observed from the values of different performance metrics represented in Table 4. The model achieved an F-score of 0.99 ± 0.005 in a fivefold cross-validation study [23], with minimum SD, demonstrating no overfitting issue in the model derivation. The better performance of IR-v2 Type 5 can be also justified by the ROC curves plotted in Figure 8e, which shows an AUC of 0.99, which is followed by the next best performing model, the Type 1 of IR-v2 with AUC 0.98, as observed in Figure 8a. The remaining models achieved AUC 0.88 (Figure 8b), 0.89 (Figure 8c), and 0.97 (Figure 8d) for Type 2, Type 3, and Type 4, respectively.
The better performance of IR-v2 Type 5 can be also justified by the ROC curves plotted in Figure 8e, which shows an AUC of 0.99, which is followed by the next best performing model, the Type 1 of IR-v2 with AUC 0.98, as observed in Figure 8a. The remaining models achieved AUC 0.88 (Figure 8b), 0.89 (Figure 8c), and 0.97 (Figure 8d) for Type 2, Type 3, and Type 4, respectively.

Classification Results
In order to justify that the performance of IR-v2 Type 5 is better than the Inception-v3, which is the best performing model among the pretrained CNNs, we trained the former from scratch, and the latter used the weights of ImageNet for transfer learning. Some of the outputs produced by both the models are displayed in Figure 9. The first column of Figure 9 represents the input image for the models, wherein the first image in column 1 is the normal patch and the remaining three images are abnormal patches. The outputs produced by the pretrained Inception-v3 and customized IR-v2 Type 5 are shown in column 2 and column 3, respectively. Here, Inception-v3 misclassified two abnormal patches, row 3 and row 4, as normal. However, IR-v2 Type 5 could correctly classify both the patches as abnormal, as shown in row 3 and row 4.

Classification Results
In order to justify that the performance of IR-v2 Type 5 is better than the Inception-v3, which is the best performing model among the pretrained CNNs, we trained the former from scratch, and the latter used the weights of ImageNet for transfer learning. Some of the outputs produced by both the models are displayed in Figure 9. The first column of Figure 9 represents the input image for the models, wherein the first image in column 1 is the normal patch and the remaining three images are abnormal patches. The outputs produced by the pretrained Inception-v3 and customized IR-v2 Type 5 are shown in column 2 and column 3, respectively. Here, Inception-v3 misclassified two abnormal patches, row 3 and row 4, as normal. However, IR-v2 Type 5 could correctly classify both the patches as abnormal, as shown in row 3 and row 4.

Localization Results
The localization results of abnormal tissues in WSI are shown in Figure 10, where column 1 shows the digitally scanned original WSIs of different subjects. Column 2 shows the annotated ground truths. Finally, column 3 shows the localized abnormal tissues output obtained after classification from the IR-v2 Type 5. It can be observed that the IR-v2 Type 5 can accurately localize the abnormal tissues in the WSI, thereby minimizing the burden of pathologists during tissue examination.

Localization Results
The localization results of abnormal tissues in WSI are shown in Figure 10, where column 1 shows the digitally scanned original WSIs of different subjects. Column 2 shows the annotated ground truths. Finally, column 3 shows the localized abnormal tissues output obtained after classification from the IR-v2 Type 5. It can be observed that the IR-v2 Type 5 can accurately localize the abnormal tissues in the WSI, thereby minimizing the burden of pathologists during tissue examination.

Comparison with Previous Works in the Same Domain
When considering the analysis for different cancers and diseases such as kidney disease [30], brain tumor, prostate cancer, and colon cancer [31], researchers have carried out several studies for automatic and semi-automatic analysis using AI [32]. In particular, considering the histology images analysis, there have been several research works conducted for not only colorectal cancer, but also different types of cancers [33], such as breast Original Raw WSI Ground Truth Localization in WSI Figure 10. Comparison of ground truths vs. the outputs produced by IR-v2 Type 5: column 1: original WSIs; column 2: corresponding ground truths; column 3: corresponding output for abnormal tissues localized in WSIs.

Comparison with Previous Works in the Same Domain
When considering the analysis for different cancers and diseases such as kidney disease [30], brain tumor, prostate cancer, and colon cancer [31], researchers have carried out several studies for automatic and semi-automatic analysis using AI [32]. In particular, considering the histology images analysis, there have been several research works conducted for not only colorectal cancer, but also different types of cancers [33], such as breast cancer, skin cancer [34], and renal cancer [35]. Considering the manual method of feature extraction in the histology image analysis of CRC, the authors in [36] performed the ML-based differentiation of colorectal tissue with and without adenocarcinoma, using quasi-supervised learning. Statistical and texture features were extracted from 230 images of colorectal tissues, wherein the dimension of considered images was 4080 × 3720 pixels. The accuracy achieved by the model for the binary classification was 95%. In [37], the authors considered 20 WSI of CRC tissue for generating patch size of 150 × 150 pixels for training, and patch size of 5000 × 5000 pixels for testing. The patches were used to obtain the different statistical and texture features and train the ML models for performing eight class classifications. The authors used one nearest neighbor; support vector machine (SVM), and decision tree as the ML classifiers, and the best accuracy achieved was 87.4%. However, when considering ML analysis, the features are handcrafted. Manual feature extraction is a tedious job, and some unknown important features might be overlooked.
Therefore, towards the end of 2016, many works were published that focused on the use of DL for image analysis [38]. One of the works [39] used CNN to extract the features automatically, and the extracted features were used for the classification of breast and colon cancer into benign and malignant tumors. The CNN model consisted of five layers of architecture similar to LeNet and achieved 99.74% accuracy for binary classification. With a focus on colorectal histology image analysis, a binary classification model was proposed using VGG as the base model. The work used 28 normal and 29 tumor images, and cropped into 6806 normal and 3474 tumor images, achieving sensitivity, specificity, and accuracy of 95.1%, 92.76%, and 93.48%, respectively. The derived best-modified model was able to correctly classify 294 out of 309 normal images, as well as 667 out of 719 tumor images [40]. Similarly, the work in [41] focused on DL analysis for omitting the feature engineering and performing the classification of colorectal cancer into benign or malignant on the basis of tumor differentiation and classifications of tumor in the brain and colorectal tissues into normal and abnormal considering 717 patches and using AlexNet architecture, achieving 97.5% accuracy for classification. However, they used an SVM classifier instead of using softmax for the classification.
A different contribution was made in [42], wherein the authors attempted to predict the 5-year disease-free survival (DFS) in the case of patients with CRC. The work used VGG16 for feature extraction and long short-term memory for predicting the 5 years survival probability. However, the work achieved an AUC of only 0.69 when performing the DFS prediction directly from the image. Recently, the authors in [43] trained CNNs and recurrent neural networks on WSI of stomach and colon for performing multiclass classification considering three categories, namely, adenoma, adenocarcinoma, and non-neoplastic. They achieved AUC up to 0.99 and 0.97 for the gastric adenoma and adenocarcinoma, respectively. On the other hand, for colonic adenoma and adenocarcinoma, AUC 0.99 and AUC 0.96 were achieved, respectively. Considering 170,099 patches obtained from around 14,680 WSIs of more than 9631 subjects, the first-ever huge generalizable AI system was developed in [44]. The system used a novel patch aggregation strategy for the CRC diagnosis using weakly labeled WSI, wherein the Inception-v3 was used as the architecture with weights initialized from the transfer learning. The AI system generated output in the form of a heatmap highlighting cancers tissue/cells in WSI. Considering the various works proposed for the histopathological image analysis of colon cancer, we present a summary table (Table 5) showing the performances of different works in terms of one or more metrics such as accuracy, sensitivity, and AUC.

Comparison of Different CNN Architectures Taking Public Dataset
As presented in Table 5, a recent study [12] used TL with EfficientNet for the classification of breast cancer images and achieved an accuracy of 98.33%. Furthermore, another study [44] used TL with Inception-v3 and achieved AUC 0.988 for CRC classification. The achievements of both works were comparable to the performance of the IR-v2 Type 5 model. As a result, a public dataset [45] was used to compare the performances of pre-trained Inception-v3 and EfficientNet with IR-v2 Type 5 to verify the robustness of the latter. The validation dataset consisted of nine classes. However, only 741 images of normal and 1233 images of tumor (abnormal) were used. The performance metrics of different models are shown in Table 6, wherein IR-v2 Type 5 is shown to have performed better, with an accuracy of 90% and F-score of 91%. Table 5. Comparison of previous methods with the proposed IR-v2 Type 5 considering colon cancer histopathological WSI.

Reference
Method Results [32] ML-based feature extraction Accuracy: 98.07% [36] Quasi supervised learning Accuracy: 95% [37] Multi-class texture analysis Accuracy: 84% Furthermore, the confusion matrices for the models were presented in Figure 11, which shows that IR-v2 Type 5 could identify the tumor images more correctly as compared to other models, which is illustrated in Figure 12, wherein outputs produced by different models are presented. The first column represents the input image, while the second, third, and fourth columns contain the classified outputs produced by EfficientNet, Inception-v3, and IR-v2 Type 5, respectively. On the basis of Figure 12, all three models were correctly classified the outputs in the case of input images presented in the first and second rows. However, both EfficientNet and Inception-v3 failed to correctly classify the input images as given in the third and fourth row, although IR-v2 Type 5 could also correctly classify these two images. However, in the case of the input image given in row 5, all models were incorrectly classified the normal images as abnormal. In terms of the results in Table 6, confusion matrices depicted in Figure 11, and outputs from Figure 12 showed that IR-v2 Type 5 also performed better when the public dataset was used.

Processing Time and Visualization of Intermediate Outputs
By considering a model for the medical image analysis, we needed to ascertain that not only the model should produce better output, but also it should be faster in producing the output. Therefore, Table 7 shows the time details related to the model training for 50 epochs, single epoch processing time, and individual image testing time considering our own dataset. The derived IR-v2 Type 5 model could classify each image within 0.58 s, which is significantly faster. This could be further improved using GPU with more powerful configuration than the currently used GPU. In addition, Figure 13 shows the outputs of the learned features in different blocks of the model IR-v2 Type 5, where initially the model learned the low-level features such as edges and colors. Gradually, the model captured the complex features, which are less interpretable and became more abstract as the learning moved towards the output layers.
Diagnostics 2021, 11, x FOR PEER REVIEW 19 of 22 Figure 12. Comparison of outputs generated by the models: column 1: input image; column 2: classified outputs produced by EfficientNet; column 3: classified outputs produced by Inception-v3; and column 4: classified outputs produced by IR-v2 Type 5.

Processing Time and Visualization of Intermediate Outputs
By considering a model for the medical image analysis, we needed to ascertain that not only the model should produce better output, but also it should be faster in producing the output. Therefore, Table 7 shows the time details related to the model training for 50 epochs, single epoch processing time, and individual image testing time considering our own dataset. The derived IR-v2 Type 5 model could classify each image within 0.58 s, which is significantly faster. This could be further improved using GPU with more powerful configuration than the currently used GPU. In addition, Figure 13 shows the outputs of the learned features in different blocks of the model IR-v2 Type 5, where initially the model learned the low-level features such as edges and colors. Gradually, the model captured the complex features, which are less interpretable and became more abstract as the learning moved towards the output layers. Considering different works proposed as above, this paper proposed a comparative study considering different pretrained CNN and customized CNN. Moreover, this work mainly focused on developing a model using customized CNN architecture that performed well on the digital histopathology WSI data, especially CRC. Consequently, the IR-v2 Type 5 model developed through this study could be deployed in different hospitals for automatic classification and localization of abnormal tissues in WSI, which can assist the pathologists in making the faster and accurate decision, which could ultimately Considering different works proposed as above, this paper proposed a comparative study considering different pretrained CNN and customized CNN. Moreover, this work mainly focused on developing a model using customized CNN architecture that performed well on the digital histopathology WSI data, especially CRC. Consequently, the IR-v2 Type 5 model developed through this study could be deployed in different hospitals for automatic classification and localization of abnormal tissues in WSI, which can assist the pathologists in making the faster and accurate decision, which could ultimately expedite the treatment and therapy procedure of CRC patients. In addition, the models could be used for other cancer decisions by fine-tuning them with samples of other types of cancer images.

Conclusions
In this paper, the IR-v2 Type 5 model was designed for distinguishing the normal and abnormal patches obtained from the WSI of the colon cancer patients. Moreover, the abnormal regions were localized in the WSI. It was observed that in the case of the pretrained CNN architecture, the Inception-v3 performed better in comparison to other models with an F-score of 0.97. However, when the different models were customized in terms of numbers of filters, size of the input image, and/or the number of hidden layers, and trained from scratch, it was concluded that the IR-v2 Type 5 model achieved an F-score and AUC of 0.99. With automatic classification and localization of the abnormal tissues in WSI, the workload of the pathologists would be reduced and faster decisions for the treatment procedures could be made. However, there are some limitations in our study such as the population considered in this study representing only a specific region. In addition, the slides used for collecting the WSI belonged to a single hospital; as a result, we plan to consider a multi-institutional study, wherein data would be collected from different hospitals and analyzed to improve the robustness of the derived IR-v2 Type 5 model. Furthermore, the future work includes the scope where the designed model may not only detect and localize the abnormal region in the WSI but also can determine the primary tumor growth for the determination of the stages of CRC.