Deep Learning Approaches to Colorectal Cancer Diagnosis: A Review

: Unprecedented breakthroughs in the development of graphical processing systems have led to great potential for deep learning (DL) algorithms in analyzing visual anatomy from high-resolution medical images. Recently, in digital pathology, the use of DL technologies has drawn a substantial amount of attention for use in the effective diagnosis of various cancer types, especially colorectal cancer (CRC), which is regarded as one of the dominant causes of cancer-related deaths worldwide. This review provides an in-depth perspective on recently published research articles on DL-based CRC diagnosis and prognosis. Overall, we provide a retrospective synopsis of simple image-processing-based and machine learning (ML)-based computer-aided diagnosis (CAD) systems, followed by a comprehensive appraisal of use cases with different types of state-of-the-art DL algorithms for detecting malignancies. We ﬁrst list multiple standardized and publicly available CRC datasets from two imaging types: colonoscopy and histopathology. Secondly, we categorize the studies based on the different types of CRC detected (tumor tissue, microsatellite instability, and polyps), and we assess the data preprocessing steps and the adopted DL architectures before presenting the optimum diagnostic results. CRC diagnosis with DL algorithms is still in the preclinical phase, and therefore, we point out some open issues and provide some insights into the practicability and development of robust diagnostic systems in future health care and oncology.


Introduction
Global cancer statistics from 2018 show that the incidence of colorectal cancer (CRC) ranks highest after lung cancer and breast cancer, and worldwide, it accounts for approximately 10% of the total annual cancer patients among both men and women [1]. Although people aged 65 years and above are the most prevalent victims of this disease, the risk in younger patients is also significant, with the highest risk due to heredity (35%) followed by other factors such as obesity, bad nutritional habits, and smoking [2]. These rates show no trend toward decline, but rather are expected to increase by more than 60% in the next decade, with more than two million new diagnoses and over a million deaths by the next decade [3]. In this regard, there is a need to develop an optimal diagnosis strategy for the early and precise detection of CRC patients.
With routine screening being an important step for the reduction in mortality rates of this disease, colonoscopy (an endoscopic method) is considered a primary and straightforward clinical diagnosis method of choice for CRC [4]. Aside from this method, medical imaging techniques such as CT colonography, a complementary imaging method for polyp detection in CRC, and the histological evaluation of hematoxylin and eosin (H&E) slides remain indispensable approaches to subtle inspections for CRC. While manual observations of these imaging modalities by individual pathologists have been pursued relentlessly, recently, they have been modeled as traditional and unsophisticated approaches that are highly labor-intensive and time-consuming. Besides, inter-observer variation can be significant during pathological diagnosis, resulting in the biased analysis of typing and grading of cancer tumors [5]. Therefore, a more standardized and automated technique based on computer-aided diagnosis (CAD) has gained a lot of interest and demand lately.
Many CAD systems have been utilized in mainstream radiology to assist physicians, from improved chest X-rays and mammography applications in the 1960s to enabling the early diagnosis of cancers in the 2000s. Considering the medical and economic burdens caused by the prognosis and treatment of CRC-related diseases, researchers have been focusing on developing CAD systems for use in the early and effective diagnosis of CRC. The development of CAD systems for CRC can be dated from the conventional models that require complex a priori knowledge of mathematics [6][7][8] to advanced machine learning (ML)-based systems [9][10][11][12] that can perform beyond human levels of accuracy.
Although cancer diagnosis with deep learning (DL) has been a very popular subject of interest in the medical imaging domain, comprehensive literature reviews covering various aspects of CRC diagnosis and prognosis using state-of-the art DL schemes are still limited. The existing studies lack surveys based on various types of available standard CRC imaging datasets. In addition, in a short period of time, there has been adequate novel research and findings from DL-based CRC diagnoses. A proper review of these state-of-the-art findings in terms of adapted data preprocessing strategies is required, and a methodology needs to be developed to facilitate upcoming researchers and scholars in this field. Therefore, this review paper intends to fill this gap in four ways. First, it provides a brief retrospective overview of conventional CAD systems based on simple image processing and ML-based approaches to CRC diagnosis. Secondly, we identify and list some of the publicly available imaging datasets collected and archived from various independent sources, which are standardized for DL-based CRC diagnosis. Thirdly, we systematically categorize and highlight the latest studies on DL-based detection, diagnosis, and prognosis regarding different types of CRC, including tumors, microsatellite instability (MSI), and polyps. Lastly, we outline some open issues observed in this area of research and speculate on future studies regarding the optimization of diagnosis accuracy, such that it is practical and suitable for use in the clinical domain.
To sum up the organization of this paper, the next section provides a brief understanding of the CAD approaches based on simple image processing techniques and ML-based techniques. Section III presents a detailed discussion of recently published research into CRC diagnosis by organizing it into different categories, each highlighting a key contribution to data preprocessing, model architectures, and the optimal results obtained. Finally, a discussion and conclusion are presented in Sections 4 and 5, respectively.

Simple Image-Processing-Based and Machine-Learning-Based CAD Approaches
Owing to synthetic approaches based on image processing techniques, conventional CAD systems have been used in diagnosing CRC for a few decades. Table 1 lists a brief overview of the conventional studies regarding simple image processing techniques and ML-based CAD systems that have been researched for use in diagnosing CRC. CAD systems based on simple image processing methods rely on mathematical models explicitly defined by human rules for processing images from one modality to another, and they require the case-by-case tuning of model parameters for optimal performance. Specifically, diagnoses are mainly based on feature engineering methods where feature extraction is carried out either from vessel structural or textural analysis of the image patches using a local binary pattern (LBP) [13]. Solely relying on simple image processing algorithms, these CAD systems [6][7][8] can classify tumor regions or identify various traits of malignant tissues in CRC. Although these systems have been a part of digital pathology for the clinical diagnosis of CRC, they are application-specific and are considered heuristic approaches that require strong domain expertise, which depend upon the unique characteristics of the imaging type involved. With the technological advancements in the field of artificial intelligence (AI), a computer can mimic cognitive functions to solve real-world problems by learning all by itself. Within AI, an ML technique that allows computers to learn from real-world data without being explicitly programmed has been extensively applied to the medical imaging domain. Particularly for the clinical diagnosis of CRC patients, several research works based on ML approaches [9][10][11][12] have been conducted. The ML-based techniques utilize the handcrafted feature extraction of predefined morphological features that rely on the shape, color, and textural information of the image data. These types of features are usually extracted using different types of procedures, such as LBP, wavelet transform, gray-level co-occurrence matrices, scale invariant feature transform (SIFT), histogram of oriented gradients (HoG), etc. The extracted features are then subjected to ML-based classification algorithms that include, but are not limited to, the support vector machine, k-nearest neighbors, logistic regression, decision trees, m-Gaussian mixture models, etc. Although these techniques have been introduced as part of the medical diagnoses of CRC patients, they have certain limitations due to their infirm feature extraction procedures. Moreover, ML algorithms cannot produce unbiased representations for the large amounts of data, which makes them highly susceptible to overfitting and errors.

Deep Learning-Based Studies for CRC Diagnosis
Deep learning [14] is the branch of ML in the AI paradigm that identifies trends and patterns in the data without the need for human intervention or any feature engineering methods. The DL method makes use of multiple hidden layers to extract an abstract representation of the input at each layer that is appropriate to perform a specific task. DL models have been regarded as superior to ML-based techniques in the presence of large amounts of data and have been popular in multiple disciplines [15,16], including the diagnosis and prognosis of cancer in digital pathology. In Figure 1, a procedural diagram demonstrating the working mechanisms of ML and DL-based CAD systems screening for CRC patients is displayed. ML relies on handcrafted feature extraction before passing features for classification, whereas DL concurrently extracts and classifies features through multiple hidden layers and activation functions. This makes the DL technique suitable for learning more task-specific representations of large-scale image datasets; thus, they are frequently preferred over ML in solving medical imaging classification or in tumor detection problems.
works based on ML approaches [9][10][11][12] have been conducted. The ML-based techniques utilize the handcrafted feature extraction of predefined morphological features that rely on the shape, color, and textural information of the image data. These types of features are usually extracted using different types of procedures, such as LBP, wavelet transform, gray-level co-occurrence matrices, scale invariant feature transform (SIFT), histogram of oriented gradients (HoG), etc. The extracted features are then subjected to ML-based classification algorithms that include, but are not limited to, the support vector machine, k-nearest neighbors, logistic regression, decision trees, m-Gaussian mixture models, etc. Although these techniques have been introduced as part of the medical diagnoses of CRC patients, they have certain limitations due to their infirm feature extraction procedures. Moreover, ML algorithms cannot produce unbiased representations for the large amounts of data, which makes them highly susceptible to overfitting and errors.

Deep Learning-Based Studies for CRC Diagnosis
Deep learning [14] is the branch of ML in the AI paradigm that identifies trends and patterns in the data without the need for human intervention or any feature engineering methods. The DL method makes use of multiple hidden layers to extract an abstract representation of the input at each layer that is appropriate to perform a specific task. DL models have been regarded as superior to ML-based techniques in the presence of large amounts of data and have been popular in multiple disciplines [15,16], including the diagnosis and prognosis of cancer in digital pathology. In Figure 1, a procedural diagram demonstrating the working mechanisms of ML and DL-based CAD systems screening for CRC patients is displayed. ML relies on handcrafted feature extraction before passing features for classification, whereas DL concurrently extracts and classifies features through multiple hidden layers and activation functions. This makes the DL technique suitable for learning more task-specific representations of large-scale image datasets; thus, they are frequently preferred over ML in solving medical imaging classification or in tumor detection problems.

Datasets
Data constitute an inevitable part of a DL algorithm, from which the model can learn concealed information or underlying statistics available within them. Data can be in any form such as numbers, audio, images, and videos. Dataset preparation is a long process that includes collection, analysis and treatment, exploration, training, and testing. The

Datasets
Data constitute an inevitable part of a DL algorithm, from which the model can learn concealed information or underlying statistics available within them. Data can be in any form such as numbers, audio, images, and videos. Dataset preparation is a long process that includes collection, analysis and treatment, exploration, training, and testing. The data from which the model is trained must be relevant to the specific problem and must resemble real-world data as much as possible. To train a DL model, a large amount of data with significant standard deviation is required. With more data, better accuracy from a DL algorithm can be acquired as the model learns an abundant number of variations and recognizes invariant features and discrete instances of the input samples.
In medical imaging, data are acquired for several purposes, including but not limited to disease diagnosis, therapy planning, intraoperative navigation, and biomedical research [17]. Unlike normal imaging modalities, medical image data are hard to acquire due to privacy and confidentiality considerations. Besides that, requirements for accurate imaging, specific contrast, minimal artifacts, and a sufficient signal-to-noise ratio make it hard to obtain the optimum image quality required for clinical practice. In cancer diagnosis, particularly in the CRC domain, the analysis of endoscopy/colonoscopy image samples, as shown in Figure 2 (top row), has been popular in the past. Capturing colonoscopy images is considered an invasive procedure where a tiny tube is inserted along the entire length of the colon to provide an interior view of cross-sectional areas. Histopathology imaging, as shown in Figure 2 (bottom row), on the other hand, is a less invasive procedure that provides a more comprehensive view of the disease, and it preserves the underlying tissue architecture. Due to a lack of computational resources and the high cost of digital imaging equipment, this image modality has been overlooked in the past. However, thanks to the high-end computational resources recently developed, spatial analysis of histopathology imagery has been considered the backbone of most automated image analysis techniques and remains the undisputed best way to diagnose vast numbers of diseases, including all cancer types [18]. In digital pathology, histological images are stained with H&E to view cellular and tissue structural details. These H&E-stained slides are utilized to confirm the presence or absence of disease, for disease grading, and for measuring disease progression in CRC.
resemble real-world data as much as possible. To train a DL model, a large amount of data with significant standard deviation is required. With more data, better accuracy from a DL algorithm can be acquired as the model learns an abundant number of variations and recognizes invariant features and discrete instances of the input samples.
In medical imaging, data are acquired for several purposes, including but not limited to disease diagnosis, therapy planning, intraoperative navigation, and biomedical research [17]. Unlike normal imaging modalities, medical image data are hard to acquire due to privacy and confidentiality considerations. Besides that, requirements for accurate imaging, specific contrast, minimal artifacts, and a sufficient signal-to-noise ratio make it hard to obtain the optimum image quality required for clinical practice. In cancer diagnosis, particularly in the CRC domain, the analysis of endoscopy/colonoscopy image samples, as shown in Figure 2 (top row), has been popular in the past. Capturing colonoscopy images is considered an invasive procedure where a tiny tube is inserted along the entire length of the colon to provide an interior view of cross-sectional areas. Histopathology imaging, as shown in Figure 2 (bottom row), on the other hand, is a less invasive procedure that provides a more comprehensive view of the disease, and it preserves the underlying tissue architecture. Due to a lack of computational resources and the high cost of digital imaging equipment, this image modality has been overlooked in the past. However, thanks to the high-end computational resources recently developed, spatial analysis of histopathology imagery has been considered the backbone of most automated image analysis techniques and remains the undisputed best way to diagnose vast numbers of diseases, including all cancer types [18]. In digital pathology, histological images are stained with H&E to view cellular and tissue structural details. These H&Estained slides are utilized to confirm the presence or absence of disease, for disease grading, and for measuring disease progression in CRC. To this end, different types of CRC datasets belonging to either colonoscopy or histological imaging have been introduced. These images are preprocessed by applying several techniques before passing them to DL algorithms for specific tasks, such as detection, segmentation, and classification. Table 2 lists some of the popular datasets used in multiple studies based on developing DL-based CAD systems. These datasets provide To this end, different types of CRC datasets belonging to either colonoscopy or histological imaging have been introduced. These images are preprocessed by applying several techniques before passing them to DL algorithms for specific tasks, such as detection, segmentation, and classification. Table 2 lists some of the popular datasets used in multiple studies based on developing DL-based CAD systems. These datasets provide comprehensive imagery of CRC tissue and tumors and entail disease-specific characteristics that are annotated by experienced pathologists.

Tumor Tissue Detection and Classification
Tumors are complex structures in a human organ that comprise multiple and distinct types of tissue. They can be interpreted as abnormal tissue composed of multiple types of cells or a matrix of cells. In CRC, the architecture of a tumor is varied, along with its development, and is the major factor in patient prognosis [25]. Therefore, an automated and highly quantitative analysis of tumor tissue is indispensable for the clinical diagnosis of CRC. Automatic analysis of these tissue regions can be helpful in quantifying their extent, in the grading of tumors, and to investigate a biological hypothesis based on tissue morphology. Figure 3 shows the different types of tumor tissues obtained from H&E-stained histological slides that are relevant to CRC. These tissue types, when evaluated by pathologists, are visually classified into one of eight different categories (tumor, stroma, complex, lympho, debris, mucosa, adipose, and empty). A DL-based CAD system can automatically classify these tumor regions if provided with adequate amounts of data and if trained with optimal network hyper-parameters. Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 20 Multiple studies based on DL have been conducted to this end in order to accurately classify CRC tumor regions. Ponzio et al. [32] proposed a CNN framework to distinctly classify adenocarcinomas (a type of tumor) from healthy tissues and benign lesions. As a preprocessing step, they created a total of 13,500 image patches with dimensions of 1089 × 1096 at 40× magnification of the original whole slide images (WSIs) from H&E-stained images. Subsequently, to compensate for the color inconsistencies, sets of whole image patches were normalized based on mean and standard deviation. These preprocessed datasets were fed into a CNN model with thirteen convolutional layers, five max pooling layers, and three fully connected layers (FCLs) to univocally classify them into one of three tissue subtypes, namely adenocarcinoma, tubulovillous adenoma, and healthy tissue. An initial classification accuracy of around 90% was obtained, which was optimized to 96% by using a transfer learning strategy. Another study [33] developed DL-based automated analysis of CRC image samples with the objective of improving the prognostic stratification of patients. In this study, the original H&E-stained WSIs were split into uniform-dimension tiles of 224 × 224 pixels, after which, VGG16 [34] (a popular pretrained CNN model) was used to extract intermediate features from the image patches. The extracted 4096 bin feature vector was classified into several tumor types by using a combination of long short-term memory (LSTM) [35] with one of three classifiers: a support vector machine (SVM), logistic regression, or naïve Bayes. LSTM is a type of recurrent neural network (RNN) that is well suited for classifying, processing, and making predictions on the time series data and is famous for its capability of learning the longterm temporal dependencies of input data. The model's performance was assessed with different accuracy metrics, where an area under the curve (AUC) value of 0.69, a hazard ratio of 2.3, and a 95% confidence interval (CI) were achieved. Similar to the previous study, Yue et al. [36] also used a well-known VGG16 framework with some notable changes to its architecture, where classification was carried out with a voting classifier and an SVM classifier. In this study, data preprocessing was applied to the H&E slides before passing them on for feature extraction. The steps included chromatic normalization of the image patches at 224 × 224 pixels and data augmentation to increase the number of samples for better generalization of the network. The patch level accuracy and F1-score were found to be 70% and 0.67, respectively, while a cluster level experiment significantly outperformed the former with a staggering accuracy of 100% and a unit F1-score.
In DL, the accuracy of a model is significantly dependent upon the type of feature extractor and the classification procedures [37]. Therefore, multiple studies considered using a variety of popular CNN models or designed a model from scratch with the optimal tuning of hyper-parameters. To make use of multiple pretrained models and to Multiple studies based on DL have been conducted to this end in order to accurately classify CRC tumor regions. Ponzio et al. [32] proposed a CNN framework to distinctly classify adenocarcinomas (a type of tumor) from healthy tissues and benign lesions. As a preprocessing step, they created a total of 13,500 image patches with dimensions of 1089 × 1096 at 40× magnification of the original whole slide images (WSIs) from H&E-stained images. Subsequently, to compensate for the color inconsistencies, sets of whole image patches were normalized based on mean and standard deviation. These preprocessed datasets were fed into a CNN model with thirteen convolutional layers, five max pooling layers, and three fully connected layers (FCLs) to univocally classify them into one of three tissue subtypes, namely adenocarcinoma, tubulovillous adenoma, and healthy tissue. An initial classification accuracy of around 90% was obtained, which was optimized to 96% by using a transfer learning strategy. Another study [33] developed DL-based automated analysis of CRC image samples with the objective of improving the prognostic stratification of patients. In this study, the original H&E-stained WSIs were split into uniform-dimension tiles of 224 × 224 pixels, after which, VGG16 [34] (a popular pretrained CNN model) was used to extract intermediate features from the image patches. The extracted 4096 bin feature vector was classified into several tumor types by using a combination of long short-term memory (LSTM) [35] with one of three classifiers: a support vector machine (SVM), logistic regression, or naïve Bayes. LSTM is a type of recurrent neural network (RNN) that is well suited for classifying, processing, and making predictions on the time series data and is famous for its capability of learning the long-term temporal dependencies of input data. The model's performance was assessed with different accuracy metrics, where an area under the curve (AUC) value of 0.69, a hazard ratio of 2.3, and a 95% confidence interval (CI) were achieved. Similar to the previous study, Yue et al. [36] also used a well-known VGG16 framework with some notable changes to its architecture, where classification was carried out with a voting classifier and an SVM classifier. In this study, data preprocessing was applied to the H&E slides before passing them on for feature extraction. The steps included chromatic normalization of the image patches at 224 × 224 pixels and data augmentation to increase the number of samples for better generalization of the network. The patch level accuracy and F1-score were found to be 70% and 0.67, respectively, while a cluster level experiment significantly outperformed the former with a staggering accuracy of 100% and a unit F1-score.
In DL, the accuracy of a model is significantly dependent upon the type of feature extractor and the classification procedures [37]. Therefore, multiple studies considered using a variety of popular CNN models or designed a model from scratch with the optimal tuning of hyper-parameters. To make use of multiple pretrained models and to evaluate their performance, Kather et al. [38] investigated whether the existing pretrained CNN models could extract the prognosticators directly from H&E-stained tissue slides. Human cancer tissue slides from multiple patient cohorts (NCT biobank at http://dx.doi.org/10 .5281/zenodo.1214456 (accessed on 18 July 2021), a DACHS study at http://dx.doi.org/ 10.5281/zenodo.1214456 (accessed on 18 July 2021), and a TCGA cohort at http://cancer. digitalslidearchive.net (accessed on 18 July 2021) (NCT: National Center for Tumor Diseases, DACHS: Darmkrebs Chancen der Verhütung durch Screening, and TCGA: The Cancer Genome Atlas) were used as training and testing datasets. For data preprocessing, they created several non-overlapping image patches, each at 224 × 224 pixels, and normalized them with the Macenko method [39]. Five pretrained models (VGG19 [34], AlexNet [40], SqueezeNet v1.1 [41], GoogLeNet [42], and ResNet-50 [43]) were used for feature extraction, while classification was carried out by replacing the classification head with a new fully connected layer. Among them, the best classification accuracy was achieved by VGG19, which was trained on a full set of 100,000 images and tested with an external test set of more than 7000 images, while the least accurate model was SqueezeNet, with a classification accuracy of less than 50%.
A study in [44] segmented and filtered the background area of tumors by using Otsu's thresholding [45] and labeled the tumor area with a self-developed annotation tool before passing it to the CNN model for feature extraction and classification. They built a new model with the combination of DeepLab v2 [46], and ResNet-34 [43] and compared the model's performance with analyses of experienced pathologists. They found that their DL model for the diagnosis of adenoma in CRC was quite similar to the results from the pathologists, where a slide-level accuracy of over 90% and an AUC of 0.92 were obtained. Choi et al. [47] used an approach similar to the one in [44], where data preprocessing was carried out by discarding the unnecessary black regions in the endoscopic image samples via filtering. A transfer learning approach was used where the pretrained weights of various DL models, such as Inception-v3 [48], ResNet-50 [43], and DenseNet-161 [49], were used with 10-fold cross-validation. They evaluated their performance in terms of accuracy, recall, and precision, where they obtained respective values of 92.48%, 99.7%, and 99.2%. Similar studies [50][51][52] based on the tumor tissue detection type are listed in Table 3.

MSI Detection
Microsatellites, which are also known as short tandem repeats (STRs), are tiny repeating stretches of DNA that are scattered across the entire genome region, accounting for approximately 3% of the whole region [55]. The MSI phenotype is one of the molecular changes that occurs in CRC, and it is also observed in different types of cancer, such as adrenocortical, rectal, colon, stomach, and endometrial tumors, and breast and prostate cancer [56]. MSI can also be referred to as a hyper-mutable phenotype that is an outcome of deficient mismatch repair (dMMR). In Figure 4, we can see MSI patches, indicated by yellow arrows, that show activations around the potential patterns of infiltrating immune cells. The identification of MSI status in CRC patients is crucial, because it helps to determine the presence of related diseases such as lynch syndrome, a highly penetrant hereditary cancer syndrome accounting for one-third of patients with MSI. Therefore, a less labor-intensive and broadly accessible MSI testing tool based on DL approaches has been studied lately. These CAD systems contribute an automated screening tool to triage patients when making clinical decisions, so as to identify differential treatment responses.

MSI Detection
Microsatellites, which are also known as short tandem repeats (STRs), are tiny repeating stretches of DNA that are scattered across the entire genome region, accounting for approximately 3% of the whole region [55]. The MSI phenotype is one of the molecular changes that occurs in CRC, and it is also observed in different types of cancer, such as adrenocortical, rectal, colon, stomach, and endometrial tumors, and breast and prostate cancer [56]. MSI can also be referred to as a hyper-mutable phenotype that is an outcome of deficient mismatch repair (dMMR). In Figure 4, we can see MSI patches, indicated by yellow arrows, that show activations around the potential patterns of infiltrating immune cells. The identification of MSI status in CRC patients is crucial, because it helps to determine the presence of related diseases such as lynch syndrome, a highly penetrant hereditary cancer syndrome accounting for one-third of patients with MSI. Therefore, a less labor-intensive and broadly accessible MSI testing tool based on DL approaches has been studied lately. These CAD systems contribute an automated screening tool to triage patients when making clinical decisions, so as to identify differential treatment responses. The authors in [57] introduced adversarial MSI-based assessment (AMIBA), a modality to diagnose microsatellite instability directly from histopathological images. Histological image data with a clinically determined MSI status (MSI-H, MSI-L, and MSS) were obtained from TCGA available at https://portal.gdc.cancer.gov/ (accessed on 21 July 2021), where high (H), low (L), and microsatellite stable (MSS) each represents highest, lowest, and no presence of microsatellites. For data preprocessing, the image slides were clipped into non-overlapping image patches of 1000 × 1000 pixels, which were obtained at 20 × magnification from the original slide. Furthermore, patches with more than half of the area empty were not included in the process, which generated a total of 620,833 patches to train the specific DL architecture. Multiple state-of-the-art DL architectures were used, including ResNet-18 [43], AlexNet [40], and VGG-19 [34], where weights were initialized with parameters pretrained on an ImageNet dataset [58]. By using the Adam optimization algorithm with a learning rate of 0.0001, the authors obtained patch-level and slide-level accuracies of 91.7% and 98.3%, respectively. The authors in [57] introduced adversarial MSI-based assessment (AMIBA), a modality to diagnose microsatellite instability directly from histopathological images. Histological image data with a clinically determined MSI status (MSI-H, MSI-L, and MSS) were obtained from TCGA available at https://portal.gdc.cancer.gov/ (accessed on 21 July 2021), where high (H), low (L), and microsatellite stable (MSS) each represents highest, lowest, and no presence of microsatellites. For data preprocessing, the image slides were clipped into non-overlapping image patches of 1000 × 1000 pixels, which were obtained at 20 × magnification from the original slide. Furthermore, patches with more than half of the area empty were not included in the process, which generated a total of 620,833 patches to train the specific DL architecture. Multiple state-of-the-art DL architectures were used, including ResNet-18 [43], AlexNet [40], and VGG-19 [34], where weights were initialized with parameters pretrained on an ImageNet dataset [58]. By using the Adam optimization algorithm with a learning rate of 0.0001, the authors obtained patch-level and slide-level accuracies of 91.7% and 98.3%, respectively.
Considering ways to facilitate universal MSI screening, the research in [59] studied how deep residual learning can predict the MSI status directly from H&E-stained histological slides. Multiple datasets regarding MSI status were collected from large patient cohorts in TCGA, which were manually annotated and classified to represent one tumor tissue and two non-tumor tissues (dense and loose tissue). The image slides were preprocessed to create 11,977 unique image tiles, each with a 256 µm edge length. Furthermore, to convert all the images into a reference color, a color normalization technique based on the Macenko method was used. The authors conducted initial experiments with multiple convolutional layers, from which ResNet-18 was selected as the optimum model due to its noteworthy advantages, such as a short training time, better classification performance, less risk of overfitting, and comparatively fewer training parameters. All models were trained on an ImageNet dataset, and only the weights of the last 10 layers were fine-tuned, while the rest of them were frozen. By using the Adam optimizer [60] and L2 regularization with multiple learning rates {10 −5 , 10 −6 }, they obtained an area under the curve (AUC) of 0.99. The CI for both true MSI and MSS tiles was found to be 95%.
Another study conducted in 2020 by Lee et al. [61] developed a two-stage DL-based classification pipeline for predicting MSI status in CRC patients. In the two-stage process, the first stage was responsible for segmenting the tumor area into two types of tissue (MSI-H and MSI-L). The latter stage then classified the tissue types into their corresponding class. H&E-stained histological WSIs annotated by professional pathologists were obtained from a pathology AI platform (PAIP) at http://wisepaip.org/paip (accessed on 21 July 2021) and were preprocessed before being used as input to the DL architecture. During preprocessing, the WSIs were cropped to magnifications of 20 × and 10 × to obtain image patches of 224 × 224 pixels before converting the RGB images to the CIE L × a × b color space. Other preprocessing methods, such as foreground mask extraction with Otsu's thresholding, were used to segment individual patches. Two DL models were adapted in this research: the feature pyramid network (FPN) [62] and inception ResNet-V2 [43], one for each stage in the classification pipeline. Multiple optimization algorithms such as Adam and RMS prop were used, with each being trained on one of two learning rate schedulers (step decay and cosine annealing) with a learning rate of 10-4. The optimum precision, recall, and F1-score were found to be 0.93, 0.93, and 0.94, respectively.
Similarly, to develop a DL system for detecting CRC tumor specimens with MSI, Echle et al. [63] collected H&E-stained slides from 8836 CRC tumors from the MSI-DETECT consortium (https://jnkather.github.io/msidetect/ (accessed on 22 July 2021)). All specimens belonged to a large cohort of patients from Germany, the Netherlands, the United Kingdom, and the United States, where each specimen with MSI was identified via genetic analysis. These data were preprocessed by tessellating the slides into individual square tiles of 256 µm edge lengths followed by color normalization with the Macenko method before passing them to a ShuffleNet model [64] for classification. The whole model was trained on Nvidia RTX6000 graphical processing unit (GPU) hardware with the Adam optimizer, L2 regularization, and a learning rate of 5 × 10 −5 . The classification results were evaluated based on several performance metrics, where the optimal values of area under the receiver operating characteristics (AUROC), area under the precision recall curve (AUPRC), sensitivity, and specificity were recorded as 0.96, 0.9, 99%, and 98%, respectively.
Recently, several other DL-based studies for MSI detection and/or classification were conducted, which are listed in Table 4.

Polyp Detection
The beginning phase of most CRCs is stimulated with a growth of tissues on the inner lining of the colon. These abnormal growths of tissue from the mucous membrane, developing over a period, are called polyps, and are often considered a precursor to CRC. Figure 5a-c, respectively, show a polyp image from a DL-based CAD system extracted from a CRC patient, the corresponding annotations, and the detected polyps. Colonic polyps, especially with a large size and in large numbers, are more likely to be cancerous, and if not treated early, they could develop into colon cancer. CRC polyps can be categorized as neoplastic and non-neoplastic. The former is non-cancerous, while the latter can develop into cancer and can be further sub-categorized into adenomas and serrated polyps. In clinical practice, the detection of polyps is usually accomplished via colonoscopy, which is an expensive, manual, and time-consuming procedure. Frequent reviews of colonoscopy data are required, because 20% of polyps are likely to be missed during a single review. This is extremely labor intensive, and the lack of a thorough inspection of the data might result in the missed detection of polyps. Taking this into account, an automated and non-invasive procedure based on CAD has been reliable and is considered more robust for accurate detection of polyps. Specifically, DL-based segmentation and classification algorithms have been indispensably applied recently to enable the routine detection of polyps in CRC diagnosis. In this regard, in [68], multiple studies related to colon cancer analysis were collected under the field of colon cancer and deep learning; then, they were categorized into five categories that are detection, classification, segmentation, survival prediction, and inflammatory bowel diseases. In [69], the current systematic review on colorectal cancer detection and localization, and difficulties of a fair comparison and the reproducibility of those methods were addressed.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 12 of 20 the current systematic review on colorectal cancer detection and localization, and difficulties of a fair comparison and the reproducibility of those methods were addressed. A conference paper was published by Godkhindi et al. [70] in 2017 with the objective being the automatic detection of polyps via CT colonography using DL techniques. To achieve this, a CT colonography image dataset was collected from the Cancer Imaging Archive (TCIA), available at https://www.cancerimagingarchive.net/ (accessed on 23 July 2021), containing ground truth information for segmentation purposes. In the data preprocessing steps, the authors discarded the air gap-filled regions in the colon images by using thresholding and filtering techniques. Furthermore, to label each block in the image, binary region of interest (ROI) segmentation was performed. A CNN model with three convolutional layers, three max pooling layers, and a single FCL was designed and trained using 10-fold cross-validation, which obtained classification accuracy, sensitivity, and specificity levels of 88.56%, 88.77%, and 87.35%, respectively.
Another similar study [71] applied state-of-the-art DL algorithms to each colonoscopy frame of a gastrointestinal image analysis (GIANA) (dataset available at https://giana.grand-challenge.org/ (accessed on 23 July 2021)) that consisted of 18 videos collected from endoscopic results of multiple patients. For data preprocessing, the black edges of the endoscopy image frames were removed, and the images were resized to model-specific dimensions of 284 × 265 pixels before being passed for data augmentation with horizontal and vertical flips and a blur filter. This approach used ResNet-50 as a fully convolutional neural network (FCNN) to extract descriptive characteristics from the input image. The extracted features were then subjected to a faster RCNN [72] model with two FCLs, each operating as a regression and a classification layer. After extensive experiments and evaluations, their model achieved a precision value of 80.31%, recall of 75.37%, accuracy of 71.99%, and specificity of 65.70%.
Different levels of diagnostic accuracy can be observed by adapting different strategies for data preprocessing and by using DL algorithms. The optimal solution for medical image diagnostics is, however, not obtained through minimal trials and testing, but is considered via continuous and long-running research. In this context, several studies tried to overcome the shortcomings in previous research or developed a completely novel scheme for CRC diagnosis. Considering this, to enhance detection accuracy obtained by reference studies, Lee et al. [73] developed and validated a robust DL algorithm for use in the detection of colorectal polyps. In that study, the authors collected endoscopy data samples from the Asan Medical Center, Korea, between May 2017 and February 2018. The whole training dataset contained 8075 images from 185 colonoscopy videos of 103 patients. For validation and testing, different sets of data samples were collected from the same place. These datasets were preprocessed by storing them at a fixed resolution of 475 × 420 pixels before labeling the location and dimension A conference paper was published by Godkhindi et al. [70] in 2017 with the objective being the automatic detection of polyps via CT colonography using DL techniques. To achieve this, a CT colonography image dataset was collected from the Cancer Imaging Archive (TCIA), available at https://www.cancerimagingarchive.net/ (accessed on 23 July 2021), containing ground truth information for segmentation purposes. In the data preprocessing steps, the authors discarded the air gap-filled regions in the colon images by using thresholding and filtering techniques. Furthermore, to label each block in the image, binary region of interest (ROI) segmentation was performed. A CNN model with three convolutional layers, three max pooling layers, and a single FCL was designed and trained using 10-fold cross-validation, which obtained classification accuracy, sensitivity, and specificity levels of 88.56%, 88.77%, and 87.35%, respectively.
Another similar study [71] applied state-of-the-art DL algorithms to each colonoscopy frame of a gastrointestinal image analysis (GIANA) (dataset available at https://giana. grand-challenge.org/ (accessed on 23 July 2021)) that consisted of 18 videos collected from endoscopic results of multiple patients. For data preprocessing, the black edges of the endoscopy image frames were removed, and the images were resized to model-specific dimensions of 284 × 265 pixels before being passed for data augmentation with horizontal and vertical flips and a blur filter. This approach used ResNet-50 as a fully convolutional neural network (FCNN) to extract descriptive characteristics from the input image. The extracted features were then subjected to a faster RCNN [72] model with two FCLs, each operating as a regression and a classification layer. After extensive experiments and evaluations, their model achieved a precision value of 80.31%, recall of 75.37%, accuracy of 71.99%, and specificity of 65.70%.
Different levels of diagnostic accuracy can be observed by adapting different strategies for data preprocessing and by using DL algorithms. The optimal solution for medical image diagnostics is, however, not obtained through minimal trials and testing, but is considered via continuous and long-running research. In this context, several studies tried to overcome the shortcomings in previous research or developed a completely novel scheme for CRC diagnosis. Considering this, to enhance detection accuracy obtained by reference studies, Lee et al. [73] developed and validated a robust DL algorithm for use in the detection of colorectal polyps. In that study, the authors collected endoscopy data samples from the Asan Medical Center, Korea, between May 2017 and February 2018. The whole training dataset contained 8075 images from 185 colonoscopy videos of 103 patients.
For validation and testing, different sets of data samples were collected from the same place. These datasets were preprocessed by storing them at a fixed resolution of 475 × 420 pixels before labeling the location and dimension of each polyp in the image with bounding boxes. This study used a one-shot classification model of every object present in the image using YOLO v2 [74] without using an attention mechanism. The classification model was a fine-tuned Darknet19 model provided at https://pjreddie.com/darknet/ (accessed on 26 July 2021), which was pretrained on an ImageNet dataset. By creating B bounding boxes with a confidence score for the class probability of each box, the model was able to secure a sensitivity level of 96.7% with a false positive rate (FPR) of 6.3%.
Similarly, in 2020, Poudel et al. [75] developed a classification model for use in the identification of adenomas, Crohn's disease, ulcerative colitis, and normal images by using endoscopic image samples from CRC patients. They used two datasets: the first was provided by Gill Hospital in Korea with a total of 3515 images, and the second was a publicly available KVASIR dataset [19] with 4000 endoscopy samples. Each dataset was normalized to model-specific inputs and was subjected to augmentation, which included flipping, scaling, rotating, zoom, contrast normalization, and shearing. A transfer learning approach was used with a ResNet-50 architecture as a baseline model, initialized with pretrained weights from ImageNet. An efficient dilation technique [76] was adopted to preserve the spatial information of the final layers in the network by using dilated convolution layers in ascending and descending order. The original ResNet-50 model was also modified by using DropBlock regularization [77] at deeper layers to make it robust towards noise and artifacts. With extensive experiments using both datasets, the optimal values for precision, recall, and F1-score were found to be 0.932, 0.928, and 0.93, respectively.
Another study in [78] created an endoscopic dataset from different sources and annotated the ground truths by collaborating with experienced gastroenterologists. Due to the severe differences in the existing datasets in terms of image resolution and color temperature (possibly due to different imaging equipment setups), the authors built a new dataset to serve as a benchmark to train and evaluate the DL models for polyp detection and classification. The new dataset included multiple publicly available endoscopic datasets as well as some independently collected from the University of Kansas Medical Center. Due to the extreme imbalance among the total number of image frames in each dataset, an adaptive sampling rate was utilized to homogenize the representativeness of each polyp by extracting important frames from the video. In total, 116 training, 17 validations, and 22 testing sets were generated, each comprising of 28,773, 4254, and 4872 frames, respectively. By using these datasets, eight of the most popular state-of-the-art object detection models were evaluated. These included Faster RCNN [72], YOLOv3 [79], SSD [80], RetinaNet [81], DetNet [82], RefineDet [83], YOLOv4 [84], and ATSS [85]. Using these frameworks, three different types of experiments were conducted: first, frame-based oneclass polyps detection, second, two-class polyps detection, and the third, sequence-based two class polyp detection. For the two frame-based detection experiments, the performance was measured by regular object detection metrics, while for the sequence-based detection, regular object detection was applied to each frame. Finally, the voting procedure was applied to pick the mostly predicted polyps. For both the frame-based and sequence-based detection methods, RefineDet performed very well with an F1 score of 88.6, and 86.3, respectively. Other similar studies published recently on polyp detection using endoscopy images samples are listed in Table 5.

Discussion
Multiple DL algorithms discussed in aforementioned sections have achieved highly reliable results in accurately detecting different types of tumors, MSI cells, and colorectal polyps. The evaluations of these models have been conducted through several validation tests and are designed to perform domain specific tasks such as segmenting tumorous from non-tumorous tissue or classifying cancerous cells from the healthy ones. For tumor tissue classification tasks, the EfficientNet [88] model was shown to display superior performance, while U-Net [90], and YOLO architectures [74,79,84] showed high precision in solving polyp segmentation and detection tasks, respectively. Using DL methods, the clinical inspection of CRC-related patients is quickly performed with high diagnostic accuracy. With that being said, the accuracy of DL algorithms can vary significantly and is dependent upon the amount of data with which the model is trained. Especially in the medical imaging sector, the availability of a publicly usable, large-scale standard dataset for conducting experiments is considered rare, relative to other fields, for example, in natural images (ImageNet). In most scenarios, to complement the scarcity of vast amounts of data, techniques such as data augmentation are widely practiced. Ranging from traditional augmentation techniques such as flipping, shifting, and rotation, novel techniques such as generative adversarial networks (GANs) [91] and style transfer techniques [92] have also been extensively used to create and add synthetic instances to increase the data samples, guaranteeing higher efficiency in the DL models. In order to address the complications created by limited data, other techniques, such as transfer learning, can mitigate the model's dependency on training data sample size by using pretrained weights of other large-scale datasets to initialize the model hyperparameters. Because real-world medical imaging data are hard to acquire, data augmentation and synthetic imaging techniques can be helpful to enhance the accuracy of DL models in diagnosing CRC. Similarly, a clear difference in the accuracy of the DL model can be perceived, depending upon the use of preprocessed and non-preprocessed datasets to train any model. For a quantitative evaluation of the DL model, data preprocessing such as ROI extraction, color normalization, thresholding, etc., must be incorporated. In addition, data cleaning/preprocessing eliminates a portion of low-quality data or outliers, such as image pairs with suboptimal registration.
Apart from that, image annotation such as tumor tissue labeling in CRC is considered a highly sophisticated and time-consuming task, and thus, there is a need for highly skilled and experienced pathologists to prepare high-quality datasets for training and testing. Not only limited to image annotations, the requirement for medical associates and pathologists is also extremely important in order to consider the values and preferences of the patients, the medical judgments, the interventional procedures, policy making, and other tasks that cannot be accomplished by computer programs alone. Therefore, the need for pathologists remains essential for medical practices diagnosing not only CRC but also other cancer variants.
Current DL models exist in various forms and architectures, and frequent optimizations of those models are being released to ensure highly accurate results in CRC diagnosis. However, only an abundant number of experiments and user-based experiences can guarantee the reliability of those models for clinical purposes. Therefore, it is necessary to apply several DL algorithms to identify and detect each type of CRC malignancy and compare them to find the optimal diagnosis procedure. Moreover, improving the existing theoretical foundation of the DL on the basis of the type of experimental data must be taken into consideration to quantify the performance of multiple DL-based CRC detection modalities. Such improvements must address the data-specific assessment of any algorithm, its computational complexity, and the hyperparameter tuning strategies [93]. The fact that currently incorporated models can be biased towards non-CRC datasets cannot be overlooked, and thus, specific criteria should be validated for CRC-specific DL models in order to obtain intuitive insights into their optimization characteristics and certainties. CRC diagnosis and prognosis with DL technologies are almost ready to be commercialized for practical use cases in clinical settings. By exploring several other opportunities regarding data preparation and model architectures, there is still room for improvement in the accuracy of those models that are still in the suboptimal phase.

Conclusions
DL has expanded rapidly over the past few years in the field of oncology, especially for screening and the diagnosis of CRC-related diseases. Putting this into perspective, in this paper, we reviewed the publicly available CRC imaging datasets and recently published research works that focused on detecting different types of CRC, including tumor detection, MSI detection, and polyp detection. Furthermore, we also outlined some issues regarding the scarcity of data and preprocessing strategies and provided insights into developing problem-specific DL architectures to diagnose CRC patients in real time to enable their commercialization for clinical practice. Through extensive research, and development of medical application-oriented DL models, and by collaborating with experienced pathologists in collecting high-quality annotated datasets, we believe that the reliable and automated screening of one of the most fatal cancer subtypes will be possible in the near future.