Deep Learning on Histopathological Images for Colorectal Cancer Diagnosis: A Systematic Review

Colorectal cancer (CRC) is the second most common cancer in women and the third most common in men, with an increasing incidence. Pathology diagnosis complemented with prognostic and predictive biomarker information is the first step for personalized treatment. The increased diagnostic load in the pathology laboratory, combined with the reported intra- and inter-variability in the assessment of biomarkers, has prompted the quest for reliable machine-based methods to be incorporated into the routine practice. Recently, Artificial Intelligence (AI) has made significant progress in the medical field, showing potential for clinical applications. Herein, we aim to systematically review the current research on AI in CRC image analysis. In histopathology, algorithms based on Deep Learning (DL) have the potential to assist in diagnosis, predict clinically relevant molecular phenotypes and microsatellite instability, identify histological features related to prognosis and correlated to metastasis, and assess the specific components of the tumor microenvironment.


Introduction
Colorectal cancer (CRC) is one of the most common types of gastrointestinal cancer, the second most common cancer in women and the third in men [1]. Despite existing variations, such as geographical distribution, age and gender differences, the CRC incidence, overall, is estimated to increase by 80% in the year 2035, worldwide [2]. This rising incidence of CRC is mainly due to changes in lifestyle, particularly dietary patterns [3]. Most CRCs are sporadic (70-80%), while approximately one third have a hereditary component [4]. Within the term CRC, a wide range of carcinoma subtypes is included, characterized by different morphological features and molecular alterations.
The cornerstone of CRC diagnosis is the pathologic examination (biopsy or surgical excision) [5]. With the advent of screening methods, many precursor lesions are also detected and biopsied. Consequently, a wide range of pre-malignant lesions have been identified, and occasionally, a differential diagnosis between pre-malignant and malignant lesions is quite challenging [6]. The histopathological examination of the tissue remains the "gold standard" for diagnosis, with the first step being the optimal preparation of the histological section, stained with Hematoxylin and Eosin (H&E) [7]. Further examination with special in situ methods, such as immunohistochemistry (IHC) and in situ hybridization (ISH), and other molecular techniques follows [8]. There are published guidelines for preanalytical, analytical and post-analytical procedures in a pathology laboratory [9]. As Machine learning is a branch of AI which is based on the concept that machines could have access to data and be able to learn on their own. AI has a broader scope and involves machines that are capable of carrying out tasks requiring intelligence. Machine learning techniques focus on the creation of intelligent software using statistical learning methods and require access to data for the learning procedure [27]. A branch of machine learning, which has drawn a lot of attention over the last few years, is DL. DL involves training artificial neural networks (ANNs) with multiple layers of artificial neurons (nodes). Neural networks are inspired from the human physiology of the brain, comprising a simplified artificial model of the human neural network. An ANN is a collection of connected artificial neurons. The simplest ANN architecture is the single layer feed forward neural network. In these types of networks, the information moves in one direction only, from the inputs' nodes to the hidden layer nodes and then to the output nodes. The success and wide acceptance of ANNs relies on their capability to solve complex mathematical problems, nonlinear or stochastic, by using very simple computational operations. In contrast to a conventional algorithm, which needs complex mathematical and algorithmic operations and could only apply to one problem, an ANN is computationally and algorithmically very simple and its structure allows it to be applied in a wide range of problems [28].
DL has rapidly developed during the last decade due to the significant increase in processing power and to the fact that, for the first time, artificial models are able to achieve Machine learning is a branch of AI which is based on the concept that machines could have access to data and be able to learn on their own. AI has a broader scope and involves machines that are capable of carrying out tasks requiring intelligence. Machine learning techniques focus on the creation of intelligent software using statistical learning methods and require access to data for the learning procedure [27]. A branch of machine learning, which has drawn a lot of attention over the last few years, is DL. DL involves training artificial neural networks (ANNs) with multiple layers of artificial neurons (nodes). Neural networks are inspired from the human physiology of the brain, comprising a simplified artificial model of the human neural network. An ANN is a collection of connected artificial neurons. The simplest ANN architecture is the single layer feed forward neural network. In these types of networks, the information moves in one direction only, from the inputs' nodes to the hidden layer nodes and then to the output nodes. The success and wide acceptance of ANNs relies on their capability to solve complex mathematical problems, nonlinear or stochastic, by using very simple computational operations. In contrast to a conventional algorithm, which needs complex mathematical and algorithmic operations and could only apply to one problem, an ANN is computationally and algorithmically very simple and its structure allows it to be applied in a wide range of problems [28].
DL has rapidly developed during the last decade due to the significant increase in processing power and to the fact that, for the first time, artificial models are able to achieve more accurate results than humans in classification tasks [29]. Both DL and machine learning techniques in general affect our everyday life in various ways. From the simple-looking face recognition program used in Facebook, to the classification of abnormal/normal human cells in bioinformatics. For image analysis problems, such as the histological lesions' detections, prognosis and diagnosis, DL approaches mainly employ Convolutional Neural Networks (CNNs) for segmentation and classification, while few studies employ another DL approach, called Generative Adversarial Networks (GANs), to improve the training set of images before classification.
CNNs have produced high classification rates in modern computer vision applications. The term "convolutional" suggests that a deep neural network applies the mathematical convolution operation to at least one of its multiple hidden layers. Many CNN model variations have been implemented in recent years, which are based on a common layer pattern: (a) 1 input layer, (b) L-1 convolution layers and (c) 1 classification layer. The key feature of a sequential CNN is that it transforms the input data through neurons that are connected to neurons of the previous convolution layer. Initially, the raw image is loaded at the input layer, which is usually set to accept a three-dimensional spatial form of an image file (width × height × depth), with the depth, in this case, indicating the RGB (Red, Green, Blue) color channels. More technically, each of the convolution layers calculates the dot product between the area of the neurons in the input layer and the weights in a predetermined size of a filtering kernel (e.g., 3 × 3). In this way, local features can be detected through K declared kernels. As a result, all nodes (neurons) of each convolution layer calculate their activation value based on only one subset of spatially adjacent nodes on the filtered feature maps of each previous convolution layer. The most common deep network architectures, such as AlexNet and GoogleNet, use the same neuron type at each hidden layer [30,31]. These architectures achieve very high accuracy in classification problems, while their training is a computationally intensive and time-consuming process. Currently, many different architectures, such as VGG, DenseNet, ResNet, Inception.v3, etc., have been proposed, performing well under different conditions and problem parameters [31][32][33].
GANs are also a DL approach applied on digital image analysis [34]. GANs are a smart way to train a model as a supervised learning problem, even if based on their principles they are unsupervised machine learning procedures. A typical GAN consists of two sub-models: (a) the generator network, where the training generates new samples with similar characteristics to the real ones and (b) the discriminator network, which provides a binary classification of the generating samples, discriminating the real (approved) samples from the fake ones. GANs have been rapidly evolved, especially in image processing and classification, providing a sophisticated approach to simulate images for CNN training, avoiding overtraining and overfitting. It is an alternative method of image augmentation which extracts simulated images using simple transformations such as rotation, shearing, stretching, etc.
In this paper, a systematic review for the application of DL in colorectal cancer, using digital image analysis in histopathological images, is presented. The aim of the manuscript focuses on the investigation from both medical and technical viewpoints. The innovative contribution of this systematic review is the combination of the two viewpoints provided, presenting a more comprehensive analysis of AI-based models in CRC diagnosis. A deeper understanding on both medical and technical aspects of DL will better reveal the opportunities of implementing DL-based models in clinical practice, as well as overcome several challenges occurring for the optimal performance of the algorithms. According to the PRISMA guidelines [35], an expanded algorithm was used for searching the literature works. Specific inclusion and exclusion criteria have been defined to result in the final studies of interest, which have been categorized for both medical and technical points of views. In the next sections, significant backgrounds for both the clinical practice and the details about DL in image analysis are outlined, the method for the study selection is analyzed, and results are extensively discussed.

Search Strategy
We systematically searched PubMed from inception to 31 December 2021 for primary studies developing a DL model for the histopathological interpretation of large bowel biopsy tissues and CRC. For this purpose, we used the following algorithm: (convolutional neural networks OR CNN OR deep learning) AND ((cancer AND (colon OR colorectal OR intestin* OR bowel)) OR (adenocarcinoma AND (colon OR colorectal OR intestin* OR bowel)) OR (carcinoma AND (colon OR colorectal OR intestin* OR bowel)) OR (malignan* AND (colon OR colorectal OR intestin* OR bowel))) AND (biop* OR microscop* OR histolog* OR slide* OR eosin OR histopatholog*). The search was conducted on 14 January 2022.

Study Eligibility Criteria
The study was conducted according to the PRISMA guidelines and registered to PROS-PERO 2020. Eligible articles were considered based on the following criteria. We included studies presenting the development of at least one DL model for the histopathological assessment of large bowel slides and CRC. Eligible applications of the DL models included diagnosis, tumor tissue classification, tumor microenvironment analysis, prognosis, survival and metastasis risk evaluation, tumor mutational burden characterization and, finally, microsatellite instability detection. We excluded articles that presented in vitro models, used endoscopic or radiological images instead of histological sections, and involved nonphotonic microscopy. Furthermore, eligible articles should report original studies and not reviews/meta-analyses, concern humans and be written in English. Additionally, articles referring to organs other than the large bowel and benign entities were deemed ineligible.

Study Selection
All citations collected by the previously mentioned methodology were independently screened by four researchers, who were properly trained before the process started, using the online software Rayyan. Three of the researchers were scientifically capable of evaluating the medical aspect of the query and one of them was a CNN expert, able to assess the technical part. During the screening period, the researchers would meet regularly to discuss disagreements and continue training. Conflicts were resolved by consensus. The full texts of potentially eligible articles were later retrieved for further evaluation.

Medical Viewpoint
According to the medical scope of view, there are five categories: (a) studies for diagnostic purposes, (b) the classification of the tumor tissue, (c) the investigation of the tumor microenvironment, (d) the role of histological features to prognosis, metastasis and survival, and finally, (e) the identification of microsatellite instability.

Diagnosis
DL techniques can assist in the process of pathology diagnosis [14]. The algorithms perform a binary classification, for instance, cancer/non-cancer, colon benign tissue/colon ADC.

Reports excluded (n = 10)
Identification of studies via MEDLINE database

Medical Viewpoint
According to the medical scope of view, there are five categories: (a) studies for diagnostic purposes, (b) the classification of the tumor tissue, (c) the investigation of the tumor microenvironment, (d) the role of histological features to prognosis, metastasis and survival, and finally, (e) the identification of microsatellite instability.

Diagnosis
DL techniques can assist in the process of pathology diagnosis [14]. The algorithms perform a binary classification, for instance, cancer/non-cancer, colon benign tissue/colon ADC.
The classification of the tumor regions in WSIs by AI-based models could assist in the time-consuming process of a microscopical examination. The suggested models in the

Diagnosis
DL techniques can assist in the process of pathology diagnosis [14]. The algorithms perform a binary classification, for instance, cancer/non-cancer, colon benign tissue/colon ADC.
The classification of the tumor regions in WSIs by AI-based models could assist in the time-consuming process of a microscopical examination. The suggested models in the study by Gupta et al. classified normal and abnormal tissue in CRC slides and localized the cancer regions with good performance metrics [36]. Zhou et al. used global labels for tumor classification and localization without the need for annotated images [37]. In the same framework, DL algorithms performed a binary classification of CRC images for detecting cancerous from non-cancerous regions, achieving good performance metrics and supporting the potential for use in clinical practice [38][39][40][41][42]. A recent study evaluating the segmentation performance of different DL models, showed that AI-patch-based models had great advantages, although this segmentation approach could result in lower accuracy when more challenging tumor images are included [43]. Moreover, AI-based models could be combined to persistent homology profiles (PHPs) and effectively identify normal from tumor tissue regions, evaluating the nuclear characteristics of tumor cells [44]. A patch-cluster-based aggregation model, including a great number of WSIs developed by Wang et al., performed the classification of CRC images (cancer, not cancer) assessing the clustering of tumor cells, and the results were comparable to pathologists' diagnosis, revealing no statistical difference [45]. The acceleration of tumor detection by CNNs could be obtained by reducing the number of patches, taking care to select the most representative regions of interest [46]. Both proposed methods in the study of Shen et al. performed with good accuracy and efficiency in detecting negative cases. Lastly, Yu et al., using a large dataset, demonstrated that SSL, with large amounts of unlabeled data, performed well at patch-level recognition and had a similar AUC as pathologists [47].
Colon benign tissue and colon ADC were classified with good accuracy by DL models developed by Togaçar et al. and Masud et al. [48,49]. The study of Song et al. showed that the DL model and the pathologists' estimation were in agreement in diagnosing CRC [50]. However, the binary classification algorithm for adenoma and non-cancerous (including mucosa or chronic inflammation) tiles showed a proportion of false predictions in challenging tiles consisting of small adenomatous glands.
The accurate identification of benign from malignant tissues achieved a sensitivity of 0.8228 and specificity of 0.9114 by a DL model trained with Multiphoton microscopy (MPM) images, although images were lacking biomarkers such as colonic crypts and goblet cells [51]. Holland et al. used the same classification model and 7 training datasets consisting of a descending number of images [52]. The mean generalization accuracy appeared to rely on the number of images within the different training sets and CNNs, although the larger datasets did not result in a higher mean generalization accuracy, as expected. Lizuka et al. conducted a classification of CRC into adenocarcinoma, adenoma or normal tissue on three different test sets, revealing great performance metrics and promising results for clinical practice [53]. The progression of CRC could be assessed by CNN, designed to identify benign hyperplasia, intraepithelial neoplasia, and carcinoma using multispectral images, however, the contribution of the pathologist's assessment and a bigger dataset were required [54]. Another study demonstrated that colorectal histological images could be classified into normal mucosa, an early preneoplastic lesion, adenoma and cancer with good accuracy, although these four classes may occasionally overlap and result in uncertainty in labeling [55]. Moreover, the ARA-CNN model was designed for an accurate, reliable and active tumor classification in the histopathological slides, aiming to minimize the uncertainty of mislabeled samples [56]. The model achieved great performance metrics not only in the binary, but also in the multiclass tumor classification, such as the proposed CNN by Xu et al. and Wang et al. [57,58]. Three studies by Papadini et al., Jiao et al. and Ben Hamida et al. proposed CNN approaches for multi-class colorectal tissue classification in a large dataset number, underlining the great potential of AI-based methods to efficiently perform multiple classifications of tumor regions [59][60][61]. Repurposing a stomach model trained in poorly differentiated cases of gastric ADC using a transfer learning method, DL algorithms could perform the classification of poorly differentiated adenocarcinoma in colorectal biopsy WSIs, benefiting from histological similarities between gastric and colon ADC [62].
The challenging task of gland segmentation was approached by Xu et al. and Graham et al., developing CNNs for gland segmentation and achieving a good performance in statistical metrics as well as generalization capability [63,64]. In addition, Kainz et al. trained two networks to recognize and separate glands which achieved 95% and 98% classification accuracy in two test sets [65]. Further research, both in H&E-stained and IHC images of colorectal tissue, was performed for glandular epithelium segmentation [66].
Grading into normal, low-grade and high-grade CRC was approached by Awan et al. and Shaban et al. with 91% and 95.7% accuracy, respectively, using the same dataset [67,68]. Lastly, the grading of colorectal images was performed by an unsupervised feature extractor via DL, showing great accuracy, although, as expected, the subcategorization of low-grade tissue images had reduced the accuracy [69].

Tumor Microenvironment
An automated assessment of the CRC tumor microenvironment was carried out, including the stroma, necrosis and lymphocytes associated with progression-free intervals (PFI) [70]. Jiao et al. demonstrated that a higher tumor-stroma ratio was a risk factor, whilst high levels of necrosis and lymphocytes features were associated with a low PFI. Pham's et al. proposed a DL model for binary and 8-class tumor classification in CRC images, as well as, for the prediction and prognosis of the protein marker, DNp73 in IHC rectal cancer images provided perfect results and outperformed other CNNs [71]. Pai et al. conducted a tumor microenvironment analysis in colorectal TMAs [72]. The algorithm efficiently detected differences between MMRD and MMRP slides based on inflammatory stroma, tumor infiltrating lymphocytes (TILs) and mucin, and the quantified proportion of tumor budding (TB), and poorly differentiated clusters (PDCs) associated with lymphatic, venous and perineural invasion. A Desmoplastic Reaction (DR) could be also classified by DL algorithms in CRC histopathological slides containing the deepest tumor invasion area [73]. The classification of a DR based on a myxoid stroma could be a significant prognostic marker for patients' survival.
Comprehensive analysis of the tumor microenvironment proved to show a great performance by the ImmunoAIzer, a DL model for cell distribution description and tumor gene mutation status detection in CRC images, proposed by Bian et al. [74]. Optimal results were achieved in accuracy and precision for biomarker prediction, including CD3, CD20, TP53 and DAPI. Additionally, the suggested DL framework could effectively quantify TILs, PD-1 expressing TILs in anti-PD-1 immunofluorescence staining images, as well as detect APC and TP53. Lymphocytes could be detected in colorectal IHC images stained positive for CD3 and CD8 biomarkers by 4 different CNNs, with U-Net showing the best performance according to the F1 score [75]. In the same framework, Xu et al. proposed a DL model for the quantification of the immune infiltration (CD3 and CD8 T-cells' density) within the stroma region using IHC slides [76]. The CNN-IHC model performed with high accuracy and was efficient in predicting survival probability, which was increased when patients had a higher stromal immune score. Predictions of genetic mutation genes, such as APC, KRAS, PIK3CAM SMAD4, TP53 and BRAF, could be followed through the DL algorithms to support the clinical diagnosis and better stratify patients for targeted therapies [77,78]. Schrammen et al. proposed the Slide-Level Assessment Model (SLAM) for simultaneously tumor detection and predictions of genetic alterations [79]. In a 2017 study, recognizing the molecular tumor subtype based on histopathology image data, Popovici et al. proposed a challenging approach utilizing a DCNN, which was effective in predicting relapse-free survival [80]. Xu et al. compared a DCNN to handcraft feature representation in IHC slides of CRC, stained for an Epidermal Growth Factor Receptor (EGFR), and demonstrated that the DCNN showed a better performance versus the handcrafted features in classifying epithelial and stromal regions [81]. In addition, Sarker et al. developed a DL approach for the identification and characterization of an Inducible T-cell COStimulator (ICOS) biomarker, which achieved high accuracy in the ICOS density estimation and showed potential as a prognostic factor [82]. Tumor budding could be quantified in CRC IHC slides stained for pan-cytokeratin, whereas a high tumor budding score was correlated to a positive nodal status [83].
Analysis for cell nuclei types (epithelial, inflammatory, fibroblasts, "other") by a CNN model trained on 853 annotated images showed a 76% classification accuracy [26]. All four cell types were associated with clinical variables, for instance, fewer inflammatory cells were related to mucinous carcinoma, while metastasis, residual tumors, as well as venous invasion were related to lower numbers of epithelial cells. A similar study, by Sirinukunwattana et al., described a CNN method for the detection and classification of four cell nuclei types (epithelial, inflammatory, fibroblast and miscellaneous) in histopathological images of CRC [84]. Höfener et al. used the same dataset as Sirinukunwattana et al. for nuclei detection from Cthe NNs based on the PMap approach [85]. A novel CNN architecture, Hover-net, was proposed by Graham et al. for the simultaneous segmentation and classification of nuclei, as well as for the prediction of 4 different nuclear types [86]. In 2017, the deep contour-aware network (DCAN) was developed by Chen et al. for accurate gland and nuclei segmentations on histological CRC images [87].

Histological Features Related to Prognosis, Metastasis and Survival
A peri-tumoral stroma (PTS) score evaluated by CNNs was significantly higher in patients with positive lymph nodes compared to the Lymph Node Metastasis (LNM)negative group [88]. However, due to the small dataset and the selection of classes used, the PTS score for LNM and extramural tumor deposits in early-stage CRC was not detected. Kiehl et al. and Brockmoeller et al. showed that LNM could be predicted by DL models with a good performance [89,90]. Furthermore, the incidence of metastasis in histologic slides with one or more lymph nodes was predicted by CNN, with good accuracy, both for micro-and macro-metastases [91].
Bychkov et al., using TMAs of the most representative tumor area of CRC, proved the efficiency of a DL model to predict the 5-year disease-specific survival (DSS), while Skrede et al. reported data for the prediction of cancer-specific survival [92,93]. Similarly, DSS was predicted by a DL model and clinicopathological features, such as poorly differentiated tumor cell clusters, were associated with high DL risk scores [94]. A Crohn-like lymphoid reaction (CLR) density at the invasive front of the tumor was a good predictor of prognosis in patients with advanced CRC, independent of the TNM stage and tumor-stroma ratio [95]. Determining the ratio of the desmoplastic and inflamed stroma in histopathological slides by DL models could be of great value in predicting the recurrence of disease after rectal excision and a lower desmoplastic to inflamed stroma ratio was associated with a good prognosis [96]. Tumor-stroma ratio (TSR) measures could be an important prognostic factor and, as shown by Zhao et al. and Geesink et al., a stroma-high score was associated with reduced overall survival [97,98]. The "deep stroma score" by Kather et al., a combination of non-tumor components of the tissue, could be an independent prognostic factor for overall survival, especially in patients with advanced CRC [99]. IHC slides stained for pan-cytokeratin from patients with pT3 and pT4 colon ADC were used to train a DCNN to predict the occurrence of distant metastasis based on tumor architecture [100]. Another study showed that IHC-stained images of the amplified breast cancer 1 (AIB1) protein from CRC patients could operate as a predictive 5-year survival marker [101].

Microsatellite Instability
Deploying the dataset of the MSIDETECT consortium, Echle et al. developed a DL detector for the identification of MSI in histopathological slides [102]. High MSI scores were accompanied by the presence of a poorly differentiated tumor tissue, however, false MSI scores were also noted in necrotic and lymphocyte infiltrated areas. The binary classification of DL algorithms for predicting MSI and MSS status in CRC images was performed in studies by Wang, Yamashita, Bustos and Cao et al., with the latter study associating MSI with genomic and transcriptomic profiles [103][104][105][106]. Another MSS/MSI-H classifier model was trained on tumor-rich patch images for better classification results, although some images were misclassified indicating that a larger dataset was required [107]. Generating synthesized histology images could also be utilized by DL models for detecting MSI in CRC, as demonstrated by Krause et al. [108]. A synthetic dataset achieved an almost similar AUC in predicting MSI compared to real images, although the best performance was noted when a combination of synthetic and real images was generated. Image-based consensus molecular subtype (CMS) classification in CRC histological slides from 3 datasets showed a good performance, and the slides having the highest prediction confidence were in concordance with the histological image features [109]. In another study, CMS classification was associated with mucin-to-tumor area quantification, and revealed that CMS2 CRC had no mucin and MUC5AC protein expression was an indication for worse overall survival [110]. Lastly, a CNN for predicting tumor mutational burden-high (TMB-H) in H&E slides was developed by Shimada et al. and showed an AUC of 0.91, while high AUC scores were also noted in the validation cohorts [111]. TMB-H was associated with TILs, although further development is important for this CNN model to be included in clinical practice.

Technical Viewpoint
The presented DL methods for image analysis in colorectal histopathology images could follow a categorization close to the one presented, which is presented in the background section. The systematic review indicates a rapid implementation of the field, presenting DL applications that cover many technical approaches. Most of the presented works in the literature employ a Convolution Neural Network in different segmentation and classification problems (i.e., binary classification for the diagnosis or prognosis of cancer, multiclass problems to characterize different tissue types, segmentation problems for the detection of the microenvironment of the tissue). According to the scope of the study, the authors proposed an appropriate architecture, providing the performance of their method and perhaps comparing with other already developed CNNs. Few studies used GANs to improve the training of the network, while several of them extended architectures for encoding and decoding, such as U-Net. Recent studies took the advantage of a high classification performance, developing retrospective or cohort studies based on the DL results. Technically, almost all the studies utilized popular machine learning environments, such as PyTorch, TensorFlow, Keras, Fastai, etc., which provided robust implementations of DL approaches. The main category of CNN application can be divided into three subcategories: (i) custom CNN architectures, (ii) popular architectures with transfer learning, and finally, (iii) novel architectures, ensemble CNNs or frameworks.

Custom CNN Architecture
Custom CNN architectures denote those approaches where the authors built, from scratch, all the layers of the network, visualizing in detail the feature extraction layers, the fully connected layers of the classifier, as well as all the layers between of them. Commonly, these architectures consisted of few layers and a small number of parameters, instead of the well-known architectures where the networks expanded and were deeper than custom ones. In several cases, custom CNNs performed well for typical simple problems, where it was probably meaningful to avoid complex architectures and networks with a high consuming computational effort. Several proposed custom CNNs were constructed, containing up to 4 convolution layers for feature extraction and up to 2 fully connected layers for the classifier [38,45,53,66,80]. For example, one of the first presented methods by Xu et al. classified the regions of the image as the epithelium or stroma, employing a simple CNN within a total of 4 layers (2 convolution and 2 fully connected) [81]. Other research teams implemented deeper architectures than the latter, including at least 8 layers [40,83,98]. For example, one of the most recent studies used a custom architecture of 15 layers (12 convolutional and 3 fully connected) for diagnosis purposes [40]. Finally, the most complex custom CNN, proposed by Graham et al. and called MilD-Net+, provides simultaneous gland and lumen segmentation [64].

Popular Architectures with Transfer Learning
The most comfortable way to apply CNNs on imaging problems is the utilization of the machine learning environments, where researchers can easily call already developed architecture. Such architectures gradually became very popular due to their standard implementation as well as their ability to transfer learning from the training in other datasets. According to the concept of transfer learning, it is less computationally expensive to employ a pre-trained deep network instead of a network with randomly generated weights, even if the training set includes images with different characteristics and classes. As a result, in most of the cases, the popular models were trained on the ImageNet dataset, which contained many images of different sources [27]. The most common pre-trained model used for CRC is based on the VGG architectures. Four of the studies, presented by Zhao et al. [95,97], Xu et al. [76] and Jiao et al. [70], employed the VGG-19, while two of the studies employed the VGG-16 [41,101]. Furthermore, two other studies compared different parameters of the general VGG architecture [38,80]. The second and third mostly used CNN for CRC is the Inception.v3 [39,45,53,77,111], the Resnet (ResNet-50 used by Chuang et al. [91], ResNet-18 used by Kiehl et al. [90] and Bilal et al. [78], and ResNet-34 used by Bustos et al. [105] and Bilal et al. [78]), or the combination of them called the InceptionResNet.v2 [100]. These architectures introduced the Inception and the residual blocks, which made the model less sensitive to overfitting. Interesting approaches [67,70,88] were developed using either the U-Net model, where the initial image was encoded to a low resolution and then decoded, providing images with similar characteristics or the ShuffleNet [80,91,103]. Finally, other well-known models were also used, such as AlexNet [57], the YOLO detector [75], the CiFar Model [25], the DenseNet [73], the MobileNet [94], LSTM [71], Xception [51], the DarkNet [48] and EfficientNetB1 [62].
In the category with the pre-trained popular models, all the comparative works could be included. These studies employed either the well-known models referenced above [36,61], or other models such as GoogleNet [99], SqueezeNet [52] and ResNeXT [43]. Finally, two studies utilized [72] or proposed [103] cloud platforms where the user can fine tune several hyper-parameters of popular pre-trained architectures.

Novel Architectures
Many research teams focus on the technical innovation evaluating their proposed methodologies in colorectal image datasets. The studies of these categories are mostly (a) modifications of popular architectures, (b) combinations of techniques into a framework, or (c) ensemble approaches.
Several modified architectures were the HoVer-Net [64] based on the Preact-ResNet-50, the KimiaNet [112] based on the DenseNet, the architecture proposed by Yamashita et al. [104] based on the MobiledNet, and finally, the modification of the loss functions on the ResNet proposed by Medela et al. [113]. Finally, Bian et al. [74] proposed an CNN based on the Inception.v3, adding several residual blocks.
Several studies engaged a CNN architecture with other sophisticated methods and concepts of artificial intelligence. One of the first attempts in the field was developed by Sirinukunwattana et al., proposing a combination of a custom CNN architectures with the Spatial Constrain Regression [84]. A similar concept developed two custom CNN architectures with PMaps approaches [85]. Chen et al. presented a novel deep contouraware network for the detection and classification of the nuclei [87]. A Deep Belief Network for feature extraction, followed by the Support Vector Machines for classification, was deployed by Sari et al. [69]. A recent work employed a Deep embedding-based Logistic Regression (DELR), which also used active learning for sample selection strategy [60]. In two other studies, the DenseNet was combined with Monte Carlo approaches [46], while the Inception.v3 was cooperated with Adversarial Learning [109]. Finally, Kim et al. [114] combined the InceptionResNet.v2 with Principal Component Analysis and Wavelet Transform. Some other research teams combined two or more CNNs on a single framework. Two different approaches combined the VGG architectures with the concept of the ResNet [66,92], while the ARA-CNN, proposed by Raczkowski et al. [56], combined the ReSNet with the DarkNet. Lee et al. [107] proposed a framework of an initial custom architecture followed by the Inception.v3. Furthermore, three frameworks based on the ResNet were developed by Zhou et al. [37]. Shaban et al. [68] developed a novel context-aware framework consisting of two stacked CNNs. Finally, another combination between different architectures, which was presented in the literature, is the DeepLab.v2 with ResNet-34 [50].
In recent years, voting systems are increasingly used for classification purposes. These ensemble approaches engage two or more algorithms, where the prediction of the highest performance finally prevails. The first ensemble pipeline was presented by Cao et al. in 2020, which votes according to the likelihood extracted from ResNet-18 [106]. Nguyen et al. [42,110] proposed an ensemble approach with two CNNs (VGG and CapsuleNet), while Kheded et al. deployed an approach with three CNNs as combination backbones: (a) the U-Net with the ResNet, (b) the U-Net with the InceptionResNet.v2 and (c) the DeepLab.v3 with Xception [115]. Another ensemble framework was developed by Skrede et al. [93], with ten CNN models based on the DoMore.v1. The most extended voting systems were presented by Paladini et al. [59], who introduced two ensemble approaches using the ResNet-101, ResNeXt-50, Inception-v3 and DenseNet-161. In the first one, called the Mean-Ensemble-CNN approach, the predicted class of each image was assigned using the average of the predicted probabilities of the four trained models, while in the second one, called the NN-Ensemble-CNN approach, the deep features corresponding to the last FC layer are extracted from the four trained models.

Improving Training with GANs
Apart for the segmentation and classification, DL in CRC has also been applied for the improvement of the training dataset using GANs. There have been three works with GANs' applications presented during the past two years. In the first attempt [108], a Conditional Generative Adversarial Network (CGAN), consisting of six convolution layers for both the generator and the discriminator network, was employed to train the ShuffleNet for the classification. Finally, a very recent study presented a novel GAN architecture, called SAFRON [116], which enabled the generation of images of arbitrarily large sizes after training on relatively small image patches.

Discussion
A pathology diagnosis focuses on the macroscopic and microscopic examination of human tissues, with the light microscope being the valuable tool for almost two centuries [11]. A meticulous microscopic examination of tissue biopsies is the cornerstone of diagnosis and is a time-consuming procedure. An accurate diagnosis is only the first step for patient treatment. It needs to be complimented with information about grade, stage, and other prognostic and predictive factors [4]. Pathologists' interpretations of tissue lesions become data, guiding decisions for patients' management. A meaningful interpretation is the ultimate challenge. In certain fields, inter-and intra-observer variability are not uncommon [12,13]. In such cases, the interpretation of the visual image can be assisted by objective outputs. Many data have been published over the last 5 years exploring the possibility of moving on to computer-aided diagnosis and the measurement of prognostic and predictive markers for optimal personalized medicine [117,118]. Furthermore, the implementation of AI is now on the horizon. In the last 5 years, extensive research has been conducted to implement AI-based models for the diagnosis of multiple cancer types and, in particular, CRC [14,15,119]. The important aspects in a CRC diagnosis, such as histological type, grade, stromal reaction, immunohistochemical and molecular features have been addressed using breakthrough technologies.
The traditional pathology methods are accompanied by great advantages [120]. The analytical procedures in pathology laboratories are cost-effective and, during recent years, have become automated, eliminating the time and errors of procedures, while maintaining high levels of sensitivity and specificity of techniques, such as IHC [119]. Despite the widespread availability, challenges and limitations of traditional pathology methods remain, such as the differences between laboratories' protocols and techniques, as well as the subjective interpretation between pathologists, resulting in inconsistency in diagnoses [12,13]. Novel imaging systems and WSI scanners promise to upgrade traditional pathology, preserving the code and ethics of practice [119]. The potential of DL algorithms is expanding all over the fields in histopathology. In clinical practice, such algorithms could provide valuable information about the tumor microenvironment quantitative analysis of histological features [76]. Better patient stratification for targeted therapies could be approached by DL-based models predicting mutations, such as MSI status [77,78,107]. More than ever, AI could be of great importance for a pathologist in daily clinical practice. AI is consistently supported by extensive research, which is followed by good performance metrics and potential. Several studies have shown that many DL-based models' predictions did not differ in terms of statistical significance when compared to pathologists' predictions [45,104]. Thus, DL algorithms could provide valuable results for diagnoses in clinical practice, especially when inconsistencies occur. The available scanned histological images can be reviewed and examined by the collaboration of pathologists simultaneously, from different locations [121,122]. For an efficient fully digital workflow, however, the development of technology infrastructure, including computers, scanners, workstations and medical displays is necessary.
Summarizing the presented DL studies from the medical point of view, 17 studies focus on diagnosis, classifying the images as cancer/not cancer, benign/colon ADC or benign/malignant, 17 studies classify tumor tissues, 19 studies investigate the microenvironment of tumors, 14 studies extract histological features related to prognosis, metastasis and survival, and finally, 10 studies detect the microsatellite instability status. The remaining 5 studies that were not described mainly concerned the technical aspects of DL in histological images of CRC. Summarizing the presented DL works from the technical point of view, 80 studies are applications of CNNs, either for image segmentation or classification, and 2 studies employ GANs for the simulation of histological images. The unbalanced distribution between CNN-based and GAN-based studies is an expected result due to the objectives of these two deep learning approaches. CNNs directly classify the images into different categories (e.g., cancer/not cancer). In contrast, GANs just improve the dataset to avoid overtraining and overfitting during the training procedure, without dealing directly with the main medical question. From the CNN-based studies, 10 studies proposed a custom CNN architecture, which was developed from scratch, 42 studies employed already developed architectures, often using transfer learning, and finally, 26 studies implemented novel architectures, such as (a) the modification of those already developed (5 studies), (b) a combination between CNNs or CNNs with other AI techniques (15 studies) and (c) ensemble methods (6 Studies). Finally, two (2) of the studies did not provide any detail about the DL approach.
The application of DL methods in the diagnosis of CRC over the last 5-years seems to be evolving rapidly, faster than other fields of histopathology. However, it seems that there is an expected gradual evolution, starting from the simple techniques of CNNs, then employing transfer learning to the networks, and finally attempting to develop new architectures, focusing on the requirements of the medical question. Additionally, in the last two years, alternative deep learning techniques such as GANs have started to be used. The contribution of such methods will be significant, since DL requires a sufficient size of the training set to perform well and provide generalization. Large data sets may not always be available from the annotations of pathologists and, therefore, need to be enriched with a simulated training set.
It is expected that CNN's application directly in histopathological images will present a better performance compared to traditional techniques. CNNs are advantageous over traditional image processing techniques due to the training procedure, while they are also more robust than the traditional AI techniques because they automatically extract features from the image. In this systematic review, different studies use a variety of performance metrics, while the natures of each classification problem are also different to each other. Therefore, it is not meaningful to calculate the average performance value for all the studies. For this reason, only the accuracy (Acc) and area under the curve (AUC), which were used more than the other metrics, have been used to evaluate each different classification problem. The mean value and Standard Error of Mean have been computed for binary classification problems (Acc = 94.11% ± 1.3%, AUC = 0.852 ± 0.066), 3-class classification problems (Acc = 95.5% ± 1.7%, AUC = 0.931 ± 0.051), and finally 8-class classification problems (Acc = 94.4% ± 2.0%, AUC = 0.972 ± 0.022), which provides sufficient samples of these metrics. The above performance values confirm that DL in colorectal histopathological images can achieve a reliable prediction.

Conclusions
When dealing with human disease, particularly cancer, we need in our armamentarium all available resources, and AI is promising to deliver valuable guidance. Specifically for CRC, it appears that the recent exponentially growing relevant research will soon transform the field of tissue-based diagnoses. Preliminary results demonstrate that AI-based models are further applied in clinical cancer research, including CRC, and breast and lung cancer. However, to overcome several limitations, larger numbers of datasets, quality image annotations, as well as external validation cohorts are required to establish the diagnostic accuracy of DL models in clinical practice. Given the available collected data, a part of the current systematic review could be extended to meta-analysis, especially utilizing the data from retrospective studies and survival analysis. The latter could provide us with a comprehensive status for the contribution of DL methods to the diagnosis of CRC.