Deep Learning and Its Applications in Computational Pathology

: Deep learning techniques, such as convolutional neural networks (CNN), generative adversarial networks (GAN), and graph neural networks (GNN), have over the past decade changed the accuracy of prediction in many diverse fields. In recent years, the application of deep learning techniques in computer vision tasks in pathology demonstrated extraordinary potential in assisting clinicians, automating diagnosis, and reducing costs for patients. Formerly unknown pathological evidence, such as morphological features related to specific biomarkers, copy number variations, and other molecular features, were also able to be captured by deep learning models. In this paper, we review popular deep learning methods and some recent publications about their applications in pathology.


Introduction
With the development of artificial intelligence and machine learning techniques in the past decade, many deep-learning-based computer vision models are playing important roles in our daily life and have revolutionized various industries by their superior performance and efficiency in prediction tasks, such as autopilot, machine translation, electronic sports, and facial recognition [1][2][3][4][5]. Recently, these technologies have also shown their extraordinary potential and capabilities in solving many complicated questions in biomedical field by analyzing massive biomedical data, such as protein structural prediction with Alphafold that outperforms experimental results [6,7] and tumor segmentation in MRI scans [1]. In particular, computational pathology, a discipline that involves the effort of both pathologists and informaticians, has been especially benefitting from the advancement of deep learning in recent years [8][9][10][11]. Several models have also been demonstrated to be useful in clinical diagnosis based on histopathology images [12][13][14]. In addition, some model-extracted morphological features show correlations with features at molecular level including single mutations and subtypes, most of which are previously unknown to human pathologists and clinicians [13]. Here, we discuss these deep learning methods from technical perspective and summarize their successful applications in pathology from recent publications.

Convolutional Neural Networks
Deep learning is a type of machine learning method that is using a multi-layer perceptron called artificial neural networks (ANN) [1,2,15]. Training a deep learning model involves designing and selecting a neural network architecture, loss functions, and evaluation metrics as well as tuning hyperparameters including batch size, step size, and regularization methods [1,2,16,17]. Convolutional neural networks (CNN), a variant of ANN, have proved their power in tackling various computer vision tasks, such as image classifications, segmentations, and object detections [18][19][20][21][22]. The first modern CNN architecture, LeNet5, was introduced by Yann LeCun et. al. in 1998 [23]. This gradientbased 6-layer convolutional neural network shows its power in recognizing hand written digits and characters [23]. However, the development of CNN was restricted by limited computational compacities and resources for a decade. The advancement of computational hardware in recent years, especially Graphical Processing Unit (GPU) and Tensor Processing Unit (TPU), empowers the boost of deep neural networks. Many CNN architectures, such as AlexNet, VGG, InceptionNet, and ResNet, can be trained into models that even outperform human beings in a computer vision classification challenge called ImageNet, which contains 1.2 million high-resolution images of more than 1000 classes [18,[24][25][26][27][28].
AlexNet was introduced in 2012. This architecture is much larger than the previous LeNet5 with 650000 neurons and 60 million trainable parameters packed into this 5 convolutional and 3 fully connected layer design [27]. Overlapping max pooling, ReLU nonlinearity, and dropout regularization are also incorporated. Because of the development of hardware, AlexNet at that time can be trained and run on only 2 GPUs [27]. It can achieve a top-5 test error rate of 15.3%, which made it the winner of ILSVRC-2012 competition and outperforming the second-best model by more than 10% [27]. The overwhelming success of AlexNet drew people's attention back to CNN and numerous other new architectures, including the VGG, InceptionNet and ResNet, came out in the following years.
VGG architecture was introduced in 2014 when it won the ImageNet challenge of that year [29]. Compared to AlexNet, VGG increases the depth of the model by adding more convolutional layers with smaller convolutional filters [29]. However, with the introduction of newer architectures in the following years, VGG architecture loses its popularity due to the gigantic size, high complexity to train, and not as accurate performance.
InceptionV1 architecture was announced in 2015 with the name GoogLeNet, which is a 22-layer deep CNN ( Figure 1) [19]. The 2 key innovations that make Inception architectures outstanding are the inception module and auxiliary classifier. The inception module consists of multiple convolutional kernels with different sizes on the same layer [19]. This design allows the model to capture similar features of various sizes. Deep CNN are prone to be overfit and passing gradient updates through the entire network is hard, which is often referred as the vanishing gradient problem. By adding auxiliary classifiers in the middle of the network, the auxiliary loss from the middle of the model take parts in the final loss calculation so that the gradients also represent the middle part of the network [19]. InceptionV2 and InceptionV3 were introduced a year later, which modified the inception module by factorizing the larger kernels into a stack of smaller kernels to make the architecture more computationally efficient [24]. In addition, the InceptionV3 uses RMSProp optimizer and adds batch normalization into the auxiliary classifiers, which significantly improve the performance with a top-5 error of 3.57% and top-1 error of 17.2% on ImageNet, much better than human [24]. InceptionV4 further refined the architecture by adding reduction blocks and unifying the inception modules [25].
A major competitor of InceptionNet is Resnet, which applies the idea of residual connection ( Figure 2) [26]. In this architecture, each layer are learning the residuals from the previous layer with reference to the layer inputs [26]. The top-5 error on ImageNet is 3.57%, which is similar to the performance of InceptionV3 [26]. Interestingly, this residual connection idea was later brought by the InceptionNet team to make the Inception-ResNetV1, a modified version of InceptionV3, and InceptionResNetV2, a modified version of InceptionV4 [25]. Because of using the residual connection, InceptionResNetV2 achieved a markedly 3.1% top-5 error on ImageNet [25].

Visualization of CNN Models
Ever since the introduction of deep neural networks, people are eager to know what their models have learned [30]. For image-based tasks with CNN models, visualizing the captured features is the most straightforward way. Class activation mapping (CAM) and saliency map are two simple ways to visualize the learned features by projecting the weights and gradients of the output layer back to the input image [31][32][33]. However, these visualization methods are image-specific and will only roughly imply where the models are focusing on. In addition, many saliency methods are criticized recently for giving misleading visualization interpretations and researchers are advised to use them with caution [34]. To unveil the CNN models further, the direct deconvolution and indirect optimization are the two major approaches [35]. Deconvolution starts with finding an image from the dataset that triggers high activity to the neuron of interest and the gradient of neuron activity is calculated [35]. In general, a deconvolutional network is a reversed convolutional network, which maps features back to pixels [36]. However, deconvolution visualization can be noisy and may contain features that are not easy to interpret [37]. The indirect optimization approach can provide more accurate visualization than that from deconvolution [35]. The algorithms optimize the colors of the pixels of an image to maximize the activation of the neuron of interest [37][38][39] Once a set of optimized images for many neurons have been obtained, a dimensional reduction visualization method, such as UMAP and tSNE, can create an atlas that systematically displays the correlations of features captured by different neurons at the same layer [37,[40][41][42]].

Graph Neural Networks
Graph neural network (GNN) is a type of neural network to deal with data consisting of relation information [43]. Data with non-Euclidean structure of information, such particle interactions, molecular structures, and object relationships in images could be modeled by GNNs [44]. In general, GNN can be further classified into four categories: recurrent GNN, convolutional GNN, graph autoencoders, and spatial-temporal GNN [44].

Generative Adversarial Networks
Generative adversarial network (GAN) is a type of neural network consisting of two networks that are trained at the same time [45]. The generator part is trained to create fake images that tries to fool the discriminator while the discriminator, trained with both real and generated fake images, is able to distinguish them [45]. Many variants of GANs have been applied to different tasks such as style transfer, visualization of neural networks, and object segmentation [45][46][47][48][49][50][51][52]. Cycle-GAN, a GAN variant using cycle-consistence loss to train two pairs of generators and discriminators simultaneously, has become increasingly popular for image-to-image translation tasks [49]. Unlike conditional GAN, which required paired data of two styles, cycle-GAN only needs two sets of images of two styles, which significantly lower the requirement of data while preserve the quality of style transfer [49].

Classification and feature prediction
With the success of CNN models in various real-world computer-vision classification tasks, researchers and scientists have also been training and testing these models in use case scenarios in biomedical fields, including pathology. These studies may involve training an existing CNN architecture from scratch. However, it requires more data and the data augmentation techniques may not always be suitable for biomedical images. Alternatively, transfer learning techniques, which freezes most of the parameters from a model oftentimes pre-trained on ImageNet, have more advantages in terms of the requirement of data size. For example, an InceptionV3-based ImageNet pre-trained CNN model achieves high level of accuracy in determining skin lesion malignancy and melanoma possibility [12,53].
In clinical settings, pathologists typically examine histopathology slides under microscopes to provide diagnosis or other clinical information. Thanks to the development of digital pathology equipment, digitizing histopathology slides is cheaper and more accessible. As a result, more and more deidentified digital histopathology slide images become available in many databases. These images, often with extremely large dimension of hundreds of thousands by hundreds of thousands of pixels, are saved in SVS or SCN format, which is a tuple of the same image with different resolutions [54]. Because of this, in order to fit these digital histopathology images into CNN architectures, people usually developed their own customized pipeline with commonly used techniques such as tiling the whole slide images (WSI) or sampling regions of interest (ROI) (Figure 3) [55]. In the past few years, classification CNN models trained on histopathology images have shown phenomenal high performance and promising clinical potentials in prediction both morphological features and molecular features. The visualization techniques also reveal results that often match the pathologist's expectation and many models are generalizable to independent real-world clinical images. For example, Inception and In-ceptionResnet architected models show high accuracy and other statistical metrics in predicting subtypes and key biomarker mutations, such as STK11 and EGFR, in nonesmall-cell lung cancer histopathology slides [13,56,57]. With the integration of other critical clinical variables and images, immune response, G-CIMP, and telomere length are able to be predicted in glioblastoma patients [58]. BRAF mutation, a well-known biomarker in malignant melanoma, can also be accurately predicted with a CNN based model [59]. Other molecular and genomic features, such as microsatellite instability (MSI), can be predicted from histopathology slides with a reasonable accuracy as well [14]. The critical gene expression level could also be inferred by applying these CNN classification models to WSI [60]. Some models nowadays also show successful classification results in histopathology images of multiple tissue types [61,62]. These successful cases indicate that CNN can be a suitable approach to study the correlation between molecular features and morphological features in histopathology slides, some of which may be undetectable or often ignored by human pathologists.
However, histopathology images are quite different from the images in the ImageNet because of their extremely larger sizes, higher resolution, and sparser useful features distributions [10,54,63]. Deep learning architectures that could take advantage of these characteristics are very likely to get better results and unveiling more interesting hidden features in histopathology image classification tasks. For instance, a multi-resolution CNN model, which takes advantage of the data structure of SVS and SCN image files, achieves higher performance in classifying endometrial cancer molecular features than its single resolution counterparts [64]. Weakly supervised techniques, such as multiple instance learning, also demonstrate decent performance in classification tasks of histopathology images and are gaining popularity in recent years [63,65,66]. The innovative idea of bring GNN models into solving histopathology classification problems allows more capacity in understanding the subtle relations between features of different tissue structures and at different locations on the gigantic digital histopathology slides [67,68].

Segmentation
In addition to classification tasks, CNN models are also capable of segmenting cells or tissue in histopathology slides [8,54]. The segmented cells or tissue could then be used to train classification models for different prediction tasks, including the recurrence of non-small-cell lung cancer [69] and endometrial tissue types [70]. A popular segmentation CNN architecture used in biomedical field is U-net, which has similar structure as autoencoder [71]. A 3D version of U-net, which has 3D convolutional layers instead of 2D convolutional layers, is capable of segmenting volumetric images [21]. Other autoencoder-based methods also achieve promising results in segmentation tasks of histopathology images, such as highlighting tumor regions in liver cancer WSI [72]. Well-trained style transfer models are also viable options for segmentation tasks [47]. With the introduction of GAN, using conditional-GAN or cycle-GAN models and in combination with CNN models for segmentation problems is also shown to be viable with less stringent training data requirements [45,52,73]. Unlike most of the classification models, the segmentation models can be more adaptive to different types of tissues due to the similarities of the stained features and textures of the histopathology slides [74]. Also, the evaluation metrics of these classification models can be drastically different from those of the classification models. Since the segmentation labels are usually also images, it is not easy to determine a binary prediction or even a prediction score at per-image level. Hence, typical statistical metrics such as AUROC or precision and recall often are not capable of fairly evaluate segmentation tasks. Pixel level metrics, such as intersection over the union (IoU), also pose its weakness as it cannot objectively give relative importance to pixels of different regions. Object level metrics can be an optimal alternative, but the requirement of identifying all objects on the label images prohibits its adoption in real world model evaluation. Therefore, researchers oftentimes use customized evaluation metrics with a combination of customized pixel weights, dice loss, and IoU with specific thresholds [75,76].

Summary
In this review paper, we introduced popular deep learning algorithms, such as CNN, GNN, and GAN, and also highlighted mechanism of how they work and how they can be applied to solve clinical and scientific questions in pathology. We also discussed recent publications that apply these deep learning techniques to classify or segment histopathology imaging data. With the continuing advancement of machine learning and deep learning techniques and the development of hardware and software, it is realistic to believe that the integration of artificial intelligence and pathology would become an even more attractive field to explore. Although one has to be rigorous and ethical about translating these AI-based technologies into clinical settings, we still hold an optimistic view that they would eventually revolutionize medical diagnosis processes and really push the development of precision medicine forward. Above all, the ultimate goal of introducing AI into pathology and biomedicine in general is to make healthcare more accessible, affordable, and agreeable.
Nevertheless, there are still a number of limitations of the current studies and potential obstacles that prevent these models from implementation in real world clinical settings at this moment. For example, only the patterns with prior understandings from pathologists can be used as reliable evidence for prediction, which significantly limited the tasks the deep learning models can be applied to. In addition, the patient samples that can be used as training, validation, and test sets are also very limited for each of the specific tasks of interests. Moreover, detailed labels of medical images are not often available and the labeling standards among clinicians also vary significantly in different countries. More advanced self-supervised or semi-supervised methods may solve some of these problems from technical perspective in the future.