Applied Self-Supervised Learning: Review of the State-of-the-Art and Implementations in Medicine

Machine learning has become an increasingly ubiquitous technology, as big data continues to inform and influence everyday life and decision-making. Currently in healthcare, as well as in most other industries, the two most prevalent machine learning paradigms are supervised learning and transfer learning. Both practices rely on large-scale, manually annotated datasets to train increasingly complex models. However, the requirement of data to be manually labeled leaves an excess of unused, unlabeled data available in both public and private data repositories. Self-supervised learning (SSL) is a growing area of machine learning that has the ability to take advantage of unlabeled data. Contrary to other machine learning paradigms, SSL algorithms create artificial supervisory signals from unlabeled data and pretrain algorithms on these signals. The aim of this review is two-fold: firstly, we provide a formal definition of SSL, divide SSL algorithms into their four unique subsets, and review the state-of-the-art published in each of those subsets between the years of 2014-2020. Second, this work surveys recent SSL algorithms published in healthcare, in order to provide medical experts with a clearer picture of how they can integrate SSL into their research, with the objective of leveraging unlabeled data.


Introduction
For decades, computational methods have been applied to the analysis of images and the wide array of real-world applications of image analysis has led to the growth of a subfield of artificial intelligence known as computer vision (CV). CV is an interdisciplinary field that deals with the design of algorithms that allow computers to gain a high-level, semantic understanding of images and videos. Historically, the performance of computer vision algorithms has been dependent on hand-crafted features such as SIFT [1] and HOG [2]. While these features did at one point represent the state-of-the-art, SIFT and HOG are analogous to representations associated in lower levels of the primate visual apparatus and pathway: object recognition for humans occurs several stages downstream from this area, and so it was assumed that better features could be discovered if they were learned in a more hierarchical fashion [3]. Convolutional Neural Networks (CNNs), a subset of deep learning algorithms modeled after the visual learning mechanisms of biological organisms [4], aim to imitate this hierarchical learning process.
In recent years, the field of computer vision has seen remarkable growth thanks to a much wider adoption of CNNs and similar methods, and performance of deep neural networks has since eclipsed earlier approaches on computer vision tasks across the board. The success of CNNs in image processing is due to two primary factors: the development of powerful GPUs or custom computing hardware that enables the training of larger CNN architectures, and the availability of much larger and more diverse datasets [5] for training such large networks. Where vision is concerned, the landmark dataset that sparked a lot of innovation is ImageNet [6], a large-scale image database of more than 14 million manually annotated images. ImageNet contains a densely populated semantic hierarchy, high precision annotation for all subcategories of images, and diverse sets of images for objects that include variable appearances, positions, viewpoints, poses, and occlusions. These characteristics have spurred the growth of research in the development of novel computer vision algorithms, mainly due to the IMAGENET Large Scale Visual Recognition Challenge (ILSVRC) [7], which provides a benchmark for state-of-the-art computer vision algorithms. The first major breakthrough came when AlexNet [5] took first place in the 2012 ILSVRC challenge, the first time that a CNN based method won in head-to-head competition with classical computer vision approaches. AlexNet's state-of-the-art performance showed that CNNs had the ability to outperform traditional computer vision algorithms, and since then, ILSCRV has consistently seen novel CNN architectures yielding increasingly better results [8].
While CNNs have continued to improve and allowed the field of image processing to evolve at a rapid pace, they have known limitations. CNN architectures generally contain a large number of parameters, which means that they are computationally expensive to train. This also means that the underlying semantic distributions they learn from datasets are heavily dependent on the size of the datasets, and so in many cases their performance and generalizability are upper-bounded by dataset size, especially within the domain of object recognition [9]. Because CNNs are usually trained in a supervised manner, these datasets must also be annotated, which is time consuming and costly, given that these datasets contain millions of images. This problem is compounded in domains such as digital pathology and laboratory medicine, where the annotation of images can be error-prone, there is a non-homogeneous consensus of experts in certain domain areas, and training data is limited and expensive to generate [10].
A subfield of machine learning that addresses some of these challenges is transfer learning. In a typical transfer learning process, a model is first pretrained on a large, labeled dataset such as ImageNet. After this, the model parameters are frozen, an adaptation layer is added on top of its architecture, and the new network is finetuned on target tasks using a smaller dataset with limited annotations [9]. In practice, this allows the network to leverage representations learned on the larger dataset, boosting training efficiency and performance on the smaller dataset of interest. This general practice has been shown to yield strong results in many contexts but has achieved only mixed results in medicine [11]. This may be in part because features learned from the natural images found in ImageNet may not be semantically important in specific domains with very different structure such as those commonly encountered in pathology or radiology. A review on transfer learning can be found in [12].
Because of these challenges, the need to leverage unlabeled training data without any type of manual annotation has begun to emerge. Self-supervised learning (SSL) has emerged as a field that attempts to exploit unlabeled data to allow networks to extract meaningful representations without the use of labeled data. This is done by automatically creating artificial supervisory signals from unlabeled data and using these signals to pretrain networks on different imaging tasks. As a consequence of providing the ability to leverage large amounts of unlabeled training data, SSL has allowed researchers in specific fields, including medicine, to create highly specialized datasets. These datasets can then be used to train networks to learn specific representations of the images they are analyzing without being dependent on curation, data abstraction, manual annotations or data labeling.
Given the huge volumes of unlabeled data routinely created in clinical practice and biomedical research, SSL represents a promising approach for medicine and healthcare in general. In order to give medical professionals a clear understanding of the utility SSL can provide, this review will be organized into three sections. First, a background of prerequisite material that has led to the advent of SSL will be covered. Second, a comprehensive review of SSL will be given. This covers some of the earliest techniques and pretext tasks published that provided a foundation for the field, as well as the current state-of-the-art. Lastly, a review of self-supervised pretext tasks applied to medicine is proposed, with a focus on pathology; Here we will cover novel pretext tasks that have been designed specifically for use in the field of digital pathology, look at commonalities that have led to their success, and discuss potential directions for future research and clinical implementations.

Materials and Methods
A search was conducted for papers published between the years 2014 and 2020 discussing either the field of SSL or applications of SSL in pathology. Three academic publication databases, Scopus, Google Scholar, and CrossRef, were queried using specific keywords ("Self-supervised learning", "Selfsupervised learning", "representation learning"). Papers that did not contain open-source code or links to project repositories were excluded from further evaluation. In addition to this, papers published on the topic of general SSL with fewer than 5 citations were also filtered out. For papers published specifically on the application of SSL in medicine, citation count was not a factor for inclusion, since this is a specific area covered here. Paper abstracts were reviewed to ensure that their content was relevant to either SSL or its applications in medicine. Papers that contained sufficient content to these fields were read in full, characterized, and incrementally related to the rest of the study corpus. In sum, we screened over 1,500 papers and we retained 118 for inclusion in our review ( Figure 1). The full list of papers we reviewed and characterized can be found in Supplementary Table 1. All works included in this review are organized into two categories general SSL works and applications of SSL in medicine.  Table 1. The following sections serve as prerequisite information for several research domains that are relevant to the field of SSL. Because the scope of this review will be limited to SSL's applications in computer vision, all of the algorithms covered will generally use CNNs as part of their architecture. Some of the more complex algorithms additionally utilize different forms of adversarial or contrastive learning as part of their frameworks, which allows them to achieve at times more robust and generalizable results. In order to set the stage for SSL, we will provide a short summary of transfer learning. This is motivated by the fact that transfer learning is the current dominant learning paradigm in machine learning, and transfer learning results are currently used as touchstone to validate new SSL algorithms

Convolutional Neural Networks (CNNs)
In order to provide professionals not familiar with machine learning with a more concrete idea of the foundational topics that will come up throughout this review, we begin by providing a brief summary of convolutional neural networks, which are a type of neural network architecture generally used for imaging tasks. In contrast with multilayer perceptrons (MLP) and other common network architectures consisting only of fully connected layers, CNNs are primarily composed of convolution and pooling layers (Figure 2). In a convolutional layer, small groups of weights, also called filters or kernels, are convolved with inputs from the previous layer, forming dot products with patches of the inputs equal to the size of the filters. Because each filter is convolved with the entire input, CNNs demonstrate useful properties such as translation invariance (i.e., recognizing a feature no matter where in the image it is located). Each convolutional filter produces a feature map, and CNN architectures typically have many filters in each layer. Pooling layers aim to decrease the dimensionality of feature maps by iterating over patches of values in the feature maps and either keeping only the maximum value (max pooling) or the average value (mean pooling). These different types of layers build on one another in a hierarchical fashion and allow CNNs to extract elementary visual features such as oriented edges, end-points, and corners, which are then combined in higher layers to create higher-order features [13]. CNN weights are trained and updated in an iterative manner using various optimization algorithms. As training progresses, the higher-order features learned become increasingly representative of the images the CNN is being trained with. This enables CNNs to act as autonomous feature extractors, which gives them an advantage over algorithms that rely on handcrafted features, because handcrafted features must make assumptions about the input data that do not account for unknown variability. It has also been shown that the features CNNs learn are partially invariant to shifts, scaling, and distortions, which makes them more suitable for computer vision than fully connected neural networks. Further details can be reviewed in [4]. In 2012, CNNs saw a surge in popularity, when Krizhevsky et al. designed AlexNet, which had a much deeper architecture with many more parameters than standard CNNs. AlexNet's novel architecture gave it a greater learning capacity which achieved state-of-the-art performance on the ImageNet dataset. Since then, CNNs have continued to embody the state-of-the-art in image classification.

Generative Adversarial Networks (GANs) and Adversarial Learning
Adversarial learning is a form of learning in which networks are pitted against one another. Adversarial learning is most commonly utilized in the form of the Generative Adversarial Network (GAN) framework ( Figure 3). GANs were first introduced by Goodfellow et al. in [14], and can be described as follows. We begin with some data modeled as a random vector x that we would like to generate new instances of. This vector can represent any type of data, such as pictures of any type. To generate new instances, we need to know the probability distribution for x, which we will call px. In order to approximate this probability distribution, two separate networks are trained, a generator and a discriminator. The generator takes as input a random noise vector z defined by a known prior pz and is tasked with learning a mapping from z to x. The discriminator takes as input the output of the generator, and a ground truth image, and outputs a probability, P(y), which represents the probability that the ground truth image is real or fake. As the two networks are jointly optimized, the generator learns the mapping from z to x, which approximates the probability distribution pz. Readers interested in their use in the medical domain can refer to [15,16]. Figure 3. A standard GAN framework, consisting of a generated which generates fake images from a random probability distribution, and a discriminator, which learns to discriminate between real and fake images.

Contrastive Learning
In contrastive learning, the goal is for a network to learn latent representations of instances where similar objects are closer together in the latent space and dissimilar objects are farther away. In many cases, a typical contrastive learning framework will utilize a siamese neural network [17]. In a siamese neural network architecture, two different instances are passed to the network, which shares the same weights for its first several layers. An embedding is learned for both of these inputs, and then the embeddings are concatenated and passed to further layers which translate them into some desired value. In contrastive learning, this architecture typically utilizes a contrastive loss [18]. More recent state-of-the-art techniques have also achieved better results using contrastive learning [19,20]. These techniques are elaborated on in section 3.5.

Transfer Learning
Transfer learning is formally defined as the following: given a source domain Ds with a corresponding source task Ts and a target domain Dt with a corresponding task Tt, transfer learning is the process of improving the target predictive function ft(.) by using the related information from Ds and Ts, where Ds ≠ Dt and Ts ≠ Tt [12]. A standard transfer learning framework is shown in Figure 4. Transfer learning has shown promise on a variety of computer vision tasks. The network pretrained in [9] obtained state-of-the-art results when transferring its learned features to the downstream task of object classification on the Pascal VOC 2007 [21] and Pascal VOC 2012 datasets. In [22], it is shown that transferring features and then fine-tuning them usually results in networks that generalize better than those trained directly on the target dataset. In [23] a CNN shows stateof-the-art results when transferring features learned from ImageNet pretraining to downstream tasks on the Caltech-101 and Caltech-256 datasets. At the time of their publication, [24] ranked 4 th in classification, 1 st in localization, and 1 st in detection on the ILSVRC 2013 dataset. The different networks used for these tasks all shared a common set of features learned through transfer learning. In [3], the authors discriminatively pretrained a CNN on a large auxiliary dataset (ILSVRC 2012 classification) using image-level annotations only and then transferred these features to classification tasks on the 200-class ILSVRC2013 detection dataset, outperforming the existing state-of-the-art method. In [25], the authors showed that generic visual representation learned through transfer learning outperforms many other visual representations on standard benchmark object recognition tasks.
Transfer learning has also been successfully applied to more specific domains of interest, such as the field of medicine. In [10], the authors show that a CNN pretrained on ImageNet learns transferable features that outperform handcrafted features and a CNN trained from scratch on 4 different medical tasks. In [26], experiments show similar results when comparing off-the-shelf CNN features to CNNs trained from scratch and then fine-tuned for a specific medical domain. However, [27] shows that when applying CNNs to the detection of lymph node metastasis in pathology images, pretraining improves convergence speed but the transferred features do not improve performance; the authors there postulate that this is potentially due to a large domain difference between pathology images and natural scenes in ImageNet, leading to limited transferability. [11] shows that feature representations learned through transfer learning and applied to a specific task are dependent on how well the representations can be applied to the downstream task of interest. In addition to this, transfer learning can only be utilized with extremely deep architectures that have been pretrained on large datasets such as ImageNet, which are not always necessary for medical tasks [28]. Transfer learning is also not applicable to 3D medical image analysis applications (e.g., MRIs, CTs and other "voxel" based representations) due to the fact that 2D and 3D CNNs are not compatible. While the above papers show that transfer learning has achieved success with a variety of different computer vision tasks and datasets, it too has its limitations. Not all images can be approximated by the "natural" images found in datasets such as ImageNet, specifically those found in different medicine domains. Although they might share basic geometric features such as circles and shaded lines, a picture of a park or a dog will have a much different underlying semantic structure than a high-resolution H&E pathology staining or brain MRI scan.

Self-Supervised Learning (SSL)
SSL is a field of machine learning that has recently begun to emerge as a promising alternative to supervised learning and transfer learning. SSL bears some similarity to transfer learning in that representations are learned from an auxiliary pretext task and then transferred to a downstream task of interest. However, unlike transfer learning, in SSL the data used for the pretext task and the downstream task can be taken from the same data source, or from different sources. SSL can be formally defined by modifying the definition of transfer learning: Given a source domain Ds and a target domain Dt, where Ds = Dt or Ds ≠ Dt, a pretext task Ts and a corresponding downstream task Tt, SSL is the process of improving the target predictive function fd(.) by using the related information from Ds and Ts. Put in less technical terms, SSL is defined by creating an artificial supervisory signal from some unlabeled data that can optionally be related to the target data, pretraining a network, and then finetuning the pretrained weights of that network on the target data. In some situations, the fact that the domain is the same for both the pretext task and the target task allows SSL to overcome some of the weaknesses of transfer learning, the most important one in the medical field being the poor learning caused by large visual and semantic differences between source and target domains (eg, ImageNet dataset vs digitized pathology slides).
A standard SSL framework is shown in Figure 5. SSL can be used as a preprocessing technique, where a network's weights are first pretrained using a pretext task and then trained on the dataset's actual labels and recent advancements in this field have created pretext tasks that have allowed networks to come close to matching the performance of networks trained through purely supervised techniques [20,29,30].

Figure 5.
A standard SSL framework. Unlabeled source or target data is transformed to create an auxiliary supervisory signal. The CNN is pretrained to accomplish a pretext task using either target data or sperate source data, and then the weights from layers 1-3 are reused when training to accomplish a downstream task using the target data.
The success of SSL is heavily dependent on how well the pretext tasks are designed. If not designed properly, the learning algorithm will be able to find "trivial" solutions which it can exploit as a shortcut to the representation learning. These include low-level cues like boundary patterns or textures continuing between patches, as well as chromatic aberration [31,32]. These shortcuts vary depending on the details of the pretext task and are most dealt with through various preprocessing techniques.
Self-supervised learning can be divided into four broad categories: pixel to scalar, pixel to pixel, adversarial learning, and contrastive learning. In the following sections, we review each in detail.

Self-Supervised Learning -Pixel to Scalar
One of the most central tasks in computer vision is image classification: a dataset is divided into different subgroups called classes, where each class shares some predetermined commonality. This dataset is then used to train a machine learning algorithm to differentiate between these classes. Once the algorithm is trained, it is tested on its ability to correctly classify new images into one of the original classes. An example of this would be training a CNN to discriminate between pictures of cats and dogs, and then once it is optimized, giving it images it hasn't seen before and asking it to classify each image as a cat or a dog. Characteristics of each image that determine what class it belongs to, called features, are used to train the classifier.
Image classification has repeatedly shown itself to be an efficient task to force CNNs to learn powerful and versatile representations of images. Consequently, many self-supervised algorithms also model their pretext task as an image classification task. For this review, we labeled any pretext task that transforms an image into either a scalar or vector value as pixel-to-scalar. The primary difference between pixel-to-scalar pretext tasks and a typical image classification pipeline is that, instead of using manually annotated class labels as the ground truth feedback signal, the training data is augmented in some way to create an artificial supervisory signal. This artificial supervisory signal does not require manual annotation. It is instead extracted from the training data autonomously, so massive amounts of unlabeled data can be leveraged for training. When the pretext task is designed in a clever way, it allows CNNs to learn representations that are almost as powerful and robust as those learned through supervised training.
Many pixel-to-scalar pretext tasks that follow this classification paradigm revolve around the idea of "solving jigsaw puzzles." In [31], the authors create an artificial supervisory signal by randomly sampling pairs of patches from the training image. The pairs of patches are then fed to a siamese CNN, which extracts low-dimensional representations for each image separately. The representations are combined to form a fused representation which is used to classify the location of the neighboring patch given the location of the first patch. It is shown qualitatively and empirically that the network trained with this pretext task learns to associate semantically similar patches and generalizes well to the object detection task on the PASCAL VOC 2007 dataset. Norooze et al. expands upon the work of [31], introducing the Context Free Network (CFN), an algorithm that learns by solving jigsaw puzzles as a pretext task [32]. Instead of only sampling pairs of patches from an image, all 9 patches are sampled at once. These patches are then realigned according to a permutation randomly sampled from a set of predefined permutations, each with an index assigned to it. The 9 patches are fed through a 9-headed siamese CNN, which learns a feature encoding for each patch. The feature encodings are then combined into a single fully-connected layer, which is downsampled to predict the index of the correct permutation. This method is shown to be more robust than [31] due to the fact that spatial ambiguities between similar patches are avoided when all 9 patches are evaluated at once. In [33], it is introduced an algorithm called DeepPermNet, where an image is split into patches and shuffled, and a siamese CNN takes the patches as input and outputs the permutation matrix that was used to shuffle the original image. The pretext task used in [32] is extended in [34], where the task of solving jigsaw puzzles is optimized jointly with the task of object classification over different joint domains. Schematic illustrations of the frameworks used in [31] and [32] are shown in Figure 6.  [31]. A central patch and a surrounding patch are extracted from an image and encoded by a CNN, which then predicts the relative location of the neighbor patch. (b) An illustration of the algorithm used in the Context Free Network [32]. Nine spatially close patches are shuffled and then embedded by a CNN, which predicts the original permutation of the patches.
Another common pixel-to-scalar pretext task formulation is that of solving image transformations such as image rotations. This idea was first introduced in [35], where the authors apply multiple random transformations to an image and use all transformations derived from the same image to create a surrogate class. They then use an index for each image as the class label and train a CNN in a supervised fashion with these surrogate classes. In another example, [36], authors explicitly use rotations as the transformation. They postulated that in order for a CNN to accurately predict the degree of rotation that has been applied to an image, it must possess a high-level understanding of the objects present in the image. This pretext task is implemented by rotating images from the ImageNet dataset by 0, 90, 180, and 270 degrees. This technique is different than the common data augmentation technique of rotating images by some degree to artificially increase the size of a dataset due to the fact that the images here are segmented into classes based on their degree of rotation. A CNN is then trained to predict the class representing that image's degree of rotation. A schematic of this framework is shown in Figure 7. This idea is extended in [37] and [38]. In [37], the authors apply image rotation to the domain of semi-supervised learning, where they train a CNN on both labeled and unlabeled data from ImageNet. In order to optimize the network, they combine the unsupervised image rotation loss for the unsupervised dataset with a standard cross-entropy classification loss for the labeled images. Specifically in [38], authors seek to improve the image rotation pretext task based on two observations: the fact that the features learned are discriminative with respect to rotation transformations and are therefore less applicable for tasks that are rotation invariant, and the fact that not all training examples have their scenes and objects obfuscated through rotation. The latter problem is handled by adding a weight corresponding to each training instance that mitigates the influence of noisy examples. The former problem is handled by modifying the architecture used in [36]. After the rotated versions of an image are fed to a CNN, the learned feature representation f is split in half. One half, f1 contains rotation relevant features and is used to predict image rotations. The other half, f2, contains rotation irrelevant features. In order Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 11 August 2021 doi:10.20944/preprints202108.0238.v1 to learn f2, two additional terms are added to the loss function. The first term is used to enforce similarity between copies of the same images that have been rotated multiple times. The second term is used to ensure spatial dissimilarity between the learned feature representations for each instance. Image rotation and relative patch prediction are both used as auxiliary losses in [39] to increase the effectiveness of the authors' few-shot learning algorithm. Figure 7. An illustration of the algorithm used in [36]. An input images is rotated by 0, 90, 180, and 270 degrees. A CNN is then tasked with predicting the degree of rotation for an input image.
In addition to raw image data, instances of the same image converted to multiple image modalities (such as RGB and optical flow) and videos supply abundant sources of unlabeled data for pixel-to-scalar pretext tasks. In [40], two pairs of images are passed to a network at a time, where each pair contains the same image but different modalities. The pretext task used for pretraining the network is to maximize the distance between embeddings of different images regardless of modality, but minimize the distance between embeddings of the same image represented by different modalities [40]. In [41], the authors design a pretext task to learn representations inspired by the relationship between visual stimuli with egomotion. Pairs of images taken by a moving agent are used to predict the camera transformation between the images. The image pairs are first fed to a siamese CNN, which learns lower-dimensional representations for each image. These representations are then combined into a fully connected layer which is downsampled to predict the camera transformation between the two images. The camera transformation is expressed as a 3D vector where the dimensions represent translations along the Z/X axis and rotation about the Y axis.
The pretext tasks that have been covered up to this point deal with pretraining networks for the common tasks of image classification and object detection. A comparison of the performance for the most frequently cited algorithms covered in this section can be found in Table 1. In addition to these more generally applicable tasks, pixel-to-scalar pretext tasks are versatile in how they can be designed and have been applied to a variety of highly specialized domains, where the downstream task is something specific to that domain. In [42], the authors design a self-supervised pipeline that takes images from multiple views, and outputs 6D poses (three geometrical and three angular positions based on a relative origin point) for objects in a scene. They circumvent the onerous task of manually labeling training data by utilizing object masks to separate foreground from background, which allows them to autonomously obtain pixel-wise object segmentation labels. In [43], the authors using EXIF metadata from pairs of image patches as a supervisory signal for training a classifier to determine whether an image is self-consistent. The network is then applied to the downstream tasks of splice detection and splice localization. In both [44] and [45], the authors use learning to rank as a pretext task. The pretrained network there is then successfully used for the downstream task of crowd-counting and then the network in [45] is used for the tasks of crowd-counting and image quality analysis. In [46], the authors use an auxiliary pretext task that maximizes the Euclidean distance between different data instances in the feature space in order to train a network for the downstream task of person re-identification. In [47], the authors use SSL to address the distribution shift that occurs when a model is trained on data from one distribution (source), but the goal is to make good predictions on some other distribution (target) that shares the label space with the source. This is done by jointly training a supervised head on labeled source data and several self-supervised heads on unlabeled data from both domains. The multi-task learning process pushes the features learned by the shared feature representation in the network closer together for both domains.

Self-Supervised Learning -Pixel to Pixel
Autoencoders were one of the first neural network architectures to use learn data distributions from unlabeled data [48,49]. An autoencoder consists of two components, an encoder and a decoder. The encoder takes some data as input and compresses it into a smaller feature representation, called an embedding. The decoder then reconstructs the input from the embedding. A basic autoencoder is optimized by minimizing the difference between the original input and the reconstructed output. Once the reconstructed outputs are sufficiently close to the inputs, the encoder is used to extract low-dimensional feature representations of the original inputs for use in downstream tasks. Many pretext tasks follow a similar version of this learning paradigm. We will refer to this category of pretext tasks as pixel-to-pixel.
One of the seminal papers in SSL was the work of Pathak et al. [50]. In their paper, the authors designed a pretext task in which part of an image is removed, and a specialized convolutional autoencoder which they call a context encoder is trained to reconstruct the missing piece. The incomplete image is passed through an encoder network and compressed down to a low-dimensional feature representation. That representation is then passed to a decoder which uses it to produce the missing image content. The network is optimized according to the difference between the pixels of the ground truth missing image content and that produced by the decoder. The context encoders are able to attain a higher-level understanding of images than normal autoencoders because the process of reconstructing part of an image (inpainting) requires a much deeper semantic understanding of the scene, while regular and denoising autoencoders typically only learn low-level features. A visualization of this framework is shown in Figure 8.  [50]. A patch is removed from a target image. The incomplete image is then passed to an autoencoder, which is tasked with predicting the missing section of the image.
Many pretext tasks that fall under the category of pixel-to-pixel use some form of colorization. Colorization is the process of "filling in" images that have been converted to grayscale [51,52]. The architecture used in [51], takes in a grayscale image and predicts a color histogram at every pixel. In [52], a CNN is trained to convert a grayscale image to a distribution over quantized color values. In both studies, the networks learn strong feature representations because the architectures must interpret the semantic composition of the scene and also localize objects in order to colorize arbitrary images. The ideas used in these papers are extended in [53], where the authors augment the architectures used in [51] and [52] by separating the colorization network into two disjoint subnetworks, where each one predicts one color channel for the image. Instead of only colorizing a grayscale image, they feed the entire architecture an image, and then one subnetwork receives the grayscale information and uses this to predict the color information, while the other receives the color information and uses this to predict grayscale information. The two representations are then concatenated and used to reconstruct the original image. Illustrations of [52] and [53] are shown in Figure 9.  [52]. A grayscale image is passed to an autoencoder, which is tasked with predicting the colorized version of the image. (b) An illustration of the algorithm used in [53]. A grayscale and colorized version of the same image are passed to two separate autoencoders, which are tasked with predicting the colorized and grayscale versions of their input images.
While many of the pixel-to-scalar and pixel-to-pixel pretext tasks discussed to this point share a similar design paradigm, pretext tasks with different goals will inherently learn different features [54]. Doersch et al. suggest that pushing a network to learn multiple pretext tasks at the same time, a process called multi-task learning, allows the network to cover a larger area of the feature space, and therefore allows it to learn more generalizable feature representations. The four pretext tasks used in conjunction with one another are the context prediction task from [31], the exemplar task from [35], the colorization task from [52], and a motion segmentation task by Zou et al. [55]. In this paper, the authors extract frames from videos and set up a pretext task where a CNN is tasked with predicting what pixels will move in subsequent frames. For the multi-task architecture, all pretext tasks share a common low-level architecture based on the ResNet-101 architecture [56]. At higher levels, each pretext task has its own head, with a specific architecture designed for that pretext task. Similar to pixel-to-scalar tasks, videos also provide an abundance of unlabeled data for pixelto-pixel tasks. In [57], the authors use optical flow to segment groups of pixels into objects. This allows them to autonomously extract segmentation masks from unlabeled video data. A CNN is then fed a static frame and tasked with predicting these segmentation masks. In [58], a network is trained from unlabeled video data to learn facial attributes. The network is given a source frame and a target frame as inputs. It is then optimized to generate the target frame by predicting the flow field between the two frames. In [59], the authors use frames from videos to train a network to predict the pixel values in a target frame given a source frame.
Pixel-to-pixel tasks can also be designed for many different downstream tasks of interest from specific domains. In [60], the authors pretrain a network for the task of optical flow prediction. Sundermeyer et al. [61] use SSL to pretrain a network for the downstream task of 6D object detection using RGB images. In [62], SSL is utilized to learn to detect visual landmarks in different object categories, such as the eyes and nose on a face. Ma et al. in [63] and Goddard et al. [64] design pretext tasks to train networks without any manually annotated data on the downstream tasks of depth completion and depth estimation, respectively. A comparison of the performance for the most frequently cited algorithms covered in this section can be found in Table 2.

Self-Supervised Learning -Adversarial Learning
SSL algorithms have also shown strong results when designed using adversarial learning, as opposed to the purely discriminative approaches covered so far. These algorithms typically use GANs as their foundation. One of the first papers to introduce this technique was the work of Radford et al. [65]. This paper proposes training GANs to learn image representations and later reusing the learned features for supervised tasks. However, scaling up GANs to utilize modern CNN architectures and model natural images causes their training to become unstable in practice. In order to fix this, the authors apply three architectural modifications that have recently been applied to CNNs. The first modification is to replace spatial pooling layers with strided convolutions, creating a network consisting entirely of convolutional layers and allowing the network to learn its own spatial downsampling [66]. The second is to eliminate fully connected layers on top of convolutional features [67]. The third is the technique of batch normalization, which stabilizes learning by normalizing the input to each unit to have zero mean and unit variance [68]. Through visualization, it is shown that the model learns relevant representations, and that the discriminator learns object detectors. It is also shown that the generator learns specific object representations for major scene components.
Building on this, several works have modified the GAN architecture and successfully applied it to learn robust feature representations without any labeled data. In Donahue et al. [69], the authors augment the GAN architecture by adding an encoder that maps data from the random variable x to the latent encoding z. The discriminator is then trained to classify between outputs from the encoder, Ez versus inputs to the generator z, and between outputs from the generator Gx and the ground truth images x. This pushes the network to learn an additional inverse mapping from data to latent representation. This network is called a Bidirectional GAN (BiGAN). In Chen et al. [70], the authors postulate that due to the fact that the input vector z used for the input to the GAN framework is completely random and has no constraints, the learned representations do not correlate to semantic features of the data. To account for this, the input vector z to the generator is split into two parts: z', which is still used as noise, and c, which is designed to learn the structural semantic features of the data distribution. An additional term is then added to the GAN loss function that maximizes the mutual information (MI) shared between c and x. The addition of this constraint yields results that show empirically that components of c are highly correlated to high-level semantic features of x. The architecture of a BiGAN is shown in Figure 10.  [69]. The top section follows a typical GAN architecture. In the bottom section, this process is reversed. A fake image generated by the generator is passed to an encoder which creates an encoding. The discriminator is then tasked with identifying which vector, either random noise or encoding, comes from the original distribution.
In Tian et al. [71], the authors also add an encoder to the GAN architecture. They hypothesize that adding viewpoint as a label will force the GAN framework to learn more complete image representations. This is done through two pathways, a generation path and a reconstruction path. In the generation path, the generator G is given random noise z and a view label v as input, taken from a ground truth image x. The output of G, x', and x, are fed to a discriminator D which outputs two values, Dv, the probability x' being a specific view, and Ds, the image quality. Then, in the reconstruction path, a pair of images xi and xj are used, where both images have different viewpoints but share the same identity. Xi is fed to the encoder E, which produces representations z and v, which correspond to xi's feature representation and view representation, respectively. G takes z and the ground truth view v as input, reconstructs the image, and feeds it to D along with xj. D then again outputs the probability Dv and the image quality score Ds. The authors incorporate SSL by first pretraining E using labeled images and then using its representation of v to estimate viewpoints for unlabeled images. In Jenni et al. [72], the authors propose a pretext task to learn features by classifying images as real or with artifacts. In order to generate artifacts, the structure of the generator is changed to an autoencoder that reproduces images, drops entries from the encoded features, and then a repair network is added to the decoder to help it render a realistic image. A discriminator is then trained to distinguish real from corrupt images.
In Chen et al. [73], the authors use SSL to address the challenge of GANs forgetting previously learned tasks due to the fact that they learn in a non-stationary environment [74][75][76]. This challenge is addressed by adding an auxiliary, self-supervised loss to the discriminator to predict image rotations [36]. In this framework, the generator and discriminator follow a traditional GAN framework for the task of predicting real versus fake images, however, they are designed to collaborate with one another when tasked with predicting image rotations. The addition of the image rotation task yields substantially better results than a baseline GAN and matches the performance of a GAN augmented with a supervised task requiring labeled training data. Similar to previous sections, self-supervised pretext tasks designed using an adversarial framework have a variety of applications in specific domains. In Wu et al. [77], the authors use a 3D-GAN to generate 3D objects from a probabilistic space. They utilize the techniques used in [65] to stabilize training and significantly outperform other unsupervised object generation methods. In Lin et al. [78], the authors use SSL to pretrain a specialized GAN architecture for the downstream task of remote image scene classification. This is a difficult task due to the fact that remote sensing images vary from natural images in several ways. Objects in the same category frequently have different sizes, colors, and angles. To tailor a GAN framework to learn better representations for this problem, the authors propose two changes. First, they add a layer in the discriminator to combine information from different levels of representations. The generator is then modified to optimize two separate tasks: to make the reconstructed images similar to the samples drawn from the training set, and to match the expected values of the features in the custom layer added to the discriminator. In Ren et al. [79], the authors devise a GAN framework that learns features from unlabeled synthetic images that are robust enough to be used on real images. First, they train a network that takes an image as input and predicts its depth, surface normal vector, and instance contour maps. These three quantities can be extracted from a synthetic image autonomously. In this setup, the generator and discriminator share the weights of an encoder. The discriminator then compares the features extracted from this encoder for real and synthetic images. This framework is visualized in Figure 11. In Singh et al. [80], a pretext task using adversarial learning is designed to pretrain a network for the task of semantic segmentation of overhead imagery obtained from satellites. This is a difficult task due to the fact that there is a domain gap between overhead images and ground (natural) images. The authors adopt a modified version of the inpainting task from [50] where they inpaint difficult and semantically meaningful regions. A comparison of the performance for the most frequently cited algorithms covered in this section can be found in Table 3. Cross-Domain SSL 68.0 52.6 50.0 Figure 11. An illustration of the algorithm used in [79]. An encoder is given multiple tasks. It must create an encoding of a synthetic image that can then be used to predict the edge, depth, and surface normal properties of that image. This encoding must also be able to fool a discriminator which is given an encoding of a synthetic and a real image.

Self-Supervised Learning -Contrastive Learning
Most recently, SSL algorithms have begun to shift from pixel-to-scalar and pixel-to-pixel based task towards building their frameworks around contrastive learning. [20,29,30]; a schematic representation of the most relevant of these frameworks is depicted in Figure 12. SSL algorithms that utilize different forms of contrastive learning have achieved state-of-the-art results, and in some Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 11 August 2021 doi:10.20944/preprints202108.0238.v1 cases have matched or surpassed CNNs pretrained on ImageNet using supervised learning, a substantial milestone in unsupervised learning. Figure 12. (a) An illustration of the algorithm used in [20]. An image is divided into overlapping subsections, which are then encoded with their relative spatial locations preserved. An autoregressive network then takes the top section of a column, encodes this into a context vector, and uses this to make predictions about the following rows in the corresponding column. (b) An illustration of the algorithm used in [29]. An image is transformed in two different ways and the encoded. These encodings are then passed to a contrastive loss function along with several negative encodings. (c) An illustration of the algorithm used in [30]. A random group of image samples are encoded by a dictionary encoder. A target image is then encoded by a separate encoder, and all samples are passed to a contrastive loss. Earlier works applying contrastive learning to SSL algorithms were built around the framework of siamese CNNs and focus on learning feature representations that maximize or minimize a given distance metric. One of the earliest works to apply contrastive learning to SSL is that of Wang et al. [81]. In their work, the authors design a pretext task that compares patches from video frames. These patches are defined as similar when they are the first and last frame in which an object appears in the video (i.e., "query patch" and "tracked patch", respectively). A siamese triplet network is trained using a ranking loss function to learn a feature space such that the query patch is defined as closer to the tracked patch relative to other randomly sampled patches. A model ensemble is then created by pretraining CNNs using different sets of data. In Zeng et al., the authors extend selfsupervised distance learning to 3D, utilizing a 3D CNN to learn a mapping from a volumetric 3D patch to a low-dimensional feature representation that serves as the descriptor for that local region [82]. During training, pairs of learned mappings are then fed to a siamese CNN which minimizes a contrastive loss representing the distance between learned embeddings. The embeddings are then successfully utilized for several practical applications, including scene reconstruction and 6D object pose estimation. In Wu et al. [83], the authors suggest that by learning to discriminate between individual instances, a network can learn a representation that captures similarity among these individual instances. The features for each instance are low-dimensional vector representations that are learned by a CNN. Each instance is stored in a discrete memory bank and assigned an index that acts as its class label. As training progresses, the feature representations stored in the memory bank for every instance are dynamically updated. The amount of similarity between instances is calculated directly from the features using noise contrastive estimation (NCE).
Zhuang et al. [84] take a similar approach to [83]. In their paper, the authors also design a pretext task with the goal of learning embeddings of images where similar images are clustered closer together while dissimilar images are separated. The loss function is designed to push the current embedding vector closer to its close neighbors and further from its background neighbors. For the task of ImageNet classification, the algorithm used in this paper, called Local Aggregation (LA), achieves an important milestone: through only self-supervised training, it surpasses the performance of the original AlexNet architecture pretrained using supervised learning. In Sharma et al., [85], the authors demonstrate the ability of self-supervised contrastive learning to be applied to specific domains. Here, they apply SSL to the task of learning face representations for face clustering. They first automatically generate training data without the use of manual labeling by comparing frames and sorting them into positive (similar) and negative (dissimilar) pairs based on their Euclidean distance. The training pairs are then fed to a siamese CNN which is trained using again a contrastive loss.
Self-supervised methods that utilize contrastive losses have also been developed around the idea of maximizing mutual information (MI) between the inputs and outputs of the network; here the MI definition is adapted from information theory, where MI is used to denote the dependence between two random variables. In Hjelm et al. [86], the authors train an encoder network to maximize the MI between its inputs and outputs. They define their framework, called Deep InfoMax (DIM) by combining three objectives: maximizing local information, maximizing global information, and also utilizing an adversarial loss to force the learned representations to have desired characteristics specific to a prior distribution. The contrastive loss used in [19] is also integrated into the authors' framework and achieved state-of-the-art results at the time of its publication. In Bachman et al. [87], the authors extend the DIM framework by augmenting the views of each input; this causes the network to have to extract high-level features present in all views in order to maximize the MI between them, increasing the robustness of the learned features.
Three of the most promising SSL algorithms to come out in the last two years all involve contrastive learning. These include Contrastive Predictive Coding (CPC), SimCLR, and Momentum Contrast (MoCo) [19,29,30]. In CPC, the main intuition is that if data from any domain is modeled as a sequence, as the model predicts further into the future, shared information decreases, and the model needs to utilize higher level structures to make farther predictions. The architecture of CPC can be summarized as follows. First, an encoder maps the input sequence of observations to a sequence of embeddings. Then, an autoregressive model compresses all embeddings less than or equal to the current timestep t in the sequence into a latent representation ct, referred to as the context. After this, a function f is used to model a density ratio which is used to preserve all MI between ct and future embeddings. A loss function called InfoNCE is used to optimize the function f, where the loss function corresponds to cross-entropy and is given one positive sample from the true distribution p(xt+k|ct) and multiple negative samples from a decoy distribution p(xt+k). CPC has shown to be a very versatile framework, achieving promising results in speech, images, text, and reinforcement learning. Here, we will focus on CPC's applications in computer vision. In Tian et al. [88], the authors modify the CPC framework in order to learn representations that capture MI shared between multiple views of data. In Hénaff et al. [20], the authors implement the CPC architecture specifically for images, dividing each image into smaller patches and then using a neural network to embed them. This architecture is then improved using four different techniques: the model's depth and width are increased, layer normalization is used during the training process, the complexity of the task is increased by pushing the model to make predictions in four different directions, and more extensive data augmentation is applied during preprocessing. When trained on the transfer learning task of object detection for the PASCAL VOC 2007 dataset, the improved CPC framework's learned features outperform features yielded from a network trained in a supervised manner with all ImageNet labels -which represented another landmark event in self-supervised learning [20]. In Trinh et al. [89], the authors take inspiration from both [19,90] to design their framework. Given an occluded patch from an image, their network is tasked with selecting the correct patch among negative samples obtained from the same image.
The two most recent self-supervised techniques that currently represent the state-of-the-art are SimCLR and MoCo. The SimCLR framework consists of four components. A stochastic data augmentation module first applies a random transformation to training instances, resulting in two correlated views of the same example, which are considered a positive pair. An encoder then compresses the training examples into embeddings. Next, a neural network projection head performs a second mapping to these embeddings, and the resulting latent representations are passed to a contrastive loss function. Concurrent with other recently published contrastive self-supervised methods, their work further reduces the gap between self-supervised and supervised learning. For transfer learning across 12 natural image datasets, SimCLR outperforms a network pretrained through supervised learning on 5 datasets. The supervised network outperforms SimCLR on 2.
The MoCo framework consists of two encoders. One encodes a new instance to a low dimensional representation q and the other encodes set of samples {k0, k1, k2, …} that are the keys of a dictionary. The dictionary is dynamic due to the fact that both encoders are updated during training. A contrastive loss function is utilized to push q closer to its positive key in the latent space and farther from all other dissimilar keys. A query and a key are considered similar if they are both from the same image, and vice versa. Representing another milestone achieved by SSL, MoCo outperforms a network pretrained through supervised learning on 7 detection and segmentation tasks. In Chen et al. [91], the authors take inspiration from the SimCLR framework and improve the MoCo framework by replacing the fully connected layer following the encoders with an MLP head and adding more data augmentation. Both of these additions increase the performance of MoCo on ImageNet. A comparison of the performance for the most frequently cited algorithms covered in this section can be found in Table 4. Table 4. Results of using learned feature representations from Contrastive Learning pretext tasks for the downstream task of ImageNet classification using a ResNet-50 architecture. For comparisons of results for downstream tasks with networks trained using supervised training, please refer to [20,29,30,91].

Author
Algorithm Accuracy The previously discussed results are focused on natural images, i.e. images of everyday objects or places such as those contained in ImageNet. However, medical images such as those arising from medical equipment or in digital pathology workflows are often extremely different from the natural images, both in semantic structure of the contained information and in technical representation (e.g., file size, file format, etc.). Therefore, application of SSL to the domain of medicine requires specific research and performance evaluation. Here, we review the current state of SSL in medicine.
There are many examples of SSL being successfully applied to different domains of medicine: radiology domain experts created the first applications, possibly due to the more advanced digitalization status of the imaging field, followed by other clinical specialties. In one of the earliest examples, Jamaludin et al. [92] uses longitudinal information from MRI scans to train a siamese CNN to learn embeddings where pairs of images from the same patient at different points in time and pairs of images from different patients are pushed further apart in the latent space, and vice versa. A second pretext task used is predicting vertebral body levels, and the loss functions from these two pretext tasks are combined. Close to the publication of this paper, SSL was also successfully applied to human brain scans. In Alex et al. [93], self-supervised and semi-supervised learning are combined to pretrain a network for the downstream task of segmentation of gliomas, a type of brain tumor, from MRIs. Stacked denoising autoencoders were pretrained layer by layer using unlabeled data consisting of lesion and non-lesion image patches and a reconstruction loss. After pretraining, labeled patches from a subset of patients were used to finetune the network. In Spitzer et al. [94], the authors design an auxiliary task for classifying cortical brain areas [95]. Pairs of patches are sampled from images of the same brain, and the pretext tasks are to approximate the geodesic distance between these patches as well as predict the 3D coordinates of each patch.
Shortly after the publication of these papers, further research was conducted applying SSL to other data modalities, such as endoscopic video data [96,97]. In Liu et al. [96], the authors design a self-supervised approach to train deep CNNs for the task of dense depth map estimation using monocular endoscopic video data [98]. In Ross et al. [97], the authors design a pretext task where they re-colorize unlabeled endoscopic video frames with a specialized GAN framework. Colors are converted to the Lab color space, and a U-Net [99] model that predicts the corresponding a and b channels from the luminescence channel is used as the generator with a ResNet18 model as the discriminator. This method reached comparable performance with ¼ of the original dataset, and also performs better than other pretraining methods that use non-medical data or other medical data.
SSL has also seen successful applications in cardiac MR imaging. Qin et al.
[100] address the downstream task of cardiac image segmentation by taking advantage of the fact that the tasks of cardiac MR image segmentation and motion estimation are closely related [101,102]. To leverage the related nature of these tasks, the authors design a network consisting of two branches: an unsupervised branch for the task of motion estimation and a segmentation branch. Both branches share a feature encoder. The cardiac motion estimation branch is tasked with finding a sequence of consecutive optical flow representations between a given target frame and a series of source frames. The representations learned from this task are then used for segmentation. Bai et al. [103] use standard cardiac MR scans to derive an auxiliary training signal. This leads to the pretext task of using anatomical positions defined by cardiac chamber view planes to derive feature representations of the images. This pretext task achieves a high segmentation accuracy on the downstream tasks of short-axis and long-axis image segmentations that surpasses or matches the performance of a network trained from scratch using supervised learning.
Recently, more general studies have been performed assessing the robustness and reliability of SSL tasks when applied to multiple medicine domains. In Zhou et al. [104] the authors develop a generalized SSL framework for dealing with different types of medical images, which they name Models Genesis. In Models Genesis, an autoencoder is trained using multiple SSL pretext tasks which include non-linear transformation, local pixel shuffling, outpainting, and inpainting. In Tajbakhsh et al. [28], a large scale study is performed to evaluate the effects of pretraining using different self-supervised tasks for different medicine domains. Four medicine applications are considered across various specialties: false positive reduction for nodule detection in chest CT images; lung lobe segmentation in chest CTs; severity classification of diabetic retinopathy in fundus images; and skin segmentation in color tele-medicine images. The pretext tasks used include image rotation, patch reconstruction using a Wasserstein GAN [105], and colorization using a conditional GAN [106]. Pretrained models were more successful in all tasks, except for skin segmentation, where transfer learning from ImageNet performed better. The authors postulate that this is most likely due to the fact that skin images are closer to natural images than other medical images.

Selected Applications in Medicine
Digital pathology is one field in particular where SSL has the potential to improve upon the results of transfer learning techniques for the computational diagnoses of medical images [104,107]. In routine pathology workflows, biopsies are mounted on glass slides and manually examined at the microscopic level by pathologists to assess disease characteristics such as cancer progression, genetic profiles, and cellular morphology [108]. With the development and adoption of high-resolution slide scanning technology, many parts of the pathology workflow are increasingly transitioning towards digital. As new cases are digitized along with entire archives of glass slides, large datasets of pathology images are becoming increasingly available. However, these images are many orders of magnitude larger than other types of images, typically containing over a billion pixels [109] and often have only slide-level labels (e.g. diagnosis). Obtaining pixel-level labels from expert annotations is prohibitively costly and also error-prone [27]. Therefore, self-supervised learning represents an especially promising approach to enable models to be trained on unlabeled images in pathology.
Microscopy has similarly benefited from SSL: Lu et al. [110], the authors train a CNN to automatically learn representations for single cells in microscopy images. The representations learned by solving this pretext task, the features learned by the CNN improve upon other unsupervised methods for the downstream task of discriminating protein subcellular localization classes at a single-cell level. In Zheng et al. [111], the authors address the task of segmenting white blood cells (WBCs) in blood smear images. This task is difficult for three reasons: different staining techniques and illumination conditions create significant variability in the original blood smear images; different types of WBCs sometimes cause variations to exist in the same blood smear image; and the boundaries between neighboring cells are blurred due to the irregular shapes of WBCs. For the pretext task in this paper, K-means clustering is used to separate the background and foreground region of the blood smear images. Then, cell regions are segmented using shape-based concavity analysis.
Over the past two years, as general SSL techniques have continued to increase the upper bound of unsupervised learning, more papers applying these techniques to the complex downstream task of analyzing pathology images have been published. In Yamamoto et al. [112], the authors design a pretext task that utilizes both the nucleus structure of cells analyzed in high magnification images as well as the structural pattern of cells analyzed in low magnification images. Low magnification pathology images were first divided into patches, embedded by an encoder network to form a latent representation, and then clustered using k-means. Impact scores for each image were then calculated using these clusters. A similar analysis was done for high magnification images, and then images with impact scores that did not match were removed. The features learned from the adjusted clustering were then used for subsequent predictions. When analyzed by an expert pathologist, it was found that the features learned by the pretrained networks correlated with Gleason score, which is the grading system used by pathologists to assess progression of prostate cancer. The features learned by the pretrained networks were also unique in that they were able to be understood by pathologists. In Tellez et al. [113], the authors create a technique called Neural Image Compression (NIC), which compresses large histopathology images to a higher-level latent space using unsupervised learning. NIC first divides gigapixel pathology images into smaller patches. An encoder then embeds each patch into a low-dimensional embedding vector. The embeddings are then concatenated so that their original spatial arrangement is kept intact. In order to learn patch encodings, three different unsupervised image representation learning methods were used: a variational autoencoder, contrastive training, and BiGAN [69]. In Gildenblat et al. [109], the authors use contrastive learning and a siamese CNN to pretrain a network to learn feature representations for the downstream task of image retrieval. The pretext task used in this paper is based on the assumption that in pathology images, patches that are close to one another are more likely to represent similar tissue morphology. In order to implement this, patches were extracted from each image and the network's task was to push their embeddings farther apart based on the magnitude of their spatial proximity.
In Hu et al. [114], the authors take inspiration from the adversarial frameworks used in [70,115] to design an adversarial self-supervised framework that learns cell-level visual representations and is able to separate different types of cells based on their semantic features. In Rawat et al. [107], the authors design a pretext task inspired by the idea that molecular differences of tumors can be identified through differences in morphologic phenotypes. In order to implement this pretext task, datasets of tissue microarray (TMA) cores that contained 207 tissue cores each were used. For pretraining, patches were extracted from each tissue core and assigned an index between 1 and 207.
The loss function consisted of a cross entropy loss used to predict the identity of each patch, and an additional loss term designed to minimize the distance of patches from images stained at different locations, which led to small variations in their appearance. Empirical results showed that this approach outperformed patch-based classification methods as well as networks pretrained through transfer learning. In Lu et al. [116], the authors apply SSL techniques to the classification of breast cancer histology slides. The pretext task used in this paper is CPC (as defined in section 3.5 above). Results show that using CPC as a pretext task leads to better feature representations than pretraining networks on ImageNet. A comparison of several of the pretext tasks covered in this section compared to the results from their supervised learning counterparts is shown Table 5. Table 5. A comparison of results yielded by training algorithms using different pretext tasks designed for pathology images versus their supervised counterparts. For Hu et al. [114], the results are averaged over 4 different runs. In all cases, self-supervised methods outperformed supervised methods.

Paper
Task Metric Supervised Training Self-Supervised

Discussion
This review has provided a broad and comprehensive overview of the current state of selfsupervised learning research. While still very much in its infancy, SSL has continued to yield stronger and more accurate results over the last half decade. It is clear that analytical medicine and self-supervised learning are a natural pairing, as the strengths self-supervised learning address many of the weaknesses that currently exist in machine learning in medicine. Specifically, SSL addresses the fact that while modern machine learning algorithms typically require large, labelled datasets to leverage their full learning capabilities, there are not many publicly available datasets in medicine. Also, the cost of hiring medical practitioners to do manual labeling is expensive.
These problems can be circumvented through the use of pretext tasks that are able to leverage implicit supervisory signals in unlabeled datasets to provide learning close to or equal to that of manual labeling. The trade-off for this is that while SSL requires less data, it generally requires more GPU-compute time. In fields where labeling requires highly trained specialists whose time is very expensive, SSL can be an extremely cost-effective option. Leveraging the complex, implicit signals in medical datasets also has the potential to allow both medical practitioners and data scientists to acquire results that shed light on current medical problems from a different angle. Potential directions of future research include exploring pretext tasks that allow intelligent algorithms to generalize from one domain to another, as well as designing pretext primitives that serve as building blocks for more complex tasks that these algorithms would not be able to learn with only the current data available to them.

Conclusions
SSL is a growing field that has seen rapid improvement and evolution over the past decade. In this review, we have covered the four major areas of SSL: pixel-to-scalar pretext tasks, pixel-topixel pretext tasks, adversarial learning, and contrastive learning. A wealth of different pretext tasks now exists for researchers to compare their own work to, and SSL is at a point in its lifecycle where it is spreading to specific and challenging domains such as digital pathology as well as beginning to challenge supervised learning as the dominant training paradigm. Several benchmarks for SSL have already been published [117,118], but with the vast amount of new pretext tasks being designed there is still a need for a more powerful benchmarking tool in order to fully take advantage of state-of-the-art training algorithms such as MoCo and SimCLR.
The focus of our work is to provide a comprehensive overview of SSL, along with a specific emphasis on applications in medicine and digital pathology. It is our hope that the material covered in this review will allow researchers to leverage the powerful potential of SSL and integrate it into their own machine learning pipelines, especially in those areas where data abundant and labels are scarce. We anticipate that SSL will help unlock the value of unlabeled data across different specialties in medicine and healthcare. Moving forward, SSL represents an appealing option for researchers seeking to unlock the value of large, unlabeled datasets. Author Contributions: Conceptualization, A.C. and R.U.; methodology, A.C. and R.U.; formal review and paper analysis, A.C.; writing-original draft preparation, A.C.; writing-review and editing, A.C. and R.U. and J.W. and J.R.; All authors have read and agreed to the published version of the manuscript.