A Survey on Contrastive Self-supervised Learning

Self-supervised learning has gained popularity because of its ability to avoid the cost of annotating large-scale datasets. It is capable of adopting self-defined pseudo labels as supervision and use the learned representations for several downstream tasks. Specifically, contrastive learning has recently become a dominant component in self-supervised learning methods for computer vision, natural language processing (NLP), and other domains. It aims at embedding augmented versions of the same sample close to each other while trying to push away embeddings from different samples. This paper provides an extensive review of self-supervised methods that follow the contrastive approach. The work explains commonly used pretext tasks in a contrastive learning setup, followed by different architectures that have been proposed so far. Next, we have a performance comparison of different methods for multiple downstream tasks such as image classification, object detection, and action recognition. Finally, we conclude with the limitations of the current methods and the need for further techniques and future directions to make substantial progress.


Introduction
The advancements in deep learning have elevated it to become one of the core components in most intelligent systems in existence. The ability to learn rich patterns from the abundance of data available today has made deep neural networks (DNNs) a compelling approach in the majority of computer vision (CV) tasks such as image classification, object detection, image segmentation, activity recognition as well as natural language processing (NLP) tasks such as sentence classification, language models, machine translation, etc. However, the supervised approach to learning features from labeled data has almost reached its saturation due to intense labor required in manually annotating millions of data samples. This is because most of the modern computer vision systems (that are supervised) try to learn some form of image representations by finding a pattern between the data points and their respective annotations in large datasets. Works such as GRAD-CAM [1] have proposed techniques that provide visual explanations for decisions made by a model to make them more transparent and explainable.

arXiv:2011.00362v1 [cs.CV] 31 Oct 2020
Traditional supervised learning approaches heavily rely on the amount of annotated training data available. Even though there's a plethora of data available out there, the lack of annotations has pushed researchers to find alternative approaches that can leverage them. This is where self-supervised methods plays a vital role in fueling the progress of deep learning without the need for expensive annotations and learn feature representations where data itself provides supervision. Supervised learning not only depends on expensive annotations but also suffers from issues such as generalization error, spurious correlations, and adversarial attacks [2]. Recently, self-supervised learning methods have integrated both generative and contrastive approaches that have been able to utilize unlabeled data to learn the underlying representations. A popular approach has been to propose various pretext tasks that help in learning features using pseudo-labels. Tasks such as image-inpainting, colorizing greyscale images, jigsaw puzzles, super-resolution, video frame prediction, audio-visual correspondence, etc have proven to be effective for learning good representations. Generative models gained its popularity after the introduction of Generative Adversarial Networks (GANs) [3] in 2014. The work later became the foundation for many successful architectures such as CycleGAN [4], StyleGAN [5], PixelRNN [6], Text2Image [7], DiscoGAN [8], etc. These methods inspired more researchers to switch to training deep learning models with unlabeled data in an self-supervised setup. Despite their success, researchers started realizing some of the complications in GAN-based approaches. They are harder to train because of two main reasons: (a) non-convergence-the model parameters oscillate a lot and rarely converge, and (b) the discriminator gets too successful that the generator network fails to create real-like fakes due to which the learning cannot be continued. Also, proper synchronization is required between the generator and the discriminator that prevents the discriminator to converge and the generator to diverge. Unlike generative models, contrastive learning (CL) is a discriminative approach that aims at grouping similar samples closer and diverse samples far from each other as shown in figure 1. To achieve this, a similarity metric is used to measure how close two embeddings are. Especially, for computer vision tasks, a contrastive loss is evaluated based on the feature representations of the images extracted from an encoder network. For instance, one sample from the training dataset is taken and a transformed version of the sample is retrieved by applying appropriate data augmentation techniques. During training referring to figure 2, the augmented version of the original sample is considered as a positive sample, and the rest of the samples in the batch/dataset (depends on the method being used) are considered negative samples. Next, the model is trained in a way that it learns to differentiate positive samples from the negative ones. The differentiation is achieved with the help of some pretext task (explained in section 2). In doing so, the model learns quality representations of the samples and is used later for transferring knowledge to downstream tasks. This idea is advocated by an interesting experiment conducted by Epstein [9] in 2016, where he asked his students to draw a dollar bill with and without looking at the bill. The results from the experiment show that the brain does not require complete information of a visual piece to differentiate one object from the other. Instead, only a rough representation of an image is enough to do so.
Most of the earlier works in this area combined some form of instance-level classification approach [10] [11] [12] with contrastive learning and were successful to some extent. However, recent methods such as SwAV [13], MoCo [14], and SimCLR [15] with modified approaches have produced results comparable to the state-of-the-art supervised method on ImageNet [16] dataset as shown in figure 3. Similarly, PIRL [17], Selfie [18], and [19] are some papers that reflect the effectiveness of the pretext tasks being used and how they boost the performance of their models.

Pretext Tasks
Pretext tasks are self-supervised tasks that act as an important strategy to learn representations of the data using pseudo labels. These pseudo labels are generated automatically based on the attributes found in the data. The learned model from the pretext task can be used for any downstream tasks such as classification, segmentation, detection, etc. in computer vision. Furthermore, these tasks can be applied to any kind of data such as image, video, speech, signals, and so on. For a pretext task in contrastive learning, the original image acts as an anchor, its augmented(transformed) version acts as a positive sample, and the rest of the images in the batch or in the training data act as negative samples.
Most of the commonly used pretext tasks are divided into four main categories: color transformation, geometric transformation, context-based tasks, and cross-modal based tasks. These pretext tasks have been used in various scenarios based on the problem intended to be solved. Color transformation involves basic adjustments of color levels in an image such as blurring, color distortions, converting to grayscale, etc. Figure 4 represents an example of color transformation applied on a sample image from the ImageNet dataset [15]. During this pretext task, the network learns to recognize similar images invariant to their colors.

Geometric Transformation
A geometric transformation is a spatial transformation where the geometry of the image is modified without altering its actual pixel information. The transformations include scaling, random cropping, flipping (horizontally, vertically), etc. as represented in figure 5 through which global-to-local view prediction is achieved. Here the original image is considered as the global view and the transformed version is considered as the local view. Chen et. al. [15] performed such transformations to learn features during pretext task. Traditionally, solving jigsaw puzzles has been a prominent task in learning features from an image in an unsupervised way. It involves identifying the correct position of the scrambled patches in an image by training an encoder ( figure 6). In terms of contrastive learning, the original image is the anchor, and an augmented image formed by scrambling the patches in the original image acts as a positive sample. The rest of the images in the dataset/batch are considered to be negative samples [17].

Frame order based
This approach applies to data that extends through time. An ideal application would be in the case of sensor data or a sequence of image frames (video). A video contains a sequence of semantically related frames. This implies that frames that are nearby with respect to time are closely related and the ones that are far away are less likely to be related. Intuitively, the motive for using such an approach is, solving a pretext task that allows the model to learn useful visual representations while trying to recover the temporal coherence of a video. Here, a video with shuffled order in the sequence of its image frames acts as a positive sample while all other videos in the batch/dataset would be negative samples.
Similarly, other possible approaches include randomly sampling two clips of the same length from a longer video or applying spatial augmentation for each video clip. The goal is to use a contrastive loss to train the model such that clips taken from the same video are arranged closer whereas clips from different videos are pushed away in the embedding space. In the work proposed by Qian et. al. [20], the framework contrasts the similarity between two positive samples to those of negative samples. The positive pairs are two augmented clips from the same video. As a result, it separates all encoded videos into non-overlapping regions such that an augmentation used in the training perturbs an encoded video only within a small region in the representation space. One of the most common strategies for data that extends through time is to predict future or missing information. This is commonly used for sequential data such as sensory data, audio signals, videos, etc. The goal of a future prediction task is to predict high-level information of future time-step given a series of past ones. In the work proposed by [21,22], high-dimensional data is compressed into a compact lower-dimensional latent embedding space. Powerful autoregressive models are used to summarize the information in the latent space and a context latent representation C t is produced as represented in figure 7. When predicting future information, the target (future) and context C t are encoded into a compact distributed vector representation in a way that maximally preserves the mutual information of the original signals.

View Prediction (Cross modal-based)
View prediction tasks are preferred for data that has multiple views of the same scene. Following this approach, in [23], the anchor and its positive images taken from simultaneous viewpoints, are encouraged to be close in the embedding space while distant from negative images taken from a different time within the same sequence. The model learns by trying to simultaneously identify similar features between the frames from different angles and also trying to find the difference between frames that occur later in the sequence. Figure 8 represents their approach for view prediction. Similarly, recent work proposes an inter-intra contrastive framework where inter-sampling is learned through multi-view of the same sample, and intra-sampling that learns the temporal relation is performed through multiple approaches such as frame repetition and frame order shuffling that acts as the negative samples [24].

Identifying the right pre-text task
The choice of pretext task relies on the type of problem being solved. Although numerous methods have been proposed in contrastive learning, a separate track of research is still going on to identify the right pre-text task. Work has identified and proved that it is important to determine the right kind of pre-text task for a model to perform well with contrastive learning. The main aim of a pre-text task is to compel the model to be invariant to these transformations while remaining discriminative to other data points. But the bias introduced through such augmentations could be a double-edged sword, as each augmentation encourages invariances to a transformation which can be beneficial in some cases and harmful in others. For instance, applying rotation may help with view-independent aerial image recognition but might significantly downgrade the performance while trying to solve downstream tasks such as detecting which way is up in a photograph for a display application. [25]. Similarly, colorization-based pretext tasks might not work out in a fine-grain classification represented in figure 9.  [23] Similarly, in work [26], the authors focus on the importance of using the right pretext task. The authors pointed out that in their scenario, except for rotation, other transformations such as scaling and changing aspect ratio may not be appropriate for the pretext task because they produce easily detectable visual artifacts. They also reveal that rotation does not work well when the image in a target dataset is constructed by color textures as in DTD dataset [27] as shown in figure 10. Figure 9: Most of the shapes of these two pairs of images are same. However, low-level statistics are different (color and texture). Usage of right pre-text task here is necessary [28] Figure 10: A sample from the DTD dataset [27]. An example of why rotation based pretext task will not work well.

Architectures
Contrastive learning methods rely on the number of negative samples for generating good quality representations. It can be seen as a dictionary-lookup task where the dictionary is sometimes the whole training set and the rest of the times some subset of the dataset. An interesting way to categorize these methods would be based on the technique used to collect negative samples against a positive data-point during training. Based on the approach taken, we categorized the methods into four major architectures as shown in figure 11. Each architecture is explained separately along with examples of successful methods that follow similar principles. Using a memory bank to store and retrieve encodings of negative samples (c) Using a momentum encoder which acts as a dynamic dictionary lookup for encodings of negative samples during training (d) Implementing a clustering mechanism by using swapped prediction of the obtained representations from both the encoders using end-to-end architecture

End-to-End Learning
End-to-end learning is a complex learning system that uses gradient-based learning and is designed in such a way that all modules are differentiable [29]. This architecture prefers large batch sizes to accumulate a greater number of negative samples. Except for the original image and its augmented version, the rest of the images in the batch are considered negative. The pipeline employs two encoders: a Query encoder (Q) and a Key encoder (K) as shown in figure (11a). The two encoders can be different and are updated end-to-end by backpropagation during training. The main idea behind training these encoders separately is to generate distinct representations of the same sample. Using a contrastive loss, it converges to make positive samples closer and negative samples far from the original sample. Here, the query encoder Q is trained on the original samples and the key encoder K is trained on their augmented versions (positive samples) along with the negative samples in the batch. The features q and k generated from these encoders are used to calculate the similarity between the respective inputs using a similarity metric (discussed later in section 5). Most of the time, the similarity metric used is "cosine similarity" which is simply the inner product of two vectors normalized to have length 1 as defined in equation 2.
Recently, a successful end-to-end model was proposed in SimCLR [15] where they used a batch size of 4096 for 100 epochs. It has been verified that end-to-end architectures are simple in complexity but perform better with large batch sizes and a higher number of epochs as represented in figure 12. Another popular work that follows end-to-end architecture was proposed by Oord et. al [21] where they learn feature representations of high-dimensional time series data by predicting the future in latent space by using powerful autoregressive models along with a contrastive loss. This approach makes the model tractable by using negative sampling. Also, other works that follow this approach include [30,31,32,33,34].
The number of negative samples available in this approach is coupled with the batch size as it accumulates negative samples from the current batch. Since the batch size is limited by the GPU memory size, the scalability factor with these methods remains an issue. Furthermore, for larger batch sizes, the methods suffer from a large mini-batch optimization problem and require effective optimization strategies as pointed out by [35].

Using a Memory Bank
With potential issues from having larger batch sizes that could inversely impact the optimization during training, a possible solution is to maintain a separate dictionary known as memory bank.  The representation of a sample in the memory bank gets updated when it is last seen, so the sampled keys are essentially about the encoders at multiple different steps all over the past epoch. PIRL [17] is one of the recent successful methods that learns good visual representations of images trained using a memory bank as shown in figure 13. It requires the learner to construct representations of images that are covariant to any of the pretext tasks being used, though they focus mainly on the Jigsaw pretext task. Another popular work that uses a memory bank under contrastive setting was proposed by Wu et al. [12] where they implemented a non-parametric variant of softmax classifier that is more scalable for big data applications.
However, maintaining a memory bank during training can be a complicated task. One of the potential drawbacks of this approach is that it can be computationally expensive to update the representations in the memory bank as the representations get outdated quickly in a few passes.

Using a Momentum Encoder
To address the issues with a memory bank explained in the previous section 3.2, the memory bank gets replaced by a separate module called Momentum Encoder. The momentum encoder generates a dictionary as a queue of encoded keys with the current mini-batch enqueued and the oldest mini-batch dequeued. The dictionary keys are defined on-the-fly by a set of data samples in the batch during training. The momentum encoder shares the same parameters as the encoder Q as shown in figure 11c. It is not backpropagated after every pass, instead, it gets updated based on the parameters of the query encoder as represented by equation 1 [14].
In the equation, m ∈ [0, 1) is the momentum coefficient. Only the parameters θ q are updated by back-propagation. The momentum update makes θ k evolve smoothly than θ q . As a result, though the keys in the queue are encoded by different encoders (in different mini-batches), the difference among these encoders can be made small.
The advantage of using this architecture over the first two is that it does not require training two separate models. Furthermore, there is no need to maintain a memory bank that is computationally and memory inefficient.

Clustering Feature Representations
All three architectures explained above focus on comparing samples using a similarity metric and try to keep similar items closer and dissimilar items far from each other allowing the model to learn better representations. On the contrary, this architecture follows an end-to-end approach with two encoders that share parameters, but instead of using instance-based contrastive approach, they utilize a clustering algorithm to group similar features together. One of the most recent works that employ clustering methods, SwAV [13] is represented in figure 14. The diagram points out the differences between other instance-based contrastive learning architectures and the clustering-based methods. Here, the goal is not only to make a pair of samples close to each other but also, make sure that all other features that are similar to each other form clusters together. For example, in an embedded space of images, the features of cats should be closer to the features of dogs (as both are animals) but should be far from the features of houses (as both are distinct).
In instance-based learning, every sample is treated as a discrete class in the dataset. This makes it unreliable in conditions where it compares an input sample against other samples from the same class that the original sample belongs to. To explain it clearly, imagine we have an image of a cat in the training batch that is the current input to the model. During this pass, all other images in the batch are considered as negative. The issue arises when there are images of other cats in the negative samples. This condition forces the model to learn two images of cats as not similar during training despite both being from the same class. This problem is implicitly addressed by a clustering-based approach.

Encoders
Encoders play an integral role in any self-supervised learning pipeline as they are responsible for mapping the input samples to a latent space. Figure 15 reflects the role of an encoder in a self-supervised learning pipeline. Without effective feature representations, a classification model might have difficulty in learning to distinguish among different classes. Most of the works in contrastive learning utilize some variant of the ResNet [36] model. Among its variants, ResNet-50 has been the most widely used because of its balance between size and learning capability.
In an encoder, the output from a specific layer is pooled to get a single-dimensional feature vector for every sample. Depending on the approach, they are either upsampled or downsampled. For example, in the work proposed by Misra . They further apply a single linear projection to get a 128-dimensional feature vector. Also, as part of their ablation test, they investigated features from various stages such as res2, res3, and res4 to evaluate the performance. As expected, features extracted from the later stages of the encoder proved to be a better representation of the input than the features extracted from the earlier stages.
Similarly, in the work proposed by Chen et. al. [37], a traditional ResNet is used as an encoder where the features are extracted from the output of the average pooling layer. Further, a shallow MLP (1 hidden layer) maps representations to a latent space where a contrastive loss is applied. For training a model for action recognition, the most common approach to extract features from a sequence of image frames is to use a 3D-ResNet as encoder [22,24].

Training
To train an encoder, a pretext task is used that utilizes contrastive loss for backpropagation. The central idea in contrastive learning is to bring similar instances closer and push away dissimilar instances far from each other. One way to achieve this is to use a similarity metric that measures the closeness between the embeddings of two samples. In a contrastive setup, the most common similarity metric used is cosine similarity that acts as a basis for different contrastive loss functions. The cosine similarity of two variables (vectors) is the cosine of the angle between them and is defined as follows: Contrastive learning focuses on comparing the embeddings with a Noise Contrastive Estimation (NCE) [38] function that is defined as follows: where q is the original sample, k + represents a positive sample, and k _ represents a negative sample. τ is a hyperparameter used in most of the recent methods and is called temperature coefficient. The sim() function can be any similarity function, but generally a cosine similarity as defined in equation 2 is used. The initial idea behind NCE was to perform a non-linear logistic regression that discriminates between observed data and some artificially generated noise.
If the number of negative samples is greater, a variant of NCE called InfoNCE is used as represented in equation 4.
where k i represents a negative sample.
Similar to other deep learning methods, contrastive learning employs a variety of optimization algorithms for training. The training process involves learning the parameters of encoder network by minimizing the loss function. Stochastic Gradient Descent (SGD) has one of the most popular optimization algorithms used with contrastive learning methods [17,14,10,12]. It is an stochastic approximation of gradient descent optimization since it replaces the actual gradient (calculated from the entire data set) by an estimate calculated from a randomly selected subset of data. A crucial hyperparameter for the SGD algorithm is the learning rate which in practice should gradually be decreased over time. An improved version of SGD (with momentum) is used in most deep learning approaches.
Another popular optimization method known as adaptive learning rate optimization algorithm (Adam) [39] has been used in a few methods [21,40,41]. In Adam, momentum is incorporated directly as an estimate of the first-order moment. Furthermore, Adam includes bias corrections to the estimates of both the first-order moments and the second-order moments to account for their initialization at the origin.
Since some of the end-to-end methods [15,42,13] use a very large batch size, training with standard SGD-based optimizers with a linear learning rate scaling becomes unstable. In order to stabilize the training, Layer-wise Adaptive Rate Scaling (LARS) [43] optimizer along with cosine learning rate [44] was introduced. There are two main differences between LARS and other adaptive algorithms such as Adam. First, LARS uses a different learning rate for every layer that leads to better stability. Second, the magnitude of the update is based on the weight norm for better control of training speed. Furthermore, employing cosine learning rate involves periodically warm restarts of SGD, where in each restart, the learning rate is initialized to some value and is scheduled to decrease over time.
6 Downstream Tasks Figure 16: Image classification, localization, detection, and segmentation as downstream tasks in computer vision [49] Generally, computer vision pipelines that employ self-supervised learning involve performing two tasks: a pretext task and a downstream task. Downstream tasks are application-specific tasks that utilize the knowledge that was learned during the pretext task. They can be anything such as classification, detection, segmentation, future prediction, etc. in computer vision. Figure 17 represents the overview of how knowledge is transferred to a downstream task. The learned parameters serve as a pretrained model and are transferred to other downstream computer vision tasks by fine-tuning. The performance of transfer learning on these high-level vision tasks demonstrates the generalization ability of the learned features. To evaluate the effectiveness of features learned with a self-supervised approach for downstream tasks, methods such as kernel visualization, feature map visualization, nearest-neighbor based approaches are commonly used to analyze the effectiveness of the pretext task.

Visualizing Kernels and Feature Maps
Here, the kernels of the first convolutional layer from encoders trained with both self-supervised (contrastive) and supervised approaches are compared. This helps to estimate the effectiveness of the self-supervised approach [50]. Similarly, attention maps generated from different layers of the encoders can be used to evaluate if an approach works or not. Gidaris et. al. [51] assessed the effectiveness based on the activated regions observed in the input as shown in figure 18.

Nearest Neighbor retrieval
In general, the samples that belong to the same class are expected to be closer to each other in the latent space. With the nearest neighbor approach, for a given input sample, top-K retrieval of the samples from the dataset can be used to analyze whether a self-supervised approach performs as expected or not.

Benchmarks
Recently, several self-supervised learning methods for computer vision tasks have been proposed that challenge the existing state-of-the-art supervised models. In this section, we collect and compare the performances of these methods    (2) Object detection with finetuned features on VOC7+12 using Faster-CNN based on the downstream tasks they were evaluated on. For image classification, two popular datasets ImageNet [16] and Places [52] have been used by most of the methods. Similarly, for object detection, Pascal VOC dataset has often been referred to for evaluation where these methods have outperformed the best supervised models. For action recognition and video classification, datasets such as UCF-101 [53], HMDB-51 [54], and Kinetics [55] have been used. Table 1 highlights the performance of several methods on ImageNet and reflects how these methods have evolved and performed better with time. At the moment, as seen in figure 3, SwAV [13] produces comparable accuracy to the state-of-the-art supervised model in learning image representations from ImageNet. Similarly, for image classification task on Places [52] dataset, SwAV [13] and AMDIM [32] have outperformed top supervised models with higher top-1 accuracies as shown in table 3. The methods shown in the table were first pretrained on ImageNet and later inferred on Places dataset using a linear classifier. The results advocate that representations learned by contrastive learning methods performed better than the supervised approach when tested on a different dataset.
These methods have not only excelled in image classification but also have performed well on other tasks like object detection and action recognition. As shown in    [72] for natural language processing in 2013. The authors proposed a contrastive learning-based framework by using co-occurring words as semantically similar points and negative sampling [73] for learning word embeddings. Negative sampling algorithm differentiates a word from the noise distribution using logistic regression and helps to simplify the training method. This framework results in huge improvement in the quality of representations of learned words and phrases in a computationally efficient way. Arora et al. [74] proposed a theoretical framework for contrastive learning that learns useful feature representations from unlabeled data and introduced latent classes to formalize the notion of semantic similarity and performs well on classification tasks using the learned representations. Its performance is comparable to the state-of-the-art supervised approach on the Wiki-3029 dataset. Another recent model, CONtrastive Position and Ordering with Negatives Objective(CONPONO) [75] discourses coherence and encodes fine-grained sentence ordering in text and outperforms BERT-Large model despite having the same number of parameters as BERT-Base.
Contrastive Learning has started gaining popularity on several NLP tasks in the recent years. It has shown significant improvement on NLP downstream tasks such as cross-lingual pre-training [76], language understanding [77], and textual representations learning [78]. INFOXLM [76], a cross-lingual pretraining model, proposes a cross-lingual pretraining task based on maximizing the mutual information between two input sequences and learns to differentiate machine translation of input sequences using contrastive learning. Unlike TLM [79], this model aims to maximize mutual information between machine translation pairs in cross-lingual platform and improves the cross-lingual transferability in various downstream tasks, such as cross-lingual classification and question answering. Table 6 shows the recent contrastive learning methods on NLP downstream task.
Most of the popular language models such as BERT [80], GPT [81] approach pretraining on tokens and hence may not capture sentence-level semantics. To address this issue, CERT [77] that pretrains models on the sentence level using contrastive learning was proposed. This model works in two steps: 1) creating augmentation of sentences using back-translation, and 2) predicting whether two augmented versions are from the same sentence or not by fine-tuning a pretrained language representation model (e.g., BERT, BART). CERT was also evaluated on 11 different natural

Model Dataset
Application areas Distributed Representations [72] Google internal Training with Skip-gram model Contrastive Unsupervised [74] Wiki-3029 Unsupervised representation learning CONPONO [75] RTE, COPA, ReCoRD Discourse fine-grained sentence ordering in text INFOXLM [76] MMLM Learning cross-lingual representations CERT [77] GLUE benchmark Capturing sentence-level semantics DeCLUTR [78] OpenWebText Learning universal sentence representations Table 6: Recent contrastive learning methods in NLP along with the datasets they were evaluated on and the respective downstream tasks language understanding tasks in the GLUE benchmark where it outperformed BERT on 7 tasks. DeCLUTR [78] is selfsupervised model for learning universal sentence embeddings. This model outperforms InferSent, a popular sentence encoding method. It has been evaluated based on the quality of sentence embedding on the SentEval benchmark. Table  5 provides the comparison of accuracy on different NLP dataset.

Discussions and Future Directions
Although empirical results show that contrastive learning has decreased the gap in performance with supervised models, there is a need for more theoretical analysis to form a solid justification. For instance, a study by Purushwalkam et. al. [82] reveals that approaches like PIRL [17] and MoCo [14] fail to capture viewpoint and category instance invariance that are crucial components for object recognition. Some of these issues are further discussed below.

Lack of Theoretical Foundation
In an attempt to investigate the generalization ability of contrastive objective function, the empirical results from Arora et. al. [74] show that architecture design and sampling techniques also have a profound effect on the performance. Tsai et. al. [83] provide an information-theoretical framework from a multi-view perspective to understand the properties that encourage successful self-supervised learning. They demonstrate that self-supervised learned representations can extract task-relevant information (with a potential loss) and discard task-irrelevant information (with a fixed gap). Ultimately, it propels the methods towards being highly dependent on the pretext task chosen during training. This affirms the need for more theoretical analysis on different modules in a contrastive pipeline.

Selection of Data Augmentation and Pretext Tasks
PIRL [17] emphasizes on methods that produce consistent results irrespective of the pretext task selected, but works like SimCLR [37], MoCo-v2 [42] and Tian et. al. [19] demonstrate that selecting robust pretext tasks along with suitable data augmentations can highly boost the quality of the representations. Recently, SwAV [13] beat other self-supervised methods by using multiple augmentations. It is difficult to directly compare these methods to choose specific tasks and transformations that can yield the best results on any dataset.

Proper Negative Sampling during Training
During training, an original (positive) sample is compared against its negative counterparts that contribute towards a contrastive loss to train the model. In cases of easy negatives (where the similarity between the original sample and a negative sample is very low), the contribution towards the contrastive loss is minimal. This limits the ability of the model to converge quickly. To get more meaningful negative samples, top self-supervised methods either increase the batch sizes [15] or maintain a very large memory bank [17]. Recently, Kalantidis et. al. [84] proposed a few hard negative mixing strategies to facilitate faster and better learning. However, this introduces a large number of hyperparameters that are specific to the training set and are difficult to generalize for other datasets.

Dataset Biases
In any self-supervised learning task, the data itself provides supervision. In effect, the representations learned using self-supervised objectives are influenced by the underlying data. Such biases are difficult to minimize with increase in the size of the datasets.

Conclusion
This paper has extensively reviewed recent top-performing self-supervised methods that follow contrastive learning for both vision and NLP tasks. We clearly explain different modules in a contrastive learning pipeline; from choosing the right pretext task, selecting an architectural design, to using the learned parameters for a downstream task. The works based on contrastive learning have shown promising results on several downstream tasks such as image/video classification, object detection, and other NLP tasks. Finally, this work concludes by discussing some of the open problems of current approaches that are yet to be addressed. New techniques and paradigms are needed to tackle these issues.