# A Survey on Contrastive Self-Supervised Learning

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Pretext Tasks

#### 2.1. Color Transformation

#### 2.2. Geometric Transformation

**considered as the global view and the transformed version is considered as the local view**. Chen et al. [15] performed such transformations to learn features during pretext task.

#### 2.3. Context-Based

#### 2.3.1. Jigsaw Puzzle

#### 2.3.2. Frame Order Based

#### 2.3.3. Future Prediction

#### 2.4. View Prediction (Cross-Modal-Based)

#### 2.5. Identifying the Right Pre-Text Task

#### 2.6. Pre-Text Tasks in NLP

#### 2.6.1. Center and Neighbor Word Prediction

#### 2.6.2. Next and Neighbor Sentence Prediction

#### 2.6.3. Autoregressive Language Modeling

#### 2.6.4. Sentence Permutation

## 3. Architectures

#### 3.1. End-to-End Learning

#### 3.2. Using a Memory Bank

#### 3.3. Using a Momentum Encoder for Contrastive Learning

#### 3.4. Clustering Feature Representations

## 4. Encoders

## 5. Training

## 6. Downstream Tasks

#### 6.1. Visualizing the Kernels and Feature Maps

#### 6.2. Nearest-Neighbor Retrieval

## 7. Benchmarks

## 8. Contrastive Learning in NLP

## 9. Discussions and Future Directions

#### 9.1. Lack of Theoretical Foundation

#### 9.2. Selection of Data Augmentation and Pretext Tasks

#### 9.3. Proper Negative Sampling during Training

#### 9.4. Dataset Biases

## 10. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Conflicts of Interest

## References

- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]
- Liu, X.; Zhang, F.; Hou, Z.; Wang, Z.; Mian, L.; Zhang, J.; Tang, J. Self-supervised learning: Generative or contrastive. arXiv, 2020; arXiv:2006.08218. [Google Scholar]
- Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. arXiv, 2014; arXiv:1406.2661. [Google Scholar] [CrossRef]
- Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
- Oord, A.V.d.; Kalchbrenner, N.; Kavukcuoglu, K. Pixel recurrent neural networks. arXiv, 2016; arXiv:1601.06759. [Google Scholar]
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. arXiv, 2016; arXiv:1605.05396. [Google Scholar]
- Kim, T.; Cha, M.; Kim, H.; Lee, J.K.; Kim, J. Learning to discover cross-domain relations with generative adversarial networks. arXiv, 2017; arXiv:1703.05192. [Google Scholar]
- Epstein, R. The Empty Brain. 2016. Available online: https://aeon.co/essays/your-brain-does-not-process-information-and-it-is-not-a-computer (accessed on 1 November 2020).
- Bojanowski, P.; Joulin, A. Unsupervised learning by predicting noise. arXiv, 2017; arXiv:1704.05310. [Google Scholar]
- Dosovitskiy, A.; Fischer, P.; Springenberg, J.T.; Riedmiller, M.; Brox, T. Discriminative unsupervised feature learning with exemplar convolutional neural networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2015**, 38, 1734–1747. [Google Scholar] [CrossRef][Green Version] - Wu, Z.; Xiong, Y.; Yu, S.X.; Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3733–3742. [Google Scholar]
- Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. arXiv, 2020; arXiv:2006.09882. [Google Scholar]
- He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
- Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv, 2020; arXiv:2002.05709. [Google Scholar]
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- Misra, I.; Maaten, L.V.D. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6707–6717. [Google Scholar]
- Trinh, T.H.; Luong, M.T.; Le, Q.V. Selfie: Self-supervised pretraining for image embedding. arXiv, 2019; arXiv:1906.02940. [Google Scholar]
- Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; Isola, P. What makes for good views for contrastive learning. arXiv, 2020; arXiv:2005.10243. [Google Scholar]
- Qian, R.; Meng, T.; Gong, B.; Yang, M.H.; Wang, H.; Belongie, S.; Cui, Y. Spatiotemporal Contrastive Video Representation Learning. arXiv, 2020; arXiv:2008.03800. [Google Scholar]
- Van den Oord, A.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv, 2018; arXiv:1807.03748. [Google Scholar]
- Lorre, G.; Rabarisoa, J.; Orcesi, A.; Ainouz, S.; Canu, S. Temporal Contrastive Pretraining for Video Action Recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 662–670. [Google Scholar]
- Sermanet, P.; Lynch, C.; Chebotar, Y.; Hsu, J.; Jang, E.; Schaal, S.; Levine, S.; Brain, G. Time-contrastive networks: Self-supervised learning from video. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1134–1141. [Google Scholar]
- Tao, L.; Wang, X.; Yamasaki, T. Self-supervised video representation learning using inter-intra contrastive framework. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2193–2201. [Google Scholar]
- Xiao, T.; Wang, X.; Efros, A.A.; Darrell, T. What Should Not Be Contrastive in Contrastive Learning. arXiv, 2020; arXiv:2008.05659. [Google Scholar]
- Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision—ECCV 2016, Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 69–84. [Google Scholar]
- Yamaguchi, S.; Kanai, S.; Shioda, T.; Takeda, S. Multiple Pretext-Task for Self-Supervised Learning via Mixing Multiple Image Transformations. arXiv, 2019; arXiv:1912.11603. [Google Scholar]
- Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3606–3613. [Google Scholar]
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. arXiv, 2013; arXiv:1301.3781. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018; arXiv:1810.04805. [Google Scholar]
- Kiros, R.; Zhu, Y.; Salakhutdinov, R.R.; Zemel, R.; Urtasun, R.; Torralba, A.; Fidler, S. Skip-thought vectors. Adv. Neural Inf. Process. Syst.
**2015**, 28, 3294–3302. [Google Scholar] - Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training.
**2018**. in progress. [Google Scholar] - Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. arXiv, 2019; arXiv:1910.13461. [Google Scholar]
- Glasmachers, T. Limits of end-to-end learning. arXiv, 2017; arXiv:1704.08305. [Google Scholar]
- Hjelm, R.D.; Fedorov, A.; Lavoie-Marchildon, S.; Grewal, K.; Bachman, P.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv, 2018; arXiv:1808.06670. [Google Scholar]
- Ye, M.; Zhang, X.; Yuen, P.C.; Chang, S.F. Unsupervised Embedding Learning via Invariant and Spreading Instance Feature. arXiv, 2019; arXiv:1904.03436. [Google Scholar]
- Bachman, P.; Hjelm, R.D.; Buchwalter, W. Learning representations by maximizing mutual information across views. In Proceedings of the Advances in Neural Information Processing Systems, 8–14 December 2019; Vancouver, Canada; pp. 15535–15545. [Google Scholar]
- Henaff, O. Data-efficient image recognition with contrastive predictive coding. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; pp. 4182–4192. [Google Scholar]
- Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv, 2020; arXiv:2004.11362. [Google Scholar]
- Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 h. arXiv, 2017; arXiv:1706.02677. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Chen, T.; Zhai, X.; Ritter, M.; Lucic, M.; Houlsby, N. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12154–12163. [Google Scholar]
- Gutmann, M.; Hyvärinen, A. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, Sardinia, Italy, 13–15 May 2010. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv, 2014; arXiv:1412.6980. [Google Scholar]
- Srinivas, A.; Laskin, M.; Abbeel, P. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. arXiv, 2020; arXiv:2004.04136. [Google Scholar]
- Hafidi, H.; Ghogho, M.; Ciblat, P.; Swami, A. GraphCL: Contrastive Self-Supervised Learning of Graph Representations. arXiv, 2020; arXiv:2007.08025. [Google Scholar]
- Chen, X.; Fan, H.; Girshick, R.; He, K. Improved Baselines with Momentum Contrastive Learning. arXiv, 2020; arXiv:2003.04297. [Google Scholar]
- You, Y.; Gitman, I.; Ginsburg, B. Large Batch Training of Convolutional Networks. arXiv, 2017; arXiv:1708.03888. [Google Scholar]
- Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv, 2016; arXiv:1608.03983. [Google Scholar]
- Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep Clustering for Unsupervised Learning of Visual Features. arXiv, 2019; arXiv:1807.05520. [Google Scholar]
- Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised Representation Learning by Predicting Image Rotations. arXiv, 2018; arXiv:1803.07728. [Google Scholar]
- Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; Torralba, A. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 40, 1452–1464. [Google Scholar] [CrossRef][Green Version] - Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv, 2012; arXiv:1212.0402. [Google Scholar]
- Kuehne, H.; Jhuang, H.; Garrote, E.; Poggio, T.; Serre, T. HMDB: A large video database for human motion recognition. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 2556–2563. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Zhuang, C.; Zhai, A.L.; Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6002–6012. [Google Scholar]
- Donahue, J.; Simonyan, K. Large scale adversarial representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 10542–10552. [Google Scholar]
- Li, J.; Zhou, P.; Xiong, C.; Socher, R.; Hoi, S.C.H. Prototypical Contrastive Learning of Unsupervised Representations. arXiv, 2020; arXiv:2005.04966. [Google Scholar]
- Asano, Y.M.; Rupprecht, C.; Vedaldi, A. Self-labelling via simultaneous clustering and representation learning. arXiv, 2019; arXiv:1911.05371. [Google Scholar]
- Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial Feature Learning. arXiv, 2017; arXiv:1605.09782. [Google Scholar]
- Doersch, C.; Gupta, A.; Efros, A.A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1422–1430. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A. Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction. arXiv, 2017; arXiv:1611.09842. [Google Scholar]
- Zhang, L.; Qi, G.J.; Wang, L.; Luo, J. AET vs. AED: Unsupervised Representation Learning by Auto-Encoding Transformations rather than Data. arXiv, 2019; arXiv:1901.04596. [Google Scholar]
- Goyal, P.; Mahajan, D.; Gupta, A.; Misra, I. Scaling and Benchmarking Self-Supervised Visual Representation Learning. arXiv, 2019; arXiv:1905.01235. [Google Scholar]
- Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2536–2544. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A. Colorful Image Colorization. arXiv, 2016; arXiv:1603.08511. [Google Scholar]
- Kim, D.; Cho, D.; Kweon, I.S. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 8545–8552. [Google Scholar]
- Lee, H.Y.; Huang, J.B.; Singh, M.; Yang, M.H. Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 667–676. [Google Scholar]
- Sayed, N.; Brattoli, B.; Ommer, B. Cross and learn: Cross-modal self-supervision. In GCPR 2018: Pattern Recognition, Proceedings of the German Conference on Pattern Recognition, Stuttgart, Germany, 9–12 October 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 228–243. [Google Scholar]
- Fernando, B.; Bilen, H.; Gavves, E.; Gould, S. Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3636–3645. [Google Scholar]
- Misra, I.; Zitnick, C.L.; Hebert, M. Shuffle and learn: Unsupervised learning using temporal order verification. In Computer Vision—ECCV 2016, Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 527–544. [Google Scholar]
- Yao, T.; Zhang, Y.; Qiu, Z.; Pan, Y.; Mei, T. SeCo: Exploring Sequence Supervision for Unsupervised Representation Learning. arXiv, 2020; arXiv:2008.00975. [Google Scholar]
- Liu, Z.; Gao, G.; Qin, A.; Li, J. DTG-Net: Differentiated Teachers Guided Self-Supervised Video Action Recognition. arXiv, 2020; arXiv:2006.07609. [Google Scholar]
- Tian, Y.; Krishnan, D.; Isola, P. Contrastive Multiview Coding. arXiv, 2019; arXiv:1906.05849. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst.
**2013**, 26, 3111–3119. [Google Scholar] - Gutmann, M.U.; Hyvärinen, A. Noise-Contrastive Estimation of Unnormalized Statistical Models, with Applications to Natural Image Statistics. J. Mach. Learn. Res.
**2012**, 13, 307–361. [Google Scholar] - Arora, S.; Khandeparkar, H.; Khodak, M.; Plevrakis, O.; Saunshi, N. A Theoretical Analysis of Contrastive Unsupervised Representation Learning. arXiv, 2019; arXiv:1902.09229. [Google Scholar]
- Iter, D.; Guu, K.; Lansing, L.; Jurafsky, D. Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models. arXiv, 2020; arXiv:2005.10389. [Google Scholar]
- Chi, Z.; Dong, L.; Wei, F.; Yang, N.; Singhal, S.; Wang, W.; Song, X.; Mao, X.L.; Huang, H.; Zhou, M. InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. arXiv, 2020; arXiv:2007.07834. [Google Scholar]
- Fang, H.; Wang, S.; Zhou, M.; Ding, J.; Xie, P. CERT: Contrastive Self-supervised Learning for Language Understanding. arXiv, 2020; arXiv:2005.12766. [Google Scholar]
- Giorgi, J.M.; Nitski, O.; Bader, G.D.; Wang, B. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. arXiv, 2020; arXiv:2006.03659. [Google Scholar]
- Lample, G.; Conneau, A. Cross-lingual Language Model Pretraining. arXiv, 2019; arXiv:1901.07291. [Google Scholar]
- Purushwalkam, S.; Gupta, A. Demystifying Contrastive Self-Supervised Learning: Invariances, Augmentations and Dataset Biases. arXiv, 2020; arXiv:2007.13916. [Google Scholar]
- Tsai, Y.H.H.; Wu, Y.; Salakhutdinov, R.; Morency, L.P. Self-supervised Learning from a Multi-view Perspective. arXiv, 2020; arXiv:2006.05576. [Google Scholar]
- Kalantidis, Y.; Sariyildiz, M.B.; Pion, N.; Weinzaepfel, P.; Larlus, D. Hard Negative Mixing for Contrastive Learning. arXiv, 2020; arXiv:2010.01028. [Google Scholar]

**Figure 1.**Basic intuition behind contrastive learning paradigm: push original and augmented images closer and push original and negative images away.

**Figure 3.**Top-1 classification accuracy of different contrastive learning methods against baseline supervised method on ImageNet.

**Figure 4.**Color Transformation as pretext task [15]. (

**a**) Original. (

**b**) Gaussian noise. (

**c**) Gaussian blur. (

**d**) Color distortion (Jitter).

**Figure 5.**Geometric transformation as pretext task [15]. (

**a**) Original. (

**b**) Crop and resize. (

**c**) Rotate (90${}^{\circ}$, 180${}^{\circ}$, 270${}^{\circ}$). (

**d**) Crop, resize, and flip.

**Figure 6.**Solving jigsaw puzzle being used as a pretext task to learn representation. (

**a**) Original Image; (

**b**) reshuffled image. The original image is the anchor and the reshuffled image is the positive sample.

**Figure 7.**Contrastive Predictive Coding: Although the figure shows audio as input, similar setup can be used for videos, images, text etc. [21].

**Figure 8.**Learning representation from video frame sequence [23].

**Figure 9.**Most of the shapes of these two pairs of images are same. However, the low-level statistics are different (color and texture). Usage of right pre-text task here is necessary [26].

**Figure 10.**A sample from the DTD dataset [28]. An example of why rotation based pretext tasks may not work well.

**Figure 11.**Different architecture pipelines for Contrastive Learning: (

**a**) End-to-End training of two encoders where one generates representation for positive samples and the other for negative samples. (

**b**) Using a memory bank to store and retrieve encodings of negative samples. (

**c**) Using a momentum encoder which acts as a dynamic dictionary lookup for encodings of negative samples during training. (

**d**) Implementing a clustering mechanism by using swapped prediction of the obtained representations from both the encoders using end-to-end architecture.

**Figure 12.**Linear evaluation models (ResNet-50) trained with different batch size and epochs. Each bar represents a single run from scratch [15].

**Figure 13.**Usage of memory bank in PIRL: memory bank contains the moving average representations of all negative images to be used in contrastive learning [17].

**Figure 14.**Conventional contrastive instance learning vs. contrastive clustering of feature representations in SwAV [13].

**Figure 17.**Image classification, localization, detection, and segmentation as downstream tasks in computer vision.

**Figure 18.**Attention maps generated by a trained AlexNet. Left set of images represent a supervised approach and right set of images represent a self-supervised approach. The images represent the attention maps applied on features from different convolutional layers, (

**a**) Conv1 27 × 27, (

**b**) Conv3 13 × 13, and (

**c**) Conv5 6 × 6.

**Table 1.**Performance on ImageNet Dataset: Top-1 and Top-5 accuracies of different contrastive learning methods on ImageNet using self-supervised approach where models are used as frozen encoders for a linear classifier. The second half of the table (rightmost two columns) shows the performance (top-5 accuracy) of these methods after fine-tuning on 1% and 10% of labels from ImageNet.

Method | Architecture | ImageNet (Self-Supervised) | Semi-Supervised (Top-5) | ||
---|---|---|---|---|---|

Top-1 | Top-5 | 1% Labels | 10% Labels | ||

Supervised | ResNet50 | 76.5 | - | 56.4 | 80.4 |

CPC [38] | ResNet v2 101 | 48.7 | 73.6 | - | - |

InstDisc [12] | ResNet50 | 56.5 | - | 39.2 | 77.4 |

LA [56] | ResNet50 | 60.2 | - | - | - |

MoCo [14] | ResNet50 | 60.6 | - | - | - |

BigBiGAN [57] | ResNet50 (4×) | 61.3 | 81.9 | 55.2 | 78.8 |

PCL [58] | ResNet50 | 61.5 | - | 75.3 | 85.6 |

SeLa [59] | ResNet50 | 61.5 | 84.0 | - | - |

PIRL [17] | ResNet50 | 63.6 | - | 57.2 | 83.8 |

CPCv2 [38] | ResNet50 | 63.8 | 85.3 | 77.9 | 91.2 |

PCLv2 [58] | ResNet50 | 67.6 | - | - | - |

SimCLR [15] | ResNet50 | 69.3 | 89.0 | 75.5 | 87.8 |

MoCov2 [47] | ResNet50 | 71.1 | - | - | - |

InfoMin Aug [19] | ResNet50 | 73.0 | 91.1 | - | - |

SwAV [13] | ResNet50 | 75.3 | - | 78.5 | 89.9 |

Method | Architecture | Parameters | Top-1 Accuracy |
---|---|---|---|

Supervised | ResNet50 | $25.6$ M | $53.2$ |

BiGAN [60] | AlexNet | 61 M | $31.0$ |

Context [61] | AlexNet | 61 M | $32.7$ |

SplitBrain [62] | AlexNet | 61 M | $34.1$ |

AET [63] | AlexNet | 61 M | $37.1$ |

DeepCluster [50] | AlexNet | 61 M | $37.5$ |

Color [64] | ResNet50 | $25.6$ M | $37.5$ |

Jigsaw [64] | ResNet50 | $25.6$ M | $41.2$ |

Rotation [51] | ResNet50 | $25.6$ M | $41.4$ |

NPID [12] | ResNet50 | $25.6$ M | $45.5$ |

PIRL [17] | ResNet50 | $25.6$ M | $49.8$ |

LA [56] | ResNet50 | $25.6$ M | $50.1$ |

AMDIM [37] | - | 670 M | $55.1$ |

SwAV [13] | ResNet50 | $25.6$ M | 56.7 |

**Table 3.**(1) Linear classification top-1 accuracy on top of frozen features and (2) object detection with fine-tuned features on VOC7+12 using Faster-CNN.

Method | Architecture | Parameters | (1) Classification | (2) Detection |
---|---|---|---|---|

Supervised | AlexNet | 61 M | $79.9$ | $56.8$ |

Supervised | ResNet50 | $25.6$ M | $87.5$ | $81.3$ |

Inpaint [65] | AlexNet | 61 M | $56.5$ | $44.5$ |

Color [66] | AlexNet | 61 M | $65.6$ | $46.9$ |

BiGAN [60] | AlexNet | 61 M | $60.1$ | $46.9$ |

NAT [10] | AlexNet | 61 M | $65.3$ | $49.4$ |

Context [61] | AlexNet | 61 M | $65.3$ | $51.1$ |

DeepCluster [50] | AlexNet | 61 M | $72.0$ | $55.4$ |

Color [66] | ResNet50 | $25.6$ M | $55.6$ | − |

Rotation [51] | ResNet50 | $25.6$ M | $63.9$ | $72.5$ |

Jigsaw [64] | ResNet50 | $25.6$ M | $64.5$ | $75.1$ |

LA [56] | ResNet50 | $25.6$ M | $69.1$ | − |

NPID [12] | ResNet50 | $25.6$ M | $76.6$ | $79.1$ |

PIRL [17] | ResNet50 | $25.6$ M | $81.1$ | $80.7$ |

MoCo [14] | ResNet50 | $25.6$ M | − | $81.4$ |

SwAV [13] | ResNet50 | $25.6$ M | $88.9$ | $82.6$ |

**Table 4.**Accuracy on Video Classification. All the proposed methods were pretrained with their proposed contrastive based approaches and a linear model was used for validation. R3D in model represents 3D-ResNet. ${}^{\u2020}$ represents that the model has been trained on another dataset and further fine-tuned with the specific dataset. K represents Kinetics dataset.

Method | Model | UCF-101 | HMDB-51 | K (top 1) | K (top 5) |
---|---|---|---|---|---|

C3D (Supervised) | - | 82.3 ${}^{\u2020}$ | - | - | - |

3DResNet-18 (Supervised) | R3D | 84.4 ${}^{\u2020}$ | 56.4 ${}^{\u2020}$ | - | - |

P3D (Supervised) | - | 84.4 ${}^{\u2020}$ | - | - | - |

ImageNet-inflated [67] | R3D | 60.3 | 30.7 | - | - |

jigsaw [26] | - | 51.5 | 22.5 | - | - |

OPN [68] | - | 56.3 | 22.1 | - | - |

Cross Learn (with Optical Flow) [69] | - | 58.7 | 27.2 | - | - |

O3N [70] | - | 60.3 | 32.5 | - | - |

Shuffle and Learn [71] | - | 50.2 | 18.1 | - | - |

IIC (Shuffle + res) [24] | R3D | 74.4 | 38.3 | - | - |

inflated SIMCLR [20] | R3D-50 | - | - | 48.0 | 71.5 |

CVRL [20] | R3D-50 | - | - | 64.1 | 85.8 |

TCP [22] | R3D | 77.9 (3 splits) | 45.3 | - | - |

SeCo inter + intra + order [72] | R3D | 88.26 ${}^{\u2020}$ | 55.5 ${}^{\u2020}$ | 61.91 | - |

DTG-Net [73] | R3D-18 | 85.6 | 49.9 | - | - |

CMC(3 views) [74] | R3D | 59.1 | 26.7 | - | - |

**Table 5.**Recent contrastive learning methods in NLP along with the datasets they were evaluated on and the respective downstream tasks.

Model | Dataset | Application Areas |
---|---|---|

Distributed Representations [75] | Google internal | Training with Skip-gram model |

Contrastive Unsupervised [77] | Wiki-3029 | Unsupervised representation learning |

CONPONO [78] | RTE, COPA, ReCoRD | Discourse fine-grained sentence ordering in text |

INFOXLM [79] | XNLI and MLQA | Learning cross-lingual representations |

CERT [80] | GLUE benchmark | Capturing sentence-level semantics |

DeCLUTR [81] | OpenWebText | Learning universal sentence representations |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on Contrastive Self-Supervised Learning. *Technologies* **2021**, *9*, 2.
https://doi.org/10.3390/technologies9010002

**AMA Style**

Jaiswal A, Babu AR, Zadeh MZ, Banerjee D, Makedon F. A Survey on Contrastive Self-Supervised Learning. *Technologies*. 2021; 9(1):2.
https://doi.org/10.3390/technologies9010002

**Chicago/Turabian Style**

Jaiswal, Ashish, Ashwin Ramesh Babu, Mohammad Zaki Zadeh, Debapriya Banerjee, and Fillia Makedon. 2021. "A Survey on Contrastive Self-Supervised Learning" *Technologies* 9, no. 1: 2.
https://doi.org/10.3390/technologies9010002