Heuristic Attention Representation Learning for SelfSupervised Pretraining
Abstract
:1. Introduction
 We introduce a new selfsupervised learning framework (HARL) that maximizes the similarity agreement of objectlevel latent embedding on vector space across different augmented views. The framework implementation is available in the Supplementary Material section.
 We utilized two heuristic mask proposal techniques from conventional computer vision and unsupervised deep learning methods to generate a binary mask for the natural image dataset.
 We construct the two novel heuristic binary segmentation mask datasets for the ImageNet ILSVRC2012 [24] to facilitate the research in the perceptual grouping for selfsupervised visual representation learning. The datasets are available to download in the Data Availability Statement section.
 Finally, we demonstrate that adopting early visual attention provides a diverse set of highquality semantic features that increase more effective learning representation for selfsupervised pretraining. We report promising results when transferring HARL’s learned representation on a wide range of downstream vision tasks.
2. Related Works
3. Methods
3.1. HARL Framework
Algorithm 1: HARL: Heuristic Attention Representation Learning 
Input: $D$$,\text{}M$$,\text{}T$$,\text{}\mathrm{and}\text{}{T}^{\prime}$: set of images, mask and distributions of transformations $\theta $$,\text{}{f}_{\theta}$$,\text{}{g}_{\theta}$$,\text{}\mathrm{and}\text{}{Q}_{\theta}:$ initial online parameters, encoder, projector, and predictor $\xi $$,\text{}{f}_{\xi}$$,\text{}{g}_{\xi}$; // initial target parameters, target encoder, and target projector Optimizer; //optimizer, updates online parameters using the loss gradient $K$ and $N$; //total number of optimization steps and batch size ${\left\{{T}_{K}\right\}}_{k=1}^{K}$$\text{}\mathrm{and}\text{}{\left\{{\eta}_{k}\right\}}_{k=1}^{K}$; //target network update schedule and learning rate schedule

3.2. Heuristic Binary Mask
4. Experiments
4.1. SelfSupervised Pretraining Implementation
4.2. Evaluation Protocol
4.2.1. Linear Evaluation and SemiSupervised Learning on the ImageNet Dataset
4.2.2. Transfer Learning to Other Downstream Tasks
5. Ablation and Analysis
5.1. The Output of Spatial Feature Map (Size and Dimension)
5.2. Objective Loss Functions
5.2.1. Mask Loss
5.2.2. Hybrid Loss
5.2.3. Mask Loss versus Hybrid Loss
5.3. The Impact of Heuristic Mask Quality
6. Conclusions and Future Work
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Implementation Detail
Appendix A.1. Implementation Data Augmentation
 Random cropping with resizes: a random patch of the image is selected. In our pipeline, we use the inceptionstyle random cropping [62], whose area crop is uniformly sampled in [0.08 to 1.0] of the original image, and the random aspect ratio is logarithmically sampled in [3/4, 4/3]. The patch is then resized to 224 × 224 pixels using bicubic interpolation;
 Optional horizontal flipping (left and right);
 Color jittering: the brightness, contrast, saturation and hue are shifted by a uniformly distributed offset;
 Optional color dropping: the RGB image is replaced by its greyscale values;
 Gaussian blurring with a 224 × 224 square kernel and a standard deviation uniformly sampled from [0.1, 2.0];
 Optional solarization: a pointwise color transformation $x\mapsto x\xb7{\mathsf{\mathbb{l}}}_{x0.5}+\left(1x\right)\xb7{\mathsf{\mathbb{l}}}_{x0.5}$ for pixel values in the range [0–1].
Parameter  T  T′  M  M′ 

Inceptionstyle random crop probability  1.0  1.0  1.0  1.0 
Flip probability  0.5  0.5  0.5  0.5 
Color jittering probability  0.8  0.8     
Brightness adjustment max intensity  0.4  0.4     
Contrast adjustment max intensity  0.4  0.4     
Saturation adjustment max intensity  0.2  0.2     
Hue adjustment max intensity  0.1  0.1     
Color dropping probability  0.2  0.2     
Gaussian blurring probability  1.0  0.1     
Solarization probability  0.0  0.2     
Appendix A.2. Implementation Masking Feature
Appendix B. Evaluation on the ImageNet and Transfer Learning
Appendix B.1. Implementation Masking Feature Linear Evaluation SemiSupervised Protocol on ImageNet
Dataset  Classes  Original Training Examples  Training Examples  Validation Examples  Test Examples  Accuracy Measure  Test Provided 

Food101  101  75,750  68,175  7575  25,250  Top1 accuracy   
CIFAR10  10  50,000  45,000  5000  10,000  Top1 accuracy   
CIFAR100  100  50,000  44,933  5067  10,000  Top1 accuracy   
Sun397 (split 1)  397  19,850  15,880  3970  19,850  Top1 accuracy   
Cars  196  8144  6494  1650  8041  Top1 accuracy   
DTD (split 1)  47  1880  1880  1880  1880  Top1 accuracy  Yes 
 Top1: We compute the proportion of correctly classified examples.
 AP, AP_{50} and AP_{75}: We compute the average precision as defined in [56].
Appendix B.2. Transfer via Linear Classification and FineTuning
Appendix B.3. Transfer Learning to Other Vision Tasks
Appendix C. Heuristic Mask Proposal Methods
Appendix C.1. Heuristic Binary Mask Generates Using DRFI
Appendix C.2. Heuristic Binary Mask Generates Using Unsupervised Deep Learning
References
 Shu, Y.; Kou, Z.; Cao, Z.; Wang, J.; Long, M. Zootuning: Adaptive transfer from a zoo of models. arXiv 2021, arXiv:2106.15434. [Google Scholar]
 Yang, Q.; Zhang, Y.; Dai, W.; Pan, S.J. Transfer Learning; Cambridge University Press: Cambridge, UK, 2020. [Google Scholar]
 You, K.; Kou, Z.; Long, M.; Wang, J. CoTuning for Transfer Learning. Adv. Neural Inf. Process. Syst. 2020, 33, 17236–17246. [Google Scholar]
 Misra, I.; Shrivastava, A.; Gupta, A.; Hebert, M. CrossStitch Networks for Multitask Learning. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 3994–4003. [Google Scholar]
 Li, X.; Xiong, H.; Xu, C.; Dou, D. SMILE: Selfdistilled mixup for efficient transfer learning. arXiv 2021, arXiv:2103.13941. [Google Scholar]
 Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the 2015 IEEE Information Theory Workshop (ITW), Jeju Island, Korea, 11–15 October 2015; pp. 1–5. [Google Scholar]
 ShwartzZiv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
 Amjad, R.A.; Geiger, B.C. Learning representations for neural networkbased classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2225–2239. [Google Scholar] [CrossRef] [Green Version]
 Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G.E. A simple framework for contrastive learning of visual representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
 Goyal, P.; Caron, M.; Lefaudeux, B.; Xu, M.; Wang, P.; Pai, V.; Singh, M.; Liptchinsky, V.; Misra, I.; Joulin, A.; et al. Selfsupervised Pretraining of Visual Features in the Wild. arXiv 2021, arXiv:2103.01988. [Google Scholar]
 Misra, I.; Maaten, L.v.d. Selfsupervised learning of pretextinvariant representations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6706–6716. [Google Scholar]
 Ermolov, A.; Siarohin, A.; Sangineto, E.; Sebe, N. Whitening for selfsupervised representation learning. In Proceedings of the International Conference on Machine Learning ICML, Virtual, 18–24 July 2021. [Google Scholar]
 Grill, J.B.; Strub, F.; Altch’e, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.v.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap your own latent: A new approach to selfsupervised learning. arXiv 2020, arXiv:2006.07733. [Google Scholar]
 Caron, M.; Touvron, H.; Misra, I.; J’egou, H.e.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in selfsupervised vision transformers. arXiv 2021, arXiv:2104.14294. [Google Scholar]
 Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Joulin, A. Unsupervised learning of visual features by contrasting cluster assignments. arXiv 2020, arXiv:2006.09882. [Google Scholar]
 Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Säckinger, E.; Shah, R. Signature verification using a “Siamese” time delay neural network. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, Chambéry, France, 28 August–3 September 1993. [Google Scholar]
 He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R.B. Momentum contrast for unsupervised visual representation learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 9726–9735. [Google Scholar]
 Chen, X.; He, K. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition CVPR, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
 Hayhoe, M.M.; Ballard, D.H. Eye movements in natural behavior. Trends Cogn. Sci. 2005, 9, 188–194. [Google Scholar] [CrossRef] [PubMed]
 Borji, A.; SihiteDicky, N.; Itti, L. Quantitative analysis of humanmodel agreement in visual saliency modeling. IEEE Trans. Image Process. 2013, 22, 55–69. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 BenoisPineau, J.; Callet, P.L. Visual content indexing and retrieval with psychovisual models. In Multimedia Systems and Applications; Springer: Cham, Switzerland, 2017. [Google Scholar]
 Awh, E.; Armstrong, K.M.; Moore, T. Visual and oculomotor selection: Links, causes and implications for spatial attention. Trends Cogn. Sci. 2006, 10, 124–130. [Google Scholar] [CrossRef] [PubMed]
 Tian, Y.; Chen, X.; Ganguli, S. Understanding selfsupervised learning dynamics without contrastive Pairs. arXiv 2021, arXiv:2102.06810. [Google Scholar]
 Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.S.; et al. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
 Vincent, P.; Larochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning ICML, Helsinki, Finland, 5–9 July 2008. [Google Scholar]
 Bojanowski, P.; Joulin, A. Unsupervised learning by predicting noise. arXiv 2017, arXiv:1704.05310. [Google Scholar]
 Larsson, G.; Maire, M.; Shakhnarovich, G. Colorization as a proxy task for visual understanding. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 840–849. [Google Scholar]
 Iizuka, S.; SimoSerra, E. Let there be color!: Joint endtoend learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Trans. Graph. (ToG) 2016, 35, 1–11. [Google Scholar] [CrossRef]
 Pathak, D.; Krähenbühl, P.; Donahue, J.; Darrell, T.; Efros, A.A. Context encoders: Feature learning by inpainting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 2536–2544. [Google Scholar]
 Gidaris, S.; Singh, P.; Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv 2018, arXiv:1803.07728. [Google Scholar]
 Noroozi, M.; Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In Proceedings of the European Conference on Computer Vision ECCV, Amsterdam, The Netherlands, 8–16 October 2016. [Google Scholar]
 Zhang, R.; Isola, P.; Efros, A.A. Splitbrain autoencoders: Unsupervised learning by crosschannel prediction. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 645–654. [Google Scholar]
 Mundhenk, T.N.; Ho, D.; Chen, B.Y. Improvements to context based selfsupervised learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9339–9348. [Google Scholar]
 Donahue, J.; Krähenbühl, P.; Darrell, T. Adversarial feature learning. arXiv 2017, arXiv:1605.09782. [Google Scholar]
 Goodfellow, I.J.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative adversarial nets. In Proceedings of the Neural Information Processing Systems NIPS, Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
 Donahue, J.; Simonyan, K. Large scale adversarial representation learning. In Proceedings of the Neural Information Processing Systems NeurIPS, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
 Bansal, V.; Buckchash, H.; Raman, B. Discriminative autoencoding for classification and representation learning problems. IEEE Signal Process. Lett. 2021, 28, 987–991. [Google Scholar] [CrossRef]
 Kingma, D.P.; Welling, M. Autoencoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
 Chen, T.; Kornblith, S.; Swersky, K.; Norouzi, M.; Hinton, G.E. Big Selfsupervised models are strong semisupervised learners. arXiv 2020, arXiv:2006.10029. [Google Scholar]
 Jaiswal, A.; Babu, A.R.; Zadeh, M.Z.; Banerjee, D.; Makedon, F. A Survey on contrastive selfsupervised learning. Technologies 2020, 9, 2. [Google Scholar] [CrossRef]
 Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
 Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Gool, L.V. unsupervised semantic segmentation by contrasting object mask proposals. arXiv 2021, arXiv:2102.06191. [Google Scholar]
 Zhang, X.; Maire, M. SelfSupervised visual representation learning from hierarchical grouping. arXiv 2020, arXiv:2012.03044. [Google Scholar]
 Jiang, H.; Yuan, Z.; Cheng, M.M.; Gong, Y.; Zheng, N.; Wang, J. Salient object detection: A discriminative regional feature integration approach. Int. J. Comput. Vis. 2013, 123, 251–268. [Google Scholar]
 He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
 Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
 Nair, V.; Hinton, G.E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning ICML, Haifa, Israel, 21–24 June 2010. [Google Scholar]
 You, Y.; Gitman, I.; Ginsburg, B. Scaling SGD batch size to 32K for imageNet training. arXiv 2017, arXiv:1708.03888. [Google Scholar]
 Loshchilov, I.; Hutter, F. SGDR: Stochastic gradient descent with warm restarts. arXiv 2017, arXiv:1608.03983. [Google Scholar]
 Goyal, P.; Dollár, P.; Girshick, R.B.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large Minibatch SGD: Training ImageNet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
 Kolesnikov, A.; Zhai, X.; Beyer, L. Revisiting selfsupervised visual representation learning. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 1920–1929. [Google Scholar]
 Ye, M.; Zhang, X.; Yuen, P.; Chang, S.F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 6203–6212. [Google Scholar]
 Hjelm, R.D.; Fedorov, A.; LavoieMarchildon, S.; Grewal, K.; Trischler, A.; Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv 2019, arXiv:1808.06670. [Google Scholar]
 Chen, X.; Fan, H.; Girshick, R.B.; He, K. Improved baselines with momentum contrastive learning. arXiv 2020, arXiv:2003.04297. [Google Scholar]
 Kornblith, S.; Shlens, J.; Le, Q.V. Do Better ImageNet models transfer better? In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2656–2666. [Google Scholar]
 Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.M.; Zisserman, A. The pascal visual object classes (VOC) challenge. Int. J. Comput. Vis. 2009, 88, 303–338. [Google Scholar] [CrossRef] [Green Version]
 Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster RCNN: Towards realtime object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
 Lin, T.Y.; Maire, M.; Belongie, S.J.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
 He, K.; Gkioxari, G.; Dollár, P.; Girshick, R.B. Mask RCNN. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 386–397. [Google Scholar] [CrossRef]
 Zhang, S.; Liew, J.H.; Wei, Y.; Wei, S.; Zhao, Y. Interactive object segmentation with insideoutside guidance. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12231–12241. [Google Scholar]
 Xie, Z.; Lin, Y.; Zhang, Z.; Cao, Y.; Lin, S.; Hu, H. Propagate yourself: Exploring pixellevel consistency for unsupervised visual representation learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 16679–16688. [Google Scholar]
 Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
 Pascanu, R.; Mikolov, T.; Bengio, Y. On the difficulty of training recurrent neural networks. In Proceedings of the International Conference on Machine Learning ICML, Atlanta, GA, USA, 16–21 June 2013. [Google Scholar]
 Bossard, L.; Guillaumin, M.; Gool, L.V. Food101mining discriminative components with random forests. In Proceedings of the European Conference on Computer Vision ECCV, Zurich, Switzerland, 6–12 September 2014. [Google Scholar]
 Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, University of Toronto, Toronto, ON, Canada, 2009. Available online: https://www.cs.toronto.edu/~kriz/learningfeatures2009TR.pdf (accessed on 8 April 2009).
 Xiao, J.; Hays, J.; Ehinger, K.A.; Oliva, A.; Torralba, A. SUN database: Largescale scene recognition from abbey to zoo. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 3485–3492. [Google Scholar]
 Krause, J.; Stark, M.; Deng, J.; FeiFei, L. 3D Object Representations for finegrained categorization. In Proceedings of the 2013 IEEE International Conference on Computer Vision Workshops, Sydney, Australia, 2–8 December 2013; pp. 554–561. [Google Scholar]
 Cimpoi, M.; Maji, S.; Kokkinos, I.; Mohamed, S.; Vedaldi, A. Describing textures in the wild. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3606–3613. [Google Scholar]
 HÈnaff, O.J.; Srinivas, A.; Fauw, J.D.; Razavi, A.; Doersch, C.; Eslami, S.M.A.; Oord, A.R.V.D. Dataefficient image recognition with contrastive predictive coding. arXiv 2020, arXiv:1905.09272. [Google Scholar]
 Borji, A.; Cheng, M.M.; Jiang, H.; Li, J. Salient object detection: A benchmark. IEEE Trans. Image Process. 2015, 24, 5706–5722. [Google Scholar] [CrossRef] [Green Version]
 Wang, W.; Lai, Q.; Fu, H.; Shen, J.; Ling, H. Salient object detection in the deep learning era: An indepth survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 3239–3259. [Google Scholar] [CrossRef]
 Zou, W.; Komodakis, N. HARF: Hierarchyassociated rich features for salient object detection. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 406–414. [Google Scholar]
 Zhang, J.; Zhang, T.; Dai, Y.; Harandi, M.; Hartley, R.I. Deep unsupervised saliency detection: A multiple noisy labeling perspective. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9029–9038. [Google Scholar]
Method  Linear Evaluation  SemiSupervised Learning  

Top1  Top5  Top1  Top5  
1%  10%  1%  10%  
Supervised  76.5    25.4  56.4  48.4  80.4 
PIRL [11]  63.6        57.2  83.8 
SimCLR [9]  69.3  89.0  48.3  65.6  75.5  87.8 
MoCo [17]  60.6           
MoCo v2 [54]  71.1           
SimSiam [18]  71.3           
BYOL [13]  74.3  91.6  53.2  68.8  78.4  89.0 
HARL (ours)  74.0  91.3  54.5  69.5  79.2  89.3 
Method  Food101  CIFAR10  CIFAR100  SUN397  Cars  DTD 

Linear evaluation:  
HARL (ours)  75.0  92.6  77.6  61.4  67.3  77.3 
BYOL [13]  75.3  91.3  78.4  62.2  67.8  75.5 
MoCo v2 (repo)  69.2  91.4  73.7  58.6  47.3  71.1 
SimCLR [9]  68.4  90.6  71.6  58.8  50.3  74.5 
Finetuned:  
HARL (ours)  88.0  97.6  85.6  64.1  91.1  78.0 
BYOL [13]  88.5  97.4  85.3  63.7  91.6  76.2 
MoCo v2 (repo)  86.1  97.0  83.7  59.1  90.0  74.1 
SimCLR [9]  88.2  97.7  85.9  63.5  91.3  73.2 
Method  Object Detection  Instance Segmentation  

VOC07 + 12 Detection  COCO Detection  COCO Segmentation  
AP_{50}  AP  AP_{75}  AP_{50}  AP  AP_{75}  ${\mathrm{AP}}_{50}^{mask}$  ${\mathrm{AP}}^{mask}$  ${\mathrm{AP}}_{75}^{mask}$  
Supervised  81.3  53.5  58.8  58.2  38.2  41.2  54.7  33.3  35.2 
SimCLRIN [18]  81.8  55.5  61.4  57.7  37.9  40.9  54.6  33.3  35.3 
MoCo [17]  82.2  57.2  63.7  58.9  38.5  42.0  55.9  35.1  37.7 
MoCo v2 [54]  82.5  57.4  64.0    39.8      36.1   
SimSiam [18]  82.4  57.0  63.7  59.3  39.2  42.1  56.0  34.4  36.7 
BYOL [13]        40.4      37.0    
BYOL (repo)  82.6  55.5  61.9  61.2  40.2  43.9  58.2  36.7  39.5 
HARL (ours)  82.7  56.3  62.4  62.1  40.9  44.5  59.0  37.3  40.0 
Method  Top1 Accuracy  Top5 Accuracy 

Mask Loss  
α_base = 0.3  51.3  77.4 
α_base = 0.5  53.9  79.4 
α_base = 0.7  54.6  79.8 
Hybrid Loss  
λ_base = 0.3  55.0  79.4 
λ_base = 0.5  57.8  81.7 
λ_base = 0.7  58.2  81.8 
Method  Object Detection  Instance Segmentation  

VOC07 + 12 Detection  COCO Detection  COCO Segmentation  
AP_{50}  AP  AP_{75}  AP_{50}  AP  AP_{75}  ${\mathrm{AP}}_{50}^{mask}$  ${\mathrm{AP}}^{mask}$  ${\mathrm{AP}}_{75}^{mask}$  
HARL (DRFI Masks)  82.3  55.4  61.2  44.2  24.6  24.8  41.8  24.3  25.1 
HARL (Deep Learning Masks)  82.1  55.5  61.7  44.7  24.7  25.3  42.3  24.6  25.2 
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. 
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Tran, V.N.; Liu, S.H.; Li, Y.H.; Wang, J.C. Heuristic Attention Representation Learning for SelfSupervised Pretraining. Sensors 2022, 22, 5169. https://doi.org/10.3390/s22145169
Tran VN, Liu SH, Li YH, Wang JC. Heuristic Attention Representation Learning for SelfSupervised Pretraining. Sensors. 2022; 22(14):5169. https://doi.org/10.3390/s22145169
Chicago/Turabian StyleTran, Van Nhiem, ShenHsuan Liu, YungHui Li, and JiaChing Wang. 2022. "Heuristic Attention Representation Learning for SelfSupervised Pretraining" Sensors 22, no. 14: 5169. https://doi.org/10.3390/s22145169