Heuristic Attention Representation Learning for Self-Supervised Pretraining

Recently, self-supervised learning methods have been shown to be very powerful and efficient for yielding robust representation learning by maximizing the similarity across different augmented views in embedding vector space. However, the main challenge is generating different views with random cropping; the semantic feature might exist differently across different views leading to inappropriately maximizing similarity objective. We tackle this problem by introducing Heuristic Attention Representation Learning (HARL). This self-supervised framework relies on the joint embedding architecture in which the two neural networks are trained to produce similar embedding for different augmented views of the same image. HARL framework adopts prior visual object-level attention by generating a heuristic mask proposal for each training image and maximizes the abstract object-level embedding on vector space instead of whole image representation from previous works. As a result, HARL extracts the quality semantic representation from each training sample and outperforms existing self-supervised baselines on several downstream tasks. In addition, we provide efficient techniques based on conventional computer vision and deep learning methods for generating heuristic mask proposals on natural image datasets. Our HARL achieves +1.3% advancement in the ImageNet semi-supervised learning benchmark and +0.9% improvement in AP50 of the COCO object detection task over the previous state-of-the-art method BYOL. Our code implementation is available for both TensorFlow and PyTorch frameworks.


Introduction
Visual representation learning has been an extended research area on supervised and unsupervised methods. Most supervised learning models learn visual representations by training with many labeled datasets, then transferring the knowledge to other tasks [1][2][3][4][5]. Most supervised learning frameworks try to tune their parameters such that they maximally compress mapping the particular input variables that preserve the information on the output variables [6][7][8]. As a result, most deep neural networks fail to generalize and maintain robustness if the test samples are different from the training samples on variant distribution and domains.
The new approaches are self-supervised representation learning to overcome the existing drawbacks of supervised learning [9][10][11][12][13][14][15]. These techniques have attracted significant attention for efficient, generalization, and robustness representation learning when transferring learned representation on multiple downstream tasks achieving on-par or even outperforming supervised baselines. Furthermore, self-supervised learning methods overcome the human supervision capability of leveraging the enormous availability of unlabeled data. Despite various self-supervised frameworks, these methods involve certain 1.
We introduce a new self-supervised learning framework (HARL) that maximizes the similarity agreement of object-level latent embedding on vector space across different augmented views. The framework implementation is available in the Supplementary Material section.

2.
We utilized two heuristic mask proposal techniques from conventional computer vision and unsupervised deep learning methods to generate a binary mask for the natural image dataset. 3.
We construct the two novel heuristic binary segmentation mask datasets for the ImageNet ILSVRC-2012 [24] to facilitate the research in the perceptual grouping for self-supervised visual representation learning. The datasets are available to download in the Data Availability Statement section. 4.
Finally, we demonstrate that adopting early visual attention provides a diverse set of high-quality semantic features that increase more effective learning representation for self-supervised pretraining. We report promising results when transferring HARL's learned representation on a wide range of downstream vision tasks.
The remainder of this paper is organized as follows. In Section 2, we discussed related works. Section 3 introduces the HARL framework in detail. Section 4.1 briefly describes the implementation of the HARL framework in self-supervised pretraining. Section 4.2 evaluates and benchmarks HARL performance on the ImageNet evaluation, transfers learning to other downstream tasks and compares it to previous state-of-the-art methods. In Section 5, we provide the analysis of the components impacting the performance and understanding of the behavior of our proposed method. Finally, this paper is concluded in Section 6.

Related Works
Our method is mostly related to unsupervised visual representation learning methods, aiming to exploit input signals' internal distributions and semantic information without human supervision. The early works focused on several design-solving pretext tasks, and image generation approaches. Pretext tasks focus on the aspects of image restoration such as denoising [25], predicting noise [26], colorization [27,28], inpainting [29], predicting image rotation [30], solving jigsaw puzzles [31] and more [32,33]. However, these methods, the learned representation of neural networks pre-trained on pretext tasks, still failed in generalization and robustness when performed on different downstream tasks. The generative adversarial learning [34][35][36] and variational auto-encoding [25,37,38] operate directly on pixel space and high-level details for image generations, which require costly computation that may not be essential and efficient for visual representation learning.
Self-supervised contrastive learning. The popular self-supervised contrastive learning frameworks [9,39,40] aim to pull semantic features from different cropping views of the same image while pushing other features away from other images. However, the downside of contrastive methods is that they require a considerable number of negative pairs, leading to significant computation resources and memory footprint. The efficient alternative approach is non-contrastive learning [13,18], which only maximizes the similarity of two views from the same image without contrast to other views from different images.
Self-supervised non-contrastive learning. Distillation learning-based framework [13,18] inspired by knowledge distillation [41] is applied to joint embedding architecture. One branch is defined as a student network, and another is described as a teacher network. The student network is trained to predict the representation of the teacher network; the teacher network's weights are optimized from the student network by a running average of the student network's weights [13] or by sharing with the student's weights and blocking the gradient flow through the teacher network [18]. Non-contrastive frameworks are effective and computationally efficient compared to self-supervised contrastive frameworks [9,17,39].
However, most contrastive or non-contrastive self-supervised techniques maximize similarity agreements of the whole-image context representation of different augmented views. While developing localization attention to separate the semantically features [42,43] by the perceptual grouping of semantic information proved that adopting prior mid-level visible in pretraining gains efficiency for representation learning. The most recent study related to our [39] leveraging visual attention with segmentation obtained impressive results when transferring the learned representation to downstream tasks on object detection and segmentation in multiple datasets. In contrast to our work, previous work employs pixel-level models for contrastive learning, which uses backbones specialized for semantic segmentation and uses different loss functions. It is important to note that the primary work objective is difficult to transfer to other self-supervised frameworks. It also did not investigate the masking feature method or the impact of the dimension and size of the output spatial feature maps on the latent embedding representation, which we will examine next.

Methods
In contrastive or non-contrastive learning-based frameworks, HARL object-level objectives are applicable. For example, our study implements a non-contrastive learning framework using an exponential moving average weight parameter of one encoder to another and an extra predictor inspired by BYOL [13]. HARL's objective maximizes the agreement of the object-level (foreground and background) latent embedding vector across different cropping views beyond augmentations shown in Figure 1.

Methods
In contrastive or non-contrastive learning-based frameworks, HARL object-level objectives are applicable. For example, our study implements a non-contrastive learning framework using an exponential moving average weight parameter of one encoder to another and an extra predictor inspired by BYOL [13]. HARL's objective maximizes the agreement of the object-level (foreground and background) latent embedding vector across different cropping views beyond augmentations shown in Figure 1. Figure 1. The HARL's architecture. The heuristic binary mask can be estimated by using either conventional computer vision or deep learning approaches. After that, data augmentation transformation is applied to both the image and its mask (bottom). Then, the image pairs flow to a convolutional feature extraction module. The heuristic mask is used to mask the feature maps (which are the outputs of the feature extraction module) in order to separate the foreground from the background features (middle). These features are further processed by non-linear multi-layer perceptron modules (MLP). Finally, the similarity objective maximizes foreground and background embedding vectors across different augmented views from the same image (top).

HARL Framework
The HARL framework consists of three essential steps. In step 1, we estimate the heuristic binary mask for the input image, which segments an image into foreground and background (see described detail in Section 3.2). Next, these masks can be computed using either conventional computer vision methods such as DRFI [44] or unsupervised deep learning saliency prediction [42]. After the mask is estimated, we perform the same image Figure 1. The HARL's architecture. The heuristic binary mask can be estimated by using either conventional computer vision or deep learning approaches. After that, data augmentation transformation is applied to both the image and its mask (bottom). Then, the image pairs flow to a convolutional feature extraction module. The heuristic mask is used to mask the feature maps (which are the outputs of the feature extraction module) in order to separate the foreground from the background features (middle). These features are further processed by non-linear multi-layer perceptron modules (MLP). Finally, the similarity objective maximizes foreground and background embedding vectors across different augmented views from the same image (top).

HARL Framework
The HARL framework consists of three essential steps. In step 1, we estimate the heuristic binary mask for the input image, which segments an image into foreground and background (see described detail in Section 3.2). Next, these masks can be computed using either conventional computer vision methods such as DRFI [44] or unsupervised deep learning saliency prediction [42]. After the mask is estimated, we perform the same image transformation (cropping, flipping, resizing, etc.) to both the image and its mask. Finally, if it is the RGB image, transformations such as color distortion can be applied to the image, such as the image augmentation pipeline of SimCLR [9]. The detailed augmentation pipeline is described in Appendix A.1. After data augmentation, each image and mask pair generated two augmented images x, x aligned with two augmented masks m and m as illustrated in Figure 1.
In step 2, we implement standard ResNet-50 [45] convolution residual neural network for feature extractor denotation as ƒ. Each image through the feature extractor encodes the output to obtain the spatial feature maps of size 7 × 7 × 2048, and this feature extraction process can be formulated as h = ƒ(x), where h ∈ R H×W×D. . Then, the feature maps can be separated into the foreground and background feature maps by performing element-wise multiplication with the heuristic binary mask. In addition, we provide ablation studies to analyze the impact of the spatial feature map in various sizes and dimensions, as described in Section 5.1. The foreground and background features are denoted as, h f h b (Appendix A.2 provides detail of the masking feature method). The foreground and background spatial features are down-sampled using global average pooling to project to a smaller dimension with non-linear multi-layer perceptron (MLP) architecture g.
HARL framework structure adapts from BYOL [13], in which one augmented image (x) is processed with the encoder f θ , and projection network g θ , where θ is the learned parameters. Another augmented image (x ) is processed with f ξ and g ξ , where ξ is an exponential moving average of θ. The first augmented image is further processed with the predictor network q θ . The projection and predictor network architectures are the same using the non-linear multi-layer perceptron (MLP), as detailed in Section 4. The definition of encoder, projection, and prediction network is adapted from the BYOL. Finally, the latent representation embedding vectors corresponding to the augmented image's foreground and background features are denoted as z f , z b , z f and z b ∈ R d .
where : In step 3, we compute the HARL's loss function of the given foreground and background latent representations (z f , z b , z f and z b are extracted from two augmented images x, x ) which is defined as mask loss, as illustrated in Equation (1). We apply 2 -normalization to these latent vectors, then minimize their negative cosine similarity agreement with the weighting coefficient α . We study the impact of α value and the combination of the whole image and object-level latent embedding vector in the loss objective provided in Section 5.2.

Heuristic Binary Mask
Our heuristic binary mask estimation technique does not rely on external supervision, nor is it trained with the limited annotated dataset. We proposed two approaches using conventional computer vision and unsupervised deep learning to carry it out, and these methods appear to be well generalized for various image datasets. First, we use the traditional computer vision method DRFI [44] to generate a diverse set of binary masks by varying the two hyperparameters (the Gaussian filter variance σ and the minimum cluster size s). In our implementation, we defined σ = 0.8 and s = 1000 for generating binary masks in the ImageNet [24] dataset. In the second approach, we leverage the selfsupervised encoder feature extractor of the pre-trained ResNet-50 backbone from [9,42], then pass the output feature maps into a 1 × 1 convolutional classification layer for saliency prediction. The classification layer predicts the saliency or "foregroundness" of a pixel. Therefore, we take the output values of the classification layer and set a threshold of 0.5 to decide which pixels belong to the foreground. Pixel saliency values greater than the threshold are determined as foreground objects. Figure 2 shows the example heuristic mask estimated by these two methods. The detailed implementation of the two methods, DRFI and deep learning feature extractor combined with 1 × 1 convolutional layer is described in Appendix C. In most of our experiments, we used the mask generated by the deep learning method because it is faster than DRFI by running with GPU instead of only with CPU.

Self-Supervised Pretraining Implementation
HARL is trained on RGB images and the corresponding heuristic mask of the ImageNet ILSVRC-2012 [24] training set without labels. We implement standard encoder ResNet [45]. According to previous works by SimCLR and BYOL [9,13], the encoder representation output is projected into a smaller dimension using a multi-layer perceptron (MLP). In our implementation, the MLP comprised a linear layer with an output size of 4960 followed by batch normalization [46], rectified linear units (ReLU) [47] and the final linear layer with 512 output units. We apply the LARS optimizer [48] with the cosine decay learning rate schedule without restarts [49], over 1000 epochs on the base learning rate of 0.2, scaled linearly [50] with the batch size (LearningRate = 0.2 × BatchSize/256) and the warmup epochs of 10. Furthermore, we apply a global weight decay parameter of 5 × 10 −7 while excluding the biases and normalization parameters from the LARS adaptation and weight decay. The optimization of the online network and target network follow the protocol of BYOL [13]. We use a batch size of 4096 splits over 8 Nvidia A100GPUs. This setup takes approximately 149 h to train a ResNet-50(×1).
The computational self-supervised pretraining stage requirements are largely due to forward and backward passes through the convolutional backbone. For the typical Res-Net-50 architecture applied to 224 × 224 resolution images, a single forward pass requires approximately 4B FLOPS. The projection head MLP (2048 × 4096 + 4096 × 512) requires roughly 10M FLOPS. In our implementation, the convolution network backbone and MLP network are similar compared to baselines BYOL. Since we forward to the foreground and background representation through the projection head two times instead of one, it results in an additional 10M FLOPS in our framework, less than 0.25% of the total. Finally, the cost of computing the heuristic mask images is negligible because they can be computed once and reused throughout training. Therefore, the complexity of each iteration between our method and the baseline BYOL is almost the same for "computational cost" and "training time". Example of heuristic binary masks used for mask contrastive learning framework. First row: random images from the ImageNet [24] training set. Second row: mask generated based on DRFI algorithm with a predefined sigma σ value of 0.8 and component size values of 1000. The third row is the mask obtained from the self-supervised pre-trained feature extractor ResNet-50 backbone directly followed by a 1 × 1 convolutional classification foreground and background prediction.

Self-Supervised Pretraining Implementation
HARL is trained on RGB images and the corresponding heuristic mask of the ImageNet ILSVRC-2012 [24] training set without labels. We implement standard encoder ResNet [45]. According to previous works by SimCLR and BYOL [9,13], the encoder representation output is projected into a smaller dimension using a multi-layer perceptron (MLP). In our implementation, the MLP comprised a linear layer with an output size of 4960 followed by batch normalization [46], rectified linear units (ReLU) [47] and the final linear layer with 512 output units. We apply the LARS optimizer [48] with the cosine decay learning rate schedule without restarts [49], over 1000 epochs on the base learning rate of 0.2, scaled linearly [50] with the batch size (LearningRate = 0.2 × BatchSize/256) and the warmup epochs of 10. Furthermore, we apply a global weight decay parameter of 5 × 10 −7 while excluding the biases and normalization parameters from the LARS adaptation and weight decay. The optimization of the online network and target network follow the protocol of BYOL [13]. We use a batch size of 4096 splits over 8 Nvidia A100GPUs. This setup takes approximately 149 h to train a ResNet-50 (×1).
The computational self-supervised pretraining stage requirements are largely due to forward and backward passes through the convolutional backbone. For the typical ResNet-50 architecture applied to 224 × 224 resolution images, a single forward pass requires approximately 4B FLOPS. The projection head MLP (2048 × 4096 + 4096 × 512) requires roughly 10M FLOPS. In our implementation, the convolution network backbone and MLP network are similar compared to baselines BYOL. Since we forward to the foreground and background representation through the projection head two times instead of one, it results in an additional 10M FLOPS in our framework, less than 0.25% of the total. Finally, the cost of computing the heuristic mask images is negligible because they can be computed once and reused throughout training. Therefore, the complexity of each iteration between our method and the baseline BYOL is almost the same for "computational cost" and "training time".

Evaluation Protocol
We evaluate the learned representation from the self-supervised pretraining stage on various natural image datasets and tasks, including image classification, segmentation and object detection. First, we assess the obtained representation on the linear classification and semi-supervised learning on the ImageNet following the protocols of [9,51]. Second, we evaluate the generalization and robustness of the learned representation by conducting transfer learning to other natural image datasets and other vision tasks across image classification, object detection and segmentation. Finally, in Appendix B, we provide a detailed configuration and hyperparameters setting of the linear and fine-tuning protocol in our transfer learning implementation.

Linear Evaluation and Semi-Supervised Learning on the ImageNet Dataset
The evaluation for linear and semi-supervised learning follows the procedure in [9,52,53]. For the linear evaluation, we train a linear classifier on top of the frozen encoder representation and report Top-1 and Top-5 accuracies in percentage for the test set, as shown in Table 1. We then evaluate semi-supervised learning, which is fine-tuning the pre-trained encoder on a small subset with 1% and 10% of the labeled ILSVRC-2012 ImageNet [24] training set. We also report the Top-1 and Top-5 accuracies for the test set in Table 1. HARL obtains 54.5% and 69.5% in Top-1 accuracy for semi-supervised learning using the standard ResNet-50 (×1). It represents a +1.3% and +0.7% advancement over the baseline framework BYOL [13] and significant improvement compared to the strong supervised baseline in the accuracy metric. Table 1. Evaluation on the ImageNet. The linear evaluation and semi-supervised learning with a fraction (1% and 10%) on ImageNet labels report Top-1 and Top-5 accuracies (in%) using the pre-trained ResNet-50 backbone. The best result is bolded.

Method
Linear Evaluation Semi-Supervised Learning

Transfer Learning to Other Downstream Tasks
We evaluated the HARL's quality of representation learning on linear classification and fine-tuned model following the evaluation setup protocol [9,13,39,55] as detailed in Appendix B.2. HARL's learned representation can perform well for all six different natural distribution image datasets. It has competitive performance in various distribution datasets compared to baseline BYOL [13] and improves significantly compared to the SimCLR [9] approach over six datasets, as shown in Table 2.
We further evaluated HARL's generalization ability and robustness with different computer vision tasks, including object detection of VOC07 + 12 [56] using Faster R-CNN [57] architecture with R50-C4 backbone and instance segmentation task of COCO [58] using Mask R-CNN [59] with R50-FPN backbone. The fine-tuning setup procedure and setting hyperparameter are detailed in Appendix B.3. We report the performance of the standard AP, AP 50 and AP 75 metrics in Table 3. HARL outperforms the baselines BYOL [13] and also has a significantly better performance than other self-supervised frameworks such as SimCLR [9], MoCo_v2 [17] and supervised baseline on object detection and segmentation. Table 2. Transfer via fine-tuning on the image classification task. The transfer learning performance between HARL framework and other self-supervised baseline benchmarks across six natural image classification datasets with the self-supervised pre-trained representation on the ImageNet 1000 classes using the standard ResNet-50 backbone. The best result is bolded.

Ablation and Analysis
We study the HARL's components to give the intuition of its behavior and impact on performance. We reproduce the HARL framework with multiple running experiments. For this reason, we hold the same set of hyperparameter configurations and change the configuration of the corresponding component, which we try to investigate. We perform our ablation experiments on the ResNet-50 and ResNet-18 architecture on the ImageNet training set without labels. We evaluate the learned representation on the ImageNet linear evaluation during the self-supervised pretraining stage. To do so, we attach the linear classifier on top of the base encoder with the block gradient flow on the linear classifier's input, which stops influencing and updating the encoder with the label information (a similar approach to SimCLR [9]). We run ablations over 100 epochs and evaluate the performance of the public validation set of the original ILSVRC2012 ImageNet [24] in the Top-1 accuracy metric at every 100 or 200 steps per epoch following the protocol as described in Appendix B.1.

The Output of Spatial Feature Map (Size and Dimension)
In our HARL framework, separating foreground and background features from the output spatial feature map is essential to maximize the similarity objective across different augmented views. To verify this hypothesis, we analyze several spatial outputs in various sizes and dimensions by modifying the ResNet kernel's stride to generate the different feature map sizes with the same dimension. For illustration, the standard ResNet is the sequence of four convolution building blocks (conv2_x, conv3_x, conv4_x, conv5_x). For ResNet-50 architecture, the dimension of conv_5x block output feature map is 7 × 7 × 2048. After changing the kernel stride of the conv_4x block from two to one, its new dimension will be 14 × 14 × 2048. In this modified ResNet-50 architecture, the conv5_x block's spatial feature map size is the same as the conv4_x block output.
We conduct the experiment for three different sizes including a deep ResNet-50 (7 × 7 × 2048, 14 × 14 × 2048, 28 × 28 × 2048) and a shallow ResNet-18 (7 × 7 × 512, 14 × 14 × 512, 28 × 28 × 512). Figure 3 shows the experimental results of various output sizes and dimensions in the pretraining stage that impact the learned representation when evaluating transfer representation on the ImageNet with linear evaluation protocol. Both shallow and deep ResNet architecture yields better learning ability on the larger output spatial feature map size 14 × 14 than 7 × 7. In our experiments, the performance decreases as we continue to go to a larger output size, 28 × 28 or 56 × 56.

The Output of Spatial Feature Map (Size and Dimension)
In our HARL framework, separating foreground and background features from the output spatial feature map is essential to maximize the similarity objective across different augmented views. To verify this hypothesis, we analyze several spatial outputs in various sizes and dimensions by modifying the ResNet kernel's stride to generate the different feature map sizes with the same dimension. For illustration, the standard ResNet is the sequence of four convolution building blocks (conv2_x, conv3_x, conv4_x, conv5_x). For ResNet-50 architecture, the dimension of conv_5x block output feature map is 7 × 7 × 2048. After changing the kernel stride of the conv_4x block from two to one, its new dimension will be 14 × 14 × 2048. In this modified ResNet-50 architecture, the conv5_x block's spatial feature map size is the same as the conv4_x block output.
We conduct the experiment for three different sizes including a deep ResNet-50 (7 × 7 × 2048, 14 × 14 × 2048, 28 × 28 × 2048) and a shallow ResNet-18 (7 × 7 × 512, 14 × 14 × 512, 28 × 28 × 512). Figure 3 shows the experimental results of various output sizes and dimensions in the pretraining stage that impact the learned representation when evaluating transfer representation on the ImageNet with linear evaluation protocol. Both shallow and deep ResNet architecture yields better learning ability on the larger output spatial feature map size 14 × 14 than 7 × 7. In our experiments, the performance decreases as we continue to go to a larger output size, 28 × 28 or 56 × 56.

Objective Loss Functions
HARL framework structure reuses elements of BYOL [13]. We use two neural networks denoted as online network and target network. Each network is defined by a set of parameters and . The optimization objective minimizes the loss ℒ , with respect to learnable parameters , while the set is parameterized by using an exponential moving average of the , as shown in Equation (3):

Objective Loss Functions
HARL framework structure reuses elements of BYOL [13]. We use two neural networks denoted as online network and target network. Each network is defined by a set of parameters θ and ξ. The optimization objective minimizes the loss L θ, ξ with respect to learnable parameters θ, while the set ξ is parameterized by using an exponential moving average of the θ, as shown in Equation (3): Unlike previous approaches that minimize loss function only based on the whole image latent embedding vector between two augmented views, HARL minimized the similarity of object-level latent representation, which associated the same spatial regions abstracting from segmentation mask and thus same semantic meanings. As shown in Figure 1, we use the mask information to separate the spatial semantic object-level feature (foreground and background) of the two augmented views. Then, we minimize their negative cosine similarity, denoting mask loss in Equation (1). In addition to our mask loss objective, we combine the distance loss of the whole image representation and object-level, resulting in hybrid loss as described in Equation (4). We study these two loss objectives in the self-supervised pretraining stage and then evaluate the obtained representation on the ImageNet with a linear evaluation protocol.

Mask Loss
The mask loss objective converges to minimizing the distance loss objective between foreground and background latent embedding on vector space L foreground(θ, ξ) and L background(θ, ξ) with the weighting coefficient α as described in Equation (1). We study the impact of α when it is set to a few predefined values and when it varies according to the cosine scheduling rule. In the first approach, we perform self-supervised pretraining sweeping over three different values {0.3, 0.5, 0.7}. In the second approach, we schedule the α based on a cosine schedule, α (1 − (1 − α base ))·(cos πk/K) + 1)/2, to gradually increase from the starting α base value to 1 corresponding current training step k over total training step K. We tried three α base values, including 0.3, 0.5 and 0.7. We report the Top-1 accuracy on the ImageNet linear evaluation set during the self-supervised pretraining stage, as shown in Figure 4. The weighting coefficient α value of 0.7 yields the consistent learned representation of both approaches. Furthermore, the experimental results demonstrate that the foreground is more important than the background latent representation. For example, in the ImageNet training set, many images exist in which the background information is more than 50% of the image.
Unlike previous approaches that minimize loss function only based on the whole image latent embedding vector between two augmented views, HARL minimized the similarity of object-level latent representation, which associated the same spatial regions abstracting from segmentation mask and thus same semantic meanings. As shown in Figure 1, we use the mask information to separate the spatial semantic object-level feature (foreground and background) of the two augmented views. Then, we minimize their negative cosine similarity, denoting mask loss in Equation (1). In addition to our mask loss objective, we combine the distance loss of the whole image representation and objectlevel, resulting in hybrid loss as described in Equation (4). We study these two loss objectives in the self-supervised pretraining stage and then evaluate the obtained representation on the ImageNet with a linear evaluation protocol.

Mask Loss
The mask loss objective converges to minimizing the distance loss objective between foreground and background latent embedding on vector space ℒ ( , ) and ℒ ( , ) with the weighting coefficient α as described in Equation (1). We study the impact of α when it is set to a few predefined values and when it varies according to the cosine scheduling rule. In the first approach, we perform self-supervised pretraining sweeping over three different values {0.3, 0.5, 0.7}. In the second approach, we schedule the α based on a cosine schedule, α ≜ (1 − (1 − α )) • (cos / ) + 1)/2, to gradually increase from the starting α value to 1 corresponding current training step over total training step . We tried three α values, including 0.3, 0.5 and 0.7. We report the Top-1 accuracy on the ImageNet linear evaluation set during the self-supervised pretraining stage, as shown in Figure 4. The weighting coefficient α value of 0.7 yields the consistent learned representation of both approaches. Furthermore, the experimental results demonstrate that the foreground is more important than the background latent representation. For example, in the ImageNet training set, many images exist in which the background information is more than 50% of the image.

Hybrid Loss
The objective combines whole image representation embedding v 1 and v 2 together with object-level representation embedding mask loss described in Equation (1). v 1 and v 2 are extracted from the two augmented views x and x' and are denoted as v 1 θ The hybrid loss minizines the negative cosine similarity with weighting coefficient λ: where v 1 and v 2 are the whole image latent representation; L Maskloss θ is the distance loss computed from the foreground and background latent representation described in Equation (1); . 2 is 2 -norm; and λ is the weighting coefficient in the range [0-1].
To study the impact of weighting coefficient λ, we use a cosine scheduling value similar to α in the mask loss section. In our experiment, the weighting coefficient λ cosine scheduling sweeping over four λ base values {0.3, 0.6, 0.7, 0.9}. We report the Top-1 accuracy of the ImageNet linear evaluation protocol on the validation set during the self-supervised pretraining stage, shown in Figure 5. We found using the weighting coefficient λ base value of 0.7 obtains the consistent learned representation when transferring to downstream tasks.
with object-level representation embedding mask loss described in Equation ( are extracted from the two augmented views x and x' and are ≜  ° ° ( )  ℝ and ≜  °g (x )  ℝ . The hybrid loss minizines cosine similarity with weighting coefficient : where and are the whole image latent representation; ℒ is the computed from the foreground and background latent representation descri tion (1); ‖. ‖ is ℓ -norm; and is the weighting coefficient in the range [0-1 To study the impact of weighting coefficient , we use a cosine sched similar to α in the mask loss section. In our experiment, the weighting coef sine scheduling sweeping over four values {0.3, 0.6, 0.7, 0.9}. We repo accuracy of the ImageNet linear evaluation protocol on the validation set du supervised pretraining stage, shown in Figure 5. We found using the weightin value of 0.7 obtains the consistent learned representation when tra downstream tasks.

Mask Loss Versus Hybrid Loss
We compare the obtained representation using mask loss and hybrid supervised pretraining. To do so, we implement the HARL framework with jectives on self-supervised pretraining. We use the cosine schedule function t weighting coefficient α and sweeping on three different initial values {0.3 both coefficients. We evaluate the obtained representation of the pre-trained ing ResNet-50 backbone in Top-1 and Top-5 accuracy (in %) on ImageNet li tion protocol, as shown in Table 4. According to the experimental result, usin loss incorporated between global and object-level latent representation yield resentation learning during self-supervised pretraining.

Mask Loss versus Hybrid Loss
We compare the obtained representation using mask loss and hybrid loss on selfsupervised pretraining. To do so, we implement the HARL framework with both loss objectives on self-supervised pretraining. We use the cosine schedule function to control the weighting coefficient α and λ sweeping on three different initial values {0.3, 0.5, 0.7} for both coefficients. We evaluate the obtained representation of the pre-trained encoder using ResNet-50 backbone in Top-1 and Top-5 accuracy (in%) on ImageNet linear evaluation protocol, as shown in Table 4. According to the experimental result, using the hybrid loss incorporated between global and object-level latent representation yields better representation learning during self-supervised pretraining.

The Impact of Heuristic Mask Quality
In our work, the HARL objective uses two different image segmentation techniques. Which ones lead to the best representation? We first consider the heuristics mask retrieving from the computer vision DRFI [44] approach by varying the two hyperparameters (the Gaussian filter variance σ and the minimum cluster size s) as described in detail in Appendix C.1. In our implementation, we generate a diverse set of binary masks by different combinations of σ ∈ {0.2, 0.4, 0.8} and c ∈ {1000, 1500}. The sets of the generated masks are shown in Figure 6. We found that the setting of σ = 0.8 and s = 1000 generate more stable mask quality than other combinations. Following the deep learning technique, we use the pre-trained deep convolution neural network as the feature extractor and design a saliency head prediction on top of the feature extractor output's representation in the following three steps described in Appendix C.2. The generated masks are dependent on the pixel saliency threshold, which determines the foregroundness and backgrounness of the pixel. In our implementation, we tested the saliency threshold value ranging in {0.4, 0.6, 0.7} as shown in Figure 7. We choose the threshold value equal to 0.5 for generating masks in the ImageNet dataset. After choosing the best configure of the two techniques, we generate the mask for the whole training set of the ImageNet [24] dataset. We evaluate the mask quality generated by computing the mean Intersection-Over-Union (mIoU) between masks generated with the ImageNet ground-truth mask annotated by humans from Pixel-ImageNet [60]. The mIoU of the deep learning masks achieves 0.485 over 0.398 of DRFI masks on the subset of 0.485 million images (946/1000 classes of ImageNet). We found that in a complex scene, where multiple objects exist in a single image, the mask generated from the DRFI technique is noisier and less accurate than the deep learning masks, as illustrated in Figure 8.
To fully evaluate the impact of representation learning on downstream performance, we inspect the obtained representational quality with the transfer learning performance on the object detection and segmentation shown in Table 5. The result indicates that for most object detection and segmentation tasks, HARL learning based on masks with deep learning outperforms the one with DRFI masks, although the difference is very small. It shows that the quality of the mask used for HARL does have a small impact on the performance of the downstream task.

Conclusions and Future Work
We introduce the HARL framework, a new self-supervised visual representation learning framework, by leveraging visual attention with the heuristic binary mask. As a result, HARL manages higher-quality semantical information that considerably improves representation learning of self-supervised pretraining compared to previous state-of-the-art methods [9,13,17,18] on semi-supervised and transfers learning on various benchmarks. The two main advantages of the proposed method include: (i) the early attention mechanism that can be applied across different natural image datasets because we use unsupervised techniques to generate the heuristic mask and do not rely on external supervision; (ii) the entire framework can transfer and adapt quickly either to self-supervised contrastive or non-contrastive learning framework. Furthermore, our method will apply and accelerate the currently self-supervised learning direction on pixel-level objectives. Our object-level abstract will make this objective more efficient than the existing work based on computing pixel distance [61].
In our HARL framework, the heuristic binary mask is critical. However, the remaining challenge of estimating accurate masks is suitable for datasets with one primary object, such as the ImageNet dataset. The alternative is mining the object proposal of the image in the complex dataset which contains multiple things by producing heuristic semantic segmentation masks. Designing the new self-supervised framework to solve the remaining challenge of datasets which contain multiple objects is an essential next step and exciting research direction for our future work.

Acknowledgments:
The authors would like to thank the following people for their help throughout the process of building this study: Timothy Ko for helping create the heuristic binary mask on the DRFI technique for the ImageNet ILSVRC-2012 dataset and Kai-Lin Yang who provided the setting virtual machines for experiments and gave suggestions during the development of the project.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Implementation Detail
Appendix A.1. Implementation Data Augmentation HARL data augmentation pipeline starts with the standard inception-style random cropping [62]. These cropping views continue to transform using the same set of image augmentation as in SimCLR [9], consisting of the arbitrary sequence composition transformation (color distortion, grayscale conversion, gaussian blur, solarization).
Each RGB image and the heuristic binary mask corresponding to each image are transformed through the augmentation pipeline composed of the following operations described below. First, we utilize the image with the random crop with resizing and random flipping. For the binary mask, these masks apply, only cropping and flipping the underlying RGB image which corresponds. Then, these crop images used give the probability of color distortion (color jittering, color dropping), random Gaussian blur and solarization.

1.
Random cropping with resizes: a random patch of the image is selected. In our pipeline, we use the inception-style random cropping [62], whose area crop is uniformly sampled in [0.08 to 1.0] of the original image, and the random aspect ratio is logarithmically sampled in [3/4, 4/3]. The patch is then resized to 224 × 224 pixels using bicubic interpolation; 2.
Color jittering: the brightness, contrast, saturation and hue are shifted by a uniformly distributed offset; 4.
Optional color dropping: the RGB image is replaced by its greyscale values; 5.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Implementation Detail
Appendix A. 1

. Implementation Data Augmentation
HARL data augmentation pipeline starts with the standard inception-style random cropping [62]. These cropping views continue to transform using the same set of image augmentation as in SimCLR [9], consisting of the arbitrary sequence composition transformation (color distortion, grayscale conversion, gaussian blur, solarization).
Each RGB image and the heuristic binary mask corresponding to each image are transformed through the augmentation pipeline composed of the following operations described below. First, we utilize the image with the random crop with resizing and random flipping. For the binary mask, these masks apply, only cropping and flipping the underlying RGB image which corresponds. Then, these crop images used give the probability of color distortion (color jittering, color dropping), random Gaussian blur and solarization.
1. Random cropping with resizes: a random patch of the image is selected. In our pipeline, we use the inception-style random cropping [62], whose area crop is uniformly sampled in [0.08 to 1.0] of the original image, and the random aspect ratio is logarithmically sampled in [3/4, 4/3]. The patch is then resized to 224 × 224 pixels using bicubic interpolation; The two views augmented image's x, x′ and mask pair m, m′ results from augmentations sample from distributions T, T′, M and M′, respectively. These distributions apply the primitives described above with different probabilities and magnitudes shown in Table A1. The following table specifies these parameters' inherence from the BYOL framework [13] without modification.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Implementation Detail
Appendix A.1. Implementation Data Augmentation HARL data augmentation pipeline starts with the standard inception-style rando cropping [62]. These cropping views continue to transform using the same set of ima augmentation as in SimCLR [9], consisting of the arbitrary sequence composition tran formation (color distortion, grayscale conversion, gaussian blur, solarization).
Each RGB image and the heuristic binary mask corresponding to each image a transformed through the augmentation pipeline composed of the following operatio described below. First, we utilize the image with the random crop with resizing and ra dom flipping. For the binary mask, these masks apply, only cropping and flipping t underlying RGB image which corresponds. Then, these crop images used give the prob bility of color distortion (color jittering, color dropping), random Gaussian blur and sola ization.
1. Random cropping with resizes: a random patch of the image is selected. In our pip line, we use the inception-style random cropping [62], whose area crop is uniform sampled in [0.08 to 1.0] of the original image, and the random aspect ratio is logarit mically sampled in [  The two views augmented image's x, x and mask pair m, m results from augmentations sample from distributions T, T , M and M , respectively. These distributions apply the primitives described above with different probabilities and magnitudes shown in Table A1. The following table specifies these parameters' inherence from the BYOL framework [13] without modification. The masking feature step of the HARL framework is essential to leverage the objectivelevel information from the heuristic binary mask. The masking features method is composed of three steps. The first step is taking the spatial feature map output 7 × 7 × 2048, which is the final layer before the global average output pooling of the ResNet architecture. Second, in our training loop design, the mask image is directly resized to 7 × 7 × 3 to match the size of the output spatial feature maps without passing through the encoder. Then, the resized mask indexes the feature, one encodes for the foreground feature and zero encodes for the background feature. In the end, we multiply the indexing mask with the spatial features maps to separate the foreground and background features (the corresponding output is two spatial features maps of 7 × 7 × 2048 for foreground and background features). Then, these two spatial feature map outputs apply global average pooling and further reduce dimension with non-linear multi-layer perceptron (MLP) architecture.

Appendix B. Evaluation on the ImageNet and Transfer Learning
Appendix B.1. Implementation Masking Feature Linear Evaluation Semi-Supervised Protocol on ImageNet Our data preprocessing procedure is described as follows: At training time, the images apply the simple augmentations strategies, including random flip and crops with resizing to 224 × 224 pixels. At testing time, all images applied are resized to 256 pixels along the shorter side using bicubic resampling, which took a 224 × 224 center crop. Images are normalized by color channel in training time and testing and divided by standard deviation computed on ImageNet ( [9,13] provide a similar pipeline for data processing).
Linear evaluation: We train a linear classifier on top of the frozen pre-trained encoder representation in the linear evaluation without updating the network parameters and the batch statistics. In design and configuration protocol, we follow the standard on ImageNet as in [9,51,54,55]. To train and optimize the linear classifier, we use the SGD optimizer to optimize the cross-entropy loss with the Nesterov momentum over 100 epochs using a batch size of 1024 and a momentum of 0.9. without regularization methods such as weight decay, gradient clipping [63], etc. We report the test set's accuracy (the public validation set of the original ILSVRC2012 ImageNet [24] dataset).
Semi-supervised evaluation: We fine-tuned the network parameters of the pretrained encoder representation following the semi-supervised learning protocol and procedure as in [9]. Data preprocessing and augmentation strategies at training and testing time for 1% and 10% follow a similar procedure of linear evaluation (described in Appendix C.1) except that with a larger batch size of 2048 and trained over 60 epochs for 1% labeled data and 30 epochs for 10% labeled data. In Table 1 of Section 5.1, we report that the result fine-tuned the representation over the 1% and 10% ImageNet splits from [9] with ResNet-50 (1×) architectures.
Datasets: We followed previous works [9,13] to transfer the representation on the linear classification and fine-tuned it on six different natural image datasets. These datasets are namely Food-101 [64], CIFAR-10 [65] and CIFAR-100 [65], the SUN397 scene dataset [66], Stanford Cars [67] and the Describable Textures Dataset (DTD) [68]. The detail of each dataset is described in Table A2. We use the training set and validation set, which are specified by the dataset creators, to select hyperparameters. On datasets without a test set or validation set, we use the validation examples as a test set or hold out a subset of the training examples we use as the validation set, as described in Table A2.

Standard evaluation metrics:
To evaluate HARL transfer learning on different datasets and other vision tasks, we use the standard evaluation metrics of each dataset to assess and benchmark our results on these datasets as described in Top-1, AP, AP 50 and AP 75 .

•
Top-1: We compute the proportion of correctly classified examples. • AP, AP 50 and AP 75 : We compute the average precision as defined in [56].

Appendix B.2. Transfer via Linear Classification and Fine-Tuning
Transfer linear classification: We initialize the network parameters and freeze the pre-trained encoder without updating the network parameters and batch statistics. The standard linear evaluation protocol follows [9,51,55]. In training and testing, the images are resized to 224 × 224 along the shorter side using bicubic resampling and then normalized with ImageNet statistics without data augmentation. Both phase images normalized the color channels by subtracting the average color and dividing by the standard deviation. We train a regularized multinomial logistic regression classifier on top of the frozen representation. We optimize cross-entropy loss 2 -regularization with the parameters from a range of 45 logarithmically spaced values between 10 −6 and 10 5 (similar to the optimization procedure of [13]). The model is retrained on the training and validation set combined. The model accuracy performance is reported for the test set.
Transfer fine-tuning: We follow fine-tuning protocol as in [9,51,69] to initialize the network with the parameters of the pre-trained representation. At both phase training and testing time, we follow the image preprocessing and data augmentation strategies to the linear evaluation procedure in Appendix B.1. To fine-tune the network, we optimized the cross-entropy loss using SGD optimizer with a Nesterov momentum value of 0.9 and trained over 20,000 steps with a batch size of 256. We set a hyperparameter including the momentum parameter for batch statistics, learning rate and weight decay selection method, same as in [9,13]. After selecting the optimal hyperparameters configured for the validation set, the model is retrained on the combined training and validation set together, using the specified parameters. The absolute accuracy is reported for the test set.