Revisiting Low-Resolution Images Retrieval with Attention Mechanism and Contrastive Learning

: Recent empirical works reveal that visual representation learned by deep neural networks can be successfully used as descriptors for image retrieval. A common technique is to leverage pre-trained models to learn visual descriptors by ranking losses and ﬁne-tuning with labeled data. However, retrieval systems’ performance signiﬁcantly decreases when querying images of lower resolution than the training images. This study considered a contrastive learning framework ﬁne-tuned on features extracted from a pre-trained neural network encoder equipped with an attention mechanism to address the image retrieval task for low-resolution image retrieval. Our method is simple yet effective since the contrastive learning framework drives similar samples close to each other in feature space by manipulating variants of their augmentations. To benchmark the proposed framework, we conducted quantitative and qualitative analyses of CARS196 (mAP = 0.8804), CUB200-2011 (mAP = 0.9379), and Stanford Online Products datasets (mAP = 0.9141) and analyzed their performances.


Introduction
Even though high-quality images are popular, we get often degraded or low-resolution images due to careless photo taking of loose focusing or picture taking from afar. Thus image retrieving from low-quality images is requested in real applications because retrieval results for a degraded image are usually poor. In this paper, we propose a proper deeplearning-based framework that can straightly deal with low-resolution images. This study implemented three main modules for solving category image retrieval on low-resolution samples. First, an attention-based encoder network was employed to extract meaningful visual representations of images. Second, we manipulated a contrastive learning framework to obtain embeddings that are used for information retrieval. The purpose of contrastive learning is to find consistent representations of different resolution views augmented from the same source. Third, a model was trained end-to-end, including its encoder network and projection head with multiple loss functions, consisting of contrastive loss for maximizing agreement of different resolution versions of an identical image, cross-entropy loss for classification, and triplet loss for maximizing distance of negative pairs (different category) and minimizing the distance of positive pairs (same category). This section presents a brief review of previous studies about image retrieval, attention mechanism, and contrastive learning. To show the importance of this study, we examined some failure cases when using low-resolution images as queries for image retrieval, as shown in Figure 1. These examples highlight that low-resolution images are inferior for feature matching despite using a powerful pre-trained model. Failure cases when querying a retrieval system using low-resolution images. The base model is EfficientNet-b7 with features extracted from the last convolutional layer. The first image of each round is the query, and the next five images are the top 5 retrieval results, with labels of some images different from the label of the query image. Images in the first two rows belong to the CUB200-2011 dataset, while those of the last two rows belong to the Stanford Online Product dataset. First row: 1st image is query image from class "mourning warbler ", 2nd image from class "magnolia warbler", 3rd image from class "yellow-throated vireo", 4th image from class "whitecrowned sparrow", 5th image from class "blue-headed vireo", 6th image from class "yellowthroated vireo". Second row: 1st image is query image from class "Clark nutcracker", 2nd image from class "Western Kingbird", 3rd image from class "song sparrow", 4th image from class "American pipits", 5th image from class "blue-headed vireo", 6th image from class "scissor-tailed flycatcher". Third row: 1st image is query image from class "stapler ", 2nd image from class "lamp", 3rd image from class "cabinet", 4th and 5th images from class "stapler", 6th image from class "kettle". Fourth row: 1st image is query image from class "lamp ", 2nd and 3rd image from class "cabinet", 4th image from class "lamp", 5th and 6th images from class "chair". Image Retrieval: Image descriptors based on deep convolutional neural networks (CNNs) have been used as primary descriptors in various computer vision tasks, such as classification, semantic segmentation, and especially on image search [1,2]. Studies on image retrieval have progressively developed various methods to compress spatial feature maps into a vector-form descriptor. There are growing appeals for constituting image descriptors. Outputs from the final fully connected layers [1,3], the most activated convolutions [2], and generalized pooling of convolutions [4] have been utilized as image descriptors. Each descriptor has different functions, such as concentrating on informative regions or large receptive field regions. Consequently, modern methods have come up with ensemble techniques to boost the desired systems' performance. Early fusion blends descriptors across layers and trains an integrated model based on an end-to-end approach [3,[5][6][7], while late fusion is a method in which individual models and features from multiple learners are entangled to form a compact global descriptor [8]. Recently, one of the major topics that show the growth in interest from numerous researchers in this field is attention mechanism.
Attention Mechanism: One of the current trends in designing neural network architecture is the attention mechanism [9][10][11]. The attention model integrates the concept of relevance by focusing only on the relevant aspects of a given input, which is useful for achieving a compelling performance of the task. These systems only focus on a relevant part of input useful for getting the required knowledge for working on a task and ignoring Failure cases when querying a retrieval system using low-resolution images. The base model is EfficientNet-b7 with features extracted from the last convolutional layer. The first image of each round is the query, and the next five images are the top 5 retrieval results, with labels of some images different from the label of the query image. Images in the first two rows belong to the CUB200-2011 dataset, while those of the last two rows belong to the Stanford Online Product dataset. First row: 1st image is query image from class "mourning warbler ", 2nd image from class "magnolia warbler", 3rd image from class "yellow-throated vireo", 4th image from class "whitecrowned sparrow", 5th image from class "blue-headed vireo", 6th image from class "yellow-throated vireo". Second row: 1st image is query image from class "Clark nutcracker", 2nd image from class "Western Kingbird", 3rd image from class "song sparrow", 4th image from class "American pipits", 5th image from class "blue-headed vireo", 6th image from class "scissor-tailed flycatcher". Third row: 1st image is query image from class "stapler ", 2nd image from class "lamp", 3rd image from class "cabinet", 4th and 5th images from class "stapler", 6th image from class "kettle". Fourth row: 1st image is query image from class "lamp ", 2nd and 3rd image from class "cabinet", 4th image from class "lamp", 5th and 6th images from class "chair". Image Retrieval: Image descriptors based on deep convolutional neural networks (CNNs) have been used as primary descriptors in various computer vision tasks, such as classification, semantic segmentation, and especially on image search [1,2]. Studies on image retrieval have progressively developed various methods to compress spatial feature maps into a vector-form descriptor. There are growing appeals for constituting image descriptors. Outputs from the final fully connected layers [1,3], the most activated convolutions [2], and generalized pooling of convolutions [4] have been utilized as image descriptors. Each descriptor has different functions, such as concentrating on informative regions or large receptive field regions. Consequently, modern methods have come up with ensemble techniques to boost the desired systems' performance. Early fusion blends descriptors across layers and trains an integrated model based on an end-to-end approach [3,[5][6][7], while late fusion is a method in which individual models and features from multiple learners are entangled to form a compact global descriptor [8]. Recently, one of the major topics that show the growth in interest from numerous researchers in this field is attention mechanism.
Attention Mechanism: One of the current trends in designing neural network architecture is the attention mechanism [9][10][11]. The attention model integrates the concept of relevance by focusing only on the relevant aspects of a given input, which is useful for achieving a compelling performance of the task. These systems only focus on a relevant part of input useful for getting the required knowledge for working on a task and ignoring Appl. Sci. 2021, 11, 6783 3 of 17 irrelevant details. Approaches tackling the image retrieval problem include using attention structures introduced in [12][13][14][15].
This study employed a simple contrastive learning framework [16] for image retrieval but investigated an effective architecture for the feature extractor. As a result, we found that the Visual Transformer [11] model is especially effective for visual representation learning. Inspired by Transformer architecture [10] from the natural language processing domain, ViT is proposed as a promising model that naturally integrates a self-attention mechanism to solve computer vision problems. Self-attention is introduced to visual tasks to interpret the correlation among pixels where a high attention score between two visual patches indicates their strong relation and vice versa. Although ViT is not the first method to implement self-attention for the visual task [17][18][19], it is remarkable thanks to its overwhelming results and efficiency in hardware accelerators, in addition to its simplified implementation. When pre-trained using large-scale datasets and transferred to multiple recognition benchmarks, ViT outperforms state-of-the-art convolution-based neural networks [11,20,21]. However, Transformers lack some inductive biases compared to CNNs, such as translation equivariance, and thus training on sufficient amounts of data is recommended. Otherwise, the self-attention mechanism inherits receptive field properties from CNNs and considers wide regions even from low layers.
Contrastive learning: Learning visual representation is mostly a label-driven task, where learnable feature extractors are trained to optimize objective functions that involve the label of samples, such as categories and pairs of negative and positive samples. The success of such tasks requires large amounts of labeled data [22][23][24], which is not always available and is often very expensive to acquire. However, unsupervised visual representation learning remains an unexploited area in computer vision research. Recently, a considerable research effort has been put into methods to enhance vision systems without providing a large amount of full supervision. In particular, this effort is characterized by advances in self-supervised learning with a contrastive loss function [25][26][27]. Self-supervised learning frameworks formulate pretext learning tasks that leverage unlabeled data to learn high-level semantic visual representations useful for the downstream task of interest. For example, pretext tasks such as predicting orientation of rotated image [28], filling in a missing patch [29], or jigsaw re-ordering [30] are beneficial for downstream tasks such as recognition and semantic segmentation because high-level concepts of objects (e.g., shape and texture) are encoded when solving the pretext task. Precisely, self-supervised representation learning's underlying concept is maximization of the mutual information between different views of the data [31][32][33].
The main theme of contrastive learning is an instance discrimination task. An image and its augmentation are taken to be in the same class (positives), and all other images are considered to be of different classes (negatives). Noticeably, contrastive loss objective function and aggressive data augmentation are the other two key factors influencing the success of self-supervised representation learning [16,32]. In addition, Ref. [34] showed that contrastive loss encourages consistent representation of augmented view and matches prior distribution. There are two main practical advantages of contrastive learning. First, the agreement is estimated between only the learned representations of various views, which lie on a lower dimensional space than the original one. Second, various views can be chosen to capture different aspects and modalities of the data with plenty of modeling flexibility [35,36]. These properties can be especially beneficial for feature matching that helps to retrieve information in retrieval tasks.
Contribution: Our research aimed to find a solution for the challenging problem of category image retrieval tasks on low-resolution images. The problem can be hypothesized into a general question about learning effective visual representations in an embedding space where similar images with different resolutions are kept close to each other and dissimilar ones are placed far away from each other. In this paper, we present a framework that consists of contrastive learning trained over the Visual Transformer encoder (ViT). The benefits of using contrastive learning are expected to maximize positive pairs' agreement, which are samples from the same class or augmenting resolution samples from the same source, via contrastive loss. However, it is usually a huge challenge when training a contrastive framework from scratch since it requires large-batch training for a long period [16]. Therefore, a possible solution to the problem at hand is proposed in this paper. We used a powerful pre-trained encoder to extract visual representations and fine-tune contrastive learning to learn embeddings for feature matching. Our contribution can be summarized as follows:

1.
We adapted the Visual Transformer to the image retrieval task when the embedded vectors were calculated using attention weights. The main advantage of this method is that the attention mechanism of the ViT model helps to focus more on an object of interest when comparing two images.

2.
We addressed the problem of retrieval with degraded samples such as low-resolution. We proposed using a contrastive learning framework to learn an embedded space where the same samples are close together with respect to Euclidean distance.

3.
We conducted extensive experiments on CARS196, Stanford Online Products, and CUB200-2011 datasets under various circumstances. Both quantitative and qualitative results show that the proposed framework is efficient.

Materials and Methods
This study introduces an effective yet simple framework for image retrieval. The feature descriptor is extracted from a backbone network and goes further through a projection module to become embedded vectors for retrieval. We study the behavior of the output representation space when training with a contrastive loss, in particular, how augmenting impacts the space properties and the performance of image retrieval on low-resolution inputs. Our framework is illustrated in Figure 2. This section demonstrates our framework with three modules. A feature extractor module extracts a visual representation of a given image, while a projection head trained in a contrastive approach helps to map visual representation to an embedding space so that the similarity of samples can be calculated. Finally, we introduce an auxiliary module with classification loss and triplet loss, which significantly enhances the category retrieval's performance.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 4 of 17 agreement, which are samples from the same class or augmenting resolution samples from the same source, via contrastive loss. However, it is usually a huge challenge when training a contrastive framework from scratch since it requires large-batch training for a long period [16]. Therefore, a possible solution to the problem at hand is proposed in this paper. We used a powerful pre-trained encoder to extract visual representations and finetune contrastive learning to learn embeddings for feature matching. Our contribution can be summarized as follows: 1. We adapted the Visual Transformer to the image retrieval task when the embedded vectors were calculated using attention weights. The main advantage of this method is that the attention mechanism of the ViT model helps to focus more on an object of interest when comparing two images. 2. We addressed the problem of retrieval with degraded samples such as low-resolution. We proposed using a contrastive learning framework to learn an embedded space where the same samples are close together with respect to Euclidean distance. 3. We conducted extensive experiments on CARS196, Stanford Online Products, and CUB200-2011 datasets under various circumstances. Both quantitative and qualitative results show that the proposed framework is efficient.

Materials and Methods
This study introduces an effective yet simple framework for image retrieval. The feature descriptor is extracted from a backbone network and goes further through a projection module to become embedded vectors for retrieval. We study the behavior of the output representation space when training with a contrastive loss, in particular, how augmenting impacts the space properties and the performance of image retrieval on low-resolution inputs. Our framework is illustrated in Figure 2. This section demonstrates our framework with three modules. A feature extractor module extracts a visual representation of a given image, while a projection head trained in a contrastive approach helps to map visual representation to an embedding space so that the similarity of samples can be calculated. Finally, we introduce an auxiliary module with classification loss and triplet loss, which significantly enhances the category retrieval's performance.

Figure 2.
The overall architecture of our proposed framework. The framework is described using five major computing steps. (1) An input image is augmented into two views with different resolutions for contrastive learning. (2) Two augmented images go through a shared encoder module to extract representation vectors. (3) A projection head is added on top of the encoder model to perform non-linear mapping into an embedding space so that (4) contrastive and triplet loss can be established to maximize the similarity of positive pairs and dissimilarity of negative pairs. (5) We add a linear layer with softmax activation followed by batch normalization to calculate all samples' class probability from two branches. The overall architecture of our proposed framework. The framework is described using five major computing steps. (1) An input image is augmented into two views with different resolutions for contrastive learning. (2) Two augmented images go through a shared encoder module to extract representation vectors. (3) A projection head is added on top of the encoder model to perform nonlinear mapping into an embedding space so that (4) contrastive and triplet loss can be established to maximize the similarity of positive pairs and dissimilarity of negative pairs. (5) We add a linear layer with softmax activation followed by batch normalization to calculate all samples' class probability from two branches.

Feature Extractor
We manipulate the Visual Transformer [11] model to extract a visual representation of given images. The main component of the ViT model is the self-attention encoder module, which implements Transformer architecture [10] in the most standard way. According to the original version of ViT, our feature extractor involves three main steps.
Patch embedding: We split an image into a sequence of patches and map each patch to a D dimensions embedding space. Precisely, we put an image x ∈ R C×W×H through D 2d convolutions with the kernel size of P × P and stride of P, resulting in a feature map with the size D × N × N, then flattening the feature map into a sequence of N 2 latent vectors with a constant size of D. In the above configuration, N 2 = H × W/P 2 is the number of embedded patches, where (H, W) represents the resolution of the original image, and (P, P) is the resolution of an image patch. Apart from the embedded patches, the ViT model adds an extra learnable class embedding for classification tasks, and the results obtained from using this class token are referred to as "ViT-class" in our study. To maintain the position of the patches after flattening, we follow the standard way by adding a learnable 1D positional embedding into each patch, and this positional embedding does not share weights across patches.
Encoder: Encoder is a computational block consisting of a multi-head attention module [10] and an MLP with two consecutive linear layers. Input and output of encoder module are both embedded vectors of batches. LayerNorm is applied before feeding embedded vectors into the attention module and MLP. The multi-head attention module expands the model's ability to jointly focus on different positions, thus providing different representation subspaces of pair (key K, query Q, value V) from different attention heads: where each head is a context vector from scale dot-product attention.
Q, K, and V represent a query, key, and value, respectively, calculated inside the transformer architecture that encodes information from the image's patches with the self-attention mechanism to mutually attend to each other. d is the dimensions of patch embeddings; Equation (2) employs d to scale the attention scores. The encoder module is illustrated in Figure 3.

Feature Extractor
We manipulate the Visual Transformer [11] model to extract a visual representation of given images. The main component of the ViT model is the self-attention encoder module, which implements Transformer architecture [10] in the most standard way. According to the original version of ViT, our feature extractor involves three main steps.
Patch embedding: We split an image into a sequence of patches and map each patch to a dimensions embedding space. Precisely, we put an image ∈ ℝ × × through 2d convolutions with the kernel size of × and stride of , resulting in a feature map with the size × × , then flattening the feature map into a sequence of 2 latent vectors with a constant size of . In the above configuration, 2 = × / 2 is the number of embedded patches, where ( , ) represents the resolution of the original image, and ( , ) is the resolution of an image patch. Apart from the embedded patches, the ViT model adds an extra learnable class embedding for classification tasks, and the results obtained from using this class token are referred to as "ViT-class" in our study. To maintain the position of the patches after flattening, we follow the standard way by adding a learnable 1D positional embedding into each patch, and this positional embedding does not share weights across patches.
Encoder: Encoder is a computational block consisting of a multi-head attention module [10] and an MLP with two consecutive linear layers. Input and output of encoder module are both embedded vectors of batches. LayerNorm is applied before feeding embedded vectors into the attention module and MLP. The multi-head attention module expands the model's ability to jointly focus on different positions, thus providing different representation subspaces of pair ( , , ) from different attention heads: where each head is a context vector from scale dot-product attention.
Q, K, and V represent a query, key, and value, respectively, calculated inside the transformer architecture that encodes information from the image's patches with the selfattention mechanism to mutually attend to each other.
is the dimensions of patch embeddings; Equation (2) employs to scale the attention scores. The encoder module is illustrated in Figure 3.

Visual descriptor:
We develop a novel visual descriptor for the corresponding image based on the attention mechanism. Our technique naturally arises from attention maps of the ViT model. We generate visual descriptors by combining the final embeddings of informative regions selected by ranking attention weights.
N 2 be the list of embedded patches resulting from the i th encoder. Obviously, the first list of embedded patches z (0) is spawned from the summation of linear projections and positional embedding as described The ith embedding layer is the output from the ith encoder module, while input is the (i − 1)th embedding layer z i = encoder z i−1 . Let a p be a joint self-attention weight of a local patch x p , p = 1, . . . , N 2 , then our visual descriptor for an image x is the ranking weighted sum of its patch embeddings, where K is the desired rank, a = σ([a 1 , . . . , a N 2 ]) is a permutation of the list of joint attention weights, and z = σ([z 1 , . . . , z N 2 ]) is the corresponding list of embedded patches' output from the final encoder and sorted with the same order. When σ is the arrangement from greatest to least, our visual descriptor is the weighted sum of the most K attentive regions. According to ViT architecture, each encoder block has its own self-attention maps from the multi-head attention. The self-attention mechanism allows ViT to interpret information across the entire image, even in the early layers. In the early layers, some heads consistently focus on small areas, while others attend to most parts of the image, indicating that the ability to unite information globally is already in use inside the early layers. Meanwhile, the attention regions from all heads tend to be wider when going through higher layers, showing that the model aims to capture global information at higher layers. This ability is analogous to the receptive field concept in CNNs, which is the strength of convolutional layers capable of integrating both local and global information.
Instead of considering the correlation among patches of an image, we investigate parts of the image that should be attended to and extract their corresponding embedded patches. As highlighted in the equation, the embedded patches are output from the final encoder block, but the attention weights are jointly measured across multi-head attention modules.
Let A (i) ∈ [0, 1] h×(N 2 ×N 2 ) be the attention map calculated inside the multi-head attention module of the ith encoder, where h is the number of attention heads, and N 2 is the number of patches as earlier mentioned. Then, A (i) is a collection of attention maps, which are symmetric matrices of size N 2 × N 2 whose coefficients estimate the degree of attention between two patches. To derive the joint attention map, we first average all attention maps for different heads to obtain an attention map responsible for a layer. To account for residual connections, we add an identity matrix to the attention map and re-normalize the weights, where is the normalized average attention map from the ith layer, I N 2 ×N 2 is an identity matrix of size N 2 × N 2 , and normalize(X) = X/ ∑ i,j X ij . Finally, the joint attention map is obtained by multiplying the attention maps across all layers.
where Ajoint ∈ [0, 1] N 2 ×N 2 and L is the number of encoder blocks. To generate the visual descriptors, we require only the attention weights assigned to image patches; these Appl. Sci. 2021, 11, 6783 7 of 17 attention weights attend to themselves and can be attained by extracting diagonal of the joint attention matrix a = a 1 , . . . , a N 2 ×N 2 = diagonal(Ajoint). The attention weights derived here are exactly the weights involved in Equation (3). The aforementioned method is referred to as attention rollout and was introduced in [11]. Figure 4 illustrates our approach to extract visual representations from the ViT model.  (3). The aforementioned method is referred to as attention rollout and was introduced in [11]. Figure 4 illustrates our approach to extract visual representations from the ViT model.

Contrastive Learning Framework
We train projection heads on top of the ViT model in a contrastive way to maximize the agreement of positive pairs in an embedding space. The contrastive learning framework can be decomposed into four modules: data augmentation, backbone network, projection head, and contrastive objective function.
Data augmentation: This study concentrates on low-resolution image retrieval, and thus the augmentation method we applied here only varies the input's resolution. A stochastic data augmentation module transforms the input data randomly, resulting in two correlated views of the same sample but with different resolutions. In this work, we apply random cropping followed by a low-pass filter. The low-pass filter used here is the Gaussian blur with a fixed size of the kernel and random kernel standard deviation.
Base encoder: The input to the base encoder network is augmented data, and output is a representation vector. We mainly use ViT as our base encoder model; however, we also experiment with other encoder networks, such as BiT [20] and EfficientNet [21]. In the case of BiT and EfficientNet encoders, we extract the final convolutional layer before fully connected layers and then apply adaptive average pooling to obtain a 1-D representation vector. This study uses = ( ) as a notation for the representation vector. As a result, the dimension of the representation vector varies based on encoder architectures. The base encoder is visualized as two "Encoder" blocks as in Figure 2.
Projection head: The projection head consists of a non-linear mapping that maps representation vectors into an embedded space where the similarity between samples can be measured. As suggested in [16], we use an MLP with one hidden layer to formulate the projection head. This study uses = ( ) as a notation for the embedded vector. The projection head is depicted in Figure 2 as "Projector" blocks with a hidden layer. The size of the hidden layer is the same as the number of dimensions of embedded vectors.

Contrastive Learning Framework
We train projection heads on top of the ViT model in a contrastive way to maximize the agreement of positive pairs in an embedding space. The contrastive learning framework can be decomposed into four modules: data augmentation, backbone network, projection head, and contrastive objective function.
Data augmentation: This study concentrates on low-resolution image retrieval, and thus the augmentation method we applied here only varies the input's resolution. A stochastic data augmentation module transforms the input data randomly, resulting in two correlated views of the same sample but with different resolutions. In this work, we apply random cropping followed by a low-pass filter. The low-pass filter used here is the Gaussian blur with a fixed size of the kernel and random kernel standard deviation.
Base encoder: The input to the base encoder network is augmented data, and output is a representation vector. We mainly use ViT as our base encoder model; however, we also experiment with other encoder networks, such as BiT [20] and EfficientNet [21]. In the case of BiT and EfficientNet encoders, we extract the final convolutional layer before fully connected layers and then apply adaptive average pooling to obtain a 1-D representation vector. This study uses z = Encoder(x) as a notation for the representation vector. As a result, the dimension of the representation vector varies based on encoder architectures. The base encoder is visualized as two "Encoder" blocks as in Figure 2.
Projection head: The projection head consists of a non-linear mapping that maps representation vectors into an embedded space where the similarity between samples can be measured. As suggested in [16], we use an MLP with one hidden layer to formulate the projection head. This study uses e = Projector(z) as a notation for the embedded vector. The projection head is depicted in Figure 2 as "Projector" blocks with a hidden layer. The size of the hidden layer is the same as the number of dimensions of embedded vectors.
Contrastive objective function: A contrastive function is defined so that minimizing it results in maximizing the agreement between positive pairs; in other words, it is meant to pull similar samples closer to each other and push dissimilar samples in the opposite direction. We measure the similarity within a minibatch of N random samples. Following the setting of the data augmentation module, two different resolution versions of the original input are generated inside a minibatch, resulting in 2N data points. Within a multiview minibatch, let i ∈ I ≡ {1, 2, . . . , 2N} be the index of an arbitrary augmented sample, and let j(i) ∈ I be the index of the other augmented sample obtained from the same source sample. In self-supervised contrastive learning [16], the loss function for a positive pair of examples (i, j) is defined as Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 17 Contrastive objective function: A contrastive function is defined so that minimizing it results in maximizing the agreement between positive pairs; in other words, it is meant to pull similar samples closer to each other and push dissimilar samples in the opposite direction. We measure the similarity within a minibatch of random samples. Following the setting of the data augmentation module, two different resolution versions of the original input are generated inside a minibatch, resulting in 2 data points. Within a multiview minibatch, let ∈ ≡ {1,2, … ,2 } be the index of an arbitrary augmented sample, and let ( ) ∈ be the index of the other augmented sample obtained from the same source sample. In self-supervised contrastive learning [16], the loss function for a positive pair of examples ( , ) is defined as where is embedding obtained from the projection head, and is the temperature scaling factor. This loss is infoNCE loss that maximizes a lower bound on mutual information of two observations.
Regarding the presence of labels, supervised contrastive losses [37] can be used and also be generalized to an arbitrary number of positive pairs. The loss function takes the following form: where ( ) ≡ { ∈ ( ): = } is the set of indices for all positives in the multiview batch. In addition to the augmented version of the anchor, supervised contrastive loss considers the same label samples within the minibatch as a positive pair.

Auxiliary Module: Classification Loss and Triplet Loss
Contrastive loss maximizes agreement between augmented versions derived from the same source without implicitly sampling negative pairs [16]. To achieve a better performance, many negative pairs are sampled to ensure the convergence of the contrastive objective function. For example, [16] used a batch size of 8196. In that case, 16382 negative samples per positive pair were given from both augmentation views, and the same conditions were applied in [37] with a batch size of 6144. Using a large batch size is a computational burden and hard to train with regular optimizations [38,39]. In this study, instead of using a large batch size, we leverage the samples' label to implicitly generate pairs of negative and positive samples. We also found that training with implicit labels or supervised training is standard practice to learn embedded vectors for category image retrieval.
As proposed in previous literature on image retrieval, softmax cross-entropy loss and ranking loss, such as triplet loss, are used for end-to-end training of a CNN backbone [6] or to fine-tune model triplet loss based on a classifier trained with cross-entropy loss [5,8]. In this study, we train the model with auxiliary classification loss to maximize inter-class distance and utilize triplet loss to rank the embeddings of inter-class pairs over intra-class pairs. We add label smoothing [40] and temperature scaling [41] in the auxiliary crossentropy loss function to prevent overconfidence and to learn better embedding.
where 2 is the batch size, including augmented samples, M is the number of classes, and is the temperature scaling factor. = + is logits of sample ℎ obtained by adding a trainable liner layer over the embedded vector .
Additionally, we add triplet loss to the objective function. Minimizing the triplet loss in the embedding space results in instances with the same label, and its augmentations where e i is embedding obtained from the projection head, and τ is the temperature scaling factor. This loss is infoNCE loss that maximizes a lower bound on mutual information of two observations. Regarding the presence of labels, supervised contrastive losses [37] can be used and also be generalized to an arbitrary number of positive pairs. The loss function takes the following form: Appl. Sci. 2021, 11, x FOR PEER REVIEW 8 of 17 Contrastive objective function: A contrastive function is defined so that minimizing it results in maximizing the agreement between positive pairs; in other words, it is meant to pull similar samples closer to each other and push dissimilar samples in the opposite direction. We measure the similarity within a minibatch of random samples. Following the setting of the data augmentation module, two different resolution versions of the original input are generated inside a minibatch, resulting in 2 data points. Within a multiview minibatch, let ∈ ≡ {1,2, … ,2 } be the index of an arbitrary augmented sample, and let ( ) ∈ be the index of the other augmented sample obtained from the same source sample. In self-supervised contrastive learning [16], where is embedding obtained from the projection head, and is the temperature scaling factor. This loss is infoNCE loss that maximizes a lower bound on mutual information of two observations.
Regarding the presence of labels, supervised contrastive losses [37] can be used and also be generalized to an arbitrary number of positive pairs. The loss function takes the following form: where ( ) ≡ { ∈ ( ): = } is the set of indices for all positives in the multiview batch. In addition to the augmented version of the anchor, supervised contrastive loss considers the same label samples within the minibatch as a positive pair.

Auxiliary Module: Classification Loss and Triplet Loss
Contrastive loss maximizes agreement between augmented versions derived from the same source without implicitly sampling negative pairs [16]. To achieve a better performance, many negative pairs are sampled to ensure the convergence of the contrastive objective function. For example, [16] used a batch size of 8196. In that case, 16382 negative samples per positive pair were given from both augmentation views, and the same conditions were applied in [37] with a batch size of 6144. Using a large batch size is a computational burden and hard to train with regular optimizations [38,39]. In this study, instead of using a large batch size, we leverage the samples' label to implicitly generate pairs of negative and positive samples. We also found that training with implicit labels or supervised training is standard practice to learn embedded vectors for category image retrieval.
As proposed in previous literature on image retrieval, softmax cross-entropy loss and ranking loss, such as triplet loss, are used for end-to-end training of a CNN backbone [6] or to fine-tune model triplet loss based on a classifier trained with cross-entropy loss [5,8].
In this study, we train the model with auxiliary classification loss to maximize inter-class distance and utilize triplet loss to rank the embeddings of inter-class pairs over intra-class pairs. We add label smoothing [40] and temperature scaling [41] in the auxiliary crossentropy loss function to prevent overconfidence and to learn better embedding.
where 2 is the batch size, including augmented samples, M is the number of classes, and is the temperature scaling factor. = + is logits of sample ℎ obtained by adding a trainable liner layer over the embedded vector .
where P(i) ≡ p ∈ A(i) : y p = y i is the set of indices for all positives in the multiview batch. In addition to the augmented version of the anchor, supervised contrastive loss considers the same label samples within the minibatch as a positive pair.

Auxiliary Module: Classification Loss and Triplet Loss
Contrastive loss maximizes agreement between augmented versions derived from the same source without implicitly sampling negative pairs [16]. To achieve a better performance, many negative pairs are sampled to ensure the convergence of the contrastive objective function. For example, [16] used a batch size of 8196. In that case, 16,382 negative samples per positive pair were given from both augmentation views, and the same conditions were applied in [37] with a batch size of 6144. Using a large batch size is a computational burden and hard to train with regular optimizations [38,39]. In this study, instead of using a large batch size, we leverage the samples' label to implicitly generate pairs of negative and positive samples. We also found that training with implicit labels or supervised training is standard practice to learn embedded vectors for category image retrieval.
As proposed in previous literature on image retrieval, softmax cross-entropy loss and ranking loss, such as triplet loss, are used for end-to-end training of a CNN backbone [6] or to fine-tune model triplet loss based on a classifier trained with cross-entropy loss [5,8].
In this study, we train the model with auxiliary classification loss to maximize inter-class distance and utilize triplet loss to rank the embeddings of inter-class pairs over intraclass pairs. We add label smoothing [40] and temperature scaling [41] in the auxiliary cross-entropy loss function to prevent overconfidence and to learn better embedding.
where 2N is the batch size, including augmented samples, M is the number of classes, and τ is the temperature scaling factor. c i = W T z i + b is logits of sample ith obtained by adding a trainable liner layer over the embedded vector e i . Additionally, we add triplet loss to the objective function. Minimizing the triplet loss in the embedding space results in instances with the same label, and its augmentations should be closer together to form well-separated clusters. The version of triplet loss used Appl. Sci. 2021, 11, 6783 9 of 17 in this study is online triplet mining using the hard-batch strategy [42]. For each sample a in the batch, we can select the hardest positive and the hardest negative samples within the batch when forming the triplets to be used for computing the loss.
where m is margin and P and K are the number of classes and the number of samples in these classes, calculated within a minibatch. The final loss function for end-to-end training of our framework is the weighted summation of a contrastive loss and the two auxiliary losses.

Results
We evaluated our proposed framework on category image retrieval tasks with lowresolution queries. This section gives a brief overview of the datasets that were used in this experiment, and then we give implementation detail such as model configurations and training settings; finally, we present quantitative results of the ranking recall and show visualization results.

Datasets
We report performance on three popular datasets widely used for category-level image retrieval. CUB200-2011 [43] dataset contains 11,788 images representing 200 bird classes. CARS196 [44] dataset consists of 16,185 images corresponding to 196 classes. Stanford Online Products [45] contains 120 k online product images of 22,634 categories. Such data are prone to fine-grain tasks, especially CARS196 and CUB200-2011, which contain many images that share the same properties of an object across categories such as shape and color. In the contrastive learning framework, we apply an identical augmentation for all three datasets. First, we randomly crop part of the input image with the ratio ranges from 0.5 to 1, then apply Gaussian kernel to generate two blurred images originated from the same source; the kernel size is set to 23, but the variance is randomized from 1 to 5 to generate multiresolution samples. At the end of the preprocessing step, we resize the augmented images to a fixed size of 224 × 224. The training and testing set for each dataset are separated as per the default settings provided in the datasets package. We further split 20% from the training set to form a validation set. The sampling strategy is similar for all the three datasets. In the retrieval phase, the samples from the test set without blurry augmentation are used to form a gallery and samples from the evaluation set with the above augmentations are used as queries.

Implementations
All experiments are implemented using Pytorch on a Titan V GPU with 12 GB memory. We use Python as the programing language for all experiments, including structuring model architecture, loading dataset, and evaluations. The deep learning framework used in this study is Pytorch. The source code is available at https://github.com/Ka0Ri/Contrastivelearning-for-image-retrieval (accessed on 20 May 2021).
We use BiT [20] and EfficientNet [21] as the backbone network for comparison with the ViT [11] model. All the models are fine-tuned based on pre-trained weights on ImageNet. The comparison of the size of models is given in Table 1. For more details, we use ViT-B-16, the base version with 12 encoder layers, 12 self-attention heads per layer, and patchify an image by 16 × 16. We use BiT-M-R152 × 4 architecture, a varied version of ResNet version 2 with 152 layers and a width factor of 4. BiT models maintain almost the same architecture as the original ResNet version 2, except for replacing batch normalization with group normalization and weight standardization. The BiT comes up with three versions, S, M, and L, where the pre-trained dataset scales up from S to L. The scale of BiT depends on which ResNet model is used; in our study, two factors are considered: depth factor, which reflects how deep the model is going to be (50, 101, 152 layers), and width factor, which defines how many channels are used in a residual model (×1, ×2, ×4). EfficientNets are a family of models resulting from extensive search by neural architect search to balance performance and computational resources usage. EfficientNet's scaling factors include depth, width, and resolution of a pre-trained dataset of use to maximize the model accuracy for any given resource constraints. EfficientNet is scaled up from MobileNet (B0) into seven versions from B1 to B7 where the number of parameters gradually increased from 5.3 M to 66 M. We experiment with EfficienNet and BiT along with ViT architecture to clarify the effectiveness of ViT's attention-based mechanism over residual-based and depth-wise convolution-based mechanisms. In this study, we use Efficient-B7 for a fair comparison with other types of architectures. ViT, BiT, and EfficientNets are models that challenge the recently developed state of the art in regards to the large-scale ImageNet dataset. We consider the setup generic for training hyperparameters in all experiments unless otherwise stated in the separated experiments. In the training phase, we use the AdamW optimizer [46] with cosine-annealing learning-rate schedule during the training process, the weight decay is set to 10 −6 , and the other parameters of the optimizer are set to default. The learning is also warmed up during the first ten epochs from 0 to 0.001. The model is trained for 100 epochs with a batch size of 64.

Quantitative Results
We save the model's weights that attain the lowest loss value or the highest accuracy for the validation set in the case of using the auxiliary classification loss. The model's weights are loaded into the corresponding model to extract embedded vectors for performing image search in the retrieval phase. The search strategy used in our study is simply an exhaustive search using L 2 similarity. Note that the embedded vectors are normalized in a unit sphere so that the similarity can be calculated simply by the dot product. For quantitative evaluation, the embedded vectors from the test set are used to build a gallery and embedded vectors extracted from the validation set act as queries. It is rational to apply a more effective searching strategy rather than an exhaustive comparison using L2 for image retrieval; however, we focus on improving the searching space and consider effective search methods as future works. The matching score for a query is measured by ranking recall as follows.
where x is a query image and R k is a set of top k retrieval results. The main focus of this experiment is to calculate the average recall value over queries from the validation set, and such a higher recall value shows better performance. In addition, we also calculate the mean average precision (mAP) for comparison purposes.
Most experiments were carried out with a basic setting as follows: the ViT-B-16 model is fine-tuned and used to perform image retrieval on the CUB200-2011 dataset with supervised contrastive loss function (α = 0), auxiliary classification loss (β = 1), and triplet loss function (γ = 1); the embedding dimension is set to 128, and the representation vector is extracted from the class token. First, we show the performance of the proposed framework using different datasets. Then we make an ablation study about the effectiveness of the backbone networks, the number of embedding dimensions, and loss components in the subsequent sections.

Experimental Results of Different Datasets
This section sets a benchmark for our framework on CUB200-2011, CARS196, and SOP datasets. Recall results are reported for the first five ranks on CUB200-2011 and CARS196 datasets, and they are reported in recall of rank 1, 10, 100, 500, and 1000 on the SOP dataset with the instance-level label. The experimental results show that our approach achieves a recall of 0.9414, 0.8541, and 0.9806 with the first ranking and mAP of 0.9379, 0.8804, and 0.9141 on the CUB200-2011, CARS196, and SOP datasets, respectively, details are given in Table 2. SOP dataset gives two types of labels: class labels with 22,634 categories and super-class labels with 12 categories. The super-class labels indicate the type of product, while the class label varies according to each product, that is, different views of the same product. Our result also shows a high recall value with instance image retrieval, and we obtain recall at the first rank of 0.947 on the SOP dataset.  Table 3 shows image retrieval performances as a result of different backbone encoder networks. The highest mAP value is obtained from ViT architecture (0.9379) which proves that representations encoded by ViT perform better than representations extracted from other state-of-the-art architectures, such as BiT (mAP = 0.904) and EfficientNets (mAP = 0.7906). Our results demonstrate that attention-based architecture performs better than ResNet-based architectures in regards to studies about image retrieval [1,3,8]. It is well understood that a big model commonly results in better performance. However, these results are not biased since we selected those architectures with the approximate number of parameters.

The Effectiveness of the Number of Embedding Dimensions
A typical experiment in previous studies about information retrieval aimed to analyze the impact of the number of dimensions of embeddings. After a series of experiments, it is found that the dimensions of 256 produce the best performance with the mAP of 0.955 and the first rank recall of 0.9453, as shown in Table 4. These findings are consistent with previous research on contrastive learning [16,26,37], which shows that the embeddings' dimensions should be either 128 or 256. It is also consistent with studies about image retrieval using the embeddings search strategy [3,6,8]. We also note that the size of a hidden layer in the projection head is equal to the dimensions of the embedded vector.

The Effectiveness of Loss Components
We study the effectiveness of loss components that impact the performance of image retrieval. We add auxiliary loss functions such as triplet loss and classification loss into the contrastive loss to train the model end-to-end. In particular, we experiment with a tuple of three parameters (α, β, γ) ∈ {0, 1} 3 that characterize the presence of self-contrastive loss (α = 1), otherwise supervised contrastive loss (α = 0), the presence of classification loss (β = 1), and triplet loss (γ = 1). As mentioned in the previous section, the classification loss used in this study is the cross-entropy loss with label smoothing (p = 0.1) and temperature scaling (τ = 0.5). Otherwise, the temperature scaling factor in contrastive loss is set to 0.5. In addition, the margin parameter in triplet loss is set to 1. Table 5 demonstrates that the case of using supervised contrastive loss in combination with classification and triplet loss achieves the best performances (mAp = 0.9379). The experiment also shows that stand-alone contrastive loss is not enough to accomplish the category image retrieval task, as shown in the cases of self-supervised contrastive loss (mAP = 0.6026) and supervised contrastive loss (mAP = 0.6214). In addition, the effectiveness of classification loss is higher than that of triplet loss, as shown in the case of (self-)contrastive loss combined with classification loss (mAP = 0.9120, mAP = 0.8929), compared to the case of (self-)contrastive loss combined with triplet loss (mAP = 0.8445, mAP = 0.8910). The results confirm that classification loss is a good option to supply category information for image retrieval tasks.

The Effectiveness of Attention Mechanism
Finally, we analyze the effectiveness of attention embeddings, which is the key proposal in our study. ViT architecture is built upon the attention mechanism, where each image patch has its own attention weight to determine which region should be focused. In this study, we also investigate the effectiveness of the attention mechanism by analyzing the number of "decisive" patches corresponding with the highest attention weights. Table 6 illustrates that our simple method achieves a better result using 25 decisive patches (recall@1 = 0.9492, mAP = 0.9475). The case of only one decisive patch means that only the most attentive region is extracted, while the case of 128 decisive patches reflects that the average attention of all patch's embeddings is extracted. We compare our findings with a previous study [6] to address our results, as shown in the CGD row. The study from [6] is a typical study about image retrieval where a ResNet-based model is fine-tuned using combined global descriptors, and this study achieved significant performance on a broad range of datasets. We verify that our proposed framework outperforms CGD in both terms of ranking recall and mAP. Together, the present findings confirm that using attention patch's embedding is slightly more robust than class's embedding, as suggested in [12]. To conclude this section, we show that fine-tuning the ViT model with contrastive learning provides substantially better results than the direct use of features extracted from the pre-trained model, that is, an mAP of 0.54 without fine-tuning compared to an mAP of 0.6214 when fine-tuned with supervised contrastive learning. This phenomenon can be extended to other models as well.

Qualitative Analysis
From the quantitative results presented in the previous section, we verify that our best model for a retrieval system is the model with 25 decisive patch embeddings trained with supervised contrastive loss, classification loss, and triplet loss. This section illustrates some qualitative analyses obtained from our method. First of all, we describe that our framework works well across fine-grain datasets, typically CUB200-2011, CARS196, and SOP. Figure 5 shows the retrieval results when the queries are images with mild resolution reduction. It is important to highlight the fact that a successful retrieval system should show that retrieval results match the query in different views, such as varied lighting conditions or points of view. This property partially manifests in our results, for example, a green car in a different viewpoint or the same type of car but in two different colors, red and blue, as illustrated in Figure 5.
To clarify the purpose of our study, we consider the case where strong resolution reduction is used, as depicted in Figure 6. The result from Figure 6 is concrete proof that our framework can deal with low-resolution images. We speculate that the result might be due to contrastive learning to fine-tune the model with both multiresolution and fine resolution samples. The study may raise concerns about superior performance, which can be addressed by using super-resolution preprocessing. However, training additional models for super-resolution is impractical because of the demand for high computational resources. In addition, in our findings, we believe that the effectiveness of descriptors for feature matching is a result of careful fine-tuning of the model with proper augmentation.
Finally, we once again address the effectiveness of the attention mechanism, as demonstrated in Figure 7. The attention mechanism grants the ability to focus on the region of interest, that is, the region that contains objects, as displayed in Figure 6. Attention is particularly important when investigating visual objects in a distracting background. The experimental results herein aim to verify that with the help of attention, the retrieval results can be more accurate in fine-grain category-image retrieval and even in sophisticated cases where semantic vision may not be distinguished.
results can be more accurate in fine-grain category-image retrieval and even in sophisticated cases where semantic vision may not be distinguished.   results can be more accurate in fine-grain category-image retrieval and even in sophisticated cases where semantic vision may not be distinguished.

Conclusions
This study introduced a simple yet effective framework for category image retrieval. We exploited the Visual Transformer architecture to take advantage of the self-attention mechanism to enhance the robustness of representation vectors. Additionally, we solved the low-resolution image retrieval problem by using contrastive learning with a proper augmentation strategy. The solution proposed here addresses only the case of low-resolution samples; however, the same framework can be applied to other degraded image retrieval systems if a suitable augmentation method is defined. We guarantee the effectiveness of our approach through extensive experiments, both quantitative and qualitative, on several public datasets. However, the limitation of this study is that the retrieval system was only analyzed at the category level. The lack of evaluations at the instance level makes the model a bit inferior to the general-purpose retrieval system, and we wish to tackle this challenge in future studies.