Toward Improving Image Retrieval via Global Saliency Weighted Feature

: For full description of images’ semantic information, image retrieval tasks are increasingly using deep convolution features trained by neural networks. However, to form a compact feature representation, the obtained convolutional features must be further aggregated in image retrieval. The quality of aggregation affects retrieval performance. In order to obtain better image descriptors for image retrieval, we propose two modules in our method. The ﬁrst module is named generalized regional maximum activation of convolutions (GR-MAC), which pays more attention to global information at multiple scales. The second module is called saliency joint weighting, which uses nonparametric saliency weighting and channel weighting to focus feature maps more on the salient region without discarding overall information. Finally, we fuse the two modules to obtain more representative image feature descriptors that not only consider the global information of the feature map but also highlight the salient region. We conducted experiments on multiple widely used retrieval data sets such as roxford5k to verify the effectiveness of our method. The experimental results prove that our method is more accurate than the state-of-the-art methods.


Introduction
Content-based image retrieval (CBIR) [1][2][3][4], with the advent of convolutional neural networks, has dramatically shaped queries in image retrieval in recent years. The volume of image databases is also increasing with the rapid development of computer and Internet technology. Developing quick and accurate methods to obtain the required images from large-scale image databases has become popular in research. Extracting the features of the image is the main aim of these methods. Then, the similarity of these features is measured and the retrieval results are finally obtained. Query extensions and other methods are quintessentially used as a supplement [5][6][7][8] to improve the accuracy of retrieval.
The research process of content-based image retrieval is mainly decomposed into two aspects: manual features and deep features. The initial image retrieval is mainly based on a few manual global features, such as color and texture [9][10][11], but because these manual global features are easily affected by occlusions, displacements, and lighting conditions, the retrieval performance is considerably reduced. To solve the problems affecting the manual global features, researchers have proposed manual local features that are not easily affected by scale changes and illumination [12,13]. The most representative method is the scale-invariant feature transform (SIFT) [12], which is invariant to image distortion, illumination changes, viewpoint changes, and scale scaling. Owing to the complexity and time-consuming compution of SIFT [12], speeded up robust features (SURF) [14] was proposed based on SIFT [12]. Bag-of-visual words (BoW) [15] was proposed to aggregate these local features into a global image representation. BoW [15] was inspired by text 1.
We introduce a novel feature pooling method, named generalized R-MAC, which can capture the information contained in all the feature points in each region of R-MAC [25] instead of only considering the maximum value of the feature points in each region.

2.
We present an approach for the aggregation of convolutional features, including nonparametric saliency weighting and pooling steps. It focuses more attention on the convolutional features of the salient region without losing the information of the entire building (target region).

3.
We conducted comprehensive experiments on several popular data sets, and the outcomes demonstrate that our method provides state-of-the-art results without any fine-tuning.

Aggregation Methods
Since convolutional neural networks have been broadly adopted in the field of image retrieval, research hotspots have gradually concentrated on the fully connected layer instead of the convolutional layer of convolutional neural networks (CNNs). The features produced by the convolutional layer are more robust in image transformation, can more accurately express spatial information. Therefore, obtaining a more representative image descriptor becomes a key step to improving the accuracy of image retrieval.
In the early days, a multitude of classical encoding methods for hand-crafted features was used to generate image descriptors. SIFT [12] is a scale-invariant feature transformation that can detect key points in the image. BoW [15] uses clustered local image features as visual words and counts their frequency to construct image descriptors. VLAD, proposed by Jegou et al. [16], mainly trains a small codebook through the clustering method and encodes it according to the distance between the feature and the cluster center. Hao et al. proposed a multiscale fully convolutional (MFC) [27], which uses three different scales to extract features and fuses them to generate the ultimate descriptor.
In recent years, three-dimensional convolution features have been fused into more compact feature descriptors. Babenko et al. proposed SPoC [23] to calculate the average value of each feature map to obtain a compact feature descriptor. The calculation of this average value does not consider the importance of each feature value. The point with the larger feature value is more likely to be the target region. Razavian et al. proposed MAC [28] to calculate the maximum value of each feature map to obtain compact feature descriptors. Taking the maximum value of each feature map, although feature filtering is simple, it loses too much information contained in the feature map. R-MAC [25] performs sliding window sampling on deep convolutional features and takes the maximum value of each window aggregated into image feature descriptors. Compared with MAC, although R-MAC captures more information from each feature map in a multiscale manner, there are still problems with MAC when selecting the maximum value for each region obtained. Crow [26], proposed by Jimenez et al., cites the idea of the attention mechanism and performs spatial weighting and channel weighting on the convolution feature map to highlight the target region of the image aggregation method. The semantic-based aggregation (SBA) method was proposed by Xu et al. [29]. These aggregation methods improve the accuracy of retrieval to a certain extent, but they do not effectively use of the global information of the feature map and ignore focusing on the salient region of the feature map. Based on this, we propose global saliency-weighted deep convolution features, which makes the feature descriptors more representative of image information and achieves more accurate retrieval results. We conducted a comprehensive comparison test to prove that the algorithm we constructed exceeds various existing state-of-the-art algorithms.

Normalization and Whitening
Normalization is an overwhelmingly crucial step in image retrieval [30]. Normalization can convert data to a uniform range for comparison. L2 normalization [30] converts the data to a range between 0 and 1. The value range of the feature output by the convolutional neural network is usually extremely large. Accordingly, L2 normalization can be used to balance the size and value influence. The specific calculation is as follows: where X is a concrete vector and is a 2-norm value of this vector. Power normalization [30] reduces the vector according to the power exponent. The specific calculation formula is as follows: where ξ is a configurable hyperparameter, ranging from 0 to 1. Additionally, X is a concrete vector. To avoid transforming the sign of the value after power normalization, we use sgn(X), called the sign function, which is defined as sgn(X) = 1 for X > 0 and sgn(X) = −1 for X ≤ 0. Whitening has often been adopted as a postprocessing operation in image retrieval to whiten and reduce dimensions based on the work of Jegou and Chum [8]. When reducing dimensionality, we use whitening to prevent the interaction effect between raw data components and to reduce noise [7,31,32]. Here, we use whitening as a postprocessing step to improve our retrieval performance. The essence of whitening proposed by Mikolajczyk and Matas [32] is linear discriminant projections, which are divided into two parts.
In the first part, we obtain the covariance matrix C S of the intraclass images using Equation (3) and the covariance matrix C D of the interclass images (non-matched image pair) using Equation (4), where where i and j denote two distinct images in the data set, x i and x j denote the feature descriptors after pooling of image, and label i = label j (label i = label j ) means that images i and j are associated with the same (different) class.
In the second part, we obtain the projection P in the whitened space (the eigenvector of the covariance matrix C S ) and apply it in the descriptors of images: x whiten where eig(.) is the eigenvector of the specific matrix, µ is the mean pooling vector (a mean vector of the descriptors of all images in the test data set after pooling) to perform centering, and x whiten i is the whitened result of x i . It is worth noting that Equation (5) does not take only the eigenvectors greater than a specific threshold for feature reduction.

Similarity Measurement Method
After obtaining the feature descriptors of the image, we need to measure the similarity between the feature descriptors of the query image and the feature descriptors of all the images in the test set. A commonly used method for computing the similarity of two vectors is based on the inverse of their Euclidean distance. The formula is as follows: where A ∈ R n and B ∈ R n are two sample vectors, and a i and b i are the ith components of vectors A and B, respectively. We denote by L(A, B) the Euclidean distance between A and B. Cosine similarity is also used for similarity measurement, which uses the angle between two concrete vectors to calculate the cosine value to represent the similarity of two vectors. The closer the cosine value is to 1, the more similar the two vectors are. The closer the cosine value is to −1, the more dissimilar the two vectors are. The specific calculation formula is as follows: where S(A, B) is the cosine similarity between A and B.

Query Expansion
Query expansion (QE) was first applied in text retrieval to improve effectiveness by extracting new keywords from top ranked results retrieved by an original query [5][6][7] to generate a new expanded query.
Inspired by this idea, where in image retrieval tasks, query expansion starts with a given query image, we retrieved top n ranked images, including the query image itself. Then, we calculated the average of these image features and generated a new query that is evaluated to re-rank [33] images. Query expansion expands the scope of image retrieval and is one of the postprocessing operations in image retrieval. Radenovic et al. [7] proposed α-weighted query expansion (αQE).

Algorithm Background
Our algorithm calculates the descriptors of convolution features in two modules: generalized R-MAC (GR-MAC) and saliency joint weighting (SJW). They are introduced in detail in Sections 3.2 and 3.3, respectively.
We obtain the ultimate feature descriptor by fusing the feature descriptors obtained by the two modules.
where Gr comes from Section 3.2, Sjw comes from Section 3.3, and α and β are fusion factors with values ranging from 0 to 1. GR-MAC inevitably weakens the influence of the salient region while obtaining more global information. Therefore, after integration of the SJW module, it is equivalent to assigning greater weight to the features of more important salient regions while obtaining global information. Such a fusion promotes the GR-MAC module of the data set with more obvious saliency regions. For more complex data sets with obvious saliency regions, there is a better improvement effect.
The algorithm provides five variables as input: X is the three-dimensional feature map obtained by the last convolutional layer, n is the number of channels selected, ρ is the scale factor, and α and β are fusion factors. In this algorithm, for an input threedimensional feature map, we obtain two different one-dimensional feature descriptors through Sections 3.2 and 3.3, separately, and then normalize them separately. In the end, fusion is performed according to the values of α and β. After normalization and whitening, the final feature descriptor is obtained. The specific retrieval process is shown in Figure 1.
As shown in Figure 1, we briefly describe the specific retrieval process after using the global saliency weighting algorithm. We first used the pretrained network to extract the feature maps from the images in the data sets. Afterward, we used the generalized R-MAC and saliency joint-weighted deep convolutional feature described in detail in Sections 3.2 and 3.3 to obtain one-dimensional descriptors for the feature maps, separately. GR-MAC considers the maximum value of each region while considering the information contained in other response values and aggregates the three-dimensional feature maps obtained by the pretraining network into one-dimensional descriptors with more global information. SJW introduces a saliency algorithm to weight the salient region to obtain a salient region that is more conducive to image retrieval, such as the feature descriptor of the edge of a building. Then, after the abovementioned one-dimensional descriptors obtain the L2 normalization process separately, we merged them in a linear manner (such as Equation (10)). Finally, the fused feature descriptors were whitened to obtain the feature descriptors of Gsw for image retrieval to obtain the top K retrieval result, returning images that are most similar to the query image. As shown in Figure 1, we briefly describe the specific retrieval process after using the global saliency weighting algorithm. We first used the pretrained network to extract the feature maps from the images in the data sets. Afterward, we used the generalized R-MAC and saliency joint-weighted deep convolutional feature described in detail in Sections 3.2 and 3.3 to obtain one-dimensional descriptors for the feature maps, separately. GR-MAC considers the maximum value of each region while considering the information contained in other response values and aggregates the three-dimensional feature maps obtained by the pretraining network into one-dimensional descriptors with more global information. SJW introduces a saliency algorithm to weight the salient region to obtain a salient region that is more conducive to image retrieval, such as the feature descriptor of the edge of a building. Then, after the abovementioned one-dimensional descriptors obtain the 2 normalization process separately, we merged them in a linear manner (such as Equation (10)). Finally, the fused feature descriptors were whitened to obtain the feature descriptors of for image retrieval to obtain the top K retrieval result, returning images that are most similar to the query image.

Generalized R-MAC
Motivation: in Section 2, we briefly introduced the original R-MAC [25] algorithm. First, the R-MAC [25] algorithm divides each feature map into multiple regions according to the scale, as shown in Figure 2. Then, it takes the maximum value for each region to aggregate and to obtain the final image feature descriptor.

Generalized R-MAC
Motivation: in Section 2, we briefly introduced the original R-MAC [25] algorithm. First, the R-MAC [25] algorithm divides each feature map into multiple regions according to the scale, as shown in Figure 2. Then, it takes the maximum value for each region to aggregate and to obtain the final image feature descriptor. As shown in Figure 2, we use to describe the size of scales. In R-MAC [25], = 1,2,3. A sliding window is used to extract square feature regions at different scale When = 1, the side length of the extracted square region is the minimum value of th height and width , namely ( , ). Assuming that the number of extracted re gions is when = 1, the side length of the region is 2 ( , )/( + 1) and eac scale will extract * ( + − 1) regions.
The specific operation details are as follows: taking the feature map of a channel a an example, the feature map is divided into regions and the result region is as follow  As shown in Figure 2, we use L to describe the size of scales. In R-MAC [25], L = 1, 2, 3. A sliding window is used to extract R square feature regions at different scales. When L = 1, the side length of the extracted square region is the minimum value of the height H and width W, namely min(H, W). Assuming that the number of extracted regions is m when L = 1, the side length of the region is 2min(H, W)/(L + 1) and each scale will extract L * (L + m − 1) regions. The specific operation details are as follows: taking the feature map of a channel as an example, the feature map is divided into R regions and the result region is as follows: where X r−mac kr is the rth region of the kth feature map. The R regions of each channel are obtained by the above formula; the maximum value is retained in the R-MAC [25] algorithm for feature aggregation. However, using max pooling in each scale region of R-MAC [25] loses the information contained in other feature values in the region. A didactic example is provided to illustrate the phenomenon in Figure 3. Figure 2, we use to describe the size of scales. In R-MAC [25], 1,2,3. A sliding window is used to extract square feature regions at different scale When = 1, the side length of the extracted square region is the minimum value of th height and width , namely ( , ). Assuming that the number of extracted re gions is when = 1, the side length of the region is 2 ( , )/( + 1) and eac scale will extract * ( + − 1) regions.

As shown in
The specific operation details are as follows: taking the feature map of a channel a an example, the feature map is divided into regions and the result region is as follow where is the rth region of the th feature map. The R regions of each channel are obtained by the above formula; the maximum value is retained in the R-MAC [25] algorithm for feature aggregation. However, usin max pooling in each scale region of R-MAC [25] loses the information contained in othe feature values in the region. A didactic example is provided to illustrate the phenomeno in Figure 3. As shown in Figure 3, we assume that the black square is a channel-wise feature ma after the convolution layer. Obviously, the blue and green squares are two different re gions obtained by R-MAC [25]. As R-MAC [25] takes the maximum value for each region the feature values obtained in the blue region and the green region are both 0.37, but it apparent that the distributions of the feature values of the two regions are not the sam R-MAC [25] only obtains information at the maximum response of 0.37 in the region. Thu R-MAC [25] ignores the amount of information contained in other feature values in th region. The feature values obtained by this part of the convolution are also part of th image representation.
Method: to solve this problem, we propose the idea of a generalized R-MAC. As shown in Figure 3, we assume that the black square is a channel-wise feature map after the convolution layer. Obviously, the blue and green squares are two different regions obtained by R-MAC [25]. As R-MAC [25] takes the maximum value for each region, the feature values obtained in the blue region and the green region are both 0.37, but it is apparent that the distributions of the feature values of the two regions are not the same. R-MAC [25] only obtains information at the maximum response of 0.37 in the region. Thus, R-MAC [25] ignores the amount of information contained in other feature values in the region. The feature values obtained by this part of the convolution are also part of the image representation.
Method: to solve this problem, we propose the idea of a generalized R-MAC. As shown in Figure 4, we use the l p norm in each scale region of R-MAC [25] to integrate max pooling and average pooling. In this way, not only can we obtain the maximum value of the most representative feature obtained by convolution but also,through the average pooling of l p norm fusion, we can obtain the overall information contained in the region. Our proposed generalized R-MAC can obtain richer semantic information. Here, we designed an effective scheme to calculate the generalized R-MAC.
For the specific region division of each feature map, we follow the idea of R-MAC [25], as shown in Equation (11).
As shown in Figure 4, we use the norm in each scale region of R-MAC [25] to integrate max pooling and average pooling. In this way, not only can we obtain the maximum value of the most representative feature obtained by convolution but also,through the average pooling of norm fusion, we can obtain the overall information contained in the region. Our proposed generalized R-MAC can obtain richer semantic information. Here, we designed an effective scheme to calculate the generalized R-MAC. For the specific region division of each feature map, we follow the idea of R-MAC [25], as shown in Equation (11).
R-MAC [25] distills the maximum value in each region, but this loses the global feature information of the region. Hence, we use another strategy for calculating regional feature values, as shown in Equation (12). The specific method involves using the norm to fuse max pooling and sum pooling to obtain more representative regional features. The specific calculation formula is as follows: where is the value calculated by the norm of the th region of the th feature map, where = 3.
Compared with R-MAC selecting the largest feature value in each region as the representative of the region, the feature value calculated by Equation (12) is used as the representative of each region. In Equation (12), when = ∞, the value of is the value of max pooling, and when = 1, the value of is the value of average (avg) pooling. That is, the proportion of max pooling and avg pooling can be determined by adjusting the coefficient p. Accordingly, the finally obtained representative values of each region not only have the characteristics of max-pooling of the target region as the response value is larger but also have the characteristics of avg pooling, which does not lose the information contained in all the feature values in the region. In this way, each region within the scale can obtain a better expression of the feature value, and finally, a more representative image feature descriptor can be obtained. R-MAC [25] distills the maximum value in each region, but this loses the global feature information of the region. Hence, we use another strategy for calculating regional feature values, as shown in Equation (12). The specific method involves using the l p norm to fuse max pooling and sum pooling to obtain more representative regional features. The specific calculation formula is as follows: where f kr is the value calculated by the l p norm of the rth region of the kth feature map, where p = 3. Compared with R-MAC selecting the largest feature value in each region as the representative of the region, the feature value calculated by Equation (12) is used as the representative of each region. In Equation (12), when p = ∞, the value of f kr is the value of max pooling, and when p = 1, the value of f kr is the value of average (avg) pooling. That is, the proportion of max pooling and avg pooling can be determined by adjusting the coefficient p. Accordingly, the finally obtained representative values of each region not only have the characteristics of max-pooling of the target region as the response value is larger but also have the characteristics of avg pooling, which does not lose the information contained in all the feature values in the region. In this way, each region within the scale can obtain a better expression of the feature value, and finally, a more representative image feature descriptor can be obtained.
According to the multiple regional feature values obtained by each channel, the regional feature values of the channel are summed to generate the ultimate feature vector descriptor Gr, as follows: where g k is the sum of the values obtained by all regions f k of the kth channel and Gr is the image immature descriptor using generalized R-MAC.
The resulting generalized R-MAC descriptor considers the overall information and maximum response in each region. It can more accurately describe the global information of convolution features.

Saliency Joint Weighting Method
Motivation: in Section 2, we briefly introduced the original cross-dimensional weighting (Crow) [26] algorithm. The Crow [26] algorithm performs spatial weighting and channel weighting on the convolution feature map. We find that the spatial matrix of Crow [26] only sums the feature maps of all channels and then normalizes it as a spatial weighting matrix. It is a simple operation to count the number of nonzero values in each channel and to assign greater weight to the channel, with more zero values corresponding to the feature. We think that this simple spatial weighting matrix and channel weighting matrix still has room for optimization. We intended to use the saliency algorithm to identify the salient region of the convolutional features and to apply greater weighting to the salient region to design a weighting matrix that is more beneficial for distinguishing the target region from the background noise region. Retaining the Crow [26] algorithm to pay more attention to the characteristics of the image target region also strengthens the salient regions that are more conducive to image retrieval, such as the edges of buildings.
Method: first, we use Figure 5 to introduce the general flow of our method.
∈ where is the sum of the values obtained by all regions of the th channel and is the image immature descriptor using generalized R-MAC.
The resulting generalized R-MAC descriptor considers the overall information and maximum response in each region. It can more accurately describe the global information of convolution features.

Saliency Joint Weighting Method
Motivation: in Section 2, we briefly introduced the original cross-dimensional weighting (Crow) [26] algorithm. The Crow [26] algorithm performs spatial weighting and channel weighting on the convolution feature map. We find that the spatial matrix of Crow [26] only sums the feature maps of all channels and then normalizes it as a spatial weighting matrix. It is a simple operation to count the number of nonzero values in each channel and to assign greater weight to the channel, with more zero values corresponding to the feature. We think that this simple spatial weighting matrix and channel weighting matrix still has room for optimization. We intended to use the saliency algorithm to identify the salient region of the convolutional features and to apply greater weighting to the salient region to design a weighting matrix that is more beneficial for distinguishing the target region from the background noise region. Retaining the Crow [26] algorithm to pay more attention to the characteristics of the image target region also strengthens the salient regions that are more conducive to image retrieval, such as the edges of buildings.
Method: first, we use Figure 5 to introduce the general flow of our method. As shown in Figure 5, we briefly describe the key steps of how to obtain the final saliency joint-weighted (SJW) descriptor from the above equations. For an original image, we obtain the feature map through a pretrained neural network. The green squares in the image represent the feature map of each channel obtained by different convolution kernels. Then, through the variance-channel selection step, the feature maps corresponding to the first channels with the largest variance values are selected, and then, the spatial As shown in Figure 5, we briefly describe the key steps of how to obtain the final saliency joint-weighted (SJW) descriptor from the above equations. For an original image, we obtain the feature map through a pretrained neural network. The green squares in the image represent the feature map of each channel obtained by different convolution kernels. Then, through the variance-channel selection step, the feature maps corresponding to the first n channels with the largest variance values are selected, and then, the spatial weighting matrix Sal is obtained by Equation (18). The normalized spatial weighting matrix M is obtained by Equation (19). In order to highlight the salient region we need, we first performed a saliency operation on the matrix M from the spatial perspective to obtain the saliency weighting matrix Sw obtained by Equation (20) and Equation (21). We then fused the matrix M through Equation (22) to obtain the final spatial weighting matrix S. Afterward, through Equation (23), the saliency joint weighted (SJW) feature maps were obtained by using the feature maps X and the spatial weighting matrix S, and then the preliminary descriptor was obtained by pooling. A series of blue cubes of different shades represent the saliency joint-weighted (SJW) feature maps. In order to make the final descriptor more representative, we weighed it from the perspective of the channel. Finally, raw descriptors obtained by sum pooling the saliency joint-weighted (SJW) feature maps and the channel weighting matrix C obtained by Equation (17) were used as an element-wise product to obtain a one-dimensional saliency joint-weighted (SJW) descriptor. Among them, let H and W represent the sizes in the matrices Sal, M, Sw, and S to make the flowchart more intuitive, where the characters ω, Ψ, δ, and Φ represent the elements in the matrices Sal, M, Sw, and S, respectively; ϕ is an element in the channel weighting matrix C.
First, for a specific image I, we used the pretrained network without the fully connected layers to obtain the 3D activated tensor of H × W × K dimensions, where K denotes the number of feature maps output from the last convolution layer.
Then, similar to Crow [26], we also performed spatial weighting and channel weighting operations on the obtained convolution features. For our channel weighting matrix, unlike Crow [26], which only considers the number of nonzero quantities for each channel, we also considered the variance of each channel as a part of the consideration that affects the value of the channel weight matrix. The same as Crow [26], for each dimension of the feature map, the ratio of its sum value corresponding to the number of zero values was calculated in each feature map, and the obtained expression is as follows: where c1 k is the ratio of the total spatial location in the kth feature map to the zero value. We can judge the amount of information contained in a feature map by counting the number of nonzero amounts, which is used to improve features that are not often seen but are also meaningful.
Based on c1 k , we proposed adding a variance term to optimize the channel weight matrix, as shown in Equation (15). For the feature map of each channel to find its standard deviation, the obtained expression is as follows: where X ijk is the average of X ijk and c2 k is the variance of the kth channel. Then, we calculated the proportions of c1 k and c2 k . After that, we summed them, and the resulting expression is as follows: When aggregating deep convolution features, since we subsequently greatly strengthened the target channel and target region based on the variance and response value, the channel with less information may be ignored. However, such channels may also contain especially crucial information and channels with small variance can also suppress noise. Therefore, we needed to assign a larger weight to the feature channel with a large number of zeros and small variance. Finally, similar to Crow [26], the inversion operation was performed through the log function. The specific formula for obtaining the final channel weight vector C k using logarithmic transformation is as follows: where C k is the channel weight and ε is an intensely small constant added to prevent the denominator from being zero. Our spatial weighting matrix is different from Crow [26], which only sums and normalizes all the channel feature maps and then serves as the final spatial weighting matrix S. We firstly filtered the top k feature channels that are more conducive to distinguishing the target region from the background noise region through the variance to sum the initial weighting matrix Sal and then normalized the weighting matrix M, which is similar to Crow [26]. According to the obtained channel selection factor c2, we selected the channel with a large variance to superimpose the weighting matrix of the space because the channel with large variance is more conducive to the distinction between the target region and the background region. The specific calculation formula is as follows: Let c2 ∈ R K be the vector of K channels' variances. We denote Max n (c2) as the top n with the largest variance in c2. Sal ∈ R (W×H) is the matrix of summing the feature values corresponding positions in all channels by c2.
Then, the obtained spatial weighting matrix Sal is normalized and power-scaled to obtain the normalized weighting matrix M.
We used the linear computational (LC) algorithm [34] to detect the saliency of the weighting matrix M. First, we scaled the value of the obtained matrix M to between 0 and 255 to be applicable to the LC algorithm [34]. Then, we mapped the value of M to the pixel space from 0 to 255 to calculate the subsequent saliency matrix. The mapping formula is as follows: where MG is a weighting matrix that normalizes the value of M to a range between 0 and 255. Concerning the idea of the LC algorithm [34], we calculated the sum of the Euclidean distances between each point on the spatial weighting matrix and all other points as the response value of the point. Afterward, we obtained the spatial saliency weighting matrix. The specific calculation formula is as follows: where Sw ij is the feature value of the saliency weighting matrix at i, j. Then, we fused the obtained saliency weighting matrix Sw into the original weighting matrix M with a certain scale factor, so that the final spatial weighting matrix assigns greater weight to the key region with a certain fusion factor, while the response information of the target region and the background noise region was retained in the original convolution feature. For the obtained spatial weighting matrix Sw, we used a certain scale factor ρ to fuse it into the spatial weighting matrix obtained by the previous channel selection so that the final weighting matrix S can better highlight the target region of the feature map.
where ρ is the fused factor and S is the spatial weight matrix after saliency joint weighting. Finally, we obtained the saliency joint-weighted (SJW) descriptor through the final spatial weighting matrix S and the channel weighting matrix C. We multiplied the obtained final spatial weighting matrix S with the three-dimensional feature map obtained by convolution and then summed the feature maps of each channel to obtain a one-dimensional image descriptor F.
where f k is the feature value obtained from the kth feature map after spatial weighting and F is the vector of all f k . For the obtained one-dimensional descriptor F, we used the channel weight vector obtained by the above Equation (17) for weight, so that the descriptor pays more attention to the important feature channels and forms the final spatial saliency-weighted descriptor Sjw.
where g k is the feature value obtained from the kth feature map after the channel weighting and Sjw is the vector of all g k .

Experiments and Evaluation
In this section, to verify the rationality of our designed convolution feature aggregation scheme, we conducted many comparative experiments. First, we tested the effectiveness of the two modules separately, and then, we fused the two modules according to a certain scale factor. The experimental outcomes revealed that our proposed global saliency-weighted convolution feature achieves state-of-the-art performance.

Data Set
The retrieval data set was used to train and test the retrieval algorithm. In this section, we briefly introduce the data set used in the experiment.

1.
Oxford5k data set [33]: This data set is provided by Flickr and contains 11 landmarks in the Oxford data set and a total of 5063 images.

2.
Paris6k data set [35]: Paris6k is also provided by Flickr. There are 11 categories of Paris buildings, which includes 6412 images and 5 query regions altogether for each class.

3.
Holidays data set [36]: The Holidays data set consists of 500 groups of similar images; each group has a query image for a total of 1491 images.

4.
Revisited Oxford and Paris [37]: The Revised-Oxford (Roxford) and Revisited-Paris (Rparis) data sets consist of 4993 and 6322 images, respectively; each data set has 70 queries images. They re-examine the oxford5k and paris6k data sets by deleting comments and adding images. There are three difficulty levels of evaluation protocol: easy, medium, and hard.

Test Environment and Details
Our experiment was implemented on TITAN XP, and the graphics processing unit (GPU) memory was 11 G (the graphics card was composed of a GPU computing unit and video memory, etc. The video memory can be regarded as space, similar to memory). We used the PyTorch deep learning architecture to construct the VGG16 model [3], which was pretrained on ImageNet [38]. Hence, we did not need training. For the test, we used the VGG16 model to extract convolutional features maps from the conv5 layer, and the whole number of channels was 512. For our newly designed algorithms in testing, the first module (GR-RMAC) required 3 min 58 s and 2012 MB of video memory; the second module (SJW) required 19 min 13 s and 2138 MB of video memory; and the entire algorithm (GSW) required a total of 19 m 28 s and 2629 MB of video memory. The test cases were conducted on the oxford5k [33] and paris6k [35], Holidays [36], oxford105k, paris106k, Roxford, and Rparis data sets [37], which have 5063, 6412, 1491, 4993, and 6322 images, respectively. As for image size, we followed Crow [26], which maintains the original size of images as input. After parameter analysis, we set the best hyperparameters in all experiments: L = 3, n = 0.3, ρ = 100, α = 0.6, and β = 0.4. In addition, we used the same network model, dimensions of image descriptor, and input size of image. The method used to calculate the similarity was the cosine similarity. For the general evaluation standard, we used mean average precision (mAP) [33] and precision.
The average precision (AP) measures the quality of the learned model in each category. mAP measures the quality of the learned model in all categories. After the AP was obtained, the average value was taken. The range of mAP values was between 0 and 1. The specific AP, mAP calculation formula is as follows: where k is the resulting images returned that are relevant to the query image, b is the volume of returned images during retrieval, and m is the volume of images in the test data set that is relevant to the query image.
where Q is the quantity of the query images in a data set. Precision means that, given a specific number of returned images in image retrieval, the ratio of the number of correct images to the number of returned images P@1 denotes the accuracy of returning an image in the retrieval, while P@5 denotes the accuracy of returning five images in the retrieval.

Testing the Two Modules Separately
Next, we conducted various comparative experiments on the two proposed aggregation modules, which generally reduce the influence of noise. Thus, we use whitening without reducing dimensions to increase precision. To further compare our approach with R-MAC [25] and Crow [26], we evaluated the performance of the generalized R-MAC pooling module and the saliency joint-weighted deep convolutional feature module on the Oxford5k, Paris6k, Holidays, Roxford5k, and Rparis6k data sets with VGG16. Table 1 illustrates the outcomes of the R-MAC [25] and generalized R-MAC feature descriptors. By observing this table, we find that, when we tested on VGG16, the mAP of Paris6k, Oxford5k, and Holidays using Generalized R-MAC (GR-MAC) was 83.63%, 70.35%, and 89.58%, respectively. Simultaneously, the best retrieval outcomes were obtained. Table 1. Performance (mean average precision (mAP)) comparison between regional maximum activation of convolutions (R-MAC) [25] and generalized R-MAC (GR-MAC) on the Paris6k, Oxford5k, and Holidays data sets. The best result is highlighted in bold.  Table 2 shows the results of the R-MAC [25] and the GR-MAC feature descriptors. By observing this table, we conclude that, when we test on VGG16, the mAP for Roxford5k-Easy and Rparis6K-Easy achieved using GR-MAC was 63.33% and 80.15%, respectively; the P@1 and P@5 of Roxford-Easy and Rparis-Easy using GR-MAC was 85.29% and 78.24%, and 95.71% and 93.71%, respectively; and the mAP of Roxford-Medium and Rparis-Medium using GR-MAC was 42.26% and 63.86%, respectively. The P@1 and P@5 of Roxford-Medium and Rparis-Medium using GR-MAC were 84.29%, 70.19%, 95.71%, and 96.29%, respectively, and the best retrieval results were obtained. Table 2. Performance (mAP) comparison between regional maximum activation of convolutions (R-MAC) [25] and generalized R-MAC (GR-MAC) on ROxford and RParis data sets. The best result is highlighted in bold. After a series of comparative experiments using R-MAC [25] and our improved generalized R-MAC (GR-MAC) on five data sets, it is clear that our algorithm is better than R-MAC [25] by analyzing the experimental data in Tables 1 and 2, that is, the effectiveness of our method was verified. Compared with R-MAC taking the maximum value for each divided region, we took the maximum value while considering the influence of other response values in the region so that the image feature descriptor we use obtains better results during retrieval. By comparing the test results, we find that, for the more difficult Oxford5k, Roxford, and Rparis data sets, our improved generalized R-MAC (GR-MAC) is more accurate and that, for the simple Paris6k and Holidays data sets, our method also provides a small improvement. Table 3 indicates the outcomes of the cross-dimensional weighting (Crow) [26] and saliency joint weighting (SJW) methods. From this table, we learn that, when we test on VGG16, the mAP of the Paris6k, Oxford5k, and Holidays data sets using saliency joint weighting (SJW) is 79.41%, 69.62%, and 89.74%, respectively, which are the best retrieval results.  Table 4 indicates the outcomes of the cross-dimensional weighting (Crow) [26] and the saliency joint-weighted feature descriptors. From this table, we learn that, when we test on VGG16, the mAP of Roxford5k-Easy and Rparis6K-Esay using GR-MAC is 63.09% and 78.68%, respectively; the P@1 and P@5 of Roxford-Easy and Rparis-Easy using GR-MAC are 88.24% and 75.00%, and 97.14% and 93.71%, respectively. The mAP of Roxford-Medium and Rparis-Medium using GR-MAC is 47.36% and 60.90%, respectively; the P@1 and P@5 of Roxford-Medium and Rparis-Medium using GR-MAC are 87.14% and 72.57%, and 97.14% and 96.00%, respectively, and the best retrieval results are obtained. In addition, given the adequacy of experiments, we conducted a comparative test of Crow [26] and our improved saliency joint weighting (SJW) method on five data sets. By analyzing the experimental data in Tables 3 and 4, we show that our algorithm is better than Crow [26] on the five data sets, which verifies the effectiveness of our method. Compared with the feature descriptor obtained by Crow's [26] spatial weighting and channel weighting, the saliency weighting matrix obtained by the saliency detection algorithm is more conducive to distinguishing the salient region, and the image feature descriptor obtained by the improved channel weighting matrix achieves better results in retrieval.

Method
We visualized the heat map for improved saliency weighting (SW) and saliency joint weighting (SJW) and found that the saliency weighting matrix Sw pays more attention to a slice of salient regions that can distinguish different buildings, such as the edge information of the overall shape of the building or the detailed information of the window grille shape.
However, these salient regions are the key regions in general image retrieval (such as building data sets). Obviously, as shown in the visualization in Figure 5, column b, the key region is brighter than the others. The ability to represent feature maps has a strong influence on the accuracy of image retrieval. Therefore, in order to improve the expressive ability of the feature map, we explored it from two perspectives to improve the feature map from the original convolutional layer to improve its representativeness. From a spatial perspective, we used the saliency weighting matrix Sw to highlight the feature values of the salient region. At the same time, the feature maps of the convolutional layer were fused proportionally. As shown in column c of Figure 5, the above saliency joint weighting scheme can make the improved feature map pay more attention to the salient region without losing the overall building information obtained by the deep neural network. From a channel perspective, we used nonzero quantities and channel variance to measure the importance of different channels and used channel weighting to obtain the final saliency joint-weighted (SJW) descriptor Sjw.
As shown in Figure 6, we randomly selected images and obtained their origin feature maps, saliency-weighted (SW) maps, and saliency joint-weighted (SJW) heat maps. Through the comparison of heat maps, we found that our saliency joint weighting method can make the target region more prominent without ignoring the information of other regions. (a) We randomly selected three original images from the Oxford5k data set [33]. (b) The SW map means a saliency-weighted map for which drawing data were from the result calculated by Equations (16) and (17). After visualization, we found that the saliencyweighted map can focus on the target region of the image. (c) SJW map means the saliency joint-weighted map. The obtained saliency-weighted map was fused with the original features to a certain extent using a scale factor to obtain the final saliency joint-weighted map, as shown in Equation (22). From the analysis of the above results, it can be seen that our saliency joint-weighted (SJW) feature focuses on the key region of retrieval, without losing other parts of information. The results of the experimental application of this retrieval result analysis procedure are provided to illustrate the proposed algorithm. As shown in Figure 7, the query image (the image in the blue box) corresponds to the return images of the module's retrieval results. The first row is the retrieval results of GR-MAC, the second row is the retrieval results of SJW, and the third row is the retrieval results of GSW. Comparing the two rows of images, as discussed, the differences are due to the two modules focusing on differences. GR-MAC pays more attention to global information, and the first line returns a more similar overall building. SJW pays more attention to salient local details, and the second row can return the partial details of the building that are partially obscured (such as the dome-shaped top). GSW can return both images with global information and images with more attention to salient details.
(the image in the blue box) corresponds to the return images of the module's retrieval results. The first row is the retrieval results of GR-MAC, the second row is the retrieval results of SJW, and the third row is the retrieval results of GSW. Comparing the two rows of images, as discussed, the differences are due to the two modules focusing on differences. GR-MAC pays more attention to global information, and the first line returns a more similar overall building. SJW pays more attention to salient local details, and the second row can return the partial details of the building that are partially obscured (such as the dome-shaped top). GSW can return both images with global information and images with more attention to salient details.

Feature Aggregation Method Comparison
We conducted retrieval experiments on the global saliency-weighted (GSW) descriptor. Query expansion can further enhance the performance of image retrieval. In this section, we use our method to calculate feature descriptors. Then, we verify the benefit of query expansion on the Oxford5k, Paris6k, and Holidays data sets. Table 5 illustrates the outcomes of different feature descriptors before and after query expansion. By observing this table, we find that, when we tested on VGG16, the SJW module obtained more discriminative global features for data sets with more saliency regions. For example, oxford5k had a significant effect on improvement. The accuracy rate of GR-MAC in Table 5 is 70.35%, and the effect after integrating the SJW module can reach 72.90%, since there are many images in the Paris6k data set that do not have obvious saliency regions. Our GR-MAC algorithm, which is more conducive to obtaining global information, produced better results than the GSW algorithm, which incorporates saliency weighting. The mAP of the Paris6k, Oxford5k, and Holidays data sets using GSW (ours) + QE was 89.55%, 79.87%, and 91.51%, respectively. Obviously, the best retrieval results were obtained. By observing Table 6, we drew the inference that the results for the features probably increase when we apply the operation of query expansion on diverse data sets.
From the experimental results, we conclude that the previous post-QE operations used to improve retrieval accuracy are also very applicable to our proposed algorithm. Regardless of the feature descriptor obtained from the GR-MAC or SJW module, or the feature descriptor obtained from the fusion of the two modules, the QE operation in the subsequent retrieval process greatly improves our retrieval effect.
To prove the effectiveness and superiority of our proposed global saliency-weighted convolution feature algorithm, we compared the experimental results with the latest feature aggregation algorithms. We not only conducted experiments using the original image representation but also compared the experimental results through query expansion. The outcomes are displayed in Tables 6 and 7. Table 6 reveals that only by using our method can we obtain the best results and that better results can be obtained through query expansion. The above experiments fully validated the utility and excellence of our algorithm. In Table 6, because there is no experiment under this condition in the literature of the corresponding methods, we replaced the missing value with a small horizontal line. Table 7 indicates that only by using our method can we obtain the best results and that better results can be obtained through query expansion. Our algorithm not only provided promising results on the classic data set but also provided greatly improved results on the new building data set revised in 2018, which further proves the generalization ability of the algorithm. The extensive results in Tables 6 and 7 show that the two proposed modules have mutual promotion effects. The feature descriptors obtained by fusing the two modules achieve the best accuracy for subsequent retrieval operations. Compared with the previous method, we find that, on the five data sets, the retrieval effect provided by our method is significantly improved. A higher accuracy rate can be achieved after the QE operation. Figure 8 visualizes the embedding of our global saliency weighting (GSW) method using the Barnes-Hut t-SNE (t-distributed stochastic neighbor embedding) [41] in the ox-ford5k data set [33], which proves that our method can group semantically similar images.

Discussion
From the experiments above, key findings emerge: 1. The proposed generalized R-MAC (GR-RMAC) algorithm produces a better retrieval effect than the regional maximum activation of convolutions (R-MAC) [25] algorithm by capturing more effective information in multiple regions of R-MAC [25].
2. Our proposed saliency joint weighting (SJW) algorithm produces a more excellent spatial weighting matrix and channel weighting matrix through saliency detection. Compared with the previous cross-dimensional weighting (Crow) [26] obtained by spatial weighting and channel weighting, our weighting method effectively improves the retrieval performance.
3. We fused the feature descriptors obtained by GR-RMAC and SJW as the final retrieval feature and found that the two proposed modules have a mutual promotion effect. This fusion achieved the best retrieval effect. Comparing it with the previous algorithm, our method provided significantly improved results on multiple data sets.
Two limitations of these two aggregation modules are that it may be suitable for instance retrieval and not for other types of image retrieval that require higher saliency precision. Simultaneously, the calculated amount of the two modules is not low, so we will further optimize this in the future to reduce the computational burden of retrieval and study a more versatile aggregation method.

Discussion
From the experiments above, key findings emerge: 1.The proposed generalized R-MAC (GR-RMAC) algorithm produces a better retrieval effect than the regional maximum activation of convolutions (R-MAC) [25] algorithm by capturing more effective information in multiple regions of R-MAC [25].
2. Our proposed saliency joint weighting (SJW) algorithm produces a more excellent spatial weighting matrix and channel weighting matrix through saliency detection. Compared with the previous cross-dimensional weighting (Crow) [26] obtained by spatial weighting and channel weighting, our weighting method effectively improves the retrieval performance.
3. We fused the feature descriptors obtained by GR-RMAC and SJW as the final retrieval feature and found that the two proposed modules have a mutual promotion effect. This fusion achieved the best retrieval effect. Comparing it with the previous algorithm, our method provided significantly improved results on multiple data sets.
Two limitations of these two aggregation modules are that it may be suitable for instance retrieval and not for other types of image retrieval that require higher saliency precision. Simultaneously, the calculated amount of the two modules is not low, so we will further optimize this in the future to reduce the computational burden of retrieval and study a more versatile aggregation method.

Parameter Analysis
In this section, we test the primary parameters of our algorithm on the oxford5k data set [33]. We used the same evaluation criterion, mAP, on the previous feature pooling methods (SPoC [23], MAC [24], R-MAC [25], Crow [26], etc.) to measure the accuracy of retrieval. We tested the best parameters on oxford5k and applied them to the other four data sets.
In the calculation process of global saliency-weighted aggregated convolution features, we used a total of five adjustable parameters: the scale of the R-MAC [25] division region, the number of channels selected by the three-dimensional feature map , the proportion of the fusion saliency weighting matrix factor , and the scale factors and that combine the generalized R-MAC module and the saliency joint-weighted deep convolutional feature module. By adjusting these parameters for the calculation of global

Parameter Analysis
In this section, we test the primary parameters of our algorithm on the oxford5k data set [33]. We used the same evaluation criterion, mAP, on the previous feature pooling methods (SPoC [23], MAC [24], R-MAC [25], Crow [26], etc.) to measure the accuracy of retrieval. We tested the best parameters on oxford5k and applied them to the other four data sets.
In the calculation process of global saliency-weighted aggregated convolution features, we used a total of five adjustable parameters: the scale L of the R-MAC [25] division region, the number of channels selected by the three-dimensional feature map n, the proportion of the fusion saliency weighting matrix factor ρ, and the scale factors α and β that combine the generalized R-MAC module and the saliency joint-weighted deep convolutional feature module. By adjusting these parameters for the calculation of global aggregated saliencyweighted convolutional features, we selected the best parameters to obtain the final feature descriptor to enable the descriptor obtained by the generalized R-MAC module and the saliency joint-weighted deep convolutional feature module for fusion. We first performed L2 normalization for them and then performed L2 normalization after fusion to obtain the final feature descriptor to facilitate subsequent whitening operations.
Firstly, we tested the value of the scale parameter L, which determines the number of regions R divided in the GR-MAC algorithm. When the value of L is too small, it does not reflect the advantage of obtaining more information on multiple scales. When the value of L is too large, there will be large amounts of repeated information collected and this information may be the background region and may not necessarily all contribute to the image. Thus, a larger L is not necessarily better.
We aimed to select the optimal L so that the GR-MAC algorithm could obtain the optimal number of regions R. Table 8 detects the results of parameter L. To obtain the best feature descriptors, we tested the value of L between 1 and 4. The mAP was the highest at L = 3, and the maximum mAP value is bolded in Table 8. We chose L = 3 to obtain the final generalized R-MAC feature descriptor. Then, we tested the parameter n described in Section 3.2. The parameter n represents the number of feature maps selected in the saliency joint-weighted deep convolutional feature module. We sampled n uniformly and then experimented on the Oxford5k data set. The network used was a pretrained VGG16. The result is shown in Table 9. We aimed to choose a moderate n. The value of n should be neither too small, which would lose too much feature channel information, nor too large, which would make our selection process an invalid operation. We aimed to choose the most appropriate n so that the chosen feature channel would contain enough information after the summation. At the same time, due to the large variance in these channels, it was more conducive to distinguish the target region from the background region.
From Table 9, we can infer that the mAP obtains the maximum value at n = 30%. The maximum mAP value is bolded. We chose n = 30% to obtained the final saliency joint-weighted feature descriptor.
We tested the parameter ρ described in Section 3.2. The parameter ρ represents the scale factor of the spatial weighting matrix-aggregated saliency weighting matrix in the saliency joint-weighted deep convolutional feature module. If ρ is too large, it may focus too much on the target region and ignore the background regions containing a certain amount of information. If ρ is too small, it may not focus on the desired target information. We used the VGG16 network pretrained on ImageNet in the experiments to experiment with ρ ranging from 1 to 500. The results are listed in Table 10. From Table 10, we conclude that mAP obtains a maximum value at ρ = 100. The maximum mAP value is bolded. We chose ρ = 100 to obtain the final saliency joint-weighted feature descriptor.
Finally, the obtained generalized R-MAC feature descriptor and saliency joint-weighted feature descriptor were tested for the fusion scale factors α and β. We used the VGG16 network pretrained on ImageNet to test the Oxford5k data set, and the results are shown in Table 11.
We used the fusion factors α and β to fuse the image feature descriptors obtained by the GR-MAC algorithm and SJW algorithm to obtain the final GSW feature for retrieval. The feature descriptors obtained by GR-MAC focus more attention on the global information of feature maps on multi-scales. Under the influence of the saliency algorithm, the feature descriptor obtained by SJW focuses on the saliency region that is most conducive to image retrieval while focusing on the target region and ignoring the background region. The two descriptors have their advantages; consequently, we hoped to find the most effective fusion ratio and to obtain the feature descriptor with the best retrieval effect. From Table 11, the results demonstrated that mAP obtains the maximum value at α = 0.6 and β = 0.4. The maximum mAP value is bolded. We chose α = 0.6 and β = 0.4 to obtain the final global saliency weighting feature descriptor.
From Table 12, we used Alexnet to perform module ablation experiments on the oxford5k and paris6k data sets, separately, to illustrate that our proposed method is still applicable in other models. Together, the findings confirmed that the feature maps obtained by different pretrained models have a certain impact on the retrieval effect, but for the feature maps obtained under the same network, the feature descriptors obtained by our method are better than the previous algorithm.

Conclusions
In this paper, we constructed two effective aggregation and improvement methods for deep convolution features and then merged them. The final retrieval accuracy reached stateof-the-art levels. We improved the classic regional maximum activation of the convolutions (R-MAC) [25] method and proposed a generalized R-MAC (GR-MAC) aggregation method, which allows the descriptor to obtain richer global information and is not limited to a single maximum value. Not only does the saliency joint weighting (SJW) module give the function of the convolutional layer feature an intuitive impression but also the obtained SJW feature pays more attention to the salient region of the image without losing the overall information of the building, improving the retrieval performance to better than that provided by current methods. After fusing the two proposed improved modules, they produced better retrieval performance on multiple building data sets.
Collectively, regardless of the proposed improved module, network training is not required. This nonparametric and easy-to-implement module allows the two proposed aggregation modules to be easily embedded in any other deep learning tasks. Future research should consider the potential effects of these two aggregation modules on more deep learning tasks, such as target detection and few-shot learning.