Distribution Structure Learning Loss (DSLL) Based on Deep Metric Learning for Image Retrieval

The massive number of images demands highly efficient image retrieval tools. Deep distance metric learning (DDML) is proposed to learn image similarity metrics in an end-to-end manner based on the convolution neural network, which has achieved encouraging results. The loss function is crucial in DDML frameworks. However, we found limitations to this model. When learning the similarity of positive and negative examples, the current methods aim to pull positive pairs as close as possible and separate negative pairs into equal distances in the embedding space. Consequently, the data distribution might be omitted. In this work, we focus on the distribution structure learning loss (DSLL) algorithm that aims to preserve the geometric information of images. To achieve this, we firstly propose a metric distance learning for highly matching figures to preserve the similarity structure inside it. Second, we introduce an entropy weight-based structural distribution to set the weight of the representative negative samples. Third, we incorporate their weights into the process of learning to rank. So, the negative samples can preserve the consistency of their structural distribution. Generally, we display comprehensive experimental results drawing on three popular landmark building datasets and demonstrate that our method achieves state-of-the-art performance.


Introduction
Along with the popularity of the Internet and smart devices, the number of pictures stored on the network has exploded. Therefore, image retrieval has become a common concern. One big challenge in image retrieval is to detect robust and discriminative features from numerous images. Traditional methods depend on handcrafted features, which include global features such as spectral (color), texture, and shape features, as well as the aggregated features like bag of words (BoW) [1], vector of locally aggregated descriptors (VLAD) [2], and Fisher vector (FV) [3]. Such design is time-consuming and needs substantial professional knowledge.
The advancement of deep learning has pushed forward the development of content-based image retrieval [4]. Deep convolutional neural network (CNN) features become greatly abstractive and contain semantic information at high-level, which have outperformed traditional handcrafted features in image retrieval [5,6]. Furthermore, the deep features are learned automatically from data and there 1.
We propose a novel and effective metric learning method. To the best of our knowledge, it introduces the concept of structural preservation and structural consistency for the first time. Structural preservation deals with the intraclass distribution of positive samples, while structural consistency accounts for the structural distribution information of negative samples.

2.
To solve the problem of shrinking each feature vector of positive samples into the same point, we set a hypersphere learning for every class. The method uses the hardest positive sample to train the network by adjusting its distance from the query sample. This effectively preserves the structure of the positive sample.

3.
We add the entropy weight based on structural distribution to the loss function. The algorithm considers the distribution characteristics of the negative samples, captures the feature structure, addresses the relative similarity of each sample, which has been ignored in past works, The rest of this paper is organized as follows. We firstly review some deep metric learning methods and briefly introduce its existing applications in image retrieval in Section 2. Section 3 describes the algorithm framework and its detail. Then, Section 4 discusses the major significance and evaluates our newly proposed network architecture by a series of targeted experiments, after which future works of this paper are drawn in the final section.

Image Retrieval
Search accuracy and search speed play a key role in image retrieval in which search accuracy relies on image characteristics, while search speed is largely determined by indexing. In the early stages of image retrieval, image features were acquired by local patch descriptors (e.g., SIFT [26]) and SURF [27]. The image-level signatures are obtained by BoW, as the aggregated [28] local features [24,29].
This feature of the inverted table is further used as a valid index structure. Then, FV and average pooling (SPoC) [30] are used to aggregate the local features of the image to form an image global feature [30]. For feature aggregation, Tolias et al. [31] proposed maximum activations of convolutions (MAC) descriptors extracted from many multi-scale overlapping regions of the last convolutional feature map. The region descriptor is processed by L2 normalized, then through principal component analysis (PCA) + whitened [32], and finally normalized by L2 and aggregated into a global signature called regional maximum activations of convolutions (RMAC) [33].
The above methods all use fixed pre-trained CNNs. But these CNNs are trained to classify images, and a case in point is 1000 classes of ImageNet. This does not differentiate the aggregation method, and more inclined to sub-optimally perform in image retrieval. In order to solve this question, Radenovic et al. [34], proposed to add the MAC layer to the last convolutional layer of VGG [35] or ResNet [36], then used the Siamese structure to train the network [34]. In [37], a trainable generalized mean (GEM) pool layer replaces the MAC layer. This greatly improves retrieval accuracy. The recent NetVLAD consists of a standard CNN and a subsequent VLAD layer vector that aggregates the final convolutional features into fixed dimensional signatures and whose parameters can be trained by backpropagation [38]. To decrease the co-occurrences of certain weights and handle the issue of over-counting, J´egou and Chum [32] considered it was essential to whiten the data representation. A generative model based on an unsupervised approach through PCA conducted on an independent dataset [24,39] helps to learn whitening in the above process. Also, Radenović proposed a discriminative manner in whitening transform. This is achieved by using the same procedure of acquisition from the training data based on a 3D model. In image retrieval, query extensions based on CNN global descriptors [13,37] can be used to improve search accuracy.

Deep Metric Learning
One purpose of metric learning aims to learn a nonlinear projection function that converts an image from pixel level to discriminant space, where samples from the same category or high similarity will be collected together and samples from different categories or samples with low similarity will be pushed away. Recent research on depth metric learning in the visual tasks of face recognition [40], pedestrian recognition [41], image set classification [42], fine-grained retrieval [43], image retrieval [44], target tracking [45], multimedia retrieval [46], and visual search [47,48] have played a catalytic role. We will summarize the latest in-depth metrics learning methods in the following sections.

Contrastive Loss
By minimizing the loss function, the gap (distance) between positive samples is gradually narrowed while the gap (distance) between negative samples is gradually widened. ( This loss overruns pairs of samples. Let I a , I b ∈ I be a pair of input images shown to the datasets. Through the pre-feedback of the network, we can get their normalized feature vector f I a and f I b , d I a ,I b =||f I a − f I b 2 is the Euclidean distance of the two picture feature vectors, Y = 1 if I a and I b are considered similar, and Y = 0 if they are considered dissimilar. Here, (z) + means max (z, 0), α is a threshold parameter designed according to actual needs.

Triplet Loss
In the feature space, the triplet loss narrows down the distance between the anchor point and the positive sample as close as possible and pulls away the distance between the anchor point and the positive sample as far as possible. Besides, the distance gaps between the anchor point and the negative sample are reflected within a threshold m: where T is a triple set, a, b and c are the anchor points, the index of the positive and negative sample points, respectively, f is the embedded function, d I a ,I b = ||f I a − f I b 2 is the Euclidean distance of the eigenvectors of the two pictures I a and I b , where (z) + denotes max (z, 0), m is a threshold parameter designed according to actual needs.

N-Pair Loss
The N-pair loss is designed to take advantage of all sample pairs in mini-batch and learn more differentiated representations based on structural information between the data. Specifically, the sample consists of one positive sample and N−1 negative samples from N-1 different categories, i.e., one negative sample per category, and the loss function can be expressed as follows: where η is L 2 regularized embedding vector, N is the number of selected sample types, X i is the query image, X + i is a sample to X i , and X + j is a sample with low similarity to X i , where j i.

Lifted Structured Loss
Lifted structured loss relies on the advantages of training batches of minibatch SGD training, uses random sampled image pairs or triples, constructs training batches to calculate the loss of each pairs or triplets, and the loss function is given as a log-sum-exp formulation: where D I i ,I k = f I i − f I k 2 is the Euclidean distance in the embedding space of samples I i and I j , α is a threshold parameter designed according to actual needs. P is a set of similar sample pairs, and N is a set of dissimilar sample pairs. By learning the anchor points, similarities, and dissimilarity proxy points, the triplet loss of the triplet loss in different spaces is optimized, and the ternary loss on the agent is the tight upper limit of the original loss. The Neighbourhood Components Analysis (NCA) loss attempts to use exponential weighting such that I a is closer to I b than to any element in the set Z. The loss can be expressed as follows: where D I a ,I b = f I a − f I b 2 is the Euclidean distance in the embedding space of samples I a and I b , Z is the negative data set, I b and I c are the proxies of positive and negative points, the anchor is I a . The NCA loss is much better than dynamic allocation because it uses static proxy allocation, which means that each class has a proxy. However, the agents in the static proxy assignment are learned during training and are displayed in the class vector of the fully connected layer classification. Therefore, proxy-NCA does not retain the scalability of DML because it needs to consider the number of classes. The proposed DSLL is a structured loss based on structural consistency and structural distribution ranking consistency. It avoids two limitations of traditional methods by combining all meaningful data points and exploring inherent structured information. Figure 1 shows a comparison of our method with other different methods.
where D I i ,I k = f I i -f I k 2 is the Euclidean distance in the embedding space of samples I i and I j , is a threshold parameter designed according to actual needs. P is a set of similar sample pairs, and N is a set of dissimilar sample pairs.

Proxy-NCA
By learning the anchor points, similarities, and dissimilarity proxy points, the triplet loss of the triplet loss in different spaces is optimized, and the ternary loss on the agent is the tight upper limit of the original loss. The Neighbourhood Components Analysis (NCA) loss attempts to use exponential weighting such that I a is closer to I b than to any element in the set Z. The loss can be expressed as follows: where D I a , is the Euclidean distance in the embedding space of samples and , Z is the negative data set, I b and I c are the proxies of positive and negative points, the anchor is I a .
The NCA loss is much better than dynamic allocation because it uses static proxy allocation, which means that each class has a proxy. However, the agents in the static proxy assignment are learned during training and are displayed in the class vector of the fully connected layer classification. Therefore, proxy-NCA does not retain the scalability of DML because it needs to consider the number of classes.
The proposed DSLL is a structured loss based on structural consistency and structural distribution ranking consistency. It avoids two limitations of traditional methods by combining all meaningful data points and exploring inherent structured information. Figure 1 shows a comparison of our method with other different methods. For the sake of simplicity, we only represent three classes. Anchor (query picture) is represented in brown. In the triplet loss [7], anchor is only compared to one positive sample and one negative sample. In N-pair-mc [20], proxy-NCA [10], and lifted struct [22], a positive sample and multiple negative sample types were introduced. N-pair-mc randomly selects a negative sample in each negative sample class. The proxy NCA pushes the negative sample set agent away from the anchor instead of pushing the negative sample farther away. Lifted struct uses all negative sample categories. Instead, our proposed distribution structure learning loss uses a representative positive sample and five negative samples with different similarities to the anchor.

Proposed Method
The design of our DSLL algorithm solves two basic problems of deep metric learning for image retrieval: (1) we maintain the structure of the sample with high correlation with the query sample, Figure 1. A graphical representation of the loss of different metric learning is presented. Different categories are represented by different shapes. For the sake of simplicity, we only represent three classes. Anchor (query picture) is represented in brown. In the triplet loss [7], anchor is only compared to one positive sample and one negative sample. In N-pair-mc [20], proxy-NCA [10], and lifted struct [22], a positive sample and multiple negative sample types were introduced. N-pair-mc randomly selects a negative sample in each negative sample class. The proxy NCA pushes the negative sample set agent away from the anchor instead of pushing the negative sample farther away. Lifted struct uses all negative sample categories. Instead, our proposed distribution structure learning loss uses a representative positive sample and five negative samples with different similarities to the anchor.

Proposed Method
The design of our DSLL algorithm solves two basic problems of deep metric learning for image retrieval: (1) we maintain the structure of the sample with high correlation with the query sample, and (2) we also preserve the structural similarity order consistency of the sample with low correlation with the query sample. The first novelty of our approach is to set the space between the positive sample and the given query picture so as to build up a more informative structure for learning features, thereby maintaining the structure of the positive sample feature vector. The entropy weight based on structural distribution is another major innovation put forward in our approach. Our basic idea is to calculate how the adjacent samples of each negative sample individual feature are distributed and use this distribution information to optimize the control of subsequent distance weighting operations. The entropy based on structural distribution refers to the relative relationship between the spatial distance distributions of the samples around the negative samples with low correlation. Hard negative samples provide higher discriminability between matching and non-matching pairs, and the sample structure information is more informative in the recognition, so they are given higher weight. In this section, we will elaborate on DSLL and its details proposed for image retrieval, which includes loss function and fine-tuned architecture of the two-branch network, and then we will also explain the principle of the loss function.

CNN Network Architecture
In order to obtain the eigenvectors of the image, we constructed a CNN network. As shown in Figure 2, we selected the convolutional layer of VGG (in the dashed box in Figure 2), which was completely convolved, and the network can extract features regardless of image size. The last pooling layer of the VGG convolutional layer was replaced by generalized-mean pooling (GeM). The feature map obtained by convolution is vectorized by the pooling operation, then the Lw whitening operation is performed, and finally L2-normalized. The VGG network is a fine-tuned network. In the following, the AlexNet [49] and ResNet101 [50] (pre-trained on ImageNet) network frameworks used in our experiments were formed by replacing the VGG part (in the dashed box) in Figure 2, where the network connection was the same as VGG. Finally, we removed the fully connected layer and replaced the last pooling operation with a GeM pooling operation.
based on structural distribution is another major innovation put forward in our approach. Our basic idea is to calculate how the adjacent samples of each negative sample individual feature are distributed and use this distribution information to optimize the control of subsequent distance weighting operations. The entropy based on structural distribution refers to the relative relationship between the spatial distance distributions of the samples around the negative samples with low correlation. Hard negative samples provide higher discriminability between matching and nonmatching pairs, and the sample structure information is more informative in the recognition, so they are given higher weight. In this section, we will elaborate on DSLL and its details proposed for image retrieval, which includes loss function and fine-tuned architecture of the two-branch network, and then we will also explain the principle of the loss function.

CNN Network Architecture
In order to obtain the eigenvectors of the image, we constructed a CNN network. As shown in Figure 2, we selected the convolutional layer of VGG (in the dashed box in Figure 2), which was completely convolved, and the network can extract features regardless of image size. The last pooling layer of the VGG convolutional layer was replaced by generalized-mean pooling (GeM). The feature map obtained by convolution is vectorized by the pooling operation, then the Lw whitening operation is performed, and finally L2-normalized. The VGG network is a fine-tuned network. In the following, the AlexNet [49] and ResNet101 [50] (pre-trained on ImageNet) network frameworks used in our experiments were formed by replacing the VGG part (in the dashed box) in Figure 2, where the network connection was the same as VGG. Finally, we removed the fully connected layer and replaced the last pooling operation with a GeM pooling operation.

The Architecture of Network Training
As shown in Figure 3, the training part of the network is composed of multiple CNNs sharing the same weight. The CNN we used here is the network constructed in Figure 2. A positive sample is highly similar to a query image and belongs to the same category, while a negative sample is classified into a different category. The multiple identical networks adopted the corresponding feature vectors of input image multiples and output multiples. We combined the eigenvectors of the negative samples obtained with the obtained ordering, and then combined the query image with the eigenvectors of the positive samples to train the network.

The Architecture of Network Training
As shown in Figure 3, the training part of the network is composed of multiple CNNs sharing the same weight. The CNN we used here is the network constructed in Figure 2. A positive sample is highly similar to a query image and belongs to the same category, while a negative sample is classified into a different category. The multiple identical networks adopted the corresponding feature vectors of input image multiples and output multiples. We combined the eigenvectors of the negative samples obtained with the obtained ordering, and then combined the query image with the eigenvectors of the positive samples to train the network. Entropy 2019, 21, x 7 of 23

Network Evaluting Architecture
In the test section of this article, we build a network framework as shown in Figure 4. First, we adjusted the images in the query image and image database to different sizes through multi-scale processing, then put multiple input images to the CNN network, as the framework built in Figure 2, and form a single descriptor by combining the global descriptors from multiple scales. We calculate the size of the Euclidean distance between the image feature vector and the image feature vector in the image library, sort the image according to the similarity, and obtain the image with the top ranking. The result that does not match the query image is discarded. We sum the left results by using the original query and the renormalization of experience.

Distributed Structure Learning Loss
We aimed at learning the discriminant function L m and train the network so that a higher similarity is observed between positive sample pairs while a lower similarity between negative samples in the feature space.
Based on the similarity of pairwise in past works [26], our goal is to keep the positive sample closer to the predetermined structure and maintain the threshold boundary. Besides, we plan to separate the positive and negative samples with the threshold β and e α n τ, to achieve this, we take an architecture based on Siamese and train the network with two branches. The training input includes

Network Evaluting Architecture
In the test section of this article, we build a network framework as shown in Figure 4. First, we adjusted the images in the query image and image database to different sizes through multi-scale processing, then put multiple input images to the CNN network, as the framework built in Figure 2, and form a single descriptor by combining the global descriptors from multiple scales. We calculate the size of the Euclidean distance between the image feature vector and the image feature vector in the image library, sort the image according to the similarity, and obtain the image with the top ranking. The result that does not match the query image is discarded. We sum the left results by using the original query and the renormalization of experience.

Network Evaluting Architecture
In the test section of this article, we build a network framework as shown in Figure 4. First, we adjusted the images in the query image and image database to different sizes through multi-scale processing, then put multiple input images to the CNN network, as the framework built in Figure 2, and form a single descriptor by combining the global descriptors from multiple scales. We calculate the size of the Euclidean distance between the image feature vector and the image feature vector in the image library, sort the image according to the similarity, and obtain the image with the top ranking. The result that does not match the query image is discarded. We sum the left results by using the original query and the renormalization of experience.

Distributed Structure Learning Loss
We aimed at learning the discriminant function L m and train the network so that a higher similarity is observed between positive sample pairs while a lower similarity between negative samples in the feature space.
Based on the similarity of pairwise in past works [26], our goal is to keep the positive sample closer to the predetermined structure and maintain the threshold boundary. Besides, we plan to separate the positive and negative samples with the threshold β and e α n τ, to achieve this, we take an architecture based on Siamese and train the network with two branches. The training input includes

Distributed Structure Learning Loss
We aimed at learning the discriminant function L m and train the network so that a higher similarity is observed between positive sample pairs while a lower similarity between negative samples in the feature space.
Based on the similarity of pairwise in past works [26], our goal is to keep the positive sample closer to the predetermined structure and maintain the threshold boundary. Besides, we plan to separate the positive and negative samples with the threshold β and e α n τ, to achieve this, we take an architecture based on Siamese and train the network with two branches. The training input includes the image pair (q, i) and the label y (q, i) ∈ {0, 1}, if q and i is non-matching (label 0) or matching (label 1). Given an image X i , our goal is to pull the positive sample closer to the predetermined structure retention threshold boundary β, pushing the negative sample to a different distance from the boundary τ according to the e α n value, where β and τ are the two boundary thresholds. We chose contrast loss to construct a loss function based on the consistency of distribution structure retention and structural similarity, and applied it to matching pairs and non-matching pairs, and the loss function is defined as If q is similar to i, y (q,i) = 1, q and i are not similar, y (q, i) = 0, d q, i =||f X q − f(X i ) 2 refers to the Euclidean distance between two points.
Selection of positive images: several sets of randomly selected images from positive image pairs were prepared during the training process. The annotated positive image pairs from the training dataset were treated as positive images inside the training sets. We used the method proposed in [37], the positive image pair was randomly selected from a set of images, rather than using a pool of images that are captured in a similar camera position. The image presented sufficient co-observed points with the query image, and do not show extreme scale changes. This positive imag In Equation (8), the scale changes between the two images are reflected from scale (i, q). The harder matching examples selected by this method are guaranteed to make sure the depiction of the same object. This method is conducive to increase more diversified viewpoints.
Selection of negative images: we calculated the Euclidean distance between the query image and the image feature vector in the dataset to obtain a cluster with a different distance from the query image, and select a negative sample from the low correlation cluster different from the query image.
The cluster obtained by calculating the Euclidean distance is shown in Figure 5, where panel q is the query image, and panels a-f are negative clusters that observe far distance to the query image. If we want to select five negative examples, then we may first consider panel a, if it is not in the query cluster q, or in the positive example clusters, then panel a is used as a low correlation image listed in the input set of query image q. Similarly, panel b is also marked with a low correlation in the input set of images; for image c, although there exists large distance between its feature vector and the feature vector of the query image, panels c and b belong to one labeled cluster. So, panel c, as a low correlation one, is not taken in the negative image set of the query panel q. Panels d-f are taken as low correlation images to q. When the number of the required image equals to N, the low correlation image is no longer selected, so other images will be no longer considered. If we define N as 5, then the selected pictures will be labeled as N1, N2, N3, N4, and N5. The ranking of N1 to N5 is based on Euclidean distance between the hardest sample in each class and the query image. Entropy weight based on structural distribution: for each query sample q, there are a large number of negative samples with different structural distributions. In order to fully utilize them, we propose to weigh the negative samples according to their spatial distribution, i.e., self-similarity and related similarity, that is, the extent to which the sample violates the constraint. The principle of the weighting strategy is shown in Figure 6. The query sample and the positive sample are represented by red circles, N1 to N5 represent negative samples, and the left graph shows that when N2 to N5 negative samples are far away from the query sample, the N1 distance query sample when the distance is relatively close, in which N1 is the hardest negative sample. At this time, keep the distance between N1 and the query sample, and draw other negative samples. As shown in the right figure, it is found that N1 is not the hardest negative sample, that is, relatively similar. The decrease in sex shows that the sample distribution around the negative sample has a huge impact on the similarity evaluation of the negative sample and the positive sample. In order to be able to highlight this effect in the process of training, we propose an entropy weight based on structural distribution, the formula of which is shown in Equation (9) where q is the sample picture of the query, i is the selected sample, and q and i selected here are negative sample pairs, is a negative sample set, we define the similarity of the two samples as S qi ≔ < f x q ;θ , f(x i ;θ) >, where <•,•>, resulting in an × similarity matrix whose element at (q, i) is , where λ, α, β are fixed hyper-parameters. The harder the negative sample is, the larger is, the is ranking from large to small, and the serial number is the value of α. Entropy weight based on structural distribution: for each query sample q, there are a large number of negative samples with different structural distributions. In order to fully utilize them, we propose to weigh the negative samples according to their spatial distribution, i.e., self-similarity and related similarity, that is, the extent to which the sample violates the constraint. The principle of the weighting strategy is shown in Figure 6. The query sample and the positive sample are represented by red circles, N1 to N5 represent negative samples, and the left graph shows that when N2 to N5 negative samples are far away from the query sample, the N1 distance query sample when the distance is relatively close, in which N1 is the hardest negative sample. At this time, keep the distance between N1 and the query sample, and draw other negative samples. As shown in the right figure, it is found that N1 is not the hardest negative sample, that is, relatively similar. The decrease in sex shows that the sample distribution around the negative sample has a huge impact on the similarity evaluation of the negative sample and the positive sample. In order to be able to highlight this effect in the process of training, we propose an entropy weight based on structural distribution, the formula of which is shown in Equation (9) where q is the sample picture of the query, i is the selected sample, and q and i selected here are negative sample pairs, N i is a negative sample set, we define the similarity of the two samples as S qi := < f x q ; θ , f(x i ; θ) >, where ·, · , resulting in an n × n similarity matrix S whose element at (q, i) is S qi , where λ, α, β are fixed hyper-parameters. The harder the negative sample is, the larger S qi is, the ω is ranking from large to small, and the serial number is the value of α. Given a query X i , we rank them according to their similarity with the query sample and the structural distribution, as illustrated in Figure 7. In each set of pictures, there is one positive sample and five negative samples, and the query sample is represented by Q, P represents a positive sample, and N1, N2, N3, N4, and N5 respectively represent negative samples with different similarities to the query picture, where N1 indicates the highest degree of similarity with the query picture, and the function of the loss function is defined as follows. First, the positive sample P is pushed to the query sample Q, and the distance from the query sample Q is within β, instead of infinitely close, satisfying the first constraint, then N1 should be closer to the query picture Q than N2, push N1 closer, implement the second constraint on the other images, and so on.
Proper sampling of the sample speeds up the convergence of the model and improves the performance of the model, in this paper, we create a tuple dataset ( , ), where q is the query image, and i represents an image that matches the query image. All these tuples formed image pairs for training.

The Training Process of DSLL
In this section, we describe and illustrate the training process by using our proposed DSLL, as shown in the algorithm presented below. First, we performed feature extraction on all samples, and used the extracted feature vector to calculate the Euclidean distance between the query sample and the samples in the dataset and ranking online. Then, we selected positive and negative samples according to the requirements mentioned in Section 3.2.1, and perform feature extraction on the selected samples. Finally, the extracted features were used to calculate the loss and the training network is optimized by loss.
The detailed action process for DSLL is as follows: Given a query X i , we rank them according to their similarity with the query sample and the structural distribution, as illustrated in Figure 7. In each set of pictures, there is one positive sample and five negative samples, and the query sample is represented by Q, P represents a positive sample, and N1, N2, N3, N4, and N5 respectively represent negative samples with different similarities to the query picture, where N1 indicates the highest degree of similarity with the query picture, and the function of the loss function is defined as follows. First, the positive sample P is pushed to the query sample Q, and the distance from the query sample Q is within β, instead of infinitely close, satisfying the first constraint, then N1 should be closer to the query picture Q than N2, push N1 closer, implement the second constraint on the other images, and so on. Given a query X i , we rank them according to their similarity with the query sample and the structural distribution, as illustrated in Figure 7. In each set of pictures, there is one positive sample and five negative samples, and the query sample is represented by Q, P represents a positive sample, and N1, N2, N3, N4, and N5 respectively represent negative samples with different similarities to the query picture, where N1 indicates the highest degree of similarity with the query picture, and the function of the loss function is defined as follows. First, the positive sample P is pushed to the query sample Q, and the distance from the query sample Q is within β, instead of infinitely close, satisfying the first constraint, then N1 should be closer to the query picture Q than N2, push N1 closer, implement the second constraint on the other images, and so on.
Proper sampling of the sample speeds up the convergence of the model and improves the performance of the model, in this paper, we create a tuple dataset ( , ), where q is the query image, and i represents an image that matches the query image. All these tuples formed image pairs for training.

The Training Process of DSLL
In this section, we describe and illustrate the training process by using our proposed DSLL, as shown in the algorithm presented below. First, we performed feature extraction on all samples, and used the extracted feature vector to calculate the Euclidean distance between the query sample and the samples in the dataset and ranking online. Then, we selected positive and negative samples according to the requirements mentioned in Section 3.2.1, and perform feature extraction on the selected samples. Finally, the extracted features were used to calculate the loss and the training network is optimized by loss.
The detailed action process for DSLL is as follows: Proper sampling of the sample speeds up the convergence of the model and improves the performance of the model, in this paper, we create a tuple dataset (X q , X i ), where q is the query image, and i represents an image that matches the query image. All these tuples formed image pairs for training.

The Training Process of DSLL
In this section, we describe and illustrate the training process by using our proposed DSLL, as shown in the algorithm presented below. First, we performed feature extraction on all samples, and used the extracted feature vector to calculate the Euclidean distance between the query sample and the samples in the dataset and ranking online. Then, we selected positive and negative samples according to the requirements mentioned in Section 3.2.1, and perform feature extraction on the selected samples. Finally, the extracted features were used to calculate the loss and the training network is optimized by loss.
The detailed action process for DSLL is as follows: First, we set the parameters: the number of negative sample n, the sorting number of negative sample a, learning rate η, the initial weights δ, the structure distribution entropy weight ω, the initial biases b, then input q (query sample), i (retrieve image), the learning rate η, the embedding x i Sample pair selection: (a) Find the set of samples S based on the prior knowledge, such that x i is deemed similar to x q . (b) Make the sample x q in pair with the rest training samples and label these pairs so that: y (q,i) = 1 if Xi belongs to S, and y (q,i) = 0 otherwise.
Form the labeled training set by combining all the pairs. Training process: Step 1: feedforward all images into f to obtain the images' embedding x i Step 2: calculate the distribution of samples around each negative sample, i.e., the structural distribution entropy weight ω, online iterative ranking, and get the images' number of sorting (a)repeat until convergence: (a) For each pair (x q , x i ) in the training set, do If y (q,i) = 1, then update δ to decrease If y (q,i) = 0, then update δ to increase Step 3: gradient computation and back-propagation to update the parameters of δ, b.

Experiments
This section aims to evaluate our proposed DSLL model and make a comparison to the current model of state-of-the-art. We first introduced the experimental setup which included the datasets and the protocols of evaluation. Then we analyzed the parameters proposed in the method. In addition, we combined different aggregation methods (including SPoC, MAC and GeM) and whitening methods (PCAw and Lw) with our model to select the method that is most beneficial to our model. At the end of this section, a comparison with existing methods and a visual representation of the results will be presented.

Training Datasets
We adopted the training dataset constructed by Schonberger et al. [37]. The dataset includes 7.4 million images, which are searched and downloaded by Flickr which includes popular landmarks such as cities and countries worldwide. We reconstructed the 3D model by using the dataset with BoW and structure-from-motion (SfM), and used a method exempt from manual annotation to automatically obtain a large dataset with a query image, a positive image, and a cluster with serial 246 number.
There were a total of 91,642 training images in the dataset, and 98 cluster images identical or nearly identical to the test dataset can be excluded through image retrieval based on BoW. About 20,000 images were selected as query images, 18,1697 pairs of positive images, and 551 training clusters, including more than 163,000 from original datasets, by the minimum hash and spatial verification methods mentioned in the clustering procedure [51]. The original datasets contained all the images of the Oxford 5k and Paris 6k datasets.

Training Configurations
In the experiments, our proposed method applies to all fully convolutional CNN [52]. To train the proposed model, we used the pytorch deep learning architecture to train this deep network model based on DSLL. We initialized the parameters of the networks by the weights of the corresponding network which were pretrained on ImageNet [53].
The first-moment estimate and the second moment estimate of the gradient are dynamically adjusted for the learning rate of each parameter. Since the network pre-training parameters [39] were used, the learning rate was equal to lr = 10 -6 , for the AlexNet network during training. The learning rate was equal to lr = 7 × 10 -7 for the VGG16 and Resnet101 networks, and 7, with momentum 0.9, justified by the increase in the dimensionality of the embedding. All the images in the training set have been readjusted to a maximum size of 362 × 362 under the premise of ensuring the original aspect ratio. The training results take the experimental data obtained during the 30 epochs.

Test Datasets
To evaluate the impact of our proposed metric learning-based distribution loss on image retrieval in an instance-level image, we used three publicly available datasets, including the dataset of Oxford 5k building dataset [1], the Paris 6k dataset [24], and the Holiday 1k dataset [25].
The Oxford 5k building dataset is a widely used landmark dataset consisting of 5062 building images, collected from the Flickr dataset, corresponding to 11 famous landmarks in the Oxford area, with 55 query images evenly distributed across the 11 landmarks.
The Paris 6k dataset consists of 6392 Paris travel images, including Eiffel Tower, etc. Similar to the Oxford 5k dataset, there are also 55 search images.
The Holiday 1k dataset is a landscape-based holiday photo with a total of 1499 images, including 500 magazines and corresponding 991 related images.
We combine these datasets with 100k distractors from Oxford 100k for larger-scale evaluation.

Performance Evaluation Metrics
Throughout the experiment, we used Euclidean distance to measure similarity, and use the standards provided on the dataset website, namely calculating mean average precision (mAP) of the search results. The mAP value is calculated as shown in equation (14): Where AP (q) is mAP of the results of the query image compared with the benchmark annotations in the dataset.
where P(r) refers to the average precision when the full rate is r, P i (r) refers to the precision of the i-th query when the full rate is r, N q is the number of queries.

The Impact of Margin Parameter τ
As shown in Section 3, for each query, DSLL can guarantee the structural similarity order consistency of negative samples. Specifically, we ensure the consistency of sample structure similarity by adjusting the size of the negative sample structure space. Since the size of the negative sample space is determined by the constraint parameter τ, as shown in Figure 8, we conducted experiments on the large dataset Oxford 5k and Paris 6k to analyze their impact.
1.00, 1.25, 1.50 according to the experimental results. The data shown in Figures 8a,b are trained under AlexNet, while Figures 8c,d represent the data obtained under VGG training, and the results show that the performance is the best when τ = 1.25. It can be seen from the figure that the performance of the network increases with the increase of the threshold τ, but when the τ increases to 1.45, the map value decreases, because the larger threshold τ pulls the negative samples with high similarity away, affecting the training effect. In the following, we choose the experiment when the threshold τ is 1.25.  In order to find the threshold τ suitable for different data sets and can improve the performance of the network, we select the value from 0.5-2.0 to experiment, and select the results when τ = 0.75, 1.00, 1.25, 1.50 according to the experimental results. The data shown in Figure 8a,b are trained under AlexNet, while Figure 8c,d represent the data obtained under VGG training, and the results show that the performance is the best when τ = 1.25. It can be seen from the figure that the performance of the network increases with the increase of the threshold τ, but when the τ increases to 1.45, the map value decreases, because the larger threshold τ pulls the negative samples with high similarity away, affecting the training effect. In the following, we choose the experiment when the threshold τ is 1.25.

The Role of Structural Similarity Ranking Consistency
In this experiment, we use the contrast loss function as a prototype to compare the retrieval performance without adding ranking with Euclidean distance and structural similarity measure. The search performance results are shown in Figure 9. From the figure, we can see that the result of adding the ranking loss function is better than the original model. Among them, adding the ranking loss measured by structural similarity is the best. The reason why it is better than the ranking of Euclidean distance is that it considers the spatial distribution of the feature and the spatial distribution of the samples around it, which can ensure the vector structure of the negative sample to the greatest extent. However, the ranking of the Euclidean distance dimension only considers the Euclidean distance, and the idle features cannot be well characterized. the ranking loss function is better than the original model. Among them, adding the ranking loss measured by structural similarity is the best. The reason why it is better than the ranking of Euclidean distance is that it considers the spatial distribution of the feature and the spatial distribution of the samples around it, which can ensure the vector structure of the negative sample to the greatest extent. However, the ranking of the Euclidean distance dimension only considers the Euclidean distance, and the idle features cannot be well characterized.

Imapact of Margin Parameter β
To study the influence of the threshold parameter β, we set t = 1.5 in the experiment. The results are presented in Figure 10. We observe that a proper positive constraint β is important for DSLL to learn discriminative embeddings. Since β controls the extent to which this is being approximated, in later experiments, we chose β = 0.15 for the experiment.

Imapact of Margin Parameter β
To study the influence of the threshold parameter β, we set t = 1.5 in the experiment. The results are presented in Figure 10. We observe that a proper positive constraint β is important for DSLL to learn discriminative embeddings. Since β controls the extent to which this is being approximated, in later experiments, we chose β = 0.15 for the experiment.
search performance results are shown in Figure 9. From the figure, we can see that the result of adding the ranking loss function is better than the original model. Among them, adding the ranking loss measured by structural similarity is the best. The reason why it is better than the ranking of Euclidean distance is that it considers the spatial distribution of the feature and the spatial distribution of the samples around it, which can ensure the vector structure of the negative sample to the greatest extent. However, the ranking of the Euclidean distance dimension only considers the Euclidean distance, and the idle features cannot be well characterized.

Imapact of Margin Parameter β
To study the influence of the threshold parameter β, we set t = 1.5 in the experiment. The results are presented in Figure 10. We observe that a proper positive constraint β is important for DSLL to learn discriminative embeddings. Since β controls the extent to which this is being approximated, in later experiments, we chose β = 0.15 for the experiment.

The Combined Impact of DSLL
To verify the effect of our newly proposed loss function based on structural consistency and similarity ranking consistency, we compare the original contrast loss, the loss function with structural similarity ranking consistency, and the retrieval performance of the loss function with structural consistency and structural similarity ranking consistency. The experimental results are shown in Figure 11, where None refers to the original contrast loss, A represents the loss function based on structure preservation, and B represents the loss function based on structural similarity ranking consistency. similarity ranking consistency, we compare the original contrast loss, the loss function with structural similarity ranking consistency, and the retrieval performance of the loss function with structural consistency and structural similarity ranking consistency. The experimental results are shown in Figure 11, where None refers to the original contrast loss, A represents the loss function based on structure preservation, and B represents the loss function based on structural similarity ranking consistency.

Comparison of MAC, SPoC and GeM
This section compares state-of-the-art representations: MAC, SPoC and GeM. We use the DSLL function to perform end-to-end network training on the data set. The pooled method of comparison is connected to the last layer of the AlexNet network convolutional layer. It can be concluded from Figure 12 that GeM is superior to MAC and SPoC on all data sets. The whitening proves to be essential in some cases of CNN-based descriptors. In this experiment, we chose the best whitening method for our network. Also, we compare the existing PCA-whitening [32] (PACw) with learned discriminative whitening [37] (Lw) and the results without post-processing. Figure 13 shows the results of the comparison. Our experiments show that PCAw usually degrades performance, while Lw achieves the best performance in most cases and never

Comparison of MAC, SPoC and GeM
This section compares state-of-the-art representations: MAC, SPoC and GeM. We use the DSLL function to perform end-to-end network training on the data set. The pooled method of comparison is connected to the last layer of the AlexNet network convolutional layer. It can be concluded from Figure 12 that GeM is superior to MAC and SPoC on all data sets.
To verify the effect of our newly proposed loss function based on structural consistency and similarity ranking consistency, we compare the original contrast loss, the loss function with structural similarity ranking consistency, and the retrieval performance of the loss function with structural consistency and structural similarity ranking consistency. The experimental results are shown in Figure 11, where None refers to the original contrast loss, A represents the loss function based on structure preservation, and B represents the loss function based on structural similarity ranking consistency.

Comparison of MAC, SPoC and GeM
This section compares state-of-the-art representations: MAC, SPoC and GeM. We use the DSLL function to perform end-to-end network training on the data set. The pooled method of comparison is connected to the last layer of the AlexNet network convolutional layer. It can be concluded from Figure 12 that GeM is superior to MAC and SPoC on all data sets. The whitening proves to be essential in some cases of CNN-based descriptors. In this experiment, we chose the best whitening method for our network. Also, we compare the existing PCA-whitening [32] (PACw) with learned discriminative whitening [37] (Lw) and the results without post-processing. Figure 13 shows the results of the comparison. Our experiments show that PCAw usually degrades performance, while Lw achieves the best performance in most cases and never The whitening proves to be essential in some cases of CNN-based descriptors. In this experiment, we chose the best whitening method for our network. Also, we compare the existing PCA-whitening [32] (PACw) with learned discriminative whitening [37] (Lw) and the results without post-processing. Figure 13 shows the results of the comparison. Our experiments show that PCAw usually degrades performance, while Lw achieves the best performance in most cases and never achieves the worst performance. Therefore, we choose Lw because it has the best performance and the best generalization ability. achieves the worst performance. Therefore, we choose Lw because it has the best performance and the best generalization ability.

Gradient Value Selection
This section compares the performance of two pooling method including MAC pooling and GeM pooling in combination with PCAw and Lw in different dimensions. The performance for varying descriptor dimensionality is plotted in Figure 14, As can be seen from the figure, the higher the dimension, the better the performance. When the same pooling method is used, the Lw effect is better than PCAw, and the performance is best when combined with GeM pooling. In general, when the dimension is 256, the combination of GeM pool and Lw has the best performance.

Gradient Value Selection
This section compares the performance of two pooling method including MAC pooling and GeM pooling in combination with PCAw and Lw in different dimensions. The performance for varying descriptor dimensionality is plotted in Figure 14, As can be seen from the figure, the higher the dimension, the better the performance. When the same pooling method is used, the Lw effect is better than PCAw, and the performance is best when combined with GeM pooling. In general, when the dimension is 256, the combination of GeM pool and Lw has the best performance. achieves the worst performance. Therefore, we choose Lw because it has the best performance and the best generalization ability.

Gradient Value Selection
This section compares the performance of two pooling method including MAC pooling and GeM pooling in combination with PCAw and Lw in different dimensions. The performance for varying descriptor dimensionality is plotted in Figure 14, As can be seen from the figure, the higher the dimension, the better the performance. When the same pooling method is used, the Lw effect is better than PCAw, and the performance is best when combined with GeM pooling. In general, when the dimension is 256, the combination of GeM pool and Lw has the best performance.

Comparison with the State of the Art
This section compares the performance of our proposed DSLL method with the updated representations of the state-of-the-art performance. Table 1 lists the performance comparisons. We divide the network into two categories, (1) the use of the fine-tuned network (yes), and (2) the use of the off-the-shelf network (no), so that we can observe that our proposed DSLL was superior to all previous prior art methods. When using the VGG network framework, compared with the RMAC [33], DSLL provides a significant improvement of +5.3%, +1.8% and +1.7% in mAP on Oxford 5K, Paris 6k, and Holiday 1k datasets. Furthermore, the DSLL signatures achieves a gain of +0.5%, +1.2% and +1.3% on Oxford 5K, Paris 6k and Holiday 1k datasets, which surpassed recently published [5]. When using the ResNet network framework, on Oxford 5K, Paris 6k, and Holiday 1k datasets, the experimental results achieved +0.6%, +1.2% and +0.4% growth compared to GeM. The improvement in retrieval accuracy, between and GeM, is even more significant on large scale datasets: Holidays 101k (+3.2%) under the VGG and Holiday 101k (+2.8%) under the ResNet framework. Re-ranking and query extensions [13,37] have recently become standard techniques for improving searchability. We applied QE to the DSLL representation, which can be can be observed from Table 1  The results of our evaluation of average pooling (SPoC) and mac using PCAw and off-the-shelf networks. ‡: Results of evaluating R-MAC using [13] and off-the-shelf networks.

Visualization Purposes
In order to visualize the search results, as shown in Figure 15, we present quantitative results based on several query sample. These demonstrate that retrieved images from the top five were similar to the query image by turning to features extracted by fine-tuned, pretrained and DSLL-based ResNet101 networks respectively. In Figure 15, the first line is the result on Oxford datasets, and the query image is from Hertford Bridge, Oxford; the results of Paris datasets are displayed in the middle column, the query image is from Arc de Triomphe, Notre Dame de Paris; the last column Is the result of the query of Holiday datasets, and query image is from the field and snowy mountains. DSLL-based features significantly outperform those pretrained and fine-tuned features. The results indicate that DSLL algorithm based on structural consistency and structural similarity ranking can distinguish images with high within class variance by well learning high-level semantic features. Figure 15. Top five retrieval results for the Oxford (the first line), Paris (the second line) and Holiday (the third line) datasets. For both columns, the first image is the query image, and for each query, the three rows represent results of the extracted features from pretrained, fine-tuned, and DSLL-based ResNet101 respectively. The green ticks on the right bottom of the picture refer to the correct results, while the red ticks mean false results.

Conclusion
In this paper, we proposed a novel DSLL algorithm based on structural preservation and structural distribution consistency. It learns the hierarchical structure that represents the deep features of image samples in two ways. First, it preserves the structural information of the positive samples by learning the hyperplane for each query sample in the model, and then the entropy weight based on structural distribution calculates the spatial relationship between the negative samples and the surrounding samples to obtain the weights. The eigenvectors are combined with the weights to train the network, and the accuracy of the retrieval is improved by realizing the structural preservation of the image feature vectors and the consistency of the structural similarity ranking. We train the whole framework in an end-to-end fashion, and the experimental results from broad tests prove that DSLL achieves the most state-of-the-art performance.  DSLL-based features significantly outperform those pretrained and fine-tuned features. The results indicate that DSLL algorithm based on structural consistency and structural similarity ranking can distinguish images with high within class variance by well learning high-level semantic features.

Conclusions
In this paper, we proposed a novel DSLL algorithm based on structural preservation and structural distribution consistency. It learns the hierarchical structure that represents the deep features of image samples in two ways. First, it preserves the structural information of the positive samples by learning the hyperplane for each query sample in the model, and then the entropy weight based on structural distribution calculates the spatial relationship between the negative samples and the surrounding samples to obtain the weights. The eigenvectors are combined with the weights to train the network, and the accuracy of the retrieval is improved by realizing the structural preservation of the image feature vectors and the consistency of the structural similarity ranking. We train the whole framework in an end-to-end fashion, and the experimental results from broad tests prove that DSLL achieves the most state-of-the-art performance.