Unsupervised Hashing with Gradient Attention

: The existing learning-based unsupervised hashing method usually uses a pre-trained network to extract features, and then uses the extracted feature vectors to construct a similarity matrix which guides the generation of hash codes through gradient descent. Existing research shows that the algorithm based on gradient descent will cause the hash codes of the paired images to be updated toward each other’s position during the training process. For unsupervised training, this situation will cause large ﬂuctuations in the hash code during training and limit the learning e ﬃ ciency of the hash code. In this paper, we propose a method named Deep Unsupervised Hashing with Gradient Attention (UHGA) to solve this problem. UHGA mainly includes the following contents: (1) use pre-trained network models to extract image features; (2) calculate the cosine distance of the corresponding features of the pair of images, and construct a similarity matrix through the cosine distance to guide the generation of hash codes; (3) a gradient attention mechanism is added during the training of the hash code to pay attention to the gradient. Experiments on two existing public datasets show that our proposed method can obtain more discriminating hash codes.


Introduction
Image retrieval is a hotspot in multimedia data retrieval, and there are many different methods today. The content-based image retrieval method [1] cannot meet the requirements of timeliness and retrieval precision due to its own limitations. In order to obtain higher retrieval accuracy, early researchers preferred the nearest neighbor search (NN). NN tends to retrieve the closest object from the dataset, and the complexity increases linearly with the amount of data, so its time performance cannot be guaranteed. Nowadays, there are a lot of high-dimensional multimedia data on social networks, the existing NN-based retrieval methods cannot obtain ideal retrieval results and an acceptable retrieval time. Therefore, researchers begin to pay attention to approximate nearest neighbor (ANN) [2][3][4][5][6][7][8][9]. Compared with the traditional retrieval method [1], another effective solution is the hashing method [2], through which high-dimensional media data can be converted into hash codes in the binary space.
In recent years, hashing has become a popular sub-region in ANN search [6,10,11]. With low storage cost and high query speed, the hash method has become the preferred method for effective ANN search. Hashing methods include two subcategories: (1) data-dependent hashing; (2) data-independent hashing. The process for data-independent hashing to obtain the hash code depends on the random mapping, and a reliable hash function cannot be obtained. Liu et al. [3] uses a set of hash functions to create multiple hash tables. After hash mapping, similar data points are more likely to collide in these hash tables, but the probability of collision of different points is very small. Therefore, similar images and non-similar images can be well distinguished. On this basis, Bai et al. [12] proposed to transfer the distance constraint of the hash code from the Hamming space to the Euclidean space. Data-dependent hashing methods use image colors, textures, shapes, or other features to directly generate hash codes and construct hash functions. Current researchers are committed to using deep learning methods to map features extracted from Convolutional Neural Network (CNN) to Hamming space through a hash function.
The data-dependent hashing method includes two subclasses: supervised hashing and unsupervised hashing. The difference between the two subclasses is whether the label information is used to construct the hash function. Gong et al. [13] was a representative algorithm in unsupervised hashing. The application of the Principal Components Analysis (PCA) algorithm could effectively reduce the dimensions of the original dataset, and at the same time constructed a binary hypercube. Each vertex of the hypercube represented the data point in the dataset, the quantization error was minimized by the rotation matrix. In supervised hashing methods [14][15][16][17][18][19], the construction process of the hash function in these methods used similar information of the paired images. In binary space, the Hamming distance of the image pair was determined by the image pair similarity.
The further development of CNN shows an excellent performance in feature extraction [20]. Existing supervised deep hashing methods have achieved a good performance [21][22][23][24][25][26]. However, for the massive image data existing in the network, it is not realistic to manually label them, and the hash method that relies on annotated data may not be suitable for practical applications.
In order to use these massive amounts of unlabeled data, the researchers proposed deep unsupervised hashing method. The learning process of the hash code in the early deep unsupervised hashing method was completed by the deep autoencoder [27,28]. However, the variability of natural images in terms of position, color, posture, etc., affected the representativeness of the learned hash code. Erin Liong et al. [29] maximized the representation of hash codes. Lin et al. [30] kept the hash code of the rotating image unchanged to learn hash codes. Jin et al. [31] further studied the rotation invariance between different images in the same category label. Yang et al. [32] used two half-Gaussian distributions to estimate the cosine distance distribution of different data points. Based on this hypothetical estimate, a pair-wise loss function was designed to retain semantic information. Generative adversarial networks (GAN) were widely used in various fields [33], Ghasedi Dizaji et al.
[34] used a shared parameter generator and discriminator to optimize the hash function in the adversarial. On the basis of GAN, Deng et al.
[35] constructed a semantic similarity matrix by using feature similarity and nearest neighbor similarity to guide the construction of hash codes.
The optimization algorithm commonly used in current deep learning models is gradient descent. Huang et al. [36] proposed that the hash model based on gradient descent optimization may fall into a dilemma during the optimization process, the hash codes of the paired images will always be updated in the direction of the other side, and the worst case is that the paired hash codes will exchange directions after being updated. This phenomenon is the "dilemma" of gradient descent. After falling into this dilemma, the Hamming distance corresponding to hash code will not be reduced, affecting the learning process, since this problem is caused by gradient changes in the process of updating hash codes in backpropagation. Selectively ignoring or reducing the gradient of one of the hash codes during backpropagation is an effective way to solve this problem. The gradient attention mechanism is designed to focus on the gradient of the paired image hash codes in supervised hashing.
Existing unsupervised deep hashing methods usually use a pre-trained network model to construct a similarity matrix to guide the generation of hash codes. Therefore, the above problems still exist in unsupervised learning, through the experimental results show that after adding gradient attention mechanism, unsupervised hash method achieved better performance.

Deep Hash Model
For unsupervised hashing, X = {x i } N i=1 represents a dataset containing N images, where x i is the i-th image. Unsupervised hashing aims to obtain the binary code for each data point in the dataset and construct a hash function M(θ; x) to generate this binary code, where θ represents the parameters of the deep model. Finally, after each image x i passes through the deep model M, it will be mapped into a k-dimensional vector Z = {z i } n i=1 , where z i the i-th k-dimensional vector. After performing a b i = sign(z i ) binarization operation on this k-dimensional vector, we obtain its corresponding hash , where b i the i-th hash code. The sign function is expressed as: Unsupervised hashing usually guides the learning of hashing codes by constructing the semantic similarity matrix between image pairs. In this paper, we extract the features of FC-7 layer from n randomly selected data points in the dataset through the pre-trained Alexnet network to construct the similarity matrix. Most of the current deep hash algorithms use VGG-F network in MATLAB as the benchmark. In this paper, our experiment is implemented on the deep learning framework Pytorch. To ensure the fairness of the experiment, Alexnet, which has the same structure as VGG-F network, is selected as the benchmark network. For the input N data points, the corresponding feature matrix is obtained after the pre-trained Alexnet. According to the feature matrix, calculating the cosine distance of the corresponding feature of any two data points, and the cosine distance matrix D is constructed.
where F i represents the feature of FC-7 layer. Then, the similarity matrix S is constructed as follows through the cosine distance matrix: where D.mean represents the average value of cosine distance matrix D, α and β are threshold parameters,S ij = 1 means that x i is semantically similar to x j , S ij = −1 means the opposite, and S ij = 0 means that the semantic relationship between x i and x j is uncertain.
In the training process, the sign function cannot perform gradient training, tanh is generally used instead to relax each binary code into a continuous hash code h ∈ [−1, +1] k , where h k and b k denote the k-th bit of h and b.
To preserve the similarity of the paired image in the hash code, we use the inner product of the hash code to construct prediction similarity matrix : where h i and h j are relaxed hash codes.
To maintain the semantic information between image pairs, we minimize the L2 distance ij − S ij 2 between the similarity matrix constructed by the feature matrix and the predicted similarity matrix. The resulting objective function is: where ij = 1 k h T i h j , h i = tanh(M(θ; x)) and θ is the parameter of the network model. The commonly used method is to update the parameter θ by calculating the gradient of the loss function. First we define ν i = M(θ; x), the derivative of ν i to θ can be directly calculated, and then calculate the derivative of l(θ) to ν i by Equation (6).
Gradient descent is usually used to optimize parameters θ ← θ − µ ∂l ∂θ . In the optimization process, the backpropagation method is used to calculate the gradient of the model parameter θ, the partial derivative ∂l/∂ is first calculated, and then the partial derivative h k i can be calculated by Equation (7). ∂l where h k i denote the k-th bit of h i . Finally, the gradient ∂l/∂θ can be calculated by (8):

The Dilemma of Gradient Decline
The purpose of network model optimization is to update the parameter θ of the model in the training process, but after updating θ, the detailed changes in h cannot be displayed directly, in order to analyze the change process of h, [36] proposes a lemma to analyze the update behavior of h.

Lemma 1: for a given composite function g(y)
, where x is a vector and y = f (x) is a scaling function. Suppose x is updated by gradient descent, that is x + = x − α ∂g ∂x , where α is the step size. Conclusion: y + = f (x + ) ≈ y −â ∂g ∂y , whereâ is some positive scaler. i.e., y update along negative gradient.
Proof: Calculate the first-order Taylor expansion of f (x + ) on x: This proof is completed.
By substituting x = θ, y = h and f (x) = tanh(M(θ)) into Equation (9), it can be seen that when θ is updated by gradient descent, h changes along its negative gradient direction, and then the change in h can be observed by gradient.
For a pair of input images x i , x j , assume x i is similar to x j , i.e., s ij = 1. If one of the bits in the current corresponding continuous hash code are h i = 1 and h j = −1 respectively, then the loss can be defined as At this time, > 0, the current hash code needs to be updated. The gradients of h i and h j are: Symmetry 2020, 12, 1193 Based on the lemma above, the updated values of h i and h j are expressed as h + i and h + j respectively, which can be expressed as: where α ij and α ji are some positive scalers. It can be seen from Equation (12) and (13) that, in the process of optimization, h i moves to h j direction and tends to have the same value as h j . The update of h j is the same as h i . Note that when h i = h j , ij = S ij , the hash codes are optimal. However, in some cases i.e., after updating, therefore, the updated loss remains unchanged. In the next round of update, h i and h j will switch their symbols, and their corresponding losses will not be reduced, which will happen repeatedly. This dilemma also occurs when two different pairs of code are updated to stay away from each other.
Updating one of the paired hash codes during backpropagation is an effective solution, when their similarity prediction is not correct. When the prediction is correct, h i and h j are updated towards their sign value, which can effectively avoid the problem of symbol switching. When updating h i and h j the weight of the gradient should be carefully selected so as to update only one code when the similarity of prediction image is not correct, and two codes when the similarity of prediction is correct.

Gradient Attention Network
In this paper, we use gradient attention network to focus on the gradients of h k i and h k j , and generate weights for the derivatives of h k i and h k j before backpropagation to θ. We use the weighted derivatives of h k i and h k j instead of the original derivatives to calculate the gradients of θ. For all training pairs, the weighted derivatives of h k i and h k j are calculated as follows: where ξ k ij,i and ξ k ij,j are the weights generated by the derivative of a pair of binary bits h k i and h k j respectively. A new gradient g(ϕ) can be obtained by substituting (14), (15) into (8). The parameters of the gradient attention network are expressed by ϕ. For the hash model optimized by gradient descent, the change in the value of the corresponding hash code h k i of the paired image x i , x j is updated by the derivative related to the loss based on the original value during the optimization process. For the gradient attention network, all the inputs are The gradient attention network first generates the feature value y k ij,i for the k-th bit of the hash code corresponding to the paired image, then calculates the weight of the k-th bit and normalizes it by the softmax function. The calculation process is as follows: Figure 1 shows the deep unsupervised hash network model. The gradient attention network consists of two fully connection layers, each of which contains 100 hidden units.
The detailed learning process of the UHGA model is shown in Algorithm 1.

Input:
Training set X = {x i } n i=1 ; Code length k; Output: Updated network parameters θ and hash codes B.
Extracting features of Alexnet FC-7 layer. Using (2) to calculate cosine distance of paired image features. Using (3) to construct similarity matrix S. Repeat a mini batch of images from X = {x i } n i=1 , and for each image x i , follow each step below: •Calculate l(θ) and obtain the derivative ∂l/∂ in the process of backpropagation; and ∂l ∂h k i as input of gradient attention network to calculate attention weights y k i j,i ; •Calculate ξ k i j,i and ξ k i j, j by (16) and (17)

Optimization
The gradient attention network can solve the problem of the exchanging positions of consecutive hash codes corresponding to pairs of images after updating. Due to this problem, the loss after updating remains unchanged, as the performance of attention weight can be evaluated by the loss reduced during the updating process. In gradient attention networks, the new gradient is first generated and then the parameter θ of the hash model is updated by Equation (18). The updated loss is l(θ + ) = l(θ − αg(ϕ)), and the reduced loss is l(θ) − l(θ + ). After parameter θ is updated, the goal is still to minimize the loss; the goal for optimizing gradient attention is: For Equation (19), θ can be optimized using Stochastic Gradient Descent (SGD) from Equation (6) and, as long as θ + is a function of g(ϕ), ϕ can be optimized by any optimizer.
Gradient descent is usually used to optimize parameters ← − μ . In the optimization process, the backpropagation method is used to calculate the gradient of the model parameter , the partial derivative ∂ / ∂ℑ is first calculated, and then the partial derivative ℎ can be calculated by Equation (7).
where ℎ denote the k-th bit of ℎ .

The Dilemma of Gradient Decline
The purpose of network model optimization is to update the parameter of the model in the training process, but after updating , the detailed changes in ℎ cannot be displayed directly, in order to analyze the change process of ℎ, [36] proposes a lemma to analyze the update behavior of ℎ.
Lemma: for a given composite function ( ), where is a vector and = ( ) is a scaling function.

Suppose
is updated by gradient descent, that is = − , where is the step size.

Experimental Results and Analysis
The UHGA model is based on the Alexnet implementation in Pytorch. Detailed configuration information for the experimental machine is shown in Table 1.

Datasets, Evaluation Metrics and Benchmarks
There are two authoritative image datasets: CIFAR-10 and NUS-WIDE datasets. The experiments have used these two datasets as benchmark datasets to compare with other methods in order to verify the feasibility of the UHGA model.
(1) CIFAR-10. A total of 60,000 images, including 10 categories, and each category contains 6000 images, which is a single-label dataset. (1) NUS-WIDE. A total of 269,648 images, including 21 categories, each category is associated with at least 5000 images and each image belongs to one or more categories, which is a multi-label dataset.

Hyperparameter Analysis
We use Equation (3) to construct a similarity matrix between images. There are two hyperparameters α and β in Equation (3). In order to analyze the influence of hyperparameters, we simply set the two hyperparameters to α = η(D.mean − D.min) and β = η(D.max − D.mean). Then, we only need to adjust the hyperparameter η to observe the influence of the constructed similarity matrix on the retrieval accuracy. Table 2 shows the effect of different η on two datasets under different bits. In order to observe the change in MAP, we visualize it in Figure 2. Figure 2 shows that the value of MAP is relatively smooth when η is between 0.2-0.4, and when η continues to increase, MAP shows the opposite trend. Therefore, in subsequent experiments, the hyperparameter η is set to 0.3. We use Equation (3) to construct a similarity matrix between images. There are two hyperparameters and in Equation (3). In order to analyze the influence of hyperparameters, we simply set the two hyperparameters to = ( . − . ) and = ( . − . ) .
Then, we only need to adjust the hyperparameter to observe the influence of the constructed similarity matrix on the retrieval accuracy. Table 2 shows the effect of different on two datasets under different bits. In order to observe the change in MAP, we visualize it in Figure 2. Figure 2 shows that the value of MAP is relatively smooth when is between 0.2-0.4, and when continues to increase, MAP shows the opposite trend. Therefore, in subsequent experiments, the hyperparameter is set to 0.3.

The effect of Gradient Attention
In order to study the effect of gradient attention, we added the gradient attention mechanism to the method based on pairwise loss. We used four methods in Table 3 as comparative experiments: SSDH, SSDH+GA, Ours, Ours+GA, where SSDH is its original method, SSDH+GA means adding gradient attention mechanism in SSDH, Ours means only using our own designed loss function to train the unsupervised hash model, and Ours+GA means adding a gradient attention mechanism to our own loss function. It can be seen from Table (3) that, after adding the attention mechanism, the MAP of SSDH [32] is significantly improved. We obtained good results using our own designed loss function, and the map value continued to increase after using the attention mechanism. Figure 3 shows the decreasing trend of training loss before and after adding the gradient attention (GA) mechanism in SSDH [32] and our method. It can be clearly seen that the loss decreases faster after adding the gradient attention mechanism, which helps to obtain a better hash code.   In order to study the effect of gradient attention, we added the gradient attention mechanism to the method based on pairwise loss. We used four methods in Table 3 as comparative experiments: SSDH, SSDH+GA, Ours, Ours+GA, where SSDH is its original method, SSDH+GA means adding gradient attention mechanism in SSDH, Ours means only using our own designed loss function to The loss function we proposed in Table 3 has been greatly improved in MAP. This is also shown in the decreasing trend of training loss in Figure 3. The loss function we designed has more advantages in the generation of constrained hash codes. After adding the attention mechanism, our method and SSDH [31] are significantly improved on MAP. In Figure 3, the improvement in hash performance is reflected in the faster decline in training loss. Figure 4 shows the effect of different learning rates on retrieval results. It can be seen from (a) and (b) in Figure 4 that, without the gradient attention mechanism, the retrieval quality will decline with the increase in the learning rate because, in the process of reverse propagation, when the learning rate is high, the network cannot converge. After using the gradient attention mechanism, the magnitude of the gradient descent is not only controlled by the learning rate, but also by the gradient attention. In the process of backpropagation, the gradient attention mechanism can select the appropriate gradient modification parameters for the network to obtain a better deep hash model.

The effect of Gradient Attention
The above situations indicate that the unsupervised method based on pairwise loss have a dilemma in gradient descent, and the design of the loss function also affects the probability of this dilemma. Obviously, the loss function we designed has a lower probability of dilemma. The addition of the gradient attention mechanism can effectively reduce the occurrence of this dilemma, thereby improving the performance of the hash method. and our method. It can be clearly seen that the loss decreases faster after adding the gradient attention mechanism, which helps to obtain a better hash code.
The loss function we proposed in Table 3 has been greatly improved in MAP. This is also shown in the decreasing trend of training loss in Figure 3. The loss function we designed has more advantages in the generation of constrained hash codes. After adding the attention mechanism, our method and SSDH [31] are significantly improved on MAP. In Figure 3, the improvement in hash performance is reflected in the faster decline in training loss.  Figure 4 shows the effect of different learning rates on retrieval results. It can be seen from (a) and (b) in Figure 4 that, without the gradient attention mechanism, the retrieval quality will decline with the increase in the learning rate because, in the process of reverse propagation, when the learning rate is high, the network cannot converge. After using the gradient attention mechanism, the magnitude of the gradient descent is not only controlled by the learning rate, but also by the gradient attention. In the process of backpropagation, the gradient attention mechanism can select the appropriate gradient modification parameters for the network to obtain a better deep hash model.
The above situations indicate that the unsupervised method based on pairwise loss have a dilemma in gradient descent, and the design of the loss function also affects the probability of this dilemma. Obviously, the loss function we designed has a lower probability of dilemma. The addition of the gradient attention mechanism can effectively reduce the occurrence of this dilemma, thereby improving the performance of the hash method.

Comparison of Experimental Results
In the training of the hash function, we select 500 images from each image category to form a training dataset, and select 100 images for each category to serve as the test dataset on CIFAR-10. All remaining image samples are used as retrieved data. We select 21 kinds of common images from the NUS-WIDE dataset. Each class is associated with at least 5000 images, with a total of 190,000 images. The NUS-WIDE test dataset consists of 100 images of each class and 2100 images in total. The training dataset contains 5000 randomly selected images, and the rest of the images are retrieved as the sample dataset. Table 4 shows the comparison results of all comparative experiments on the evaluation metric MAP. We calculated the MAP value of the hash code length of all the comparison methods from 16 bits to 128 bits on the two datasets. In the measurement results of different hash code length, our method is 8.73%, 8.67%, 12.36%, 13.79% higher than USDH in the CIFAR-10 dataset, and 5.77%, 5.71%, 9.51%, 9.21% higher than USDH in the NUS-WIDE dataset.  Figure 6 show the PR and P@N curves on CIFAR-10, respectively. From Figure 5, the area under the PR curve corresponding to our method is the largest, and our method shows better results as the length of the hash code increases. As can be seen in Figure 6, as the number of retrieval results continues to increase, the retrieval precision of all methods shows a downward trend, but our method always shows the best results. Figures 7 and 8 show the PR and P@N curve on NUS-WIDE respectively. Figure 7 shows that our method still has higher performance on multi-label datasets. Although the precision of our method is lower than USDH when the recall is between 0-0.1, the area under the PR curve corresponding to our method is still the largest overall; this also shows that our method is more stable. Figure 8 shows that the n curve is not as obvious as the downward trend in Figure 6, and our method still consistently shows the best performance in retrieval accuracy.          The above experimental results show that our unsupervised hash method achieves the best performance. Compared with other methods, the advantages of our proposed method are as follows: (1) the designed loss function can generate a hash code with a greater performance (2) by adding a gradient attention mechanism to reduce the impact of the dilemma in gradient descent during training.

Conclusions
In unsupervised deep hashing, a similarity matrix is usually constructed by extracting image features and used in the training of neural networks. Gradient descent is used to update parameters during training. However, for unsupervised hashing methods based on paired image training, there may be a problem in the hash code update process, as the hash codes corresponding to paired images are updated toward each other. After the network parameters are updated, the corresponding hash code changes its location, but the overall loss remains unchanged. This paper presents a Deep Unsupervised Hashing with Gradient Attention (UHGA) algorithm based on the deep learning end-to-end model. UHGA uses the cosine distance of features to construct similarity information between images, and the gradient attention mechanism in the process of gradient descent, which successfully obtains a better deep hash model and significantly optimizes the retrieval quality. Experiments on two datasets show that UHGA significantly improves the latest hash methods.