An Efficient Supervised Deep Hashing Method for Image Retrieval

In recent years, searching and retrieving relevant images from large databases has become an emerging challenge for the researcher. Hashing methods that mapped raw data into a short binary code have attracted increasing attention from the researcher. Most existing hashing approaches map samples to a binary vector via a single linear projection, which restricts the flexibility of those methods and leads to optimization problems. We introduce a CNN-based hashing method that uses multiple nonlinear projections to produce additional short-bit binary code to tackle this issue. Further, an end-to-end hashing system is accomplished using a convolutional neural network. Also, we design a loss function that aims to maintain the similarity between images and minimize the quantization error by providing a uniform distribution of the hash bits to illustrate the proposed technique’s effectiveness and significance. Extensive experiments conducted on various datasets demonstrate the superiority of the proposed method in comparison with state-of-the-art deep hashing methods.


Introduction
Since the rapid advancements of information technology, a vast amount of data has dramatically accumulated in different fields, and we are now living in the century of big data. A major challenge is determining how users can rapidly and precisely search and retrieve the exact data from large-scale databases. Effective and efficient retrieval of data from big datasets has turned into a hot research direction in both industries and academia. As a result of highly efficient performance processing high dimensional data, hashing method has emerged as a solution to this challenge in recent years. The primary goal of hash learning methods is to transform the data item into low-dimensional vectors; a small code containing a series of bits is known as binary hash code [1,2]. Hashing techniques can be categorized into two different groups. Data-independent and data-dependent. The key dissimilarity amongst these methods is that the data-independent method either generated or manually designed the hash function. In contrast, the data-dependent method automatically learns from the data. The most broadly used data-independent hashing methods which perform image retrieval or related tasks are locality-sensitive hashing (LSH) [3] and its variant hashing methods such as Superbit LSH [4], KLSH [5], NLSH [6], and having faster computing hash functions [7]. Using LSH methods, a set of random hyperplanes are generated from a gaussian distribution because a threshold function is applied to the projection results as a projection of the original high dimensional data. The arrival of LSH has enhanced the performance of image retrieval, resulting in a new approach for resolving the issues related to large-scale image retrieval. However, the obtained hash function in data-independent methods shown by LSH is randomly generated or designed. As the number of bits increases, the algorithm's accuracy increases slowly when it uses a data-independent method like LSH, where the hash function is generated randomly or manually, regardless of the original data distributions. As a result, achieving stable retrieval

•
The deep convolutional neural network-based hash coding approach is introduced and employs multiple nonlinear projections to generate the additional distinctive short binary codes. To extract a rich representation of mid-level information, CNN is applied as the basis of the network. Meanwhile, hash encoding and concurrent learning of the feature representation in back to the back network make the feature of binary code consistent.

•
To maintain the semantic-similarity, a loss function has been designed that naturally pushes the codes of different images away and pulls the codes of similar images together. In the meantime, we minimize the quantization error by providing a uniform distribution of the hash bits. certain limits on the highest layer of the deep neural network, the proposed method is optimized using the stochastic gradient descent approach and the backpropagation algorithm. The following list summarizes the significant contributions of the proposed method: • The deep convolutional neural network-based hash coding approach is introduced and employs multiple nonlinear projections to generate the additional distinctive short binary codes. To extract a rich representation of mid-level information, CNN is applied as the basis of the network. Meanwhile, hash encoding and concurrent learning of the feature representation in back to the back network make the feature of binary code consistent.

•
To maintain the semantic-similarity, a loss function has been designed that naturally pushes the codes of different images away and pulls the codes of similar images together. In the meantime, we minimize the quantization error by providing a uniform distribution of the hash bits.

•
Finally, Extensive experiments have been conducted to test our method against three benchmarks: NUSWIDE, MIRFLICKR25, and MS COCO for retrieval tasks. Comparative results illuminate that our method outperforms state-of-the-art supervised hashing methods. Results of comparative tests reveal that our proposed approach outperforms state-of-the-art supervised hashing.

Figure 1.
Basic structure of our model. The primary network is GoogleNet, and our hash layer replaces the last classification layer of the GoogleNet convolutional layer. The hash bits of the layer define their unit number. An image batch is fed to the network during the training as input. By coding similar images in the same binary patterns, we force them to be coded the same way and vice versa, in addition to minimizing quantization errors and ensuring that hash bits are evenly distributed. The trained network is tested by inputting a new image, and this is quantized into [+1, This article is summarized as follows. In Section 2, we present the proposed method and primary model. Section 3 demonstrates the experimental results and baseline matrices, Section 4 presents the experimental setting and analysis compared with the state of art methods, and Section 5 summarizes the conclusion

Proposed Methodology
This study utilizes the supervised-based deep hashing method to generate compact binary code. An overview of the proposed method is presented in Figure 1, consisting of three main stages. The network parameter was initialized in the first stage. In the second stage, our method used the images with datasets labeling information to fine-tune the Basic structure of our model. The primary network is GoogleNet, and our hash layer replaces the last classification layer of the GoogleNet convolutional layer. The hash bits of the layer define their unit number. An image batch is fed to the network during the training as input. By coding similar images in the same binary patterns, we force them to be coded the same way and vice versa, in addition to minimizing quantization errors and ensuring that hash bits are evenly distributed. The trained network is tested by inputting a new image, and this is quantized into [+1, −1].
This article is summarized as follows. In Section 2, we present the proposed method and primary model. Section 3 demonstrates the experimental results and baseline matrices, Section 4 presents the experimental setting and analysis compared with the state of art methods, and Section 5 summarizes the conclusion

Proposed Methodology
This study utilizes the supervised-based deep hashing method to generate compact binary code. An overview of the proposed method is presented in Figure 1, consisting of three main stages. The network parameter was initialized in the first stage. In the second stage, our method used the images with datasets labeling information to fine-tune the network. At last, with the completion of the training process, binary codes were derived from the network's outputs based on input images and quantization. The proposed method forced the generated binary code to meet the following requirements to enhance the quality and bit size. (1) Ideally, similar images should be encoded into a binary code that is relatively similar. In contrast, those not similar should be as different as possible. the evenly distributed hash bits, the quantization error should be reduced from hamming to euclidean space. Our method is further explained as follows.

Network Architecture
There are two main parts of the deep architecture training phase. The first part is the initialization of the network, and the other is the optimization. Due to the excellent image classification performance, the well-known CNN-based initial architecture "GoogleNet" is adopted in our model to extract information as the basic structure of the hash. We initialize the network with a pretrained "GoogleNet" from a Caffe model trained on various large-scale ImageNet datasets. Over one million images are included in this dataset, which is divided into 100 categories. The fully-connected layers are replaced by the last convolutional layer of "Google Net" to force the learning of compact binary codes for hashing tasks. In the second stage, the network is tuned to different dataset criteria for image retrieval using stochastic gradient descent techniques and backpropagation algorithms. The following sections describe the details of the loss function and parameter update.

Loss Function
Suppose we have Ω as the space of an initial image. Our primary objective is to learn the mapping from Ω to a binary code of K bit: Ω→{+1, −1} K . Such that comparable binary codes are used to represent similar images, whether those images are similar semantically or visually. In this sense, the codes for similar images should be as close as possible, while the codes for different images should be far away. Based on this, the loss function is designed to naturally push the codes of different images away from each other and pull the codes of similar images together.
We have pair of images I 1 I 2 , and the binary code for each pair of images is represented as c 1 , c 2 . The hash code length has been indicated as K. Meanwhile, we ensure that the quantization error is reduced to the minimum, and the hash bits are constantly distributed to allow additional data. According to the definition of the loss associated with this pair of images can be written as follows: where Y indicates the semantic similarity of images, we describe Y = 1 if the image features are similar and Y = 0 otherwise. The hamming distance amongst the binary vectors is denoted as D(·, ·), and m is a margin threshold, which is m > 0. As stated in Equation (1), where the first term aims to force the hash code of the same images to be closer and dissimilar images to be far away from each other so that their hamming distance will larger. We can write this part as follows for clarity.
Network entries consist of batches of images. Our network forces each Image and the latter Image to be paired. Hence, if there are n images in the batch, then there are n pairs will be C 2 n = n! 2!(n−2)! . Figure 2 shows how the combination is achieved. Using Equation (2) to train the network using a backpropagation algorithm was an excellent idea. Nevertheless, this is challenging because of its indistinguishable properties. A popular technique, tanh or sigmoid, is used for overcoming this problem, which aims to constrain the output within {·1, +1}. This allows us to relax the integer constraint of the series constraint. However, this  (2) can be written as: We have used the l 2 norm in the upper part of the loss function, which aims to compute the distance between the network's outputs. Consequently, the lower norms produce subgradients that do not account for the information involved in different distance magnitudes when they produce image pairs with different distances.
to train the network using a backpropagation algorithm was an excellent idea. Nevertheless, this is challenging because of its indistinguishable properties. A popular technique, tanh or sigmoid, is used for overcoming this problem, which aims to constrain the output within {·1, +1}. This allows us to relax the integer constraint of the series constraint. However, this type of technique slows down network convergence. Therefore, relax the binary constraint and replace it with {−1,1} with {−1, +1}. Then Equation (2) can be written as: We have used the l2 norm in the upper part of the loss function, which aims to compute the distance between the network's outputs. Consequently, the lower norms produce subgradients that do not account for the information involved in different distance magnitudes when they produce image pairs with different distances. . Assume a batch consisting of four images as illustrated. The loss calculation is always done as a pair for each and its last image within a batch. The total number of pairs is denoted as Using the backpropagation algorithm with the minibatch gradient descent method, for this purpose, the gradient of Equation (3) with respect to c i , I ∈ {1, 2} needs to be measured as Equation (4) can be used to calculate the gradient when the value of Y = 1; alternatively, when the value of Y = 0, the gradient can be driven using Equation (5). A binarization step is required since the network's outputs are real-valued. The second portion of Equation (1) goal is to reduce the error within hamming space, as follows.
The network's real-valued outputs are measured in units called vi. In order to increase the capacity for information. We advocate a uniform distribution of compact binary codes in the initial part of (1). More information can be transferred if a binary code's likelihood of −1 or 1 is closer to 50%. This results in a sum of bits that is nearly zero. Therefore, the loss function is as follows.
The c i (j) represents a jth bit of ith binary code; the number of binary codes is indicated by q. The binary code is given by n, where n indicates the length. The proposed approach ensures that the network system uses the mentioned loss function to secure image semantic similarity and enhance retrieval performance. Additionally, it reduces quantization error and evenly distributes binary codes. After the network's training face has been completed, the model generates the q-bit binary code for image testing. We initially feed an image into a network and encode it into a K-dimensional real feature vector, as illustrated in Figure 1. Then, regarding the network's outputs, a basic quantization c = sign(v) results with K-bit binary code, as we stated earlier.

Dataset
The performance of the proposed method is evaluated on three widely used datasets, and the results are compared with other state-of-the-art methods.
MIRFLICKR-25: This is a smaller version of the MirFlickr25K dataset [35]. The dataset has 25,000 images divided into 24 categories. In our experiments, datasets are classified according to their raw annotations. Each image in the collection is labeled with a 24dimensional vector corresponding to 24 different object types. We use 10,000 images to train our hash encoding method and another 5000 images to test the hash model for image assignment. Table 1 lists the dimensions of the three datasets and the number of training, test, and labels applied to each dataset.  [36]. Images are divided into 81 categories, some of which have multiple labels. In our study, we used a subset of the NUSWIDE dataset. The 21most typically utilized labels were used in our process using the previous report. A minimum of 5000 images are associated with each label MS COCO: The current dataset contains 82,783 images from the training set and 40,137 images from the validation set [37]. The dataset generates five sentences per image as ground truth labels based on the 80 most persistent categories. There are 82,081 training images, and some images without categories have been removed from the training set.

Evaluation Matrices
To evaluate experimental performance, we use mean average precision (mAP) to measure the quality of retrieving database images, which has been shown to be discriminative and stable.
Here, the number of related images is denoted as S, and A.P. presents the average precision value of the first N images after each related image is retrieved. Then calculate the average of these values.
where a function ∏ (·) ∈ {0, 1} denotes an indicator function of r p > 0. r p corresponds to the similarity to the query image ranked p-th and G r > 0 represents the number of related where P@N is the precision weighted by the similarity level of each image

Experimental Setting and Result Analysis
Due to remarkable performance in image classification, a famous CNN-based inception architecture, "GoogleNet," has been adopted in our model for extracting the information as the basic structure to hashing. Our network was initially trained on pre-trained GoogleNet data from the Caffe model, used in various large-scale imageNet datasets. GoogleNet replaces its last convolutional layer with a fully connected layer for the hashing task to enforce compact binary code learning. In the second phase, the network is adjusted to the different data set standards for image retrieval using the stochastic gradient descent technique and backpropagation algorithm. In the next section, we will describe the details of the loss function and parameter updates Table 2 shows mAP values for all baselines that were compared. The results of our method on distinctive data sets with different binary code lengths show that it performs significantly better than each of the competitive baselines. On MIRFlickr25K and MSCOCO NUSWIDE datasets, the performance of our proposed method was enhanced by 11.05, 38.45, and 23.27%, respectively, with the comparison of NNH methods. On the three widely used datasets, the mAP values obtained by our method are 10.02%, 8.91%, and 12.37% higher than those obtained by other methods. It is clear from such improvements that our method is effective.  Figure 3 shows the precision-recall curves for three different datasets with 16-bit hash codes. Besides reporting the mAP and precision curves of the 16-bit hash code on these top retrieved data samples, we also report the efficiency of P@5000 evaluated by various top retrieved data samples, as shown in Figures 4 and 5, respectively. The average performance gaps between PCAH and proposed method retrieval performance on three large datasets are 14.31%, 38.61%, and 23.17%. Comparing the proposed method to DSH achieved an average performance improvement of 10.02%, 8.79%, and 12.287% based on P@5000 values for three datasets. Figure 5 illustrates a consistent outperforming of all other states of art methods on various datasets. The figure shows the precision-recall curve for three different datasets, and the area under the curve represents a significant performance. large datasets are 14.31%, 38.61%, and 23.17%. Comparing the proposed method to DSH achieved an average performance improvement of 10.02%, 8.79%, and 12.287% based on P@5000 values for three datasets. Figure 5 illustrates a consistent outperforming of all other states of art methods on various datasets. The figure shows the precision-recall curve for three different datasets, and the area under the curve represents a significant performance.     Furthermore, we found that the methods also yielded high accuracy at low recall points, which is enough for implementing such a system for image retrieval. Therefore, the results indicate that our proposed technique is significantly better than all baseline approaches evaluated on various datasets in terms of the performance of mAP, P@5000, Furthermore, we found that the methods also yielded high accuracy at low recall points, which is enough for implementing such a system for image retrieval. Therefore, the results indicate that our proposed technique is significantly better than all baseline approaches evaluated on various datasets in terms of the performance of mAP, P@5000, and P.R. curves. Our proposed method shows excellent performance compared with other states of the art approaches, proving its superior efficiency. Table 3 illustrates the retrieval performance of the mAP result. During comparisons with other methods on three datasets, the proposed method consistently outperformed the other methods; for example, compared with the competitive method CNNH [38], our method can achieve significant improvement with an average performance of 6.01%, 4.71%, and 5.67% MIRFlickr25K, MSCOCO, and NUSWIDE, respectively. Based on three widely used data sets, we find that the average performance gap is 2.01%, 1.07%, and 2.30%. In Table 4, you will find the P@5000 performance results. On average, the proposed method significantly outperforms CNNH and DNNH [39] when measured in detail, achieving performance improvements of 4.81%, 4.95%, and 4.95%, respectively.  Meanwhile, compared with DNNH, our method improves average performance with 2.32%, 2.07%, and 3.15% on MIRFlickr25K, MSCOCO, and NUSWIDE, respectively. Although our method performed better in mAP and P@5000 than SDH, the best deep hashing algorithm uses a binary quantization function. In Figure 6, we have also included some retrieval findings for the top 11 returned samples based on hamming ranking on the NUSWIDE, MIRFlickr25K. We can see if the recommended strategy produces better outcomes than the other options. The NUSWIDE dataset illustrates the considerable performance of our technique, precisely the suggested method, which achieves superior retrieval results than other state-of-the-art methods. The proposed approach could better retain the image pair's similarity while producing a discriminative hash code. Impact of parameter: A dimension K of the feature space is considered with a single parameter in the proposed method. Using multi-dimensional linear search, we analyze the impact of the different datasets on (   1  2  3  4  5 2 , 2 , 2 , 2 , 2 ). In particular, we fix the code length to 60 bits as log2K, which is the least common multiplier to 60. In Figure 7, the mAP results and P@5000 for image retrieval are based on the retrieval method system. For the MIRFlickr25K dataset, different K settings result in only a very slight performance impact. In contrast, it decreases when K is set to 2 on MSCOO. The setting of K for MIRFlickr25K and MSCOCO can be determined from Figure 7, and for NUSWIDE, our setting can be determined from Figure 7. Impact of parameter: A dimension K of the feature space is considered with a single parameter in the proposed method. Using multi-dimensional linear search, we analyze the impact of the different datasets on (2 1 , 2 2 , 2 3 , 2 4 , 2 5 ). In particular, we fix the code length to 60 bits as log 2 K, which is the least common multiplier to 60. In Figure 7, the mAP results and P@5000 for image retrieval are based on the retrieval method system. For the MIRFlickr25K dataset, different K settings result in only a very slight performance impact. In contrast, it decreases when K is set to 2 on MSCOO. The setting of K for MIRFlickr25K and MSCOCO can be determined from Figure 7, and for NUSWIDE, our setting can be determined from Figure 7.

Convergence Analysis and Time Complexity
The proposed method convergence and time complexity has been evaluated through some experiments. Time complexity comparison with other state-of-the-art hashing methods has been illustrated in Table 5. The convergence was evaluated using the loss. A comparison of three datasets based on the proposed method shows changes between the three datasets, as shown in Figure 8. The value of loss becomes smaller and more stable as the number of iterations increases. During training, the proposed method appears to reach convergence quickly, significantly reducing training time.

Convergence Analysis and Time Complexity
The proposed method convergence and time complexity has been evaluated through some experiments. Time complexity comparison with other state-of-the-art hashing methods has been illustrated in Table 5. The convergence was evaluated using the loss. A comparison of three datasets based on the proposed method shows changes between the three datasets, as shown in Figure 8. The value of loss becomes smaller and more stable as the number of iterations increases. During training, the proposed method appears to reach convergence quickly, significantly reducing training time.

Convergence Analysis and Time Complexity
The proposed method convergence and time complexity has been evaluated through some experiments. Time complexity comparison with other state-of-the-art hashing methods has been illustrated in Table 5. The convergence was evaluated using the loss. A comparison of three datasets based on the proposed method shows changes between the three datasets, as shown in Figure 8. The value of loss becomes smaller and more stable as the number of iterations increases. During training, the proposed method appears to reach convergence quickly, significantly reducing training time.

Conclusions
A CNN based supervised deep hashing method is implemented in this article that aims to achieve high-quality bit binary code with efficient performance for image retrieval. Two different perspectives assessed the proposed method. In a single aspect, the simultaneous hash coding learning of feature representation makes the hash code fit with the features. Furthermore, the designed loss function maintains the original space's similarity by compelling the binary codes. While optimizing the quantization, the hash bits have been allocated constantly. Experiments on three extensively used standard datasets have been performed to pour the proposed system exceeding the specific state-of-art algorithms.