Deep Supervised Hashing by Fusing Multiscale Deep Features for Image Retrieval †

paper


Introduction
The internet and communication advancements have led to an overwhelming influx of images on the web [1][2][3], creating a challenge for accurate and efficient large-scale data retrieval.To address this, hash-based image retrieval techniques [4] have gained attention due to their ability to generate compact binary codes, offering computational efficiency and storage advantages.
Deep hashing techniques [19][20][21][22] have arisen as a result of the achievements made by deep neural networks in computer vision tasks.These methods, as opposed to traditional hashing methods, possess the ability to effectively extract high-level semantic features and facilitate end-to-end frameworks for generating binary codes.Nevertheless, a drawback of numerous existing deep hashing techniques [23][24][25] is their reliance on features from the penultimate layer in fully connected networks, which serve as global image descriptors but fail to capture local characteristics.
To overcome these challenges, this paper proposes a novel deep hashing method, called Deep Supervised Hashing by Fusing Multiscale Deep Features (DSHFMDF), which effectively captures multiscale object information.Specifically, it extracts features from different network stages, fuses them at the fusion layer, and encodes them into robust hash codes.The network uses various hashing results based on different scale features, enhancing retrieval recall without sacrificing precision.The key contributions of this paper are as follows: 1.
Integration of Structural Information: Multiscale deep hashing allows for the integration of structural information from images at multiple levels of granularity.Different scales capture varying levels of structural details in the image.By considering these multiple scales, the model can learn to encode both low-level details and high-level structural features, resulting in more informative hash codes.

2.
Feature Hierarchy: Deep neural networks used in multiscale hashing often have multiple layers, each capturing features at different abstraction levels.Lower layers capture finer details, while higher layers capture more abstract features or structures.This hierarchical feature representation helps strike a balance by encoding both fine-grained structural details and higher-level abstract information, addressing the trade-off.

3.
Adaptability to Task-Specific Needs: Multiscale deep hashing can be tailored to the specific requirements of the retrieval task.For tasks where preserving structural information is crucial, appropriate design choices can be made to prioritize encoding such information in the hash codes.

4.
Quantization and Bit Allocation: Hashing involves quantization of continuous features into binary codes.The bit allocation strategy can be optimized to balance between capturing structural information and achieving high retrieval accuracy.Techniques like adaptive bit allocation can be employed to allocate more bits to preserve crucial structural details while still maintaining retrieval accuracy.

5.
Learning Objectives and Loss Functions: The design of learning objectives and loss functions can be customized to emphasize the preservation of structural information.Including terms in the loss function that encourage the preservation of certain structural features can guide the learning process.

Related Works
In recent years, hashing methods have gained significant attention in the field of image retrieval due to their ability to efficiently store large amounts of data and process the data quickly [6,26].Like Principal Component Analysis (PCA) [27] and Linear Discriminant Analysis (LDA) [28], which are widely used for dimensionality reduction, the fundamental objective of hashing is to transform high-dimensional input data, such as images, into low-dimensional hash codes.Through this, hashing methods aim to reduce the Hamming distance between similar image pairs while maximizing it for dissimilar pairs, enabling efficient and accurate image retrieval.
The existing literature on hashing methods can be broadly categorized into two types: supervised and unsupervised approaches.Supervised hashing methods utilize labeled data, whereas unsupervised methods operate without the use of any supervision.Unsupervised hashing methods, such as Locality Sensitive Hashing (LSH) [29], Spectral Hashing (SH) [30], and Iterative Quantization (ITQ) [8], are techniques that seek to learn hash functions through unlabeled training samples.These approaches convert input images into binary codes, enabling efficient storage and retrieval.While LSH has been one of the most widely used unsupervised hashing approaches, other methods like SH and ITQ have also been successfully employed in subsequent studies.Supervised hashing techniques, on the other hand, leverage the availability of labeled data to improve the accuracy of the generated hash codes.These methods outperform unsupervised approaches in terms of retrieval performance.Some notable supervised hashing methods include Supervised Hashing with Kernels (KSH) [31], Minimal Loss Hashing (MLH) [32], and Supervised Discrete Hashing (SDH) [33].KSH introduces a nonlinear hash function in kernel space to capture complex relationships between image features and their corresponding labels.Instead of directly optimizing hash functions, MLH employs structured Support Vector Machines (SVMs) to create an objective function for learning hash functions.In contrast, SDH prioritizes the generation of top-notch hash codes without any relaxation by redefining the optimization objective.
CNNH is an algorithm that focuses on learning hash codes by utilizing features extracted from Convolutional Neural Networks (CNNs).It follows a two-step process, where the hash function learning and feature representation learning are performed independently.By leveraging the rich and high-level representations captured by CNNs, CNNH aims to generate effective hash codes for image retrieval tasks.In contrast, DPSH takes a Bayesian approach to establish a relationship between hash codes and pairwise labels.It optimizes this relationship to learn hash functions that can effectively preserve the pairwise similarities among the data samples.By considering the pairwise label information, DPSH aims to learn more discriminative hash codes that can improve the retrieval performance.HashGAN, on the other hand, introduces the use of Wasserstein Generative Adversarial Networks (GANs) to enhance the training process.It exploits pairwise similarity or dissimilarity information to generate hash codes within a Bayesian framework.By leveraging the power of GANs, HashGAN can effectively increase the amount of training data and generate high-quality hash codes.In the study by Zhuang et al. [36], they propose a binary CNN classifier that incorporates a triplet-based loss.This loss function is designed to learn both semantic links and hashing functions simultaneously.By considering triplets of samples with their corresponding similarities or dissimilarities, the binary CNN classifier aims to learn hash codes that not only preserve the semantic relationships among data samples but also enhance the retrieval accuracy.DTQ takes a different approach by combining a triplet quantization strategy within a supervised deep learning framework.It jointly optimizes the quantization and feature-learning processes to generate discriminative hash codes.By considering the relationships among triplets of samples, DTQ aims to learn hash codes that can preserve the relative distances between data samples and improve retrieval performance.SSDH introduces a novel approach where hash functions are built as new fully connected layers (FC Layers).The learning of hash codes is achieved by minimizing the specified classification error.By incorporating the hash functions as additional layers within the network architecture, SSDH aims to learn compact and discriminative hash codes that can effectively preserve the semantic information in the data.Wang et al. provide a general framework for distance-preserving linear hashing that incorporates deep hashing approaches.This framework aims to learn hash functions that can preserve the pairwise distances between data samples, enabling efficient similarity search.By integrating deep hashing algorithms into the framework, Wang et al. propose an effective way to learn discriminative hash codes for image retrieval tasks.SADH is a two-step hashing algorithm that leverages the output representations from the fully connected layers (FC Layers) to update the similarity graph matrix.By utilizing the FC Layer outputs, SADH aims to improve the optimization process of hash codes.This approach focuses on capturing the intrinsic structure of the data and refining the hash codes accordingly, leading to enhanced retrieval performance.Overall, these approaches showcase various strategies to leverage deep learning techniques for the task of image hashing.They aim to learn effective hash codes by exploiting the power of deep neural networks, considering pairwise relationships, incorporating Bayesian frameworks, and optimizing the hash functions based on different objectives and constraints.
In addition to these methods, recent research has introduced further advancements in image hashing techniques: Deep-Feature Enhancing and Semantic-Preserving Hashing for Image Retrieval (DFEH) [41].DFEH addresses issues in traditional hashing methods by introducing a feature enhancement layer to improve feature extraction, remove redundant features, and better preserve semantic relationships.It uses contrastive and balance losses to produce compact binary codes.Deep Hashing via Weight Pruning (DHWP) [42] focuses on obtaining short, high-quality hash codes by training models with relatively long hash codes and gradually obtaining shorter codes via weight pruning.It outperforms existing state-of-the-art methods, especially for short hash codes.Deep Momentum Uncertainty Hashing (DMUH) [43] addresses the challenges in deep hashing by explicitly estimating and leveraging uncertainty during training.It models bit-level uncertainty and aims to improve retrieval performance by reducing uncertainty.
While many hashing methods focus on using features from the last fully connected layer (FC), it has been recognized that extracting different types of features can lead to a more comprehensive image description and enhance retrieval performance.Several approaches have been suggested for multiple-level image retrieval.For instance, Lin et al.
propose DDH [44], which combines end-to-end learning, divide-and-encode, and hash code learning into a unified framework.DDH employs a stack of convolutional pooling (conv-pool) layers to obtain multiscale features by combining the outputs of the third pooling layer and the fourth convolutional layer.In their work, Yang et al. present Feature Pyramid Hashing (FPH) [45], a novel architecture for image hashing that incorporates two pyramids, namely, vertical and horizontal pyramids.FPH is designed to effectively capture intricate visual details and semantic information, thereby enabling the retrieval of fine-grained images.In [46], Redaoui and Belloulata have proposed Deep Feature Pyramid Hashing (DFPH), which can fully utilize images' multi-level visual and semantic information.Ng et al. introduce a ground-breaking method called multi-level supervised hashing (MLSH) [47] for image retrieval.MLSH focuses on constructing and training distinct hash tables that utilize various levels of features, such as semantic and structural information.By incorporating multiple levels of information, MLSH aims to enhance the accuracy of image retrieval.
In summary, hashing methods in image retrieval have witnessed significant advancements in recent years.From unsupervised approaches to supervised and deep hashing techniques, researchers have explored various methods to learn effective hash functions and generate compact hash codes for efficient image retrieval.Furthermore, the incorporation of multi-level features has shown promise in improving retrieval performance by capturing both fine-grained details and high-level semantics.

Proposed Method
In this section, we provide a comprehensive explanation of our proposed approach, Deep Supervised Hashing by Fusing Multiscale Deep Features (DSHFMDF).We start by defining the problem of learning hash codes and subsequently introduce the architecture of our model.Finally, we describe the objective function of our proposed DSHFMDF method.

Problem Definition
Here, Y = {y i } N i=1 ∈ R K×N denotes the ground truth labels for the x i samples, where K represents the number of classes.The pairwise label matrix S = s ij indicates the semantic similarity between training image samples, with s ij ∈ {0, 1}.If s ij = 1, this signifies that samples x i and x j are semantically similar, whereas s ij = 0 indicates that they are not.The objective of deep hashing methods is to learn a deep hash function f Here, L represents the length of the binary codes.To begin with, the feature extraction stage plays a crucial role in capturing relevant information from the input image.In our approach, we leverage the VGG-19 network as the backbone, which is a deep convolutional neural network known for its ability to extract rich and discriminative features.The VGG-19 architecture comprises several layers, each responsible for extracting features at different levels of abstraction.

Model Architecture
During the feature extraction process, we extract features from multiple levels of the VGG-19 network.This includes extracting low-level features that capture structural information at a local level, such as edges and corners, as well as high-level features that capture more abstract and semantic information about the image.By considering features from multiple levels, we aim to capture both fine-grained details and high-level semantic concepts in the image representation.
During the feature extraction process, we adopt a comprehensive approach to extract features from various levels of the VGG-19 network.Specifically, we focus on the feature output of the last convolutional layer within each convolutional block ('conv3', 'conv4', and 'conv5'), as well as the fully connected layer ' f c1'.We have deliberately chosen to exclude 'conv1' and 'conv2' from this process due to their substantial memory footprint and relatively low semantic information content.
Table 1 provides a detailed overview of the convolutional layers' parameters and the corresponding feature sizes of different convolutional blocks.This multi-level featureextraction strategy enables us to capture a wide spectrum of image characteristics, ranging from low-level structural details, such as edges and corners, to high-level semantic concepts.By considering features from multiple levels of abstraction, our goal is to create a comprehensive image representation that encompasses fine-grained details and high-level semantic information.This approach allows us to effectively balance local and global information, leading to a more robust and discriminative feature representation for our specific task.
After feature extraction, we move on to the feature reduction stage, where we aim to reduce the dimensionality of the extracted features while preserving their discriminative power.To achieve this, we employ a 1 × 1 convolutional kernel, which acts as a linear combination of the features from different levels.This process helps enhance the depth and robustness of the extracted features while reducing redundancy.
Next, we proceed to the feature fusion layer, which consists of 1024 nodes.At this layer, we connect and combine the different feature levels, allowing for the integration of both low-level and high-level information.This fusion of features from multiple levels helps capture a comprehensive representation of the image, combining both local structural details and global semantic information.
In order to approximate hash codes, we perform a nonlinear mapping of the features from the fusion layer and the fully connected layer (FC).This mapping is accomplished using hash layers, which consist of L nodes representing the desired length of the hash codes.The nonlinear mapping ensures that the generated hash codes capture the essential characteristics of the image representation in a compact and efficient manner.
Moving forward, we concatenate the two hashing layers, resulting in a consolidated representation of the hash codes.This concatenated layer is then connected to the final hashing layer, which further refines the representation and prepares it for classification.By arranging the architecture in this manner, we aim to enhance the preservation of semantic information in the generated hash codes, ensuring that they possess meaningful and discriminative properties.
The classification layer is the last component of our model architecture.It contains neurons equal to the number of classes in the dataset, allowing the network to classify the images based on the learned representations.The classification layer takes advantage of the discriminative power of the hash codes to accurately assign images to their respective classes.
Through the comprehensive approach outlined above, our model utilizes different hashing outcomes based on the various feature levels.This leads to improved image retrieval performance, as the hash codes capture both local and global information.Furthermore, the learning process of the hash codes ensures the preservation of pairwise similarity and the maintenance of semantic information.This results in more meaningful and effective image retrieval based on the learned representations.

Objective Function
To ensure the learning of similarity-preserving hash codes, our DSHFMDF approach employs three loss functions: pairwise similarity loss, quantization loss, and classification loss.These losses are combined to train our model effectively.

Pairwise Similarity Loss
Our DSHFMDF strives to maintain the resemblances between pairs of input data in a Hamming space.We evaluate pairwise similarity by utilizing the inner product.In particular, the inner product ⟨., .⟩ between hash codes b i and b j is precisely defined as dist H b i , b j = 1 2 b T i b j .Given the binary codes B = {b i } N i=1 and the pairwise labels S = s ij , the formulation of the likelihood of pairwise labels is expressed in the following manner: where σ(w ij ) =

1+e
−w ij and w ij = 1 2 b T i b j .This formulation implies that a larger inner product ⟨b i , b j ⟩ corresponds to a smaller dist H b i , b j and a higher value of p 1|b i , b j .Thus, when s ij = 1, the binary codes b i and b j are considered similar.
Upon calculating the negative log-likelihood of labels on S, we encounter the subsequent optimization problem: The optimization problem described above seeks to minimize Hamming distance between similar samples while maximizing the distance between dissimilar points.This objective is in line with the goals of pairwise similarity-based hashing techniques.

Quantization Loss
In practical applications, binary hash codes are commonly used to measure similarity.However, optimizing discrete hash codes within a CNN presents challenges.To overcome this, we propose a continuous form of hash coding.The output of the hash layer is defined as u i and we set b i = sgn(u i ).
To minimize the discrepancy between continuous and discrete hash codes, we introduce the quantization loss as the second objective: Here, Q represents the mini-batch size.

Classification Loss
To ensure robust learning of multiscale features throughout the deep network, we employ cross-entropy loss (classification loss) to classify the classes.The formulation of the classification loss is given by In this context, y i,k denotes the true label and p i,k represents the softmax output of the i-th training sample belonging to the k-th class.
To summarize, the overall loss function is obtained by combining the losses from pairwise similarity, pairwise quantization, and classification:

Experiments
We validate the effectiveness of our approach using two publicly available datasets: NUS-WIDE and CIFAR-10.Firstly, we provide a concise overview of these datasets, followed by an exploration of our experimental configurations.Section 4.3 presents the evaluation metrics and baseline methods.Finally, in the concluding section, we present the results of our method, including validations and comparisons with several state-of-the-art hashing techniques.

Datasets
The CIFAR-10 [48] database, as described in the work by Krizhevsky et al. (2009), comprises a collection of 60,000 images categorized into 10 classes.Each image has a dimension of 32 × 32 pixels.Following the approach outlined in [49], we randomly choose 100 images per class to serve as queries, resulting in a total of 1000 test instances.The remaining images form the database set.Additionally, we randomly select 500 images per category (totaling 5000) from the database to create the training set.
The NUS-WIDE [50] database, introduced by Chua et al. ( 2009), is a comprehensive collection of approximately 270,000 images sourced from Flickr.It consists of 81 different labels or concepts.For our experiment, we randomly choose 2100 images from 21 classes to form the query database set, while the remaining images serve as the database.Furthermore, we randomly select 10,000 images from the database set to construct the training dataset.

Experimental Settings
To implement DSHFMDF, we utilize PyTorch as our framework.As a base network, we employ a VGG-19 convolutional network that has been pre-trained on the ImageNet dataset [51].Throughout our experiments, we train our network using the Adam algorithm [52] with a learning rate of 1 × 10 −5 .As for the hyperparameters of the cost function, we assign a value of 0.01 to alpha and 0.1 to beta.

Evaluation Metrics
In order to assess the performance of various approaches, we employ four evaluation metrics: (MAP) Mean Average Precision, (PR) Precision-Recall curves, Precision curve within Hamming radius 2, and Precision curves with top N returned results (P@N).
We conduct a comparison between our proposed DSHFMDF method and several classical or state-of-the-art methods, which encompass five unsupervised shallow methods, two traditional supervised hashing techniques, and eight deep supervised hashing techniques.In the case of the multi-label CIFAR-10 and NUS-WIDE datasets, samples are considered similar if they share the same semantic labels.Conversely, if the samples have different semantic labels, they are considered dissimilar.

Results Discussion
The results obtained from our experiments on the CIFAR-10 and NUS-WIDE datasets, evaluating the performance of various hash code lengths, are presented in Table 2.These results are, partially, presented in our preliminary paper [53].The table clearly shows that our proposed DSHFMDF (Deep Supervised Hashing by Fusing Multiscale Deep Feature) method outperforms all the compared methods by a significant margin.Specifically, when compared to SDH (Supervised Discrete Hashing), which is considered one of the top shallow hashing methods, DSHFMDF demonstrates substantial improvements, with absolute increases of 49.4% and 24% in average Mean Average Precision (MAP) on the To further evaluate the performance of DSHFMDF, we analyze its Precision-Recall (PR) performance and Precision at N (P@N) measures compared to other approaches.Figures 2b,c and 3b,c present the PR performance and P@N results for the CIFAR-10 and NUS-WIDE datasets, respectively.In Figures 2c and 3c, it is evident that the proposed DSHFMDF method achieves the highest precision when using 48-bit hash codes.This demonstrates its effectiveness in generating precise retrieval results.Additionally, Figures 2b and 3b show consistently high precision levels at low recall, which is of great importance in precision-oriented retrieval tasks and has practical application in various systems.
In conclusion, our DSHFMDF method surpasses the compared methods across multiple evaluation aspects, highlighting its superiority in image retrieval tasks.To provide visual evidence of its effectiveness in removing irrelevant images, we present Figure 4, which showcases the retrieval accuracy of various image categories in the CIFAR-10 dataset using DSHFMDF with 48-bit binary codes.The figure includes the query images in the first column, while the subsequent columns display the retrieved images using DSHFMDF.This example further emphasizes the ability of our approach to accurately retrieve relevant images and demonstrates its practical utility.
Overall, our extensive experiments and detailed analysis validate the superiority of the proposed DSHFMDF method in generating high-quality hash codes, improving retrieval performance, and achieving accurate and precise image retrieval results.

Ablation Studies
(1) In our ablation studies, we explored different fundamental feature extractors.Specifically, we replaced VGG19 with VGG13 and VGG16, and the performance of these models on the CIFAR-10 dataset is presented in Table 3.The table illustrates that using deeper neural networks can improve image retrieval performance, leading us to choose VGG19 as our primary feature extractor.(2) Ablation studies on multi-level image representations for improved hash learning: In our ablation studies, we conducted a thorough investigation into the impact of different levels of image representations on the performance of our DSHFMDF method.Unlike many existing approaches that primarily focus on extracting semantic information from the penultimate layer in fully connected networks, we recognized the importance of structural information that contains crucial semantic details for effective hash learning.To address this, we considered using feature maps from various layers, ranging from low-level to high-level features.Table 4 presents the results of these experiments, focusing on the retrieval performance using different feature maps on CIFAR-10.Notably, we observed that utilizing features from ' f c1' led to the most significant average mAP score of 69%, emphasizing the importance of high-level features in capturing semantic information.In contrast, using features from 'conv 3-5' yielded an average mAP score of only 59.4%, which reflects the significance of low-level and structural details.Furthermore, our proposed DSHFMDF method outperformed other methods, achieving an average mAP score of 82.15%.This result demonstrated that combining features from different scales, including both low-level and high-level features, significantly enhanced the performance of our method.In Figure 5, we display the Precision-Recall curves of DSHFMDF in the case of various scale features.Our DSHFMDF retains over 80% Precision and nearly identical Precision-Recall curves at 12, 24, 36, and 48 hash bits.DSHFMDF achieves superior Precision and Recall with the same hash code length compared to other methods.The binary hash codes perform best when all feature scales are used.This proves that high-level characteristics are more effective in carrying information when creating hash codes.While low-level features can contribute supplementary information to the high-level features information, low-level features cannot entirely take the place of high-level characteristics.The information contained in each scale's fea-tures is essential.This further demonstrates how well DSHFMDF makes use of all scales' features.(3) Ablation studies of the objective function: We conducted ablation studies on our objective function to assess the influence of Pairwise Quantization Loss and Classification Loss constraints, which examine the impact of hash coding and classification on CIFAR-10.We based our experiments on the proposed DSHFMDF method, where β and γ are the key parameters for J2 and J3.When β = 0, the model is referred to as DSHFMDF-J3, and when γ = 0, it is DSHFMDF-J2.As shown in Table 5, if β and γ are not set to zero, each component in the suggested loss function contributes to hash code creation, resulting in a 6.55% performance improvement.Both J2 and J3 yield similar enhancements because they reduce quantization errors and preserve semantics in the overall model.Eliminating either J2 or J3 may reduce the model's performance.

Parameter Sensitivity Analysis
To examine how hyper-parameters β and γ impact retrieval performance, we conducted experiments spanning the values of β and γ from 0.0001 to 0.1, with a tenfold increase, using a 48-bit code on the CIFAR-10 dataset.In Figure 6, we present the changes in mAP curves for DSHFMDF as β and γ vary.Initially, mAP improves as β and γ increase, but performance levels off after reaching their peaks.Notably, our method maintains satisfactory performance over a wide range of β and γ values, demonstrating its robustness to these parameters.

Conclusions and Future Work
This paper presents an end-to-end approach called Deep Supervised Hashing by Fusing Multiscale Deep Features (DSHFMDF) for image retrieval.Our proposed method focuses on generating robust binary codes by optimizing the similarity loss, quantization loss, and semantic loss.Moreover, the network leverages various hashing results based on multiscale features, thereby enhancing retrieval recall while preserving precision.In summary, multiscale deep hashing offers a framework to efficiently incorporate structural information into hash representations, aiding in the balance between structural information and image retrieval accuracy.The careful design of architectures, loss functions, and bit allocation strategies is crucial in achieving this balance based on the specific requirements of the application.Through extensive experiments on two image retrieval datasets, we demonstrate the superiority of our method over other state-of-the-art hashing techniques.As a natural extension of this research, future work will explore the application of our framework to medical image datasets.Medical images often contain objects at various scales, with many being very small.We believe that the use of our method will significantly enhance retrieval performance in the context of medical imaging.

Figure 1
Figure 1 shows the network's architecture of the proposed deep hashing for image retrieval.Our proposed approach for the model architecture is designed to be comprehensive and effective, consisting of five main components: (1) feature extraction; (2) feature reduction; (3) feature fusion; (4) hash coding; (5) and classification.By providing a more detailed and bulky description, we can better understand the intricacies of each component and how they contribute to the overall functionality of the model.

Figure 3 .
Figure 3.The results of comparing different approaches on the NUS-WIDE dataset using three evaluation metrics.

Figure 4 .
Figure 4.The top 20 retrieved results from the CIFAR-10 dataset using DSHFMDF with 48-bit hash codes are presented.The query images are displayed in the first column, while the subsequent columns showcase the retrieval results obtained by DSHFMDF.

Table 1 .
Specifics of the feature extraction network are as follows.It is important to note that we utilize the features from layers marked with '#'.For the sake of simplicity, we have omitted the ReLU and Batch Normalization layers.

Table 3 .
Mean Average Precision (MAP) of Hamming ranking for different number of bits using VGG13, VGG16, and VGG19 as the fundamental feature extractors on CIFAR-10 dataset.The best results are highlighted in bold.

Table 4 .
Mean Average Precision (MAP) of Hamming ranking of different scales for different numbers of bits on CIFAR-10.

Table 5 .
mAP values of different variants of our objective function for CIFAR-10 dataset.