Deep Multi-Semantic Fusion-Based Cross-Modal Hashing

: Due to the low costs of its storage and search, the cross-modal retrieval hashing method has received much research interest in the big data era. Due to the application of deep learning, the cross-modal representation capabilities have risen markedly. However, the existing deep hashing methods cannot consider multi-label semantic learning and cross-modal similarity learning simultaneously. That means potential semantic correlations among multimedia data are not fully excavated from multi-category labels, which also affects the original similarity preserving of cross-modal hash codes. To this end, this paper proposes deep multi-semantic fusion-based cross-modal hashing (DMSFH), which uses two deep neural networks to extract cross-modal features, and uses a multi-label semantic fusion method to improve cross-modal consistent semantic discrimination learning. Moreover, a graph regularization method is combined with inter-modal and intra-modal pairwise loss to preserve the nearest neighbor relationship between data in Hamming subspace. Thus, DMSFH not only retains semantic similarity between multi-modal data, but integrates multi-label information into modal learning as well. Extensive experimental results on two commonly used benchmark datasets show that our DMSFH is competitive with the state-of-the-art methods.


Introduction
In recent years, with the rapid development of information technology, massive amounts of multi-modal data (i.e., text [1], image [2], audio [3], video [4], and 3D models [5]) have been collected and stored on the Internet. How to utilize the extensive multi-modal data to improve cross-modal retrieval performance has attracted increasing attention [6,7]. Cross-modal retrieval, a hot issue in the multimedia community, is the use of queries from one modality to retrieve all semantically relevant instances from another modality [8][9][10]. In general, the structuring of data in different modalities is heterogeneous, but there are strong semantic correlations between these structures. Therefore, the main tasks of crossmodal retrieval are discovering how to narrow the semantic gap and exploring the common representations of multi-modal data, the former being the most challenging problem faced by researchers in this field [11][12][13][14].
Most of existing cross-modal retrieval methods, including traditional statistical correlation analysis [15], graph regularization [16], and dictionary learning [17], learn a common subspace [18][19][20][21] for multi-modal samples, in which the semantic similarity between different modalities can be measured easily. For example, based on canonical correlation analysis (CCA) [22], several cross-modal retrieval methods [23][24][25] have been proposed to learn a common subspace in which the correlations between different modalities are easily measured. Besides, graph regularization has been applied in many studies [16,[26][27][28] to preserve the semantic similarity between cross-modal representations in the common subspace. The methods in [17,29,30] draw support from dictionary learning to learn consistent representations for multi-modal data. However, these methods usually have high computational costs and low retrieval efficiency [31]. In order to overcome these shortcomings, hashing-based cross-modal retrieval techniques are gradually replacing the traditional ones. A practical way to speed up similarity searching is with binary representation learning, referred to as hashing learning, which projects a high-dimensional feature representation from each modality as a compact hash code and preserves similar instances with similar hash codes. In this paper, we focus on the cross-modal binary representation learning task, which can be applied to large-scale multimedia searches in the cloud [32][33][34].
Motivation. Although deep hashing algorithms have made remarkable progress in cross-modal retrieval, the semantic gap and heterogeneity gap between different modalities need to be further narrowed. On the one hand, most methods lack mining of ample semantic information from multiple category labels. That means these methods cannot completely retain multi-label semantic information during cross-modal representation learning. Taking [28] as an example, graph regularization is used to support intra-modal and inter-modal similarity learning, but the multi-label semantics are not mined fully during the cross-modal representation learning, which affect the semantic discrimination of hash codes. On the other hand, after the features learned from normal networks are quantized into binary representations, some semantic correlations may be lost in Hamming subspace. For instance, [59] studies the effective distance measurement of cross-modal binary representations in Hamming subspace. However, multi-label semantics learning is ignored, which leads to insufficient semantic discriminability of the hash code. Therefore, to further improve the quality of cross-modal hash codes, two particularly important problems cannot be overlooked during the hashing learning: (1) how to capture more semantic discriminative features, and (2) how to efficiently preserve cross-modal semantic similarity in common Hamming subspaces. In this work, we consider these two key issues simultaneously during the cross-modal hashing learning to generate more semantically discriminative hash codes.
Our Method. To this end, we propose a novel end-to-end cross-modal hashing learning approach, named deep multi-semantic fusion-based cross-modal hashing (DMSFH for short) to efficiently capture multi-label semantics and generate high-quality cross-modal hash codes. Firstly, two deep neural networks are used to learn cross-modal representations. Then, intra-modal loss and inter-modal loss are utilized by generating a semantic similarity matrix to preserve semantic similarity. To further capture the rich semantic information, a multi-label semantic fusion module is used following the feature learning module, which fuses the multiple label semantics into cross-modal representations to preserve the semantic consistency across different modalities. In addition, we introduce a graph regularization method to preserve semantic similarity among cross-modal hash codes in Hamming subspace.
Contributions. The main contributions of this paper are summarized as follows: • We propose a novel deep learning-based cross-modal hashing method, termed DMSFH, which integrates cross-modal feature learning, multi-label semantic fusion, and hash code learning into an end-to-end architecture. • We combine the graph regularization method with inter-modal and intra-modal pairwise loss to enhance cross-modal similarity learning in Hamming subspace. Addition-ally, a multi-label semantic fusion module was developed to enhance the cross-modal consistent semantics learning. • Extensive experiments conducted on two well-known multimedia datasets demonstrate the outstanding performance of our methods compared to other state-of-the-art cross-modal hashing methods.
Roadmap. The rest of this paper is organized as follows. The related work is summarized in Section 2. The problem definition and the details of the proposed method DMSFH are presented in Section 3. The experimental results and evaluations are reported in Section 4. We discuss the main contributions and characteristics of our research in Section 5. Finally, we conclude this paper in Section 6.

Related Work
According to learning manner, the existing cross-modal hashing techniques fall into two categories: unsupervised approaches and supervised approaches. Due to the vigorous development of deep learning, cross-modal deep hashing approaches sprang up in the last decade. This section reviews the works that are related to our paper.
Unsupervised Methods. To learn a hash function, the unsupervised hashing methods aim to mine the unlabeled samples to discover the relationship between multi-modal data. One of the most typical technique is collective matrix factorization hashing (CMFH) [60], which utilizes matrix decomposition to learn two view-specific hash functions, and then different modal data can be mapped into unified hash codes. The latent sematic sparse hashing (LSSH) method [35] uses sparse coding to find the salient structures of images, and matrix factorization to learn the latent concepts from text. Then, the learned latent semantic features are mapped to a joint common subspace. Semantic topic multimodal hashing (STMH) [37], which discovers clustering patterns of texts and factorizes the matrix of images, to acquire multiple semantic of texts and concepts of images in order to learn multimodal semantic features, into a common subspace by their correlations. Multi-modal graph regularized smooth matrix factorization hashing (MSFH) [61] utilizes a multi-modal graph regularization term which includes an intra-modal similarity graph and an intermodal similarity graph to preserve the topology of the original instances. The latent structure discrete hashing factorization (LSDHF) [62] approach uses the Hadamard matrix to align all eigenvalues of the similarity matrix to generate a hash dictionary, and then straightforwardly distills the shared hash codes from the intrinsic structure of modalities.
Supervised Methods. Supervised cross-modal hashing methods improve the search performance by using supervised information, such as training data labels. Typical supervised approaches include cross-modal similarity sensitive hashing (CMSSH) [40], semantic preserving hashing for cross-view retrieval (SEPH) [41], semantic correlation maximization (SCM) [42], and discrete cross-modal hashing (DCH) [43]. CMSSH applies boosting techniques to preserve the intra-modal similarity. SEPH transforms the semantic similarity of training data into an affinity matrix by using a label as supervised information, and minimizes the Kullback-Leibler divergence to learn hash codes. SCM utilizes all the supervised information for training with linear-time complexity by avoiding explicitly computing the pairwise similarity matrix. DCH learns discriminative binary codes without relaxation, and label information is used to elevate the discriminability of binary codes through linear classifiers. Nevertheless, these cross-modal hashing methods are established on hand-crafted features [43,63]. It is hard to explore the semantic relationships among multi-modal data. Therefore, it is difficult to obtain satisfying retrieval results.
Deep Methods. In recent years, deep learning, as a powerful representation learning technique, has been widely used in cross-modal retrieval tasks. A number of methods integrating deep neural networks and cross-modal hashing have been developed. For example, deep cross-modal hashing (DCMH) [64] firstly applies the end-to-end deep learning architecture for cross-modal hashing retrieval and utilizes the negative logistic likelihood loss to achieve great performance. Pairwise relationship-guided deep hashing (PRDH) [65] uses pairwise label constraints to supervise the similarity learning of inter-modal and intra-modal data. A correlation hashing network (CHN) [66] adapts the triplet loss measured by cosine distance to find the semantic relationship between pairwise instances. Cross-modal hamming hashing (CMHH) [59] learns high-quality hash representations to significantly penalize similar cross-modal pairs with Hamming distances larger than the Hamming radius threshold. The ranking-based deep cross-modal hashing approach (RDCMH) [49] integrates the semantic ranking information into a deep cross-modal hashing model and jointly optimizes the compatible parameters of deep feature representations and hashing functions. In fusion-supervised deep cross-modal hashing (FDCH) [67], both pair-wise similarity information and classification information are embedded in the hash model, which simultaneously preserves cross-modal similarity and reduces semantic inconsistency. Despite the above-mentioned benefits, most of these methods only use binary similarity to constrain the generation of different instances of hash codes. This causes low correlations between retrieval results and the inputs, as the semantic label information cannot be expressed adequately. Besides, most methods only concentrate on hash code learning, but ignore the deep mining of semantic features. Thus, it is essential to keep sufficient semantic information in the modal structure and generate discriminative hash codes to enhance the cross-modal hashing learning.
To overcome the above challenges, this paper proposes a novel approach to excavate multi-label semantic information to improve the semantic discrimination of cross-modal hash codes. This approach not only uses the negative logistic likelihood loss, but also exploits multiple semantic labels' prediction losses based on cross entropy to enhance semantic information mining. Apart from this, we introduce graph regularization to preserve the semantic similarity of hash codes in Hamming subspace. Therefore, the proposed method is designed to generate high-quality hash codes that better reflect highlevel cross-modal semantic correlations.

The Proposed Approach
In this section, we propose our method DMSFH, including the model's formulation and the learning algorithm. The framework of the proposed DMSFH is shown in Figure 1, which mainly consists of three parts. The first part is the feature learning module, in which multimedia samples are transformed into high-dimensional feature representations by corresponding deep neural networks. The second part is the multi-label semantic fusion part. This part aims to embed rich multi-label semantic information into feature learning. The third part is the hashing learning module, which retains the semantic similarity of the cross-modal data in the hash codes using a carefully designed loss function. In the following, we introduce the problem definition first, and then discuss DMSFH method in detail. (2) a multi-label semantic information learning module that is realized by deep neural networks, which is to fuse rich semantic information from multiple labels to generate consistent semantic representations in label subspace; (3) a hash function module that is trained by inter-modal and intra-modal pairwise loss, quantization loss, and graph regularization loss to generate cross-modal hash codes.

Problem Definition
Without loss of generality, bold uppercase letters, such as W, represent matrices. Bold lowercase letters, such as w, represent vectors. Moreover, the ij-th element of W is denoted as W ij , the i-th row of W is denoted as W i * , and the j-th column of W is denoted as W * j . W T is the transpose of W. We use I for the identity matrix. tr(·) and || · || F denote the trace of the matrix and the Frobenius norm of a matrix, respectively. sign(·) is the sign function, shown as follows: To facilitate easier reading, the frequently used mathematical notation is summarized in Table 1. This paper focuses on two common modalities: texts and images. Assume that a cross-modal training dataset consists of n instances, i.e., and v i and t i are the i-th image and text, respectively. L i = [L i1 , L i2 , . . . , L ic ] is the multi-label annotation assigned to o i , where c is the number of categories. If o i belongs to the jth class, L ij = 1; otherwise, L ij = 0. In addition, a cross-modal similarity matrix S = {S vt , S vv , S tt } is given. If image v i and text t j are similar, S ij = 1; otherwise, S ij = 0.
Given a set of training data O, the goal of cross-modal hashing is to learn two hashing functions, i.e., h v (v) and h t (t) for image modality and textual modality, respectively, where k is the length of the hash code. In addition, the hash codes preserve the similarities in similarity matrix S. If the Hamming distance between To easily calculate the similarity between two binary codes b i and b j , we use the inner product b i , b j to measure the Hamming distance as follows: where K is the length of the hash code.

Feature Learning Networks
For cross-modal feature learning, deep neural networks are used to extract semantic features from each modality individually. Specifically, for image modality, ResNet34 [46], a well-known deep convolutional network, is used to extract image data features. The original ResNet was pre-trained on imagenet datasets; in addition, excellent results have been achieved on image recognition issues. We replaced the last layer with a network that has (k + c) hidden nodes, which is followed by a hash layer and a tag layer. The hash layer has k hidden nodes for generating binary representations. The label layer has c hidden nodes for generating predictive labels.
For text modality, a deep model named TxtNet is used to generate textual feature representations, which is a three-layer network followed by a multi-scale (MS) fusion model (T → MS → 4096 → 512 → k + c). The last layer of TxtNet is a fully-connected layer with (k + c) hidden nodes, which outputs deep textual features and prediction labels. The input of TxtNet is the Bag-of-Words (BoW) representation of each text sample. The BoW vector is too sparse, but the features extracted by the multi-scale fusion model are more abundant. Firstly, the BoW vectors are evenly pooled at different scales; then, the semantic information is extracted by nonlinear mapping through a convolution operation and an activation function. Finally, the representations from different scales are fused to obtain richer semantic information. The Ms fusion model contains 5 interpretation blocks. Each block contains a 1 × 1 convolutional layer and an average pooling layer. The filter sizes of the average pooling layer are set to 50 × 50, 30 × 30, 15 × 15, 10 × 10 and 5 × 5, respectively.

Hash Function Learning
In the network of image modality, let is all network parameters before the last layer of the deep neural network, and θ vh is the network parameter of the hash layer. Furthermore, let f v 2 (v i * ; θ v , θ vl ) ∈ R 1×c denote the output of the label layer for sample v i , where θ vl is the network parameter of the label layer. In the network of text modality, let f t 1 (t i * ; θ t , θ th ) ∈ R 1×k denote the learned text feature of the i-th sample t i , where θ t is all network parameters before the last layer of deep neural network, and θ th is the network parameter of the hash layer. Furthermore, let f t 2 (t i * ; θ t , θ tl ) ∈ R 1×c denote the output of the label layer for sample t i , where θ tl is the network parameter of the label layer.
To capture the semantic consistency between different modalities, the inter-modal negative log likelihood function is used in our approach, which is formulated as: The likelihood function composed of text feature F and image feature G is as follows: where To generate the hash codes with rich semantic discrimination, two essential factors need to be considered: (1) the semantic similarity between different modes should be preserved, and (2) the high-level semantics within each mode should be preserved, which can raise the accuracy of cross-modal retrieval effectively. To realize this strategy, we define the intra-modal pair-wise loss as follows: where L v 2 is the intra-modal pair-wise loss for image-to-image and L t 2 is the intra-modal pair-wise loss for text-to-text, and L v 2 and L t 2 are defined as: where φ vv ij = 1 2 F i * F T j * is the inner product of image data, and φ tt ij = 1 2 G i * G T j * is the inner product of text data.
Based on the negative log likelihood, the loss function can be used to distinguish identical and completely dissimilar instances. However, for more fine-grained hash features, we can extract higher-level semantic information by adding a tag prediction layer, so that the network can learn hash features with deep semantics. The semantic label cross-entropy loss is: where L i * is the original semantic label information, for instance, represent the prediction labels of instance o i in the image network and text network, respectively.
In order to enhance the correlation between the same hash code in Hamming subspace, we introduce graph regularization to establish the degree of correlation between multimodal datasets. We formulate a spectral graph learning loss from the label similarity matrix S as follows: where S vt is the similarity matrix, and B = {b i } n i=1 represents the unified hash codes. we define diagonal matrix D = diag(d 1 , . . . , d n ), and L = D − S vt is the graph Laplacian matrix.
We regard F and G as the continuous substitution of the image network hash code B v and the text network hash code B t to reduce quantization loss. According to the empirical analysis, the training effect will be better if the same hash code is used for different modes of the same training data, so we set B v = B t = B. Therefore, quantization loss can be defined as: The overall objective function, combining the inter-modality pair-wise loss L 1 , the intra-modal pair-wise loss L 2 , the cross entropy loss L 3 for the predicted label, graph regularization loss L 4 and quantization loss L 5 , is written as below: where γ and β are hyper-parameters to control the weight of each part.

Optimization
The objective in Equation (13) can be solved by using an alternative optimization iteratively. We adopt the mini-batch stochastic gradient descent (SGD) method to learn parameter ϑ v = {θ v , θ vh , θ vl } in an image network and parameter ϑ t = {θ t , θ th , θ tl } in a text network, and B. Each time we optimize one network with the other parameters fixed. The whole alternating learning algorithm for DMSFH is briefly outlined in Algorithm 1, and a detailed derivation is described in the following subsections.

Optimize ϑ v
When ϑ t and B are fixed, we can learn the deep network parameter ϑ v for the image modality by using SGD with back-propagation(BP). For the i-th image F i * , we first calculate the following gradient:

Optimize ϑ t
Similarly, when ϑ v and B are fixed, we also learn the network parameter ϑ t of the text modality by using SGD and the BP algorithm. For the i-th text G i * , we calculate the following gradient: Then we can compute ∂L ∂θ t , ∂L ∂θ th , and ∂L ∂θ tl by utilizing the chain rule, based on which BP can be used to update the parameters ϑ t .

Optimize B
When ϑ v and ϑ t are fixed, the objective in Equation (13) can be reformulated as follows: We compute the derivation of Equation (18) with respect to B and infer that B should be defined as follows: where γ and β are hyper-parameters, and I denotes the identity matrix.

The Optimization Algorithm
As shown in Algorithm 1, DMSFH's learning algorithm takes raw input training data, including images, text, and labels: O = {o 1 , o 2 , . . . , o n }, with o i = (v i , t i , L i ). Before the training, parameters ϑ v and ϑ t of image network and text network were initialized; mini-batch size N v = N t = 128; the maximal number of epochs max_epoch = 500; iteration times in each epoch was iter v = n/N v ; iter t = n/N t , where n is the total number of training data. The training of each epoch consisted of three steps. Step 1: Randomly selecting N v images from O and setting them as a mini-batch. For each datum in the mini-batch, we After the gradient was calculated, the network parameters θ v , θ vh and θ vl were updated using SGD and back propagation.
Step 2: Randomly selecting N t texts from O and setting them as a mini-batch. For each datum in the mini-batch, we calculated G i * = f t 1 (t i ; θ t ; θ th ) and L t i * = f t 2 (t i ; θ t ; θ tl ) by forward propagation. After the gradient is calculated, the network parameters θ t , θ th and θ tl were updated using SGD and back propagation.
Step 3: Updating B by Equation (19). The above three steps were repeatedly iterated to realize the alternating training of image hash network and text hash network until the maximum epoch number of iterations was reached. Initialization initialize parameters ϑ v and ϑ t , mini-batch size N v = N t = 128, the maximal number of epoches max_epoch = 500, and iteration number iter v = n/N v , iter t = n/N t . repeat for iter = 1, 2, . . . , iter v do Randomly sample N v images from O to construct a mini-batch of images.
For each instance v i in the mini-batch, calculate Calculate the derivatives according to Equations (14) and (15) Update the network parameters θ v , θ vh and θ vl by applying backpropagation. end for for iter = 1, 2, . . . , iter t do Randomly sample N t texts from O to construct a mini-batch of texts. For each instance t i in the mini-batch, calculate G i * = f t 1 (t i ; θ t ; θ th ) and L t i * = f t 2 (t i ; θ t ; θ tl ) by forward propagation. Updata G. Calculate the derivatives according to Equations (16) and (17) Update the network parameters θ t , θ th and θ tl by applying backpropagation. end for Update B using Equation (19) until the max epoch number max_epoch

Experiment
We conducted extensive experiments on two commonly used benchmark datasets, i.e., MIRFLICKR-25K [68] and NUS-WIDE [69], to evaluate the performance of our method, DMSFH. Firstly, we introduce the datasets, evaluation metrics, and implementation details, and then discuss performance comparisons of DMSFH and 6 state-of-the-art methods.

MIRFLICKR-25K:
The original MIRFLICKR-25K [68] dataset contains 25,000 imagetext pairs, which were collected from the well-known photo sharing website Flickr. Each of these images has several textual tags. We selected those instances that have at least 20 textual tags for our experiments. The textual tags for each of the selected instances were transformed into a 1386-dimensional BoW vector. In addition, each instance was manually annotated with at least one of the 24 unique labels. We selected 20,015 instances for our experiments.

NUS-WIDE:
The NUS-WIDE [69] dataset is a large real-world Web image dataset comprising over 269,000 images with over 5000 user-provided tags, and 81 concepts for the entire dataset. The text of each instance is represented as a 1000-dimensional BoW vector. In our experiment, we removed the instances without labels, and selected instances labeled by the 21 most-frequent categories. This gave 190,421 image-text pairs. Table 2 presents the statistics of the above two datasets. Figure 2 shows some samples of these two datasets. ocean california sea summer beach water beautiful wow ouch geotagged 1 cool nice topf50 bravo surf waves photographer action quality awesome been1of100 interestingness1 spray topf300 explore hamster zoomzoom flowers red plant flower color roma verde green colors closeup garden nikon mare colorfull rosa sigma natura explore reflected giallo cielo passion napoli 365 fiori sole acqua rosso petali riflessi paesaggi controluce backlighting giardino riflesso gocce d300 flowre sb800 naturalmente bouganville woter 365days petaled passionphotography abigfave impressedbeauty superbmasterpiece diamond Class photographer d40x ysplix excellent Photographer awards colour Art award rosacelo mcb1105 rosariocelotto

Evaluation
Two widely used evaluation methods, i.e., Hamming ranking and hash lookup, were utilized for cross-modal hash retrieval evaluations. Based on the query data and the Hamming distance of the retrieved samples as the sorting criteria, Hamming sorting sorts the retrieved data one by one according to the increasing order of the Hamming distance. In Hamming sorting, mean average precision (MAP) is one of the performance metrics that is commonly used to measure the accuracy of the query results. The larger the MAP value, the better the method retrieval performance. The topN precision curve reflects the changes in precision according to the number of retrieved instances. Besides, a hash search is also based on the criteria of the query data and the Hamming distance of the retrieved samples. However, it only returns the data to be retrieved within the specified Hamming distance as the final result. This can be measured by a precision recall (PR) curve. The larger the area enclosed by the curve and the coordinate axis, the better the retrieval performance of the method.
The value of MAP is defined as: where M is the query dataset and AP(q i ) is the average accuracy of query data q i . The average value of accuracy is calculated as shown in Equation (21): where N is the number of relevant instances in the retrieved set, and R represents the total amount of data. p(r) denotes the precision of the top r retrieved instances, and d(r) = 1 if the r-th retrieved result is relevant to the query instances; otherwise, d(r) = 0. To comprehensively measure the retrieval performance, we utilize another important evaluation metric, i.e., F-score. It is an important evaluation metrics that comprehensively considers precision and recall, which are defined as: if β = 1, this measurement is called F1-score. At this time, the accuracy rate and recall rate have the same weight. That means they are same important. In our experiments, we used F1-score to evaluate the cross-modal retrieval performance.

Baselines and Implementation Detail
Baselines. In this paper, the proposed SFDCH method is compared with several baselines, including SCM [42], SEPH [41], PRDH [65], CMHH [59], CHN [66], and DCMH [64]. SCM and SEPH use manual features, and the other approaches extract features through deep neural networks. Here is a brief introduction to these competitors: • SCM integrates semantic labels into the process of hash learning to conduct large-scale data modeling, which not only maintains the correlation between models, but also achieves good performance in accuracy. • SEPH transforms the semantic similarity of training data into affinity matrix by using a label as supervised information, and minimizes the Kullback-Leibler divergence to learn hash codes. • PRDH integrates two types of pairwise constraints from inter-modality and intramodality to enhance the similarities of the hash codes. • CMHH learns high-quality hash representations to significantly penalize similar crossmodal pairs with Hamming distances larger than the Hamming radius threshold. • CHN is a hybrid deep architecture that jointly optimizes the new cosine max-margin loss in semantic similarity pairs and the new quantization max-margin loss in compact hash codes. • DCMH integrates features and hash codes learning into a general learning framework. The cross-modal similarities are preserved by using a negative log-likelihood loss.
Implementation Details. Our SFDCH approach was implemented by Pytorch framework. All the experiments were performed on a workstation with Intel(R) Xeon E5-2680_v3 2.5 GHz, 128 GB RAM, 1 TB SSD, and 3TB HDD storage; and 2 NVIDIA GeForce RTX 2080Ti GPUs with Windows 10 64-bit operating system. We set the max_epoch = 500; the learning rate was initialized to 10 −1.5 and gradually lowered to 10 −6 in 500 epochs. We set the batch size of the mini-batch to 128 and the iteration number of the outer-loop in Algorithm 1 to 500, and the hyper-parameters γ = β = 1. For whole experiment, we used I → T to denote using a querying image while returning text, and T → I to denote using a querying text while returning an image.

Performance Comparisons
To evaluate the performance of the proposed method, we compare DMSFH with the six baselines in terms of MAP and PR curves on MIRFLICKR-25K and NUS-WIDE, respectively. Two query tasks, i.e., image-query-text and text-query-image, are considered. Tables 3 and 4 Table 5 reports the F1-measure with hash code length 32 bits on the MIRFLICKR-25K dataset.
Hamming Ranking: Tables 3 and 4 report the MAP scores of the proposed method and its competitors for image-query-text and text-query-image on MIRFLICKR-25K and NUS-WIDE, where I → T and T → I represent image retrieval by text and text retrieval by image, respectively. It is clear from the Tables 3 and 4 that the deep hashing methods perform better than the non-deep methods. Specifically, on MIRFLICKR-25K, we can see in Table 3  and T → I MAP = 63.89% (16 bits), 65.31% (32 bits), 66.08% (64 bits), respectively. This superiority of DMSFH due to the fact that it incorporates richer semantic information than other techniques. In addition, DMSFH leverages graph regularization to measure the semantic correlation of the unified hash codes. That means it can capture more semantic consistent features between different modalities than other deep hashing models, such as CHN and DCMH. Therefore, the above results confirm that the hash codes generated by DMSFH have better semantic discrimination and can better adapt to the task of mutual retrieval of multi-modal data.

MIRFLICKR-25K
Image-Query-Text Text-Quary-Image    Hash Lookup: To further demonstrate the comparison of the proposed model with these baselines, we used PR curves to evaluate their retrieval performances. Figures 3-5 show the PR curves with different coding lengths (16 bits, 32 bits, and 64 bits) on MIRFLICKR-25K and NUS-WIDE datasets, respectively. As expected, the deep learning-based models had better performances than the manual features-based models, mainly due to the powerful representation capabilities of deep neural networks. Besides, no matter what the length of the hash code was, our method performed better, obviously, on the PR curve than the other deep based competitors. That happened mainly because DMSFH has stronger cross-modal consistent semantic learning capabilities by not only considering both the intra-modal and inter-modal semantic discriminative information, but integrating graph regularization into hashing learning as well. Besides, we selected the best five methods, and report their average precision, average recall, and average F1-measure with Hamming radius r = 0, 1, 2 in Table 5 on MIRFLICKR-25K for when the code length was 32. We found that in all cases our DMSFH can achieve the best F1-measure.

Ablation Experiments of DMSFH
To verify the validity of the DMSFH components, we conducted ablation experiments on the MIRFLICKR-25K dataset, and the experimental results are shown in Table 6. We define DMSFH-P as employing only intra-modal pairwise loss and inter-modal pairwise loss, and DMSFH-S removed the graph regularization loss. From Table 6, we can see that both the semantic prediction discriminant loss and graph regularization loss employed by DMSFH can effectively improve the retrieval accuracy. From the results, it can be seen that DMSFH can obtain better performance when using the designed modules. Table 6. Ablation experiments of DMSFH on the MIRFLICKR-25K dataset. The best results are in bold font.

Discussion
This paper proposes deep multi-semantic fusion-based cross-modal hashing (DMSFH) for cross-modal retrieval tasks. Firstly, it preserves the semantic similarity between data through intra-modal loss and inter-modal loss, and then introduces a multi-label semantic fusion module to further capture more semantic discriminative features. In addition, semantic similarity in Hamming space is preserved by graph regularization loss.
We compared DMSFH with other methods. We used the cross-modal multi-label datasets MIRFLICKR-25K and NUS-WIDE, which have 24 and 21 label attributes, respectively. According to Tables 3 and 4, it can be seen that the map scores of DMSFH are better than those of the other methods. As for DCMH and PRDH, DMSFH outperformed these two deep learning methods based on the same inter-modal and intra-modal pairwise loss, precisely because it captured more semantic information with the addition of new losses. Therefore, DMSFH is able to optimize the semantic heterogeneity problem to a certain extent and improve the accuracy. In addition, the computational cost of the model is measured using floating point operations (FLOPs), with an approximate number of FLOPs of 3.67 billion for DMSFH. Compared with real-valued cross-modal retrieval methods, the computational and retrieval cost of our method is quite low due to the shorter binary cross-modal representations (i.e., 64 bits hash codes) and Hamming distance measurements. As they generate higher dimensional feature representations (i.e., 1000 dimensional feature map), the real-valued cross-modal representation learning models always have higher complexity.
Although our study achieved some degree of performance improvement, there are limitations. First, when constructing the sample similarity matrix, our method in this paper does not fully extract the fine-grained labeling information between data, and there is still a higher performance improvement in fine-grained semantic information extraction. Second, our method mainly focuses on the construction and optimization of the loss function, but how to improve the cross-modal semantic feature representation learning is also an important issue. Therefore, deeper semantic mining in the semantic feature learning part is also a direction for our future research. Third, our method was tested on a specific dataset, and common cross-modal hash retrieval methods use data of known categories, but in practical applications, the rapid emergence of new unlabeled things often affects the accuracy of cross-modal data retrieval. How to achieve high precision cross-modal retrieval in the absence of annotation information is also an important research problem.

Conclusions
In this paper, we proposed an effective hashing approach dubbed deep multi-semantic fusion-based cross-modal hashing (DMSFH) to improve semantic discriminative feature learning and similarity preserving of hash codes in common Hamming subspace. This method learns an end-to-end framework to integrate feature learning and hash code learning. A multi-label semantic fusion method is used to realize cross-modal consistent semantic learning to enhance the semantic discriminability of hash codes. Moreover, we designed the loss function with graph regularization from inter-modal and intra-modal perspectives to enhance the similarity learning of hash codes in Hamming subspace. Extensive experiments on two cross-modal datasets demonstrated that our proposed approach can effectively improve cross-modal retrieval performance, which is significantly superior to other baselines.
In future work, we will consider the heterogeneous semantic correlations between multi-modal samples in both aspects of high-level semantics and fine-grained semantics, which can be formulated as heterogeneous information networks (HIN) to capture more semantic information and realize cross-modal semantic alignment in a more effective manner. In addition, how to measure the distance of the relation distribution of semantic details between different modalities will be studied. An essential problem will be enhancing the cross-modal semantic representation learning.

Conflicts of Interest:
The authors declare no conflict of interest.