A Cross-Modal Hash Retrieval Method with Fused Triples

Li, Wenxiao; Mei, Hongyan; Li, Yutian; Yu, Jiayao; Zhang, Xing; Xue, Xiaorong; Wang, Jiahao

doi:10.3390/app131810524

Open AccessArticle

A Cross-Modal Hash Retrieval Method with Fused Triples

by

Wenxiao Li

,

Hongyan Mei

^*,

Yutian Li

,

Jiayao Yu

,

Xing Zhang

,

Xiaorong Xue

and

Jiahao Wang

College of Electronic and Information Engineering, Liaoning University of Technology, Jinzhou 121001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(18), 10524; https://doi.org/10.3390/app131810524

Submission received: 7 July 2023 / Revised: 14 September 2023 / Accepted: 20 September 2023 / Published: 21 September 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to the fast retrieval speed and low storage cost, cross-modal hashing has become the primary method for cross-modal retrieval. Since the emergence of deep cross-modal hashing methods, cross-modal retrieval significantly improved. However, the existing cross-modal hash retrieval methods still need to effectively utilize the dataset’s supervisory information and the lack of similarity expression ability. This means that the label information needs to be maximized, and the potential semantic relationship between two modalities cannot be fully explored, thus affecting the judgment of semantic similarity between two modalities. To address these problems, this paper proposes Tri-CMH, a cross-modal hash retrieval method with fused triples, which is an end-to-end modeling framework consisting of two parts: feature extraction and hash learning. Firstly, the multi-modal data are preprocessing into the form of triple groups. The data supervision matrix is constructed so that the samples with labels and their meanings are aggregated together. In contrast, the samples with labels and their opposite meanings are separated, thus avoiding the problem of the under-utilization of supervisory information in the data set and achieving the effect of efficiently utilizing the global supervisory information. Meanwhile, the loss function of the hash learning part is optimized by considering the Hamming distance loss, single-modality internal loss, cross-modality loss, and quantization loss to explicitly constrain semantically similar hash codes and semantically dissimilar hash codes and to improve the model’s ability to judge cross-modality semantic similarity. The method is trained and tested on the IAPR-TC12, MIRFLICKR-25K, and NUS-WIDE datasets, and the experimental evaluation criteria are mAP and PR curve, and the experimental results show the effectiveness and practicality of the method.

Keywords:

triples; cross-modal; similarity retrieval; hash algorithm

1. Introduction

With the constant updating and iteration of communication technologies, information transmission carriers are showing a diversified trend. Multimedia such as images, text, video, and audio are widely available as information carriers on the Internet. On many social networking sites such as Weibo and TikTok, these data, which include images, text, video, audio, and other forms, are known as multimodal data. With the rapid growth of multimodal data in practical applications, data storage and retrieval needs are becoming diverse. Cross-modal retrieval [1,2,3] is the mutual retrieval between different modalities by using data from one modality as query keywords to retrieve similar information from another. However, the different ways of representing data in other modalities can lead to a semantic gap between them, so retrieving the correct results in cross-modal retrieval tasks takes much work. Therefore, reducing the semantic gap between different modalities and mining the expression of semantic similarity between modalities is a challenging problem at this stage.

The hash algorithm is used in semantic similarity retrieval because of its fast retrieval speed and low storage cost. The hash algorithm maps the data from the original space into the binary Hamming space. It preserves the maximum similarity between the data. The representation form of binary coding can reduce the storage cost of the computer, and the use of Hamming distance calculation can increase the retrieval speed [2,4,5]. However, manual feature and hash code learning based on hash methods are independent and can lead to poor retrieval performance. With the development of deep understanding, deep neural networks are widely used for feature learning. Compared with earlier hashing methods, the deep cross-modal hashing model puts modal features and hash functions in an end-to-end framework to learn together, which solves the problem of incompatibility between manual features and hash code learning and makes the learned hash code more efficient. Thus, The hashing algorithm is suitable for single-modal and multimodal semantic similarity retrieval. Two main existing multimodal hashing algorithms exist, namely multi-source hashing [6] and cross-modal hashing [1,7,8], with cross-modal hashing methods performing relatively well in cross-modal retrieval tasks.

The key to the cross-modal retrieval approach is to construct the latent semantic space so that each point has definite semantic information that reflects the mapping relationship between the two modalities, thus performing similarity calculations on the data in the space [9]. Most deep cross-modal hashing methods symmetrically learn hash codes, i.e., the hash codes of query instances and database instances are learned in the same way, the training process is very time-consuming, the supervisory information is difficult to make full use of during the training process, and the generalization ability of the learned hash codes is not strong. Thus, The majority of cross-modal retrieval methods currently have two issues. The first is that features are extracted directly from the data set without processing, and the second is that only the neural network’s training speed is considered. As a result, the supervisory information in the data could be utilized more effectively in the training process, and the label information is not utilized to its fullest extent. Another problem is that the private representation within the modalities is eliminated to facilitate the network’s training on different modalities. Only the standard information between the data of different modalities is retained, which leads to insufficient ability to express semantically similar information of different modalities, the inability to effectively mine the semantic relationships between the modalities, and the weak generalization ability [10,11,12,13].

To solve the problems mentioned above, we need help to effectively use the supervisory information of the data set and the insufficient ability to express semantic similarity information. This paper first introduces a cross-modal hash retrieval method for fusion triples (Tri-CMH) for cross-modal semantic similarity retrieval to solve the problem of ineffective use of supervised information. After that, the loss function in the hash learning process is optimized to make the model have the best robustness to solve the problem of insufficient ability to express semantically similar information between different modalities and poor generalization ability [14]. The main contributions of this paper are as follows.

(1) An end-to-end supervised cross-modal hashing method is proposed. The data are selected through a triples approach to close the distance between positive sample pairs of different modalities while pushing the distance between opposing sample pairs of the same modality farther away. The embedded labels are also used to learn the respective similarity relations. The score matrix of each similarity relation obtained is used as the supervised information to train the hash function, allowing the global supervised information to be used effectively.

(2) The loss function is optimized to improve the model’s training accuracy. As a result, it is possible to enforce specific limits on grouping semantically related hash codes and splitting them into separate modalities. Thus, the training accuracy of the model is improved, and the semantic information across modalities can be effectively retrieved.

(3) Verified on three general semantic similarity retrieval datasets, IAPR TC-12, MIRFLICKR-25K, and NUS-WIDE, the results show that the method in this paper obtains better retrieval results. Therefore, Tri-CMH is practical and effective.

2. Related Work

The early hash algorithms mainly performed semantic similarity retrieval on single-modal data. With the explosion of multimodal data in recent years, semantic similarity retrieval between multimodal data has received much attention. Researchers in this field have proposed many multimodal hashing methods to satisfy semantic similarity retrieval in large-scale data. The existing multimodal methods include multi-source hashing [6] and cross-modal hashing [1,7,8]. Multi-source hashing methods are severely limited in their application due to the difficulty of focusing on cross-modal global information. In contrast, cross-modal hashing methods generally require information from only one modality to retrieve semantically similar information from other modalities, so cross-modal hashing methods are more flexible and of greater interest than multi-source hashing methods. Two representative cross-modal hashing methods are unsupervised and supervised cross-modal hashing methods [2,15,16,17].

The unsupervised cross-modal hashing method involves direct feature extraction without labeling information while learning the binary hash code, intending to discover correlations between different modalities. Standard unsupervised cross-modal hashing methods include the CCA-ITQ approach [18] and the CMFH [19]. CCA-ITQ is a multivariate statistical method for measuring the similarity between two sets of variables. CMFH is a hash function for learning across views using collective matrix factorization. Unsupervised cross-modal hashing methods usually do not perform optimally due to their lack of label information and the effect of manual annotation.

Supervised cross-modal hashing methods are of greater interest as they include manual annotations. The hash function can be learned in conjunction with semantic tagging information during feature extraction, thereby reducing semantic differences between modalities to improve the performance of cross-modal retrieval [20]. Typical supervised cross-modal hashing methods include the discrete latent factor hashing (DLFH) cross-modal method [2], the semantic correlation maximization (SCM) method [7], the deep cross-modal hashing (DCMH) method [15], the generalized semantic preserving hashing for n-label cross-modal retrieval [21], the multimodal latent binary embedding (MLBE) method [22], the semantic preservation hashing (SePH) method [23], the cross-view hashing (CVH) method [24], the deep multi-semantic fusion-based cross-modal hashing (DMSFH) method [25], the deep visual-semantic hashing (DVSH) method [26] and the triplet-based deep hashing (TDH) method [27].

3. Description of Existing Cross-Modal Retrieval Methods

CCA-ITQ [18] is a typical correlation analysis iterative quantification methodology proposed by Gong Y. et al. in 2012. It is an unsupervised cross-modal hash retrieval method that provides multivariate statistics by evaluating the similarity of two sets of variables.

CMFH [19] is the collective matrix factorization hashing method proposed by Ding G et al. in 2014. CMFH learns unified hash codes using collective matrix factorization with latent factor model from different modalities of one instance, which can not only support cross-view search but also increase the search accuracy by merging multiple view information sources. It belongs to the unsupervised cross-modal hashing retrieval techniques.

SCM [7], a semantic correlation maximization method proposed by Zhang D et al. in 2014, uses label information to construct a matrix of similarity between different modalities, a classical approach in supervised cross-modal hash retrieval.

SePH [23], a semantic preservation hashing method proposed by Lin Z et al. in 2015, uses label information to construct a supervised matrix and learns the hash function between the supervised matrix and the corresponding hash code using KL scatter to maintain the similarity between the learned hash code and the original data. This method is the standard for supervised cross-modal hash retrieval.

DVSH [26] was proposed in 2016 by Cao Y et al., and consists of a hybrid deep architecture that constitutes a visual-semantic fusion network for learning joint embedding space of images and text sentences and two modality-specific hashing networks for learning hash functions to generate compact binary codes. It is the leading approach in the supervised cross-modal hashing retrieval field.

DCMH [15], a deep cross-modal hashing approach proposed by Jiang QY et al. in 2017, employs deep learning methods to combine feature extraction and hash learning into an end-to-end network architecture. It is the leading approach in the supervised cross-modal hashing retrieval field, and its emergence paved the way for deep cross-modal hashing retrieval.

C Deng et al. proposed TDH [27] in 2018, which involves a triplet-based deep hashing network for cross-modal retrieval, utilizing the triple labels, which describe the relative relationships among three instances as supervision to capture more general semantic correlations between cross-modal models, and graph regularization is introduced into the TDH method to preserve the original semantic similarity between hash codes in Hamming space. It is the leading approach in the supervised cross-modal hashing retrieval field.

DLFH [2], proposed in 2019 by Jiang QY et al., introduced a discrete latent factor cross-modal hashing approach that can directly learn binary hash codes belonging to the discrete method. This method is effective in the field of supervised cross-modal hash retrieval.

DMSFH [25] proposed in 2022 by Zhu X et al., consists of deep multi-semantic fusion-based cross-modal hashing, which uses a multi-label semantic fusion method to improve cross-modal consistent semantic discrimination learning. Moreover, a graph regularization method combines inter-modal and intra-modal pairwise loss to preserve the nearest neighbor relationship between data in the Hamming subspace. It is the leading approach in the supervised cross-modal hashing retrieval field.

4. A Cross-Modal Hash Retrieval Method with Fused Triples

4.1. Model Framework

This paper proposes a new cross-modal retrieval method with fused triples called the cross-modal hash retrieval method. The overall structure is shown in Figure 1. The method mainly consists of two parts: feature extraction and hash learning.

In the feature extraction part, the data set with supervised information are preprocessing by a triples data selection method. Then, two deep learning neural network models are used to extract features from the image and text modalities, effectively using the global supervised information while better extracting the feature information of each modality. The triple data selection can form a supervised matrix by learning the feature space to complete intra-modal and inter-modal sample matching. Similar samples are close together, and different models are further apart [13]. It effectively uses supervised information and improves the network’s cross-modal semantic similarity retrieval ability. The triples in the triples data selection method refers to the anchor, positive and negative, which are the baseline positive and negative samples. The triples approach determines whether images and text of the same semantic information belong to the same class by learning features in the feature space. Under the same category, the distance between the anchor and the positive is closer; under different types, the distance between the anchor and the negative is farther. Therefore, in triple training, not only is the training process of the network simplified, but also the training process can be controlled so that the global supervisory information can be used.

The hash learning includes hamming distance loss, intra-modal loss, cross-modal inter-modal loss, and quantization loss. Cross-modal triple loss can reduce the distance between various modal positive samples for cross-modal sample pairs with the same semantic information, enabling the network model to acquire more comparable cross-modal features and enhancing the retrieval accuracy. The four loss functions are realized in feature space and hash space, respectively, taking advantage of their respective strengths, alleviating the differences in cross-modal similarity retrieval, and compensating for the lack of semantic similarity expression capability.

4.2. Feature Extraction

In the feature extraction part, the data set selection uses triples data selection. The information based on the label is used to close the distance of similar samples while pushing the distance of different samples farther to filter out similar information to the semantics of the label. Afterward, the image modality and text modality were extracted using the convolutional neural network (CNN) model [28] and multi-layer multivariate long- and short-term memory network model (Multi-LSTM) [29] for feature extraction.

According to two excellent deep cross-modal hashing algorithms in [2,15], for the feature extraction of image modalities, we chose an eight-layer CNN model, including five convolutional layers and three fully connected layers. Figure 1 illustrates the feature extraction process of CNN for image modality. The model’s convolutional layers are identified as “conv1-conv5,” and its fully connected layers as “full6-full8”, Table 1 depicts its structure.

Where “f. num × size × size” indicates the number and size of the convolution kernels. “s” denotes the step size of the convolution. “pad” shows the number of pixels to fill. “LRN” denotes regularization to increase generalization performance. The “pool” indicates the pooling process. The value in the fully connected layer indicates the number of nodes. It is also the output dimension of the layer.

For text modal feature extraction, the text data vectorizing uses a bag-of-words model, and then feature extraction is performed using Multi-LSTM. Three LSTM layers, one fully connected layer, and one one-dimensional convolutional layer comprise the multi-layer, multi-variable, more extended short-term memory network model, known as Multi-LSTM. Each LSTM layer has an input gate, an oblivion gate, an output gate, and a memory unit. The oblivion gate discards useless information and the input gate controls the memory unit to store helpful information. The output gate controls the output of useful information. Multi-LSTM combines the advantages of each layer to avoid the gradient disappearance and explosion problem to a certain extent and to obtain the semantic feature information of text modality. Figure 1 illustrates the feature extraction process of Multi-LSTM for text modalities. Using the network model for the feature extraction of text modes is superior to simple network models, such as LSTM. The Multi-LSTM model used in this paper has three LSTM layers, and the best training results and minor errors are achieved when the number of neurons in the hidden layer is 64, 256, and 64.

4.3. Hash Learning

In this paper, we estimate inter- and intra-modal relationships throughout the training process, thus mitigating the differences in cross-modal similarity retrieval and achieving enhanced learning ability of hash functions in the network. To achieve this, the hamming distance loss

L_{H}

, the intra-modal loss

L_{I}

, the cross-modal triple loss

L_{T}

, and the quantization loss

L_{S}

involved in this training process must be optimized.

When the loss function is minimized, the optimal result of the network model training is obtained, and thus, the performance is optimal. According to the literature [2,7], the larger the likelihood function, the smaller the loss; the smaller the inverse likelihood function, the smaller the loss. From this, the objective function of Tri-CMH can be obtained as follows:

\min_{B^{(x)}, B^{(y), θ_{x}, θ_{y}}} L = L_{H} + λ L_{I} + β L_{T} + γ L_{S}

(1)

where

λ

,

β

, and

γ

are hyperparameters indicating the tuning scale factor.

Hyperparameter optimization allows the learning algorithm to select a set of optimal hyperparameters, usually aiming to optimize a measure of the algorithm’s performance on the dataset. Therefore, in training the model, the hyperparameters must be optimized to select a set of optimal hyperparameters for the learning machine to improve the performance and effect of learning. Therefore, the training process needs to be well-regulated and optimization of the input parameters.

In the experiment of this method, all possible hyperparameters were tested using the traversal method, and the parameter values of the optimal effect were finally determined. We set the hyperparameters

λ

= 100,

β

= 150,

γ

= 50,

η_{1}

=

η_{2}

= 1, fixed the input size of image and text as 128, fixed the number of iterations as 500, fixed the learning rate of modal image network and modal text network as from

10^{- 6}

to

10^{- 1}

, and averaged the experimental data from three experimental results and applied them to the algorithm.

4.3.1. Hamming Distance Loss

Assume there are n sample pairs in the data set, each with features in both image and text modalities. In this paper, we use

X = {x_{i}}_{i = 1}^{n}

to define modal image features and

Y = {y_{j}}_{j = 1}^{n}

modal text features, where

x_{i}

is the raw pixels of the image

i

and

y_{j}

is the text information associated with the image

i

. In addition, define the cross-modal similarity matrix

S

. If the image

x_{i}

is similar to the text

y_{j}

, then

S_{i j} = 1

; otherwise,

S_{i j} = 0

. The similarity is defined based on category labels with similar semantic information. If image

x_{i}

and text

y_{j}

have the same category label, then

x_{i}

and

y_{j}

are considered similar, and vice versa.

The cross-modal hash retrieval approach focuses on learning the hash function, noting the imaging modality as

h^{(X)} (x) \in {- 1, + 1}^{c}

and the text modality as

h^{(Y)} (y) \in {- 1, + 1}^{c}

, where

c

is the length of the binary hash code. Mapping the feature vectors of the two-modal data into a Hamming space ensures that the two modalities’ hash functions remain semantically similar

S

while maintaining semantic similarity within and across modalities. If

S_{i j} = 1

, the Hamming distance between the binary hash codes

B_{}^{(x)} = h^{(x)} (x_{i})

and

B_{}^{(y)} = h^{(y)} (y_{j})

should be very small. Conversely, if

S_{i j} = 0

, the Hamming distance between the two binary codes should be prominent.

G = {g_{y_{i}} | i = 1, 2, \dots \dots, N} \subseteq R

denotes the hash representation learned in the imaging modality, and

F = {f_{x_{i}} | i = 1, 2, \dots \dots, N} \subseteq R

denotes the representation learned in the text modality.

G_{}^{T}

denotes the transpose of

G

,

| | • | |_{F}^{2}

denotes the F-norm, and

s i g n (•)

denotes the symbolic function defined as:

s i g n (x) = \{\begin{cases} 1 x \geq 0 \\ - 1 x < 0 \end{cases}

(2)

Given the hash code pair

{F_{i *}, G_{j *}}

, define the logical functions

σ (Θ_{i j}) = 1 / (1 + e^{- Θ_{i j}})

,

Θ_{i} = \frac{δ}{c} F_{i *} G_{_{j *}}^{T}

,

δ > 0

as a hyperparameter indicating the tuning scale factor. Define the likelihood value of the cross-modal similarity

S

as

p (S | F, G) = \prod_{i, j = 1}^{n} p (S_{i j} | F, G)

(3)

where

p (S_{i j} | F, G) = \{\begin{cases} \frac{1}{1 + e^{- Θ_{i j}}}, S_{i j} = 1 \\ 1 - \frac{1}{1 + e^{- Θ_{i j}}}, S_{i j} \neq 1 \end{cases}

.

Then, the log-likelihood values of

F

and

G

are as follows:

L_{H} = \log p (S | F, G) = \sum_{i, j = 1}^{n} (S_{i j} Θ_{i j} - \log (1 + e^{Θ_{i j}}))

(4)

4.3.2. Single-Mode Internal Loss

In the method of this paper, not only the semantic similarity between the two modalities and the semantic similarity relationship within the modalities are considered to fully exploit the semantic information of the data. The data set is a process using the triple data selection method, so the individual labels are closely related to the internal data semantics. Then, the similarity of the corresponding modal internal data sample of that label are judged based on the label information to generate sample data with a similar meaning to the label. The loss between the label matrix and the intra-modal data feature matrix can then be taken as:

L_{I} = {‖L_{x} - L^{'}‖}_{F}^{2} + {‖L_{y} - L^{″}‖}_{F}^{2}

(5)

where,

L^{'}

and

L^{″}

are data sample matrices for image modality and text modality;

L_{x}

and

L_{y}

are label matrices for image modality and text modality.

4.3.3. Cross-Modal Losses

In this paper’s approach, a batch of network training contains many data samples with different semantic information. Here, the algorithm uses a triple loss function. According to the mathematical principle of the loss function, the triple loss function can minimize the distance between the target sample and similar samples and maximize the distance between the target sample and different samples.

For each sample, a positive sample of the other modality is selected, and a negative sample of the same modality is selected from the cross-modal triple. In other words, the text modality’s positive samples and the imaging modality’s negative samples choose to form a cross-modal triple with the features of the imaging modality. The positive samples of the text modality and the negative samples of the imaging modality choose to form a cross-modal triple with the features of the text modality. The proposed formula for the loss of triples between cross-modal states in this paper is

\begin{matrix} L_{T} & = \sum_{⟨ i, j, k ⟩} m a x {0, ε + ∥ H_{i}^{T} - {\hat{H}}_{j}^{I} ∥ - ∥ H_{i}^{T} - {\hat{H}}_{k}^{T} ∥} \\ + \sum_{⟨ i, j, k ⟩} m a x {0, ε + ∥ H_{i}^{I} - {\hat{H}}_{j}^{T} ∥ - ∥ H_{i}^{I} - {\hat{H}}_{k}^{I} ∥} \end{matrix}

(6)

where

ε

denotes an empirical distance threshold.

H_{i}^{T}

denotes modal text features,

{\hat{H}}_{j}^{I}

denotes positive samples of image modalities, and

{\hat{H}}_{k}^{T}

denotes negative samples of text modalities; then,

H_{i}^{T}

,

{\hat{H}}_{j}^{I}

, and

{\hat{H}}_{k}^{T}

form a cross-modal triple.

H_{i}^{I}

denotes modal image features,

{\hat{H}}_{j}^{T}

denotes positive samples of text modalities, and

{\hat{H}}_{k}^{I}

denotes negative samples of image modalities; then,

H_{i}^{I}

,

{\hat{H}}_{j}^{I}

and

{\hat{H}}_{k}^{T}

form a cross-modal triple. Therefore, the loss of the cross-modal triple

L_{T}

achieves the goal of pulling in the distance between positive sample pairs of different modalities while pushing away the distance between negative sample pairs of the same modality so that the retrieval results are optimal.

4.3.4. Quantifying Losses

In the method of this paper, the features of the two modalities obtained in the feature extraction part are represented as vector codes. In the hash learning part, to achieve cross-modal hash retrieval, the vector encoding needs to be converted to binary hash codes to reduce storage space and achieve fast retrieval between modalities.

To ensure the vector encoding is as consistent as possible with the binary hash code, quantization errors need to be kept to a minimum. The hash codes

G

and

F

are continuous actual values generated by the image and text networks. They must transform into binary hash codes

B^{(x)}

and

B^{(y)}

in the computation process. At the same time,

B^{(x)}

and

B^{(y)}

should maintain cross-modal similarity in

S

. Supervised information and hash code learning to integrate into a unified learning framework, effectively preserving the similarity information in the original data. Therefore, the quantization error loss of the method can be taken as:

\begin{matrix} L_{S} = η_{1} ({‖B^{(x)} - F‖}_{F}^{2} + {‖B^{(y)} - G‖}_{F}^{2}) \\ + η_{2} (‖ F ‖_{F}^{2} + ‖ G ‖_{F}^{2}) \end{matrix}

(7)

5. Experiment

This experiment was completed on a computer configured as follows: the motherboard selected was the super micro intelC621A chipset motherboard, the CPU selected was Intel Xeon Gold 6330 28C56T 2.00–3.10 GHz, the graphics card selected was the RTX309024G turbo graphics card, and the size of the RAM was 256 G.

In addition, there was also a Samsung 980 Pro2T solid-state drive. In this paper, eight advanced cross-modal retrieval methods were selected to compare with Tri-CMH, namely CMFH [19], CCA-ITQ [18], SCM [7], SePH [23], DCMH [15], TDH [27], DLFH [2], and DMSFH [25]. Among them, the first four algorithms were based on shallow frameworks, and the last four were based on deep learning.

5.1. Experimental Programme

We experimented with the proposed scheme on three publicly available authoritative retrieval datasets, and the experiments validated the practicality and effectiveness of the method in this paper. The three datasets selected are the IAPR-TC12 [30] data set, the MIRFLICKR-25K [31] data set, and the NUS-WIDE [32] data set.

The IAPR TC-12 [30] data set, consisting of 20,000 images, each with textual information corresponding to it, was annotated using 255 labels. Through data processing, 17,652 images from 168 semantic tags were ultimately retained.

The MIRFLICKR-25K [31] data set consists of 25,000 images, each with textual information corresponding to it, with all samples annotated by one or more of the 24 semantic tags. Through data processing, 20,485 images from 18 semantic tags were finally retained.

The NUS-WIDE [32] data set consists of 269,648 images, each with corresponding textual information, and all samples were annotated with one or more of the 81 semantic labels. In this paper, experiments were conducted using 195,834 sample pairs of the 21 most common concepts, which is a vast data set compared to the two large datasets above.

For the IAPR TC-12 and MIRFLICKR-25K datasets, 3000 image text pairs were randomly selected as the test set, 10,000 image text pairs were selected from the rest of the data as the training set, and the remaining image text pairs were used as the cross-modal retrieval set. For the NUS-WIDE dataset, 10,000 picture text pairs were chosen at random as the test set, 30,000 image text pairs as the training set, and the remaining image text pairings were utilized as the cross-modal retrieval set.

To validate the performance of cross-modal hash retrieval methods incorporating triples in cross-modal retrieval tasks, we use Hamming sorting and hash lookup, widely used in retrieval, as evaluation criteria. A widely used metric in Hamming ranking is mean average precision (mAP) [33], where a higher mAP value indicates a higher average precision of the retrieved results. Precision-recall curve (PR curve) [34] is a commonly used metric to evaluate the precision of hash lookups, with recall as the independent variable and precision as the dependent variable, indicating the trend of retrieval precision with recall, and, generally, the more prominent the recall, the lower the corresponding precision.

5.2. Experimental Results and Analysis

First, we analyzed cross-modal retrieval performance in retrieving textual information based on picture and image information based on textual information. We demonstrated the results of mAP using several approaches with varied-length hash codes on different datasets.

In order to analyze the performance of different methods, in this paper, some methods of cross-modal hash retrieval were experimentally compared on different datasets. mAP results are shown in Table 2, Table 3 and Table 4.

The mAP results in Table 2 show that this paper’s method outperforms other methods, regardless of whether the hash code length is 16, 32, or 64. Among them, the mainstream method DMSFH has the best overall performance on the IAPR-TC12 data set at a hash code length of 32, taking image retrieval text as an example. Compared with it, the method in this paper is 12.5 percentage points higher, with a significant improvement in cross-modal retrieval performance. The performance of CMFH and CCA-ITQ methods based on unsupervised cross-modal retrieval is lower than that of other cross-modal retrieval methods. This indicates that the performance of supervised cross-modal hashing methods on this dataset is generally better than unsupervised cross-modal hashing methods. Furthermore, the lack of label information in the unsupervised cross-modal hashing algorithm is one of the critical reasons for this result.

As seen from the mAP results in Table 3, there is a clear advantage in the effectiveness of all methods on this data set, indicating that most methods adapt to the MIRFLICKR-25 data set. Among them, the mainstream method DMSFH has the best overall performance on this data set at a hash code length of 32 for image retrieval text, for example. Compared with it, the method in this paper is 2.9 percentage points higher, with a significant improvement in cross-modal retrieval performance. Again, similar to the IAPR TC-12 data set results, the two unsupervised cross-modal hash retrieval methods performed significantly weaker than the other supervised cross-modal hash retrieval methods.

As seen from the mAP results in Table 4, the mAP results for all methods on the NUS-WIDE data set are in the middle of the range compared to the results on the other datasets. Among them, the mainstream method DMSFH has the best overall performance on the data set at a hash code length of 16 for image retrieval text, for example. Compared with it, the method in this paper is 7.1 percentage points higher, with a significant improvement in cross-modal retrieval performance. However, the results are slightly lower than the DLFH method in the case of retrieving images in text. Hence, the method’s adaptability in this paper is still deficient in this data set. The average precision means cannot be optimal when compared with more adaptable methods.

Next, we set the hash code length to 32, analyzed the performance of different methods for cross-modal retrieval on different datasets, and gave PR curve curves. The proposed method was compared with eight other mainstream methods on different data sets, and the results are shown in Figure 2, Figure 3 and Figure 4.

As can be seen from Figure 2, the accuracy of this paper’s method is higher when the recall is the initial value on the IAPR TC-12 dataset for text retrieval based on image information and continues to decrease as the recall increases. Until the recall rate is 0.1, the accuracy is lower than that of the DLFHand DMSFH method. However, when the recall reaches 0.25, the decline in accuracy starts to slow down, and the accuracy is higher than the DLFH and DMSFH methods. It is significantly better in the later stages than other methods compared. According to the PR curves principle, when two curves intersect, the performance is determined by the area enclosed by the image and axis; the larger the area, the higher the performance. In cases where it is not easy to determine the area, it can be judged by the intersection of the curve with y = x. The larger the value of the intersection coordinates, the better the performance. Therefore, although there is a situation where the accuracy of a point is lower than that of other methods, the method’s accuracy in this paper could be better. This can only mean that the method in this paper was initially only partially adapted to the data set when trained on it. However, soon, the method started to adapt to the data set and reach an optimal accuracy, which was still more excellent than the other methods until the recall was 1. Retrieving images based on textual information differs slightly from retrieving text from images. However, the overall trend is broadly similar, with a significant variation in the early stages, but it quickly adapts well. It can also be seen from the figure that the PR curves of the unsupervised cross-modal retrieval methods CMFH and CCA-ITQ are significantly lower than the other methods. It is sufficient to show that the supervised cross-modal retrieval methods are superior to the unsupervised ones.

Figure 3 demonstrates that on the MIRFLICKR-25K data set, the accuracy of this paper’s method is much greater than the accuracy of the other methods, both in the case of retrieving text based on picture information and obtaining images based on text information and from beginning to end. It shows that the proposed method is adaptable and has high retrieval accuracy on this data set. It can also be seen from Figure 3 that the unsupervised cross-modal retrieval-based methods CMFH and CCA-ITQ continue to have significantly lower PR curves than the other methods. From this, it is clear that supervised cross-modal retrieval methods are more applicable to this data set than unsupervised cross-modal retrieval methods. Furthermore, the precision of CMFH and CCA-ITQ remained between 0.55 and 0.6 regardless of the change in the recall, surrounded by the PR curves of the other methods, so the unsupervised cross-modal retrieval methods do not generalize well on this data set.

From the experimental results in Figure 4, it can be seen that on the NUS-WIDE dataset, when retrieving text based on image information, the precision of this paper’s method is higher than that of other methods when the recall is less than 0.8. Still, the method’s precision starts to drop faster or even later than other methods when the recall is 0.8. The method also shows faster precision decline when retrieving images based on textual information. According to the principle of the PR curve, it can be seen that although the precision of this method is lower than other methods, it does not mean that the precision is not good, but only means that the method of this paper is not good at adapting to the dataset in the later stage. Thus, we analyze the reason for this phenomenon because the dataset is too large. The network training time is long, meaning that in this paper’s method on the dataset with the increase of training time, the adaptive ability cannot obtain the best results, and has certain limitations. Therefore, the method in this paper needs to go through operations such as splitting samples to improve the retrieval performance when experimenting on large-scale datasets. Meanwhile, it can be seen that the unsupervised cross-modal retrieval methods CMFH and CCA-ITQ have a slight advantage in terms of accuracy on this dataset, which can indicate that the unsupervised cross-modal retrieval methods are more suitable for over-large-scale datasets. In future research, unsupervised cross-modal retrieval methods can be considered for experiments on oversized datasets.

5.3. Training Time

In order to evaluate the training time of the proposed algorithm, the experiment selected CCA-ITQ in the unsupervised cross-modal hash method, DCMH in the supervised cross-modal hash method, and the proposed method as the comparison objects, and compared the training time of the three algorithms with the hash code length of 16 on the MIRFLICKR-25K dataset. In the experiments, the entire data set was used for training. When the whole data set is used for training, the convergence time of DCMH and the proposed method needs more than 7 h, while the unsupervised method needs less than 3 h. It can be seen that the unsupervised method CCA-ITQ is the fastest training method. In contrast, the supervised cross-modal hashing method has a longer training time. Therefore, the training time of the unsupervised cross-modal hashing method is short because it has no supervision information and does not need to consider the label, thus shortening the training time. Although the training time of the unsupervised method is short, its performance accuracy is low, so the supervised cross-modal hashing method has a better effect.

5.4. Sample Adaptability Analysis

As the proposed method in this paper presents different results on different datasets, the sample adaptability of the method is analyzed. We investigate the effect of training sample size on the retrieval performance of the proposed algorithm on the MIRFLICKR-25K and NUS-WIDE datasets. The mAP values of Tri-CMH were recorded when the hash code length was set to 32, and the sample sizes were 2000, 5000, 7000, and 10,000. Figure 5 shows the variation of mAP with sample size for both datasets.

As shown in Figure 5a, on the MIRFLICKR-25K data set, a high mAP can be achieved with a small number of samples, both for text retrieval based on images and image retrieval based on text. Furthermore, the mAP tends to increase as the training sample data increase. However, starting with a sample size greater than 7000, the mAP trend slows down but also rises. The method proposed in this paper is highly adaptable to this data set.

As shown in Figure 5b, on the NUS-WIDE data set, the mAP tends to increase when the number of samples is less than 7000, whether the text content is retrieved based on image information or the image content is retrieved based on text information. However, when the number of training samples exceeds 7000, the mAP tends to decrease. The method in this paper does not adapt well to a larger sample size on this data set.

Therefore, for the NUS-WIDE data set, the method in this paper is more adapted to small samples and can work best for experiments by extracting a portion of the data with convincing results.

The results of the experimental analysis show that in the current stage of cross-modal retrieval tasks, the mAP results of Tri-CMH are all higher than those of other mainstream methods. The area below the PR curve is generally more significant than other methods, which is sufficient to demonstrate the effectiveness and practicality of the method in this paper. However, some things could be improved with the method in this paper. For example, the accuracy is better on the datasets IAPR-TC12 and MIRFLICKR-25K. The accuracy of the more extensive data set NUS-WIDE will be slightly lower than the individual better methods as the recall rate rises. The analysis of the sample adaptation ability by different sample sizes on different datasets shows that the method in this paper works best when the sample size is less than 7000. The trend of increasing mAP values slowed down for all values larger than 7000. Therefore, in future research, the method will continue to be optimized for larger data sets.

6. Conclusions

In this paper, we propose a new end-to-end cross-modal hash retrieval method with fused triples called the across-modal hash retrieval method. Specifically, triple data selection is first performed on the dataset to filter out the information that is truly similar to the semantics of the tags by bringing similar samples closer together while pushing different samples further apart based on the tag information. Then, the text data are vectorized using the bag-of-words model. Multi-Layer Multivariate LSTM (Multi-LSTM) and Convolutional Neural Networks (CNN) extract features from text and image modalities. In order to fully exploit the semantic information of the data, we considered the semantic similarity relationships between and within modalities. For each sample, a positive sample of another modality and a negative sample of the same modality are selected to form a cross-modal triad to enhance the learning ability of the hash function in the network. In judging the similarity of data samples within the modality corresponding to that label, sample data with similar meanings can be generated. At the same time, we convert the vector codes obtained from triple data selection into binary hash codes to reduce the storage space, thus realizing cross-modal hash retrieval and inter-modal fast retrieval. Finally, we conduct experiments on three public datasets in cross-modal retrieval for the method proposed in this paper. The results show that the Tri-CMH method proposed in this paper has better results on regular datasets but has poor results on ultra-large datasets, which is further analyzed for underperformance. It is finally concluded that the method in this paper is more suitable for datasets with a sample size of 7000. It needs to be decomposed into multiple small-sample datasets when encountering larger datasets. Therefore, compared with other mainstream methods with obvious advantages, this paper’s method can effectively retrieve semantic similarity information, improve inter-model and intra-modal training accuracy, and provide a reference for the new technology of cross-modal retrieval.

Author Contributions

Conceptualization, W.L.; methodology, W.L.; software, W.L.; validation, W.L., Y.L., J.Y. and X.Z.; formal analysis, H.M., J.Y., X.Z. and X.X.; investigation, W.L., Y.L., J.Y. and J.W.; resources, X.X.; data curation, W.L.; writing—original draft, W.L.; Writing—review & editing, W.L., H.M. and Y.L.; visualization, W.L.; supervision, H.M.; project administration, H.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Liaoning Education Department Scientific Research Project (JZL202015404, LJKZ0625), and General project of Liaoning Provincial Department of Education (No. LJKZ0618).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, K.; Yin, Q.; Wang, W.; Wu, S.; Wang, L. A comprehensive survey on cross-modal retrieval. arXiv 2016, arXiv:1607.06215. [Google Scholar]
Jiang, Q.Y.; Li, W.J. Discrete latent factor model for cross-modal hashing. IEEE Trans. Image Process. 2019, 28, 3490–3501. [Google Scholar] [CrossRef] [PubMed]
Zhen, L.; Hu, P.; Peng, X.; Goh RS, M.; Zhou, J.T. Deep supervised cross-modal retrieval. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. 2019, 33, 10394–10403. [Google Scholar]
Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. Adv. Neural Inf. Process. Syst. 2008, 21, 8–11. [Google Scholar]
Zhong, Z.; Zheng, L.; Luo, Z.; Li, S.; Yang, Y. Invariance matters: Exemplar memory for domain adaptive person reidentification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 598–607. [Google Scholar]
Wang, D.; Cui, P.; Ou, M.; Zhu, W. Deep multimodal hashing with orthogonal regularization. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 2291–2297. [Google Scholar]
Zhang, D.; Li, W.J. Large-scale supervised multimodal hashing with semantic correlation maximization. Proc. AAAI Conf. Artif. Intell. 2014, 28, 2177–2183. [Google Scholar] [CrossRef]
Shen, H.T.; Liu, L.; Yang, Y.; Xu, X.; Huang, Z.; Shen, F.; Hong, R. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Trans. Knowl. Data Eng. 2020, 33, 3351–3365. [Google Scholar] [CrossRef]
Chamberlain, J.D.; Bowman, C.R.; Dennis, N.A. Age-related differences in encoding-retrieval similarity and their relationship to false memory. Neurobiol. Aging 2022, 113, 15–27. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wang, L.; Yang, Y.; Lian, T. SemSeq4FD: Integrating global semantic relationship and local sequential order to enhance text representation for fake news detection. Expert Syst. Appl. 2021, 166, 114090. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Zou, X.; Bakker, E.M.; Wu, S. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 2020, 400, 255–271. [Google Scholar] [CrossRef]
Liu, F.; Gao, M.; Zhang, T.; Zou, Y. Exploring semantic relationships for image captioning without parallel data. In Proceedings of the 2019 IEEE International Conference on Data Mining (ICDM), Beijing, China, 8–11 November 2019; pp. 439–448. [Google Scholar]
Wang, H.; Sahoo, D.; Liu, C.; Lim, E.P.; Hoi, S.C. Learning cross-modal embeddings with adversarial networks for cooking recipes and food images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11572–11581. [Google Scholar]
Khan, A.; Hayat, S.; Ahmad, M.; Wen, J.; Farooq, M.U.; Fang, M.; Jiang, W. Cross-modal retrieval based on deep regularized hashing constraints. Int. J. Intell. Syst. 2022, 37, 6508–6530. [Google Scholar] [CrossRef]
Jiang, Q.Y.; Li, W.J. Deep cross-modal hashing. In Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3232–3240. [Google Scholar]
Ma, L.; Li, H.; Meng, F.; Wu, Q.; Ngan, K.N. Global and local semantics-preserving based deep hashing for cross-modal retrieval. Neurocomputing 2018, 312, 49–62. [Google Scholar] [CrossRef]
Hu, P.; Zhu, H.; Lin, J.; Peng, D.; Zhao, Y.P.; Peng, X. Unsupervised contrastive cross-modal hashing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3877–3889. [Google Scholar] [CrossRef] [PubMed]
Gong, Y.; Lazebnik, S.; Gordo, A.; Perronnin, F. Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 35, 2916–2929. [Google Scholar] [CrossRef] [PubMed]
Ding, G.; Guo, Y.; Zhou, J. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–24 June 2014. [Google Scholar]
Shu, Z.; Bai, Y.; Zhang, D.; Yu, J.; Yu, Z.; Wu, X.J. Specific class center guided deep hashing for cross-modal retrieval. Inf. Sci. 2022, 609, 304–318. [Google Scholar] [CrossRef]
Mandal, D.; Chaudhury, K.N.; Biswas, S. Generalized semantic preserving hashing for n-label cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4076–4084. [Google Scholar]
Zhen, Y.; Yeung, D.Y. A probabilistic model for multimodal hash function learning. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 12 August 2012; pp. 940–948. [Google Scholar]
Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, Honolulu, HI, USA, 21–26 July 2015; pp. 3864–3872. [Google Scholar]
Kumar, S.; Udupa, R. Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Catalonia, Spain, 16–22 July 2011; pp. 1360–1365. [Google Scholar]
Zhu, X.; Cai, L.; Zou, Z.; Zhu, L. Deep Multi-Semantic Fusion-Based Cross-Modal Hashing. Mathematics 2022, 10, 430. [Google Scholar] [CrossRef]
Cao, Y.; Long, M.; Wang, J.; Yang, Q.; Yu, P.S. Deep visual-semantic hashing for cross-modal retrieval. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 13–17 August 2016. [Google Scholar]
Deng, C.; Chen, Z.; Liu, X.; Gao, X.; Tao, D. Triplet-based deep hashing network for cross-modal retrieval. IEEE Trans. Image Process. 2018, 27, 3893–3903. [Google Scholar] [CrossRef] [PubMed]
Chatfield, K.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Return of the devil in the details: Delving deep into convolutional nets. arXiv 2014, arXiv:1405.3531. [Google Scholar]
Xu, M.; Du, J.; Xue, Z.; Guan, Z.; Kou, F.; Shi, L. A scientific research topic trend prediction model based on multi-LSTM and graph convolutional network. Int. J. Intell. Syst. 2022, 37, 6331–6353. [Google Scholar] [CrossRef]
Escalante, H.J.; Hernández, C.A.; Gonzalez, J.A.; López-López, A.; Montes, M.; Morales, E.F.; Sucar, L.E.; Villaseñor, L.; Grubinger, M. The segmented and annotated IAPR TC-12 benchmark. Comput. Vis. Image Underst. 2010, 114, 419–428. [Google Scholar] [CrossRef]
Huiskes, M.J.; Lew, M.S. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information, New York, NY, USA, 30–31 October 2008; pp. 39–43. [Google Scholar]
Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. Nus-wide: A real-world web image database from national university of Singapore. In Proceedings of the ACM International Conference On Image and Video Retrieval, New York, NY, USA, 8–10 July 2009; pp. 1–9. [Google Scholar]
Henderson, P.; Ferrari, V. End-to-end training of object class detectors for mean average precision. In Proceedings of the Asian Conference on Computer Vision, Taipei, Taiwan, 20–24 November 2016; pp. 198–213. [Google Scholar]
Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation; European Conference Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]

Figure 1. Overall structure of the cross-modal hash retrieval method with fused triples.

Figure 2. Cross-modal retrieval PR curve on IAPR TC-12. (a) Image retrieval text on IAPR-TC12; (b) text search image on IAPR-TC12.

Figure 3. PR curve for cross-modal retrieval on MIRFLICKR-25K. (a) Image retrieval text on MIRFLICKR-25K; (b) MIRFLICKR-25K on text search images.

Figure 4. PR curve for cross-modal retrieval on NUS-WIDE. (a) NUS-WIDE on image retrieval text; (b) text retrieval images on NUS-WIDE.

Figure 5. Variation of mAP under different samples. (a) MIRFLICKR-25K data set; (b) NUS-WIDE dataset.

Table 1. Convolutional neural network model structure.

Layer	Structure
Conv1	f.64 × 11 × 11; s.4 × 4, pad0, LRN, ×2 pool
Conv2	f.256 × 5 × 5; s.1 × 1, pad2, LRN, ×2 pool
Conv3	f.256 × 3 × 3; s.1 × 1, pad1
Conv4	f.256 × 3 × 3; s.1 × 1, pad1
Conv5	f.256 × 3 × 3; s.1 × 1, pad1, ×2 pool
full6	4096
full7	4096
full8	Hash code length c

Table 2. Comparison of methods on the IAPR TC-12 dataset.

Method		IAPR TC-12
Method		16 Bit	32 Bit	64 Bit
I→T	CMFH	0.310	0.325	0.308
	CCA-ITQ	0.342	0.331	0.324
	SCM	0.387	0.397	0.410
	SePH	0.449	0.452	0.471
	DCMH	0.453	0.473	0.484
	TDH	0.462	0.477	0.498
	DLFH	0.450	0.468	0.527
	DMSFH	0.536	0.512	0.561
	Tri-CMH	0.629	0.637	0.648
T→I	CMFH	0.325	0.330	0.319
	CCA-ITQ	0.342	0.332	0.324
	SCM	0.364	0.361	0.369
	SePH	0.446	0.458	0.473
	DCMH	0.518	0.537	0.546
	TDH	0.535	0.584	0.569
	DLFH	0.481	0.509	0.602
	DMSFH	0.589	0.612	0.615
	Tri-CMH	0.631	0.642	0.653

Table 3. Comparison of methods on the MIRFLICKR-25 dataset.

Methods		MIRFLICKR-25
Methods		16 Bit	32 Bit	64 Bit
I→T	CMFH	0.576	0.574	0.569
	CCA-ITQ	0.576	0.574	0.569
	SCM	0.639	0.651	0.685
	SePH	0.718	0.722	0.723
	DCMH	0.741	0.746	0.750
	TDH	0.750	0.762	0.762
	DLFH	0.761	0.780	0.789
	DMSFH	0.815	0.819	0.831
	Tri-CMH	0.838	0.848	0.856
T→I	CMFH	0.578	0.579	0.577
	CCA-ITQ	0.576	0.571	0.577
	SCM	0.655	0.670	0.698
	SePH	0.727	0.732	0.738
	DCMH	0.782	0.790	0.793
	TDH	0.802	0.825	0.844
	DLFH	0.825	0.851	0.870
	DMSFH	0.807	0.823	0.856
	Tri-CMH	0.842	0.851	0.872

Table 4. Comparison of methods on the NUS-WIDE dataset.

Method		NUS-WIDE
Method		16 Bit	32 Bit	64 Bit
I→T	CMFH	0.384	0.390	0.406
	CCA-ITQ	0.396	0.382	0.370
	SCM	0.522	0.548	0.556
	SePH	0.610	0.616	0.628
	DCMH	0.590	0.603	0.609
	TDH	0.661	0.687	0.692
	DLFH	0.674	0.695	0.705
	DMSFH	0.697	0.722	0.731
	Tri-CMH	0.768	0.780	0.791
T→I	CMFH	0.374	0.382	0.389
	CCA-ITQ	0.390	0.378	0.368
	SCM	0.516	0.540	0.549
	SePH	0.578	0.605	0.626
	DCMH	0.638	0.651	0.657
	TDH	0.728	0.743	0.752
	DLFH	0.780	0.802	0.821
	DMSFH	0.709	0.752	0.766
	Tri-CMH	0.778	0.805	0.818

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Mei, H.; Li, Y.; Yu, J.; Zhang, X.; Xue, X.; Wang, J. A Cross-Modal Hash Retrieval Method with Fused Triples. Appl. Sci. 2023, 13, 10524. https://doi.org/10.3390/app131810524

AMA Style

Li W, Mei H, Li Y, Yu J, Zhang X, Xue X, Wang J. A Cross-Modal Hash Retrieval Method with Fused Triples. Applied Sciences. 2023; 13(18):10524. https://doi.org/10.3390/app131810524

Chicago/Turabian Style

Li, Wenxiao, Hongyan Mei, Yutian Li, Jiayao Yu, Xing Zhang, Xiaorong Xue, and Jiahao Wang. 2023. "A Cross-Modal Hash Retrieval Method with Fused Triples" Applied Sciences 13, no. 18: 10524. https://doi.org/10.3390/app131810524

APA Style

Li, W., Mei, H., Li, Y., Yu, J., Zhang, X., Xue, X., & Wang, J. (2023). A Cross-Modal Hash Retrieval Method with Fused Triples. Applied Sciences, 13(18), 10524. https://doi.org/10.3390/app131810524

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Cross-Modal Hash Retrieval Method with Fused Triples

Abstract

1. Introduction

2. Related Work

3. Description of Existing Cross-Modal Retrieval Methods

4. A Cross-Modal Hash Retrieval Method with Fused Triples

4.1. Model Framework

4.2. Feature Extraction

4.3. Hash Learning

4.3.1. Hamming Distance Loss

4.3.2. Single-Mode Internal Loss

4.3.3. Cross-Modal Losses

4.3.4. Quantifying Losses

5. Experiment

5.1. Experimental Programme

5.2. Experimental Results and Analysis

5.3. Training Time

5.4. Sample Adaptability Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI