Proxy-Based Semi-Supervised Cross-Modal Hashing

Chen, Hao; Zou, Zhuoyang; Zhu, Xinghui

doi:10.3390/app15052390

Open AccessArticle

Proxy-Based Semi-Supervised Cross-Modal Hashing

by

Hao Chen

,

Zhuoyang Zou

and

Xinghui Zhu

^*

College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2390; https://doi.org/10.3390/app15052390

Submission received: 23 January 2025 / Revised: 18 February 2025 / Accepted: 21 February 2025 / Published: 23 February 2025

Download

Browse Figures

Versions Notes

Abstract

Due to the difficulty in obtaining label information in practical applications, semi-supervised cross-modal retrieval has emerged. However, the existing semi-supervised cross-modal hashing retrieval methods mainly focus on exploring the structural relationships between data and generating high-quality discrete pseudo-labels while neglecting the relationships between data and categories, as well as the structural relationships between data and categories inherent in continuous pseudo-labels. Based on this, Proxy-based Semi-Supervised Cross-Modal Hashing (PSSCH) is proposed. Specifically, we propose a category proxy network to generate category center points in both feature and hash spaces. Additionally, we design an Adaptive Dual-Label Loss function, which applies different learning strategies to discrete ground truth labels and continuous pseudo-labels and adaptively increases the training weights of unlabeled data with more epochs. Experiments on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets show that PSSCH achieves the highest mAP improvements of 3%, 1%, and 4%, respectively, demonstrating better results than the latest baseline methods.

Keywords:

semi-supervised; cross-modal hashing; category proxy

1. Introduction

With the exponential growth of multimedia data on the internet, the ability to efficiently search for and retrieve relevant information has become a crucial task. Deep cross-modal hashing, which maps different modalities (e.g., images, text, and videos) into a common low-dimensional space, has emerged as a powerful solution for fast and efficient retrieval [1,2,3,4,5,6,7,8,9]. This technique enables the use of binary codes to represent data, making retrieval both fast and scalable. However, the effectiveness of cross-modal hashing methods often relies on the availability of large amounts of labeled data for training, which is not always feasible in real-world applications [10,11,12,13].

Due to the difficulty in obtaining labeled data, especially for large-scale datasets, semi-supervised learning techniques have gained increasing attention in the field of cross-modal retrieval. These methods aim to leverage both labeled and unlabeled data, reducing the reliance on expensive and time-consuming manual labeling processes. Semi-supervised learning seeks to effectively utilize the wealth of unlabeled data, which are often more abundant and easier to collect, while maximizing the information extracted from the relatively smaller labeled dataset. By incorporating both labeled and unlabeled data into the training process, these approaches have shown great potential for improving retrieval performance in scenarios where labeled data are scarce [14,15,16].

Motivation. The existing semi-supervised cross-modal hashing retrieval methods focus on exploring the similarity relationships between different modalities, aiming to map similar data closer together and dissimilar data farther apart in a shared low-dimensional space while preserving these relationships [17,18]. However, they often neglect the “data–category” relationships, which leads to suboptimal retrieval performance. Moreover, these methods primarily concentrate on generating high-quality discrete pseudo-labels, overlooking the structural relationships between data and categories inherent in continuous pseudo-labels. For example, when predicting the pseudo-label of an unlabeled data sample A, a confidence distribution indicating its likelihood of belonging to certain categories is obtained. This distribution contains rich structural information about the relationships between data and categories. However, the existing methods fail to fully utilize this information, instead setting a threshold to discretize the continuous labels [19,20,21,22]. Therefore, effectively leveraging the information within continuous pseudo-labels remains a significant challenge.

Our Method. To answer the above questions, we proposed a novel Proxy-based Semi-Supervised Cross-modal Hashing (PSSCH) method for semi-supervised cross-modal retrieval. Our proposed method introduces a category proxy network (CPNet) and a novel Adaptive Dual-Label Loss to address the challenges in semi-supervised cross-modal hashing. The CPNet consists of two key components: feature proxies and hash proxies. Feature proxies are learned and updated based on labeled data features. Each proxy represents the category center point of a category in the feature space, providing a reliable basis for generating pseudo-labels for unlabeled data. Hash proxies represent category centers in the hash space and are obtained by forward-propagating the feature proxies. This process ensures that the hash proxies preserve the structural relationships between feature proxies in the feature space as much as possible in the hash space. By learning excellent feature proxies and hash proxies via the CPNet, we can consider both data–data and data–category relationships during model training.

The Adaptive Dual-Label Loss is specifically designed to handle the differences between discrete ground truth labels (labeled data) and continuous pseudo-labels (unlabeled data). By adopting different learning strategies, it effectively optimizes both types of labels. For discrete ground truth labels, it ensures that labeled data cluster around their corresponding categories in the hash spaces while preserving the structural relationships between the data. Unlike discrete labels, continuous pseudo-labels can capture subtle relationships between data and multiple categories. Directly discretizing pseudo-labels would result in a loss of information. Therefore, in our learning strategy, we fully leverage the rich structural information inherent in continuous pseudo-labels, thereby capturing the confidence relationships between the data and all the categories. Additionally, we dynamically adjust the weight of the loss for unlabeled data based on the epoch. This ensures that the model maintains the importance of labeled data while learning more stably from unlabeled data. In the early stages of training, the model relies more on true labels to avoid negative impacts from unlabeled data. Later, as the quality of the pseudo-labels improves, the weight gradually increases.

Contributions. To sum up, the main contributions of this article are threefold:

Category Proxy Network: We design a CPNet to generate feature proxies and hash proxies, which work with two-modal hashing networks during training. This enables the model to consider both the relationships between data and the relationships between data and categories.
Adaptive Dual-Label Loss: To capture the structural relationships between the data and categories contained in continuous pseudo-labels, an Adaptive Dual-Label Loss model is proposed.
Experimental Validation: Extensive experiments on three public datasets demonstrate the superiority of our PSSCH method.

Roadmap. The overall architecture of this paper is as follows. Section 2 introduces some representative methods in the three fields of supervised cross-modal hashing, unsupervised cross-modal hashing, and semi-supervised cross-modal hashing, respectively. Section 3 introduces the proposed method in detail. The experimental results and analysis are provided in Section 4. Section 5 summarizes the conclusions and future work.

2. Related Works

In recent years, cross-modal hashing has gained significant attention due to its potential to improve the efficiency and scalability of cross-modal retrieval tasks. In this section, we summarize representative approaches in three main categories: supervised cross-modal hashing in Section 2.1, unsupervised cross-modal hashing in Section 2.2, and semi-supervised cross-modal hashing in Section 2.3.

2.1. Supervised Hashing Methods

Supervised cross-modal hashing methods utilize labeled data to learn hash codes, aiming to maintain semantic similarity across different modalities based on known label information. Among these methods, DCMH made a groundbreaking contribution by introducing deep learning into cross-modal hashing. It employs an end-to-end approach to learn the feature representations of raw data and optimize hash codes simultaneously [23]. Building on this, the MESDCH method further considers the distinction in similarity magnitudes between data under a multi-label setting, designing a multi-label semantic similarity module to explore inter-modal similarity relationships [24]. Since mapping data into the hash space involves quantization, HHF addresses the incompatibility between metric loss and quantization loss; it alleviates this issue by setting different thresholds based on varying hash code lengths and the number of categories [25]. SCCGDH designs a label network to generate distinct centers for different labels and uses these label centers to guide data learning [26]. DCPH is the first to introduce the concept of category proxies, generating hash proxies through a proxy network to guide the model learning process, enabling the model to fully consider the relationships between data and categories [27]. Considering that the generated category proxies do not account for the data distribution, DAPH adopts a two-step approach: it first integrates data features and labels into a unified framework to learn category proxies, which are then used for guidance [28]. DSPH proposes trainable category proxies and further adds a negative sample constraint to the proxy loss [29]. SCH, addressing the oscillation problem of the loss function, sets upper and lower bounds for similarity measures between data points to mitigate oscillations [30]. Finally, DCGH simultaneously considers relationships between data points as well as between data and categories, ensuring intra-class cohesion and preserving inter-class structural relationships [31].

2.2. Unsupervised Hashing Methods

In the absence of supervision, unsupervised cross-modal hashing methods utilize structural relationships among data to guide the training of hash models. DBRC models the relationships between different modalities and introduces a reconstruction framework [32]. UDCMH dynamically assigns different weights to data during the optimization process [33]. DJSRH explores the structural relationships of data across modalities and integrates them to guide model training [34]. AGCH employs various metric methods to aggregate structural relationships between different modalities and generates a similarity matrix [35]. CIRH designs a multi-modal collaborative graph to establish heterogeneous multi-modal correlations and performs semantic aggregation on graph networks to generate a complementary multi-modal representation [36].

2.3. Semi-Supervised Hashing Methods

Semi-supervised cross-modal hashing methods leverage both the label information of labeled data and the structural information of unlabeled data, gaining wide attention in real-world applications. For example, MGCH [19] proposes a novel multi-view graph-based cross-modal hashing framework to generate hash codes in a semi-supervised manner, using a graph reasoning module to process the output of multi-view graphs. MCGCN [37] designs an additional network within the two modality-specific networks to generate joint feature representations of the two modalities and employs graph convolution to propagate label information to unlabeled data. SSCH [22] enhances semantic information through an unaligned pseudo-labeling process in the presence of incorrect pairings and learns hash representations for different data using a label enhancement strategy. SCMB [38] introduces a cross-modal memory bank, dynamically storing the feature representations of each cross-modal data instance in a shared space and class probability representations in a label space to guide model training. SKDCH [21] adopts a student–teacher network to propagate knowledge and improve triplet ranking loss, effectively alleviating the heterogeneity gap. SPAL [39] uses shared semantic prototypes to associate labeled and unlabeled data in both modalities, minimizing intra-class variation and maximizing inter-class variation to improve the discriminative representation of unlabeled data. GCSCH [20] employs graph convolution networks to capture semantic information between multi-modal data of both ground truth and pseudo-labels and uses a teacher–student learning framework to transfer knowledge from the fusion module to image and text hashing networks.

3. Methodology

This section introduces the method we propose. First, the problem is described, followed by an overview of the PSSCH framework, and finally, a detailed discussion of our PSSCH method.

3.1. Problem Formulation

Problem Definition. Cross-modal hashing methods aim to map multimodal data (such as images and text) into a shared hash space to enable efficient cross-modal retrieval. Given an image or a text query, the goal is to generate binary hash codes through the learned hashing functions and retrieve the most relevant corresponding modality (for example, retrieving the corresponding text from an image or vice versa) based on their semantic similarity. This paper focuses on image–text cross-modal retrieval. Given a training dataset

D = \{D_{l}, D_{u}\}

containing N image–text pairs,

D_{l} = \{X^{l}, Y^{l}, L^{l}\}

represents the labeled sample set with

N_{l}

labeled image–text pairs. The labeled images and texts are denoted as

X^{l} \in R^{N_{l} \times d_{x}}

and

Y^{l} \in R^{N_{l} \times d_{y}}

, respectively, and

L^{l} \in {0, 1}^{N_{l} \times C}

, where C is the number of categories. If the i-th sample belongs to the j-th class, then

L_{i j}^{l} = 1

; otherwise,

L_{i j}^{l} = 0

[20]. On the other hand,

D_{u} = \{X^{u}, Y^{u}\}

represents the unlabeled sample set, where

X^{u} \in R^{N_{u} \times d_{x}}

and

Y^{u} \in R^{N_{u} \times d_{y}}

denote the unlabeled images and texts, respectively, with

N_{u}

being the number of unlabeled samples. Finally, the total number of samples is

N = N_{l} + N_{u}

. Our goal is to learn the hashing functions

H^{x}

for the image modality and

H^{y}

for the text modality, and to obtain high-quality hash codes

B^{x} \in {- 1, 1}^{K}

and

B^{y} \in {- 1, 1}^{K}

through these hashing functions, where K is the number of bits in the hash code.

Notations. Without loss of generality, sets are denoted by math script upper case letters (e.g.,

D

). Scalers or constants are signed by uppercase letters (e.g., C). Matrices are represented in uppercase bold letters (e.g.,

A

), and

(i, j)

-th element of

A

is denoted by

A_{i j}

[31]. Vectors are denoted by lowercase bold letters (e.g.,

a

), and i-th element of

a

is represented by

a_{i}

. Transpose of a matrix or a vector is denoted by a superscript T (e.g.,

A^{T}

). Functions are denoted by calligraphy uppercase letters (e.g.,

L

). The frequently used mathematical notations are summarized in Table 1 for readability [31].

3.2. Overview of PSSCH Framework

As illustrated in Figure 1, the proposed PSSCH framework, PSSCH, consists of three networks: ImgNet, TxtNet, and CPNet.

ImgNet and TxtNet. Building on the ideas from [31,40,41], we introduce feature extraction based on the Transformer encoder in cross-modal retrieval, comprising 12 stacked encoder blocks. Each block includes Layer Normalization, Multi-Head Self-Attention, and MLP components, with a total of 12 attention heads. The process of extracting image semantic features is represented as

F_{x}^{l} = G_{x} (X^{l}, θ_{x})

and

F_{x}^{u} = G_{x} (X^{u}, θ_{x})

, where

F_{x}^{l}

and

F_{x}^{u}

represent the semantic features of labeled and unlabeled images, respectively.

G_{x}

denotes the image semantic encoder, and

θ_{x}

refers to its parameters.

Similarly, the text Transformer encoder comprises 12 encoder blocks, each with 8 attention heads for Multi-Head Self-Attention [29,31]. The extraction of text semantic features is expressed as

F_{y}^{l} = G_{y} (Y^{l}, θ_{y})

and

F_{y}^{u} = G_{y} (Y^{u}, θ_{y})

, where

F_{y}^{l}

and

F_{y}^{u}

represent the semantic features of labeled and unlabeled text, respectively.

G_{y}

represents the text semantic encoder, and

θ_{y}

corresponds to its parameters.

CPNet. The CPNet consists of four simple fully connected layers, with the dimensions of the intermediate layers matching the dimensions of image and text semantic features, and the output layer having a dimension equal to the hash code length K. One-hot encoding represents categorical information directly in a sparse discrete format, providing a clear distinction between different categories; we use the one-hot encoding of categories as input, and, through forward propagation, it produces the feature proxies

P^{f} = {\{p_{i}^{f}\}}_{i = 1}^{C}

and hash proxies

P^{h} = {\{p_{i}^{h}\}}_{i = 1}^{C}

.

3.3. Model Learning

For the generation of pseudo-labels, we first calculate the cosine similarity between the features of unlabeled data and each category center (feature proxy) in the feature space. The higher the similarity, the greater the probability we assign to the data belonging to that category. We set a threshold for this; we retain data with a cosine similarity above the threshold, and, for those below the threshold, we consider the data not to belong to the category and set it to 0. Feature proxies are learned from the features of labeled data in the feature space. Since at the initial training stage the feature proxies have not yet been well trained, the quality of the pseudo-labels is not guaranteed. To address this, we set up a dynamic weight adjustment function that gradually increases the weight of unlabeled training data as the number of epochs increases. The hyper-parameters

α

and

β

are used to explore the importance of considering the relationship between data and the relationships between data and categories during model learning. In this subsection, we will detail the generation method of pseudo-labels and the learning strategy of the model.

3.3.1. Pseudo-Label Generation

The pseudo-label generation process is based on cosine similarity [42,43]. First, we obtain the feature representations of the unlabeled data for the two modalities

F_{x}^{u}

and

F_{y}^{u}

, as well as the feature proxies

P^{f}

. Then, we calculate the cosine similarity between the feature representations of the two modalities,

F_{x}^{u}

and

F_{y}^{u}

, and the feature proxies

P^{f}

using the pseudo-label generation module, as shown in Equation (1):

S_{x}^{u} = \frac{F_{x}^{u}}{{∥F_{x}^{u}∥}_{2}} \cdot {(\frac{P^{f}}{{∥P^{f}∥}_{2}})}^{T}, S_{y}^{u} = \frac{F_{y}^{u}}{{∥F_{y}^{u}∥}_{2}} \cdot {(\frac{P^{f}}{{∥P^{f}∥}_{2}})}^{T}

(1)

where

{∥ \cdot ∥}_{2}

is the

L_{2}

norm,

{(.)}^{T}

is the transpose of vector(or matrix), and

S_{x}^{u}, S_{y}^{u} \in R^{N_{u} \times C}

represent the similarity matrices between the image and text unlabeled data features and the feature proxies, respectively. Then, compute the average similarity

S^{u}

by calculating the following Equation (2):

S^{u} = \frac{S_{x}^{u} + S_{y}^{u}}{2}

(2)

Hence, we generate the pseudo-labels

L^{u}

using

S^{u}

and a threshold, as shown in Equation (3):

\begin{matrix} L^{u} = \{\begin{matrix} S^{u} & if S^{u} \geq threshold \\ 0 & otherwise \end{matrix} \end{matrix}

(3)

In this paper, we set the threshold to 0.5. If

S^{u} \geq threshold

, we retain the continuous values in

S^{u}

. If

S^{u} < threshold

, following [30], we set it to 0.

3.3.2. Feature Proxy Learning

For the learning of feature proxies, we only constrain the relationship between the labeled data and the feature proxies to update the feature proxies. Specifically, we use the label information from the labeled data to bring the relevant feature proxies closer to the data and push the irrelevant feature proxies away from the data. For feature proxies that are related to the samples, we reduce the distance between the features and the relevant feature proxies by calculating the following Equation (4):

{cos}_{+} (f, p) = - \frac{| f \cdot p |}{| f | \cdot | p |}

(4)

For feature proxies that are not related to the samples, we enforce a larger distance between the features and irrelevant feature proxies by calculating the following Equation (5):

{cos}_{-} (f, p) = max (\frac{| f \cdot p |}{| f | \cdot | p |}, 0)

(5)

Hence, the image feature proxy loss

L_{f}^{I}

is defined in Equation (6):

\begin{matrix} L_{f}^{I} & = \frac{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1) {cos}_{+} (f_{i}^{x}, p_{j}^{f})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1)} \\ + \frac{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0) {cos}_{-} (f_{i}^{x}, p_{j}^{f})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0)} \end{matrix}

(6)

where I is an indicator function. The denominators represent the number of relevant data–proxy pairs and irrelevant data–proxy pairs, respectively, aiming for normalization. Similarly, the text feature proxy loss

L_{f}^{T}

is defined in the following Equation (7):

\begin{matrix} L_{f}^{T} & = \frac{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1) {cos}_{+} (f_{i}^{y}, p_{j}^{f})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1)} \\ + \frac{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0) {cos}_{-} (f_{i}^{y}, p_{j}^{f})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0)} \end{matrix}

(7)

The feature proxy loss

L_{f}

is defined in Equation (8):

L_{f} = L_{f}^{I} + L_{f}^{T}

(8)

3.3.3. Adaptive Dual-Label Loss

Through forward propagation, the features of labeled and unlabeled data

F_{*}^{l}

and

F_{*}^{u}

are transformed into binary-like codes

H_{*}^{l}

and

H_{*}^{u}

, where

* \in {x, y}

; the feature proxies

P^{f} = {\{p_{i}^{f}\}}_{i = 1}^{C}

generate hash proxies

P^{h} = {\{p_{i}^{h}\}}_{i = 1}^{C}

. In the hash space, we consider both labeled and unlabeled data separately. Specifically, for labeled data, which have true label information, we not only learn the relationship between the data and the hash proxies but also learn the relationship between the data points themselves through their true labels. We obtain the similarity relationship between labeled data using the following Equation (9):

S^{l} = \frac{L^{l}}{{∥L^{l}∥}_{2}} \cdot {(\frac{L^{l}}{{∥L^{l}∥}_{2}})}^{T}

(9)

The range of

S_{i j}^{l}

is [0, 1]. If

S_{i j}^{l}

> 0, then

x_{i}

(or

y_{i}

) and

x_{j}

(or

y_{j}

) are called an relevant pair. If

S_{i j}^{l}

= 0, then they are considered as irrelevant pairs. To pull relevant pairs closer, we use

{cos}_{p} (h_{i}, h_{j})

, and the distance between irrelevant data pairs is pushed away by

{cos}_{n} (h_{i}, h_{j})

.

\begin{matrix} {cos}_{p} (h_{i}, h_{j}) = max (S_{i j}^{l} - \frac{|h_{i} \cdot h_{j}|}{|h_{i}| \cdot |h_{j}|}, 0), {cos}_{n} (h_{i}, h_{j}) = max (\frac{|h_{i} \cdot h_{j}|}{|h_{i}| \cdot |h_{j}|}, 0) \end{matrix}

(10)

Therefore, the data-related loss for labeled data in hash space

L_{h}^{l_{d}}

is computed using the following Equation (11):

\begin{matrix} L_{h}^{l_{d}} = \frac{\sum_{* \in {x, y}} \sum_{i, j = 1}^{N_{l}} I (S_{i j}^{l} > 0) {cos}_{p} (h_{i}^{*}, h_{j}^{*})}{\sum_{i, j = 1}^{N_{l}} I (S_{i j}^{l} > 0)} + \frac{\sum_{* \in {x, y}} \sum_{i, j = 1}^{N_{l}} I (S_{i j}^{l} = 0) {cos}_{n} (h_{i}^{*}, h_{j}^{*})}{\sum_{i, j = 1}^{N_{l}} I (S_{i j}^{l} = 0)} \end{matrix}

(11)

For the relationship between data and hash proxies in the hash space, we can impose a constraint using the loss

L_{h}^{l_{p}}

. Similar to the loss in the feature space

L_{f}

,

L_{h}^{l_{p}}

is defined as follows:

\begin{matrix} L_{h}^{l_{p}} = \frac{\sum_{* \in {x, y}} \sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1) {cos}_{+} (h_{i}^{*}, p_{j}^{h})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 1)} + \frac{\sum_{* \in {x, y}} \sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0) {cos}_{-} (h_{i}^{*}, p_{j}^{h})}{\sum_{i = 1}^{N_{l}} \sum_{j = 1}^{C} I (l_{i j} = 0)} \end{matrix}

(12)

For unlabeled data, since their pseudo-labels directly correspond to the structural relationships between the data and the categories, learning the relationships between unlabeled data through pseudo-label information would instead introduce errors. Therefore, we directly map the structural relationship between the unlabeled data features and the feature proxies in the feature space, as contained in the pseudo-label information, to the relationship between the binary-like codes of the unlabeled data and the hash proxies in the hash space. Specifically, we bring the binary-like codes of unlabeled data closer to the relevant hash proxies by calculating

{cos}_{r} (h_{i}, p_{j}^{h})

and push them apart from irrelevant ones by calculating

{cos}_{e} (h_{i}, p_{j}^{h})

.

\begin{matrix} {cos}_{r} (h_{i}, p_{j}^{h}) = max (L_{i j}^{u} - \frac{|h_{i} \cdot p_{j}^{h}|}{|h_{i}| \cdot |p_{j}^{h}|}, 0), {cos}_{e} (h_{i}, p_{j}^{h}) = max (\frac{|h_{i} \cdot p_{j}^{h}|}{|h_{i}| \cdot |p_{j}^{h}|}, 0) \end{matrix}

(13)

Therefore, the loss for unlabeled data in the hash space,

L_{h}^{u}

, can be computed using the following Equation (14):

\begin{matrix} L_{h}^{u} = \frac{\sum_{* \in {x, y}} \sum_{i = 1}^{N_{u}} \sum_{j = 1}^{C} I (L_{i j}^{u} > 0) {cos}_{r} (h_{i}^{*}, p_{j}^{h})}{\sum_{i = 1}^{N_{u}} \sum_{j = 1}^{C} I (L_{i j}^{u} > 0)} + \frac{\sum_{* \in {x, y}} \sum_{i = 1}^{N_{u}} \sum_{j = 1}^{C} I (L_{i j}^{u} = 0) {cos}_{e} (h_{i}^{*}, p_{j}^{h})}{\sum_{i = 1}^{N_{u}} \sum_{j = 1}^{C} I (L_{i j}^{u} = 0)} \end{matrix}

(14)

The proposed Adaptive Dual-Label Loss

L_{h}

in the hash space is defined as in Equation (15):

L_{h} = α L_{h}^{l_{d}} + β L_{h}^{l_{p}} + min (1, \frac{epoch}{epoch num}) \cdot L_{h}^{u}

(15)

where

α

and

β

are the hyper-parameters; we dynamically adjust the weight of the loss for unlabeled data based on the epoch, ensuring that the model maintains the importance of labeled data while learning more stably from unlabeled data. In the early stages of training, the model relies more on the true labels to avoid negative impacts from unlabeled data, while later, as the quality of the pseudo-labels improves, the weight gradually increases.

3.4. Optimization

The pseudo-code for our proposed PSSCH method is provided in Algorithm 1. For query samples, we utilize the well-trained PSSCH model to map them into the hash space and then generate their hash codes using the sign function.

\begin{matrix} sign (x) = \{\begin{matrix} + 1, x > 0 \\ - 1, x < 0 \end{matrix} \end{matrix}

(16)

For image data

x_{i}

or text data

y_{i}

, we generate the hash codes by Equation (17):

\begin{matrix} \begin{matrix} b_{i}^{x} = sign (H^{x} (x_{i})) \\ b_{i}^{y} = sign (H^{y} (y_{i})) \end{matrix} \end{matrix}

(17)

Algorithm 1 The Pseudo-Code of the PSSCH Method

Input:: Training dataset $D$ ; The number of bits in the hash code K; The one-hot encodings of the categories $P$ ; threshold and Hyper-parameters $α$ , $β$ .
Output:: Image and Text Network Parameters: $Θ_{x}$ , $Θ_{y}$ ; Category Proxy Network Parameters $Θ_{c}$ .
1:: Initialize $Θ_{x}$ , $Θ_{y}$ and $Θ_{c}$ ; iteration number: $epoch num$ , batch size: 128, learning rate: 0.001.
2:: Construct a similarity matrix $S^{l}$ by Equation (9).
3:: while $i t e r < epoch num$ do
4:: Obtain the features $F^{l}$ and binary-like codes $H^{l}$ for labeled samples, as well as the features $F^{u}$ and binary-like codes $H^{u}$ for unlabeled samples, through forward propagation, along with the feature proxies $P^{f} = {\{p_{i}^{f}\}}_{i = 1}^{C}$ and hash proxies $P^{h} = {\{p_{i}^{h}\}}_{i = 1}^{C}$ .
5:: Generate pseudo-labels for unlabeled data by Equation (3).
6:: Compute feature proxy loss $L_{f}$ by Equation (8).
7:: Compute Adaptive Dual-Label Loss $L_{h}$ by Equation (15).
8:: Update Category Proxy Network Parameters $Θ_{c}$ by backpropagation.
9:: Update Image and Text Network Parameters $Θ_{x}$ and $Θ_{y}$ by backpropagation.
10:: end while
11:: return

4. Experiments

This section evaluates the effectiveness of the proposed PSSCH by performing image–text cross-modal retrieval and comparing the results with those of several excellent methods.

4.1. Experimental Settings

Datasets. We evaluated our method on three widely used benchmark datasets: MIRFLICKR-25K [44], NUS-WIDE [45], and MS COCO [46]. A brief description of each is provided below:

MIRFLICKR-25K. This small-scale cross-modal dataset consists of 24,581 image–text pairs, spanning 24 categories, with each sample belonging to at least one category.
NUS-WIDE. Comprising 269,648 image–text pairs, this dataset includes 81 categories. We filtered out categories with fewer samples and selected 21 common categories, resulting in 195,834 image–text pairs.
MS COCO. A large-scale dataset commonly used in computer vision, containing 82,785 training images and 40,504 validation images. Each image is associated with textual descriptions and labels across 80 categories. For our experiments, we combined the training and validation sets, with each sample belonging to at least one of these categories.

Implementation Details. To facilitate comparison, we follow the settings of the baseline method. The threshold was set to 0.5,

epoch num

was set to 100, we applied the same procedure across the three public datasets, and we randomly selected 10,000 samples as the training set, which were then divided into labeled and unlabeled data according to a certain ratio. In addition, we randomly selected 5000 samples as the query set, with the remaining samples being used as the database. In this process, images are resized to 224 × 224, and text is represented using BPE encoding.

Experimental Environment. We implemented our PSSCH method using PyTorch==1.12.1, with an NVIDIA RTX 3090 GPU. The batch size was set to 128, with 80 labeled samples and 48 unlabeled samples. The hyper-parameters

α

and

β

were set to 0.5 and 1, respectively. We used two Transformer encoders as the backbone for the PSSCH method: ViT [41] for the image and GPT-2 [47] for the text. The backbone network parameters were initialized using pre-trained CLIP features (ViT-B/32) [48]. The parameters of ImgNet and TxtNet were updated using the Adam optimizer, with a learning rate of 0.00001 for the backbone and 0.001 for the hashing layer. The CPNet network parameters were updated using the SGD optimizer with a learning rate of 0.001.

Baseline Methods. We selected 10 excellent deep cross-modal hashing methods for comparison, including four supervised cross-modal hashing methods, i.e., LEMON [49], EDMH [50], HCCH [51], and HMAH [52], four semi-supervised cross-modal hashing methods, i.e., SSCH [22], MGCH [19], TS3H [14], and GCSCH [20], and two unsupervised cross-modal hashing methods, i.e., DGCPN [53] and UCCH [54]. For supervised methods, we train using only the labeled data. Due to some methods not being open source, we directly cite the results from the published papers. Here is a brief introduction to each baseline:

LEMON. embeds label information into the hash learning process in order to fully utilize the semantic information of labels to guide the learning of hash functions.
EDMH. proposes a discrete optimization algorithm that seamlessly integrates three useful discrete constraints into a joint hashing learning model.
HCCH. mitigates the loss of important discriminative information, a hierarchical hashing scheme from coarse to fine is proposed, which refines useful discriminative information step by step using a two-layer hashing function.
HMAH. creates a hierarchical message aggregation network within a teacher–student framework, enhancing alignment of heterogeneous modalities and modeling detailed cross-modal correlations.
SSCH. obtains enhanced semantic information through a pseudo-labeling process that does not require alignment and learns the hash representations of various data via a label enhancement strategy.
MGCH. employs a multi-view graph to connect the data, utilizing anchor points as a unified semantic hub to achieve semi-supervised cross-modal hashing.
TS3H. utilizes supervised information; classifiers for different modalities are learned to predict the labels of unlabeled data, and then the hash codes are learned by combining both the new and old labels.
GCSCH. designs a fusion network to integrate the two modalities and uses a graph convolutional network to capture semantic information from both real-labeled and pseudo-labeled multi-modal data.
DGCPN. utilizes graph models to explore graph-neighbor consistency, which helps to address the inaccurate similarity calculation in unsupervised cross-modal hashing.
UCCH. proposes a novel momentum optimizer for learnable hashing in contrastive learning and designs a cross-modal ranking learning loss.

Evaluation Protocols. We evaluated our method by comparing it with baseline approaches on two cross-modal retrieval tasks: image-to-text retrieval (I→T) and text-to-image retrieval (T→I). For performance assessment, we employed standard evaluation metrics, including mean Average Precision (mAP) and precision–recall (PR) curves. mAP represents the average of Average Precision (AP) across all queries and is the most commonly used evaluation metric in cross-modal hashing. The formula for calculating mAP is as follows in Equation (18):

m A P = \frac{1}{N_{q}} \sum_{i = 1}^{N_{q}} A P (i)

(18)

where

A P (i)

represents the average precision for a query sample i and

N_{q}

denotes the total number of query samples. PR curve illustrates the relationship between recall and precision. The results from these metrics show that the PSSCH method performs exceptionally well in cross-modal similarity search.

4.2. Performance Comparison

We compared the PSSCH method with baseline methods on three public datasets. The mAP results are shown in Table 2. PSSCH generally outperforms other baseline methods, achieving satisfactory performance. On the MIRFLICKR-25K dataset, the PSSCH method achieves an average mAP that is 2.5% higher than the best baseline method in the image-to-text retrieval task. Additionally, PSSCH achieves the best performance in the 64-bit text-to-image retrieval task. However, in the text-to-image retrieval task, PSSCH performed slightly worse than GCSCH at 16 bits and 32 bits. We speculate that the size of the hash space may affect the performance of our method. Moreover, since the MIRFLICKR-25K dataset is not very large, the use of data augmentation in GCSCH on the training image data may also be a reason why it performs better in the text-to-image retrieval task. On the NUS-WIDE dataset, PSSCH also shows an average mAP improvement of about 1% over the best baseline method in both retrieval tasks. On the MS COCO dataset, PSSCH performs most significantly, with an average mAP that is 3% higher than the best baseline method. We speculate that this may be because the MS COCO dataset contains 80 categories, which is much higher than the other two datasets. In scenarios with a larger number of categories, learning the relationships between data and categories may yield better results. To demonstrate that the improvements regarding our method are significant and not due to randomness, we conducted experiments on three datasets at 64 bits to obtain 10 mAP results. We then performed a paired t-test with the best baseline method, and the test results are presented in Table 3, which proves that the performance enhancement regarding the PSSCH method is significant.

Figure 2 shows the precision–recall curve results for the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets with 64-bit codes. Precision refers to the proportion of relevant items in the retrieval results, while recall indicates the proportion of all relevant items that are correctly retrieved by the model. The PR curves for the PSSCH method outperform other baseline methods in both retrieval tasks across the three datasets. Figure 3 shows the mAP of the semi-supervised cross-modal hashing methods under different percentages of labeled samples on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets. As can be seen, with the reduction in the number of labeled samples, the mAP of the PSSCH method decreases sharply, which was expected. We must acknowledge the importance of true labels. Our feature proxies require real labeled samples for constraints in order to ensure the accurate generation of pseudo-labels for unlabeled samples. Moreover, the mAP results of the PSSCH method with 50% labeled samples are close to those obtained with 90% labeled samples, further confirming the practicality of the PSSCH method.

4.3. Ablation Studies

To validate the effectiveness of the PSSCH method, we implemented three variations to calculate the mAP values for the tasks of I→T and T→I. Specifically: (1) PSSCH-1: Using discrete [0, 1] pseudo-labels, and applying the labeled data loss

L_{h}^{l_{p}}

to the unlabeled data in the hash space. (2) PSSCH-2: Not dynamically adjusting the weight of the loss for unlabeled data

L_{h}^{u}

. (3) PSSCH-3: Removing the loss function

L_{f}

in the feature space.

The results of the ablation experiment are shown in Table 4. Since PSSCH-1 uses discrete pseudo-labels [0, 1], which require quantization of the results after pseudo-label generation, information is inevitably lost during this process. In contrast, continuous pseudo-labels can directly represent the structural relationship between data and similar categories. By comparing the results of PSSCH-1 and PSSCH on three benchmark datasets, it can be seen that PSSCH effectively utilizes the structural relationships within continuous pseudo-labels to achieve better outcomes. PSSCH-2 eliminates the dynamic weight adjustment function:

min (1, \frac{epoch}{epoch num})

.

In the early stages of training, when feature proxies have not been sufficiently trained, generating pseudo-labels based on the similarity between pre-trained features of unlabeled data and feature proxies introduces noise. Assigning a large weight to unlabeled data loss in the early stages can negatively impact model performance. Comparing the results of PSSCH-2 and PSSCH indicates that not dynamically adjusting the loss weight for unlabeled data leads to the initial low-quality pseudo-labels severely affecting the model’s training, resulting in suboptimal outcomes. Comparing the results of PSSCH-3 and PSSCH, we see that PSSCH-3 removes the loss function

L_{f}

in the feature space and only updates the feature proxy through hash loss during backpropagation without constraining the feature proxy at the feature level. This lack of guarantee in pseudo-label generation leads to suboptimal results. By comparing these three variants, the effectiveness of each component in the PSSCH method can be verified.

4.4. Sensitivity to Hyper-Parameters

α

and

β

represent the relative importance of two types of loss in the total loss. Specifically,

α

denotes the importance of the loss

L_{h}^{l_{d}}

, which constrains the relationship between data and data. Meanwhile,

β

indicates the importance of the loss

L_{h}^{l_{p}}

, which constrains the relationships between data and categories. We investigate the sensitivity of the parameters

α

and

β

. We set their ranges to 0.001, 0.01, 0.05, 0.1, 0.5, 0.8, 1. As shown in Figure 4, when either

α

or

β

is close to 0, the retrieval performance decreases. When

α

and

β

are set to 0.5 and 1, respectively, the model performs optimally. The model is still quite sensitive to hyper-parameters

α

and

β

. This also validates that, during model training, it is essential to consider both the relationship between data and data and the relationships between data and categories.

4.5. Training and Encoding Time

To investigate the efficiency of the PSSCH method, we compared the training time and encoding time with baseline methods on the MIRFlickr-25K dataset with 64 bits. For the training time, we analyzed the time complexity of the PSSCH algorithm. The time complexity of the feature proxy loss

L_{f}

is

O (N_{l} C)

, where C represents the number of categories, while the time complexity of the Adaptive Dual-Label Loss is

O (N C + N_{l}^{2})

, where

N_{l}

denotes the number of labeled samples. Thus, the overall time complexity of PSSCH is

O (N C + N_{l}^{2} + N_{l} C)

, where N represents the total number of training samples. As shown in Figure 5a, our training time is slightly higher than other methods, but the training process is conducted offline, so the training time does not affect retrieval performance. In the field of cross-modal hashing retrieval, more attention is often paid to inference time, which mainly consists of encoding time and Hamming distance computation time. The encoding time, as shown in Figure 5b, is in the millisecond range for all methods, demonstrating that the PSSCH method has a reasonable encoding time.

4.6. Visualization

We validate the proposed PSSCH method by visualizing the learned hash codes using T-SNE [55]. Specifically, we select samples from seven different single-label categories in the NUS-WIDE dataset with a hash code length of 32 bits and compare them with the TS3H and GCSCH methods. The results are shown in Figure 6. Different colored dots represent different single-label category data. To further assess the clustering effectiveness of the data after dimensionality reduction, we computed the Davies–Bouldin Index (DBI) for these three methods. The DBI is a commonly used metric for evaluating clustering performance as it calculates the separation between clusters as well as the compactness within clusters. The smaller the DBI value, the better the clustering effect. The DBI scores for T3SH, GCSCH, and PSSCH are 1.9742, 1.7121, and 1.5935, respectively. This also demonstrates that the PSSCH method achieved superior performance.

5. Conclusions

In this paper, we propose the PSSCH framework for semi-supervised cross-modal hashing retrieval, enabling the model to fully utilize the structural information in continuous pseudo-labels while simultaneously considering the relationships between data and the relationships between data and categories. However, the performance of the PSSCH method is not fully demonstrated on small-scale datasets or when the hash space is reduced. In future work, we plan to address this limitation by performing data augmentation on training data for small-scale datasets or applying balance constraints to hash codes. Additionally, we will explore extending the PSSCH method to other modalities, such as speech and video, to validate its versatility and effectiveness.

Author Contributions

Conceptualization, H.C.; methodology, H.C.; software, X.Z.; validation, H.C., Z.Z. and X.Z.; formal analysis, H.C.; investigation, H.C.; resources, X.Z.; data curation, Z.Z.; writing—original draft preparation, H.C.; writing—review and editing, X.Z.; visualization, H.C.; supervision, X.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The public datasets used in this paper can be accessed through the following links: MIRFLICKR-25K: https://press.liacs.nl/mirflickr/mirdownload.html, accessed on 19 December 2024; NUS-WIDE: https://pan.baidu.com/s/1BzEho9BvpWX93sMA4A_o-A?pwd=swq3, accessed on 19 December 2024; MS COCO: https://cocodataset.org, accessed on 19 December 2024.

Acknowledgments

The authors would like to express their gratitude to the technical staff at the Hunan Agricultural University, Teaching Building 13, Lab 511 and Lab 503 for their invaluable support throughout the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, S.; Qian, S.; Guan, Y.; Zhan, J.; Ying, L. Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, 25–30 July 2020; pp. 1379–1388. [Google Scholar]
Shi, Y.; You, X.; Zheng, F.; Wang, S.; Peng, Q. Equally-Guided Discriminative Hashing for Cross-modal Retrieval. In Proceedings of the IJCAI, Macao, China, 10–16 August 2019; pp. 4767–4773. [Google Scholar]
Qin, Q.; Huo, Y.; Huang, L.; Dai, J.; Zhang, H.; Zhang, W. Deep Neighborhood-preserving Hashing with Quadratic Spherical Mutual Information for Cross-modal Retrieval. IEEE Trans. Multimed. 2024, 26, 6361–6374. [Google Scholar] [CrossRef]
Wu, Q.; Zhang, Z.; Liu, Y.; Zhang, J.; Nie, L. Contrastive Multi-Bit Collaborative Learning for Deep Cross-Modal Hashing. IEEE Trans. Knowl. Data Eng. 2024, 36, 5835–5848. [Google Scholar] [CrossRef]
Song, G.; Huang, K.; Su, H.; Song, F.; Yang, M. Deep Ranking Distribution Preserving Hashing for Robust Multi-Label Cross-modal Retrieval. IEEE Trans. Multimed. 2024, 26, 7027–7042. [Google Scholar] [CrossRef]
Wang, X.; Zou, X.; Bakker, E.M.; Wu, S. Self-constraining and attention-based hashing network for bit-scalable cross-modal retrieval. Neurocomputing 2020, 400, 255–271. [Google Scholar] [CrossRef]
Liu, Y.; Wu, Q.; Zhang, Z.; Zhang, J.; Lu, G. Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 893–902. [Google Scholar]
Song, G.; Su, H.; Huang, K.; Song, F.; Yang, M. Deep self-enhancement hashing for robust multi-label cross-modal retrieval. Pattern Recognit. 2024, 147, 110079. [Google Scholar] [CrossRef]
Gao, Z.; Wang, J.; Yu, G.; Yan, Z.; Domeniconi, C.; Zhang, J. Long-tail cross modal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7642–7650. [Google Scholar]
Zhu, L.; Cai, L.; Song, J.; Zhu, X.; Zhang, C.; Zhang, S. MSSPQ: Multiple Semantic Structure-Preserving Quantization for Cross-Modal Retrieval. In Proceedings of the ICMR ’22: International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; Oria, V., Sapino, M.L., Satoh, S., Kerhervé, B., Cheng, W., Ide, I., Singh, V.K., Eds.; ACM: New York, NY, USA, 2022; pp. 631–638. [Google Scholar]
Li, F.; Wang, B.; Zhu, L.; Li, J.; Zhang, Z.; Chang, X. Cross-Domain Transfer Hashing for Efficient Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9664–9677. [Google Scholar] [CrossRef]
Wang, Y.; Dong, F.; Wang, K.; Nie, X.; Chen, Z. Weighted cross-modal hashing with label enhancement. Knowl. Based Syst. 2024, 293, 111657. [Google Scholar] [CrossRef]
Zhang, C.; Song, J.; Zhu, X.; Zhu, L.; Zhang, S. HCMSL: Hybrid Cross-modal Similarity Learning for Cross-modal Retrieval. ACM Trans. Multim. Comput. Commun. Appl. 2021, 17, 2:1–2:22. [Google Scholar] [CrossRef]
Fan, W.; Zhang, C.; Li, H.; Jia, X.; Wang, G. Three-stage semisupervised cross-modal hashing with pairwise relations exploitation. IEEE Trans. Neural Netw. Learn. Syst. 2023. [Google Scholar] [CrossRef]
Wang, J.; Li, G.; Pan, P.; Zhao, X. Semi-supervised semantic factorization hashing for fast cross-modal retrieval. Multimed. Tools Appl. 2017, 76, 20197–20215. [Google Scholar] [CrossRef]
Wang, X.; Liu, X.; Peng, S.J.; Zhong, B.; Chen, Y.; Du, J.X. Semi-supervised discrete hashing for efficient cross-modal retrieval. Multimed. Tools Appl. 2020, 79, 25335–25356. [Google Scholar] [CrossRef]
Liu, X.; Yu, G.; Domeniconi, C.; Wang, J.; Xiao, G.; Guo, M. Weakly supervised cross-modal hashing. IEEE Trans. Big Data 2019, 8, 552–563. [Google Scholar] [CrossRef]
Yang, L.; Zhang, K.; Li, Y.; Chen, Y.; Long, J.; Yang, Z. S3ACH: Semi-Supervised Semantic Adaptive Cross-Modal Hashing. In Proceedings of the International Conference on Neural Information Processing, Changsha, China, 20–23 November 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 252–269. [Google Scholar]
Shen, X.; Zhang, H.; Li, L.; Yang, W.; Liu, L. Semi-supervised cross-modal hashing with multi-view graph representation. Inf. Sci. 2022, 604, 45–60. [Google Scholar] [CrossRef]
Shen, X.; Yu, G.; Chen, Y.; Yang, X.; Zheng, Y. Graph Convolutional Semi-Supervised Cross-Modal Hashing. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October 2024–1 November 2024; pp. 5930–5938. [Google Scholar]
Su, M.; Gu, G.; Ren, X.; Fu, H.; Zhao, Y. Semi-supervised knowledge distillation for cross-modal hashing. IEEE Trans. Multimed. 2021, 25, 662–675. [Google Scholar] [CrossRef]
Zhang, X.; Liu, X.; Nie, X.; Kang, X.; Yin, Y. Semi-supervised semi-paired cross-modal hashing. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 6517–6529. [Google Scholar] [CrossRef]
Jiang, Q.Y.; Li, W.J. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 3232–3240. [Google Scholar]
Zou, X.; Wu, S.; Bakker, E.M.; Wang, X. Multi-label enhancement based self-supervised deep cross-modal hashing. Neurocomputing 2022, 467, 138–162. [Google Scholar] [CrossRef]
Xu, C.; Chai, Z.; Xu, Z.; Li, H.; Zuo, Q.; Yang, L.; Yuan, C. HHF: Hashing-guided hinge function for deep hashing retrieval. IEEE Trans. Multimed. 2022, 25, 7428–7440. [Google Scholar] [CrossRef]
Shu, Z.; Bai, Y.; Zhang, D.; Yu, J.; Yu, Z.; Wu, X.J. Specific class center guided deep hashing for cross-modal retrieval. Inf. Sci. 2022, 609, 304–318. [Google Scholar] [CrossRef]
Tu, R.C.; Mao, X.L.; Tu, R.X.; Bian, B.; Cai, C.; Wang, H.; Wei, W.; Huang, H. Deep cross-modal proxy hashing. IEEE Trans. Knowl. Data Eng. 2022, 35, 6798–6810. [Google Scholar] [CrossRef]
Tu, R.C.; Mao, X.L.; Ji, W.; Wei, W.; Huang, H. Data-aware proxy hashing for cross-modal retrieval. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 686–696. [Google Scholar]
Huo, Y.; Qin, Q.; Dai, J.; Wang, L.; Zhang, W.; Huang, L.; Wang, C. Deep semantic-aware proxy hashing for multi-label cross-modal retrieval. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 576–589. [Google Scholar] [CrossRef]
Hu, Z.; Cheung, Y.m.; Li, M.; Lan, W. Cross-Modal Hashing Method with Properties of Hamming Space: A New Perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7636–7650. [Google Scholar] [CrossRef] [PubMed]
Chen, H.; Zhu, L.; Zhu, X. Deep Class-guided Hashing for Multi-label Cross-modal Retrieval. arXiv 2024, arXiv:2410.15387. [Google Scholar]
Li, X.; Hu, D.; Nie, F. Deep binary reconstruction for cross-modal hashing. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; pp. 1398–1406. [Google Scholar]
Wu, G.; Lin, Z.; Han, J.; Liu, L.; Ding, G.; Zhang, B.; Shen, J. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; The AAAI Press: Cambridge, MA, USA, 2018; pp. 2854–2860. [Google Scholar]
Su, S.; Zhong, Z.; Zhang, C. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 3027–3035. [Google Scholar]
Zhang, P.F.; Li, Y.; Huang, Z.; Xu, X.S. Aggregation-based graph convolutional hashing for unsupervised cross-modal retrieval. IEEE Trans. Multimed. 2021, 24, 466–479. [Google Scholar] [CrossRef]
Zhu, L.; Wu, X.; Li, J.; Zhang, Z.; Guan, W.; Shen, H.T. Work together: Correlation-identity reconstruction hashing for unsupervised cross-modal retrieval. IEEE Trans. Knowl. Data Eng. 2022, 35, 8838–8851. [Google Scholar] [CrossRef]
Wu, F.; Li, S.; Gao, G.; Ji, Y.; Jing, X.Y.; Wan, Z. Semi-supervised cross-modal hashing via modality-specific and cross-modal graph convolutional networks. Pattern Recognit. 2023, 136, 109211. [Google Scholar] [CrossRef]
Huang, Y.; Hu, B.; Zhang, Y.; Gao, C.; Wang, Q. A semi-supervised cross-modal memory bank for cross-modal retrieval. Neurocomputing 2024, 579, 127430. [Google Scholar] [CrossRef]
Wang, J.; Gong, T.; Yan, Y. Semi-supervised Prototype Semantic Association Learning for Robust Cross-modal Retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; pp. 872–881. [Google Scholar]
Tu, J.; Liu, X.; Lin, Z.; Hong, R.; Wang, M. Differentiable cross-modal hashing via multimodal transformers. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 453–461. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021; pp. 1–22. [Google Scholar]
Rawat, A.; Dua, I.; Gupta, S.; Tallamraju, R. Semi-supervised Domain Adaptation by Similarity Based Pseudo-Label Injection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2022; pp. 150–166. [Google Scholar]
Abdelfattah, R.; Guo, Q.; Li, X.; Wang, X.; Wang, S. Cdul: Clip-driven unsupervised learning for multi-label image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 1348–1357. [Google Scholar]
Huiskes, M.J.; Lew, M.S. The mir flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008; pp. 39–43. [Google Scholar]
Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini Island, Greece, 8–10 July 2009; pp. 1–9. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtually, 13–15 December 2021; pp. 8748–8763. [Google Scholar]
Wang, Y.; Luo, X.; Xu, X.S. Label embedding online hashing for cross-modal retrieval. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 871–879. [Google Scholar]
Chen, Y.; Zhang, H.; Tian, Z.; Wang, J.; Zhang, D.; Li, X. Enhanced discrete multi-modal hashing: More constraints yet less time to learn. IEEE Trans. Knowl. Data Eng. 2020, 34, 1177–1190. [Google Scholar] [CrossRef]
Sun, Y.; Ren, Z.; Hu, P.; Peng, D.; Wang, X. Hierarchical consensus hashing for cross-modal retrieval. IEEE Trans. Multimed. 2023, 26, 824–836. [Google Scholar] [CrossRef]
Tan, W.; Zhu, L.; Li, J.; Zhang, H.; Han, J. Teacher-student learning: Efficient hierarchical message aggregation hashing for cross-modal retrieval. IEEE Trans. Multimed. 2022, 25, 4520–4532. [Google Scholar] [CrossRef]
Yu, J.; Zhou, H.; Zhan, Y.; Tao, D. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 4626–4634. [Google Scholar]
Hu, P.; Zhu, H.; Lin, J.; Peng, D.; Zhao, Y.P.; Peng, X. Unsupervised contrastive cross-modal hashing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3877–3889. [Google Scholar] [CrossRef] [PubMed]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. The proposed PSSCH framework is shown in the figure. PSSCH consists of three networks: ImgNet, TxtNet, and CPNet, where the CPNet takes the one-hot encoding of categories as input and, through forward propagation, generates feature proxies and hash proxies. The feature proxies are updated using loss

L_{f}^{I}

and loss

L_{f}^{T}

. The cosine similarity between the unlabeled data features and feature proxies is calculated through the pseudo-label generation module to generate continuous pseudo-labels. Finally, the hash codes of the data, hash proxies, and continuous pseudo-labels are used to compute the loss through Adaptive Dual-Label Loss, and the network parameters of the TxtNet, ImgNet, and CPNet are updated via backpropagation.

Figure 1. The proposed PSSCH framework is shown in the figure. PSSCH consists of three networks: ImgNet, TxtNet, and CPNet, where the CPNet takes the one-hot encoding of categories as input and, through forward propagation, generates feature proxies and hash proxies. The feature proxies are updated using loss

L_{f}^{I}

and loss

L_{f}^{T}

. The cosine similarity between the unlabeled data features and feature proxies is calculated through the pseudo-label generation module to generate continuous pseudo-labels. Finally, the hash codes of the data, hash proxies, and continuous pseudo-labels are used to compute the loss through Adaptive Dual-Label Loss, and the network parameters of the TxtNet, ImgNet, and CPNet are updated via backpropagation.

Figure 2. Precision–recall curve results on MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets. The code length is 64.

Figure 3. The mAP of the semi-supervised cross-modal hashing methods under different percentages of labeled samples on the MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets.

Figure 4. Sensitivity of parameters

α

and

β

on MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets. The code length is 64.

Figure 4. Sensitivity of parameters

α

and

β

on MIRFLICKR-25K, NUS-WIDE, and MS COCO datasets. The code length is 64.

Figure 5. Comparing the training time and encoding time with baseline methods on the MIRFLICKR-25K dataset.

Figure 6. T-SNE visualization results of TS3H, GCSCH, and PSSCH on the NUS-WIDE dataset with respect to 32-bit codes.

Table 1. A summary of common symbols and notations.

Notation	Definition
$D$	Training Dataset
N	Number of Samples
$H$	Hash Functions
$G$	Feature Extractor
I	Indicator Function
$x_{i}$	The i-th Image Data
$y_{i}$	The i-th Text Data
C	the number of categories
K	The number of bits in the hash code
$F$	Features of Data
$P^{f}$	Feature Proxies
$P^{h}$	Hash Proxies
$B$	Hash Code
$S$	Similarity matrix
epoch num	The total number of iterations

Table 2. Comparison of mAP with baseline methods on MIRFLICKR-25K, NUS-WIDE, and MS COCO based on 30% labeled samples. The best results are in bold font.

Task	Methods	MIRFLICKR-25K			NUS-WIDE			MS COCO
Task	Methods	16 Bits	32 Bits	64 Bits	16 Bits	32 Bits	64 Bits	16 Bits	32 Bits	64 Bits
I→T	DGCPN (AAAI 21)	0.703	0.713	0.720	0.566	0.589	0.601	0.575	0.613	0.630
	UCCH (TPAMI 23)	0.734	0.741	0.739	0.590	0.610	0.618	0.562	0.569	0.590
	LEMON (MM 20)	0.651	0.670	0.682	0.460	0.491	0.507	0.492	0.438	0.527
	EDMH (TKDE 22)	0.651	0.657	0.646	0.460	0.477	0.461	0.502	0.497	0.427
	HMAH (TMM 22)	0.755	0.743	0.753	0.606	0.636	0.639	0.558	0.569	0.594
	HCCH (TMM 23)	0.719	0.730	0.736	0.625	0.638	0.649	0.560	0.606	0.634
	MGCH (IS 22)	0.689	0.705	0.729	0.525	0.514	0.595	0.615	0.562	0.607
	SSCH (TCSVT 23)	0.622	0.670	0.685	0.479	0.524	0.539	0.435	0.441	0.479
	TS3H (TNNLS 23)	0.717	0.741	0.742	0.613	0.642	0.671	0.618	0.624	0.690
	GCSCH (MM 24)	0.772	0.776	0.785	0.658	0.677	0.673	0.619	0.675	0.701
	PSSCH (Ours)	0.794	0.797	0.818	0.666	0.678	0.684	0.644	0.689	0.723
T→I	DGCPN (AAAI 21)	0.692	0.701	0.710	0.578	0.596	0.601	0.572	0.609	0.625
	UCCH (TPAMI 23)	0.722	0.726	0.725	0.600	0.616	0.626	0.553	0.560	0.586
	LEMON (MM 20)	0.666	0.695	0.708	0.472	0.508	0.517	0.487	0.475	0.535
	EDMH (TKDE 22)	0.668	0.677	0.667	0.475	0.487	0.477	0.501	0.494	0.427
	HMAH (TMM 22)	0.721	0.703	0.705	0.546	0.578	0.559	0.549	0.558	0.578
	HCCH (TMM 23)	0.721	0.740	0.742	0.631	0.632	0.649	0.556	0.588	0.647
	MGCH (IS 22)	0.675	0.695	0.719	0.541	0.515	0.607	0.601	0.553	0.586
	SSCH (TCSVT 23)	0.623	0.664	0.688	0.482	0.526	0.557	0.440	0.443	0.474
	TS3H (TNNLS 23)	0.727	0.753	0.748	0.622	0.653	0.674	0.614	0.618	0.687
	GCSCH (MM 24)	0.780	0.791	0.791	0.661	0.673	0.684	0.620	0.661	0.688
	PSSCH (Ours)	0.774	0.787	0.803	0.671	0.683	0.692	0.657	0.702	0.728

Table 3. Paired t-test results comparing mAP values of PSSCH with GCSCH on MIRFLICKR-25K, NUS-WIDE, and MS COCO at 64 bits.

Task	Dataset	t-Statistic	p-Value	Conclusion
I→T	MIRFLICKR-25K	19.42	p < 0.001	PSSCH is significantly better than GCSCH
	NUS-WIDE	5.74	p < 0.001
	MS COCO	15.25	p < 0.001
T→I	MIRFLICKR-25K	11.00	p < 0.001	PSSCH is significantly better than GCSCH
	NUS-WIDE	9.08	p < 0.001
	MS COCO	29.03	p < 0.001

Table 4. mAP results of PSSCH and its variants on MIRFLICKR-25K, NUS-WIDE, and MS COCO. The best results are in bold font.

Task	Methods	MIRFLICKR-25K			NUS-WIDE			MS COCO
Task	Methods	16 Bits	32 Bits	64 Bits	16 Bits	32 Bits	64 Bits	16 Bits	32 Bits	64 Bits
Img2Txt	PSSCH-1	0.783	0.789	0.801	0.661	0.669	0.674	0.641	0.672	0.711
	PSSCH-2	0.711	0.723	0.732	0.604	0.615	0.624	0.610	0.623	0.644
	PSSCH-3	0.764	0.771	0.784	0.643	0.652	0.659	0.638	0.664	0.697
	PSSCH (Ours)	0.794	0.797	0.818	0.666	0.678	0.684	0.644	0.689	0.723
Txt2Img	PSSCH-1	0.771	0.776	0.786	0.665	0.673	0.682	0.645	0.684	0.709
	PSSCH-2	0.703	0.709	0.713	0.594	0.603	0.609	0.604	0.618	0.625
	PSSCH-3	0.752	0.758	0.767	0.641	0.647	0.658	0.630	0.662	0.689
	PSSCH (Ours)	0.774	0.787	0.803	0.671	0.683	0.692	0.657	0.702	0.728

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, H.; Zou, Z.; Zhu, X. Proxy-Based Semi-Supervised Cross-Modal Hashing. Appl. Sci. 2025, 15, 2390. https://doi.org/10.3390/app15052390

AMA Style

Chen H, Zou Z, Zhu X. Proxy-Based Semi-Supervised Cross-Modal Hashing. Applied Sciences. 2025; 15(5):2390. https://doi.org/10.3390/app15052390

Chicago/Turabian Style

Chen, Hao, Zhuoyang Zou, and Xinghui Zhu. 2025. "Proxy-Based Semi-Supervised Cross-Modal Hashing" Applied Sciences 15, no. 5: 2390. https://doi.org/10.3390/app15052390

APA Style

Chen, H., Zou, Z., & Zhu, X. (2025). Proxy-Based Semi-Supervised Cross-Modal Hashing. Applied Sciences, 15(5), 2390. https://doi.org/10.3390/app15052390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Proxy-Based Semi-Supervised Cross-Modal Hashing

Abstract

1. Introduction

2. Related Works

2.1. Supervised Hashing Methods

2.2. Unsupervised Hashing Methods

2.3. Semi-Supervised Hashing Methods

3. Methodology

3.1. Problem Formulation

3.2. Overview of PSSCH Framework

3.3. Model Learning

3.3.1. Pseudo-Label Generation

3.3.2. Feature Proxy Learning

3.3.3. Adaptive Dual-Label Loss

3.4. Optimization

4. Experiments

4.1. Experimental Settings

4.2. Performance Comparison

4.3. Ablation Studies

4.4. Sensitivity to Hyper-Parameters

4.5. Training and Encoding Time

4.6. Visualization

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI