Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval

Zhao, Yan; Shi, Guohua

doi:10.3390/app16010389

Open AccessArticle

Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval

by

Yan Zhao

and

Guohua Shi

^*

College of Electronics and Information Engineering, Shanghai University of Electric Power, Shanghai 200090, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 389; https://doi.org/10.3390/app16010389 (registering DOI)

Submission received: 1 December 2025 / Revised: 24 December 2025 / Accepted: 27 December 2025 / Published: 30 December 2025

(This article belongs to the Special Issue New Advances in Information Retrieval)

Download

Browse Figures

Versions Notes

Abstract

This paper introduces Graph-Propagated Multi-Scale Hashing with Contrastive Learning (GPMCL), a novel unsupervised cross-modal hashing framework designed to address the semantic deficiency in large-scale unlabeled multimodal data. GPMCL first constructs an initial similarity matrix via cross-modal graph propagation, effectively capturing potential inter-modal relationships. A multi-scale enhancement strategy is then employed to integrate both local and global similarities, resulting in a more informative and robust similarity representation. To adaptively distinguish sample relationships, a Gaussian Mixture Model (GMM) is utilized to determine dynamic thresholds. Additionally, contrastive learning is incorporated in the feature space to enhance intra-class compactness and inter-class separability. Extensive experiments conducted on three public benchmark datasets demonstrate that GPMCL consistently outperforms existing state-of-the-art unsupervised cross-modal hashing methods in terms of retrieval performance. These results validate the effectiveness and generalization capability of the proposed method, highlighting its potential for practical cross-modal retrieval applications.

Keywords:

graph-propagated; multi-scale; cross-modal hashing retrieval; Gaussian mixture model; contrastive learning

1. Introduction

Multimodal learning is a significant area of research within deep learning. To effectively integrate information from different modalities, it is essential to address the inherent differences among them [1]. The volume of data available online is increasing rapidly, and finding efficient methods to retrieve multimodal information from this vast pool presents a major challenge in contemporary multimodal processing. Cross-modal hashing establishes a hashing relationship among various modalities to enhance the multimodal retrieval process [2], improving efficiency while ensuring a comprehensive approach.

Recent advancements in deep learning algorithms have accelerated the development of cross-modal hashing, making the integration of deep learning a prevalent trend in this field [3]. The research community has engineered a diverse assortment of learning strategies, covering both supervised and unsupervised frameworks; however, many supervised techniques rely heavily on a substantial number of manually labeled training samples [4,5]. The acquisition of annotated data entails substantial financial costs and considerable time investment, rendering it impractical for large-scale data applications.

Given the high cost and practical infeasibility of large-scale manual annotation, research has increasingly shifted towards fully unsupervised cross-modal hashing techniques, which require no labeled data for training. Various methodologies leverage multimodal feature vectors to integrate and derive a similarity matrix, which subsequently informs the generation of hash codes. These approaches include the following: Deep Adaptive Enhanced Hashing guided by Discriminative Similarity (DAEH) [6], which incorporates distance distribution and similarity ratio information, applying an optimization strategy under adaptive teacher guidance to refine retrieval outcomes; and Unsupervised Cross-Modal Hashing through Semantic Text Mining (UCHSTM) [7], which constructs a text-adaptive similarity matrix by investigating inter-word correlations and ultimately fuses the similarity matrices of image and text modalities to guide the training of the hash model. Another research stream, for instance Pseudo-Label Driven Deep Hashing (PDDH) [8], revises the joint semantic similarity matrix via clustering pseudo-labels and develops a similarity consistency loss function that highlights the disparities across different modalities to direct the training of the hash model.

In the realm of previously examined unsupervised cross-modal hashing methodologies, optimizing the multi-modal similarity matrix and ensuring spatial alignment are critical objectives. A significant challenge lies in enhancing the discriminative power of the similarity matrix. The lack of labeled data in unsupervised learning complicates the achievement of spatial alignment across diverse modalities, as the heterogeneous feature representations inherent to different modalities hinder their integration into a cohesive space, thereby complicating semantic alignment and similarity evaluation [9].

To address these challenges, this study proposes a graph propagation-guided multi-scale similarity learning mechanism. This innovative approach enhances the original similarity matrix by iteratively propagating information through the adjacency relationships among cross-modal samples. By incorporating both local and global scale information, the mechanism generates a similarity matrix that is rich in semantic content. It effectively addresses the heterogeneity between modalities, thereby facilitating a more semantically robust learning of hash codes and improving cross-modal retrieval performance. The proposed network architecture is illustrated in Figure 1. We have developed a similarity matrix that utilizes graph propagation and multi-scale enhancement, in conjunction with a hash contrastive learning module, which collectively ensures that the generated hash codes encapsulate richer semantic information, ultimately leading to improved retrieval outcomes.

We clarify that our work follows the standard unsupervised paradigm in cross-modal hashing: no semantic labels from the target datasets are used during training. To obtain robust feature representations, we employ pre-trained models (CLIP for vision, BoW/LDA for text) as fixed feature extractors. While these models were trained on external data with supervision or self-supervision, they are not fine-tuned with our target labels. During evaluation, ground-truth labels are used solely for calculating retrieval metrics, following the common protocol in unsupervised retrieval literature. This places our work within the practical and well-established framework of unsupervised learning with pre- trained backbones.

In summary, the contributions of this paper are as follows:

The primary focus of this research is the design of a multi-scale similarity learning module guided by graph propagation, which generates a comprehensive semantic similarity matrix without the need for labeled data.
We propose a hash contrastive learning module that establishes thresholds for positive and negative samples using GMM, providing more stable and distinguishable training signals for the entire model. This significantly enhances the semantic cohesion of the hash codes and improves retrieval accuracy.
Comprehensive experiments carried out on three benchmark datasets verify that GPMCL can optimize hash functions more efficiently in comparison with other unsupervised cross-modal hashing algorithms.

2. Related Work

Cross-modal retrieval requires addressing the semantic gap caused by the inherent differences between modalities [1]. A core research focus in cross-modal hashing lies in learning efficient hash codes and their semantic representations in Hamming space. Existing methods can be broadly divided into two categories based on supervision: supervised approaches that leverage semantic labels, and unsupervised methods that depend on the intrinsic structure of the data.

2.1. Supervised Methods

Supervised cross-modal hashing enhances hashing learning methods by utilizing manually labeled supervisory information. Its optimization objective is to minimize the Hamming distance between hash codes of similar samples while maximizing the distance between hash codes of dissimilar samples. Since supervised learning employs semantically labeled supervisory information, it can guide the development of more discriminative hash representations, thereby improving the accuracy of cross-modal retrieval. Based on semantic preserving hashing for cross-view retrieval (SePH) [10], this method transforms semantics into probability distributions and employs sampling methods to learn hash functions. Supervised hashing based on matrix factorization (SMFH) [11] utilizes matrix factorization, labeled data, and manifold information to learn hash functions. Discrete cross-modal hashing (DCH) [12] directly learns discriminative binary codes while maintaining discrete constraints. Scalable cross-modal retrieval discrete matrix factorization hashing (SCRATCH) [13] significantly improves cross-modal retrieval accuracy through an innovative iterative optimization strategy and discrete hash code generation mechanism while maintaining linear time complexity. Deep cross-modal hashing (DCMH) [14] integrates feature learning and hash code learning into a single framework. Differentiable cross-modal hashing through multimodal transformers (DCHMT) [15] proposes a differentiable cross-modal hashing method based on multimodal Transformers, achieving efficient cross-modal retrieval through position-aware encoding and selective binarization mechanisms. Deep semantic-aware proxy hashing (DSPH) [16] maps multimodal multi-label data into a unified discrete space and captures fine-grained semantic correlations between original samples.

2.2. Unsupervised Methods

Unsupervised cross-modal hashing methods perform hashing learning solely based on the data itself. When unsupervised cross-modal hashing first emerged, it primarily relied on traditional machine learning paradigms such as matrix decomposition and spectral clustering. These methods utilized potential statistical correlations and structural topological information inherent in multimodal data to construct a shared hashing space across different modalities. Spectral Hashing (SH) [17] employs the idea of spectral clustering for information retrieval. Cross-View Hashing (CVH) [18] maps semantically similar objects to nearby binary codes by learning hashing functions for each view, thereby enabling efficient cross-view similarity search. Inter-Media Hashing (IMH) [19] learns hashing functions through a linear regression model, effectively generating hash codes for new data points. Latent Semantic Sparse Hashing (LSSH) [20] extracts image and text features using sparse coding and matrix decomposition, respectively, and generates unified hash codes through joint space mapping and iterative optimization.

As research has progressed, methods based on deep learning have been increasingly adopted due to their capacity to adapt to complex cross-modal nonlinear relationships. The Joint Semantic Affinity Matrix Input for Multimodal Instances (DJSRH) [21] integrates original neighborhood relationships from various modalities, enabling the capture of latent semantics. The Deep Graph Neighborhood Consistency Preservation Network (DGCPN) [22] considers graph neighborhood relationships by accounting for both the data and its neighbors, addressing the limitations of existing methods in managing intricate data relationships. Joint Distribution-based Similarity Hashing (JDSH) [23] constructs a joint modal similarity matrix to fuse cross-modal information and employs a distribution-based similarity weighting strategy, allowing the generated hash codes to better preserve the original semantic relationships and enhance discriminability. Deep Semantic Aligned Hashing (DSAH) [24] introduces a semantic alignment loss function to coordinate the similarity between features and hash codes, innovatively utilizing cross-modal hash codes for feature reconstruction to bridge modal differences. Pseudo-Label Driven Deep Hashing (PDDH) [8] refines the joint semantic similarity matrix through clustering pseudo-labels. Unsupervised Contrastive Learning for Cross-Modal Hashing (UCCH) [25] proposes a novel cross-modal ranking learning loss for hashing. The Multi-Perspective Semantic Alignment Mechanism (MPSAM) [26] acquires the consistency of cross-modal similarity by reducing the quantization error of element consistency in multi-perspective similarity.

2.3. Multi-Modal Representation and Feature Fusion Paradigms

Beyond the specific domain of cross-modal hashing, a broader research thread in multi-modal learning focuses on the fundamental challenges of representation learning and feature fusion across diverse data types. Recent studies underscore the importance of constructing robust cross-modal similarities and implementing effective fusion strategies for downstream tasks. For instance, research on sentiment analysis demonstrates that combining textual content with contextual metadata (such as user engagement metrics) through a hybrid fusion mechanism can yield refined insights and measurable performance gains, highlighting the practical value of multi-modal feature integration beyond simple association [27]. Similarly, applications in educational technology showcase how multi-modal dictionaries leverage aligned textual, visual, and symbolic representations to enhance accessibility, pointing to the practical deployment requirements of lightweight yet effective cross-modal alignment modules that must also meet stringent accessibility standards [28]. These works reflect a contemporary emphasis on designing generalizable alignment frameworks that can adapt to varying data structures and interaction patterns.

Drawing from this paradigm, our proposed Graph-Propagated Multi-Scale Hashing with Contrastive Learning (GPMCL) method directly addresses these core challenges of cross-modal alignment and representation integration. Our graph propagation mechanism provides a structured approach to constructing and refining cross-modal similarity structures, which aligns with the need for robust similarity modeling in tasks such as sentiment analysis. The multi-scale similarity learning component further enables the fusion of both local and global semantic information to form a comprehensive representation, mirroring the hybrid fusion principles demonstrated in educational multi-modal tools. Finally, the contrastive separation strategy, enhanced by adaptive Gaussian Mixture Model (GMM) thresholding, offers a principled objective to improve feature discriminability in a shared representation space. This overall framework thus advances the broader goal of building effective, adaptable, and semantically-aware multi-modal systems for practical applications like retrieval and classification.

3. The Proposed Method

The network architecture of GPMCL is illustrated in Figure 1. On the image side, a pre-trained visual encoder is employed to extract image features, while the text side utilizes features processed through a textual feature extractor. The specific extraction method, such as Bag-of-Words(BoW) [29] or Latent Dirichlet Allocation(LDA), is selected according to the dataset characteristics. The features from these two modalities are subsequently integrated, and the resulting cross-modal similarity matrix is refined using graph propagation [30] and multi-scale learning techniques [31]. In the context of hash learning, this research integrates a feature reconstruction module [24] and a hash contrastive learning module, along with intra-class and inter-class structural loss [32], thus enabling more accurate reconstruction of hash codes. The following sections will present a detailed overview of these components.

3.1. Problem Definition

This section primarily addresses the challenges and definitions of symbols relevant to GPMCL. A dataset consisting of N pairs of text-image combinations can be represented by the set

O = \{o_{1}, o_{2}, \dots, o_{N}\}

, where

o_{i} = (I_{i}, T_{i}), i \in (1, N)

denotes the-ith pair of image-text combinations, with

I_{i}

and

T_{i}

representing the original image and text, respectively. The features extracted by the image network and text network are denoted as

F_{I} \in R^{N \times 512}

and

F_{T} \in R^{N \times 512}

, where indicates the number of samples in each batch. After processing through the hashing layer, the image features and text features produce outputs

b_{I} \in R^{N \times c l}

and

b_{T} \in R^{N \times c l}

, where

c l

represents the length of the hash code.

3.2. Network Architecture

The overall architecture consists of two branches: an image network and a text network, both designed to extract discriminative features and map them into a common Hamming space.

3.2.1. Image Network

This study implements an image encoding network for the visual component, utilizing the CLIP-based ViT-B/16 model [33] as a feature extractor while keeping the training parameters fixed. The input images are processed by a visual transformer, resulting in the generation of 512-dimensional semantic feature vectors. Following the feature extraction process, these vectors are mapped to a hash space via a fully connected network that incorporates binary constraints. Ultimately, the features are reconstructed through the application of a decoder:

\begin{matrix} F_{I} & = C L I P (I, θ_{I}) \in R^{N \times 512} \end{matrix}

(1)

\begin{matrix} h_{I} & = f c_{I} (F_{I}, δ_{I}) \end{matrix}

(2)

\begin{matrix} b_{I} & = tanh (μ h_{I}) \end{matrix}

(3)

\begin{matrix} F_{I}^{*} & = Dec (s i g n (b_{I}), ω_{I}) \end{matrix}

(4)

where N represents the sample size, while

θ_{I}

,

δ_{I}

,

ω_{I}

are learnable parameters. The scaling parameter

μ

is a scheduled parameter that follows a fixed exponential update rule as defined in Equation (11). The model utilized is the CLIP ViT-B/16, referred to as

C L I P (*)

, while

f c_{I} (*)

denotes the fully connected layer model that maps image features to the hash space. The reconstruction feature decoder, also labeled as

Dec (*)

in this paper, consists of two fully connected layers, with the network architecture structured as

c l

→ 256 → 512. This architecture is employed to restore the hash code to a representation that aligns with the dimensions of the original features.

3.2.2. Text Network

The text inputs are preprocessed into feature vectors using either Bag-of-Words (BoW) or Latent Dirichlet Allocation (LDA), depending on the dataset. Consequently, we developed a hierarchical framework designed to enhance these features at the text end. The preprocessed text features

T \in R^{N \times d_{T}}

are initially processed through two Multi-Layer Perceptron (MLP) layers for dimensionality reduction and feature refinement:

{F_{T}}^{\circ} = MLP (T, θ_{T}) \in R^{N \times 512}

(5)

To address the issue of the text network being weaker than the image network, this paper introduces residual connections with a dimension-matching projection layer [34] to effectively enhance text features:

\begin{matrix} R_{T} & = Proj (T, ϕ_{T}) \in R^{N \times 512} \end{matrix}

(6)

\begin{matrix} F_{T} & = ({F_{T}}^{\circ} + R_{T}) \in R^{N \times 512} \end{matrix}

(7)

\begin{matrix} h_{T} & = f c_{T} (F_{T}, δ_{T}) \end{matrix}

(8)

\begin{matrix} b_{T} & = tanh (μ h_{T}) \end{matrix}

(9)

\begin{matrix} {F_{T}}^{*} & = Dec (s i g n (b_{T}), ω_{T}) \end{matrix}

(10)

where N represents the sample size,

d_{T}

denotes the original text feature dimension,

Proj (\cdot)

represents a linear projection layer that maps the original

d_{T}

-dimensional text feature to a 512-dimensional space, aligning the residual branch with the main branch, while

θ_{T}

,

ϕ_{T}

,

δ_{T}

,

ω_{T}

are learnable parameters updated through gradient descent. The scaling parameter

μ

is a scheduled parameter that follows a fixed exponential update rule as defined in Equation (11). The text preprocessing features, designated as T, is followed by dimensionality reduction implemented through a multilayer perceptron (MLP) network, referred to as

M L P (*)

. Subsequently,

f c_{T} (*)

represents a fully connected layer model that transforms text features into a hash space. Additionally, the reconstruction feature decoder, labeled as

Dec (*)

, ensures structural consistency with its image counterpart and is used to revert the hash code to a representation that maintains the same dimensionality as the original features.

3.2.3. Model Training Optimization

To enhance training stability and progressively enforce hash binarization, we introduce a scheduled scaling parameter

μ

that controls the sharpness of the hyperbolic tangent activation. Specifically,

μ

is defined as a model parameter but is updated according to a fixed exponential schedule with respect to the training epoch t, rather than through gradient-based optimization. This approach ensures smooth training in the early stages while gradually strengthening the binarization constraint in later stages.

Let

μ_{t}

denote the value at epoch t. We initialize

μ_{0} = 1

and update it after each epoch according to the following rule:

μ_{t} = 1 + e^{γ t}

(11)

where t represents the current training epoch number, and

γ = 0.015

denotes the exponential growth rate that controls the schedule. This scheduled update rule ensures that

μ

increases smoothly, allowing the tanh function to progressively approximate the

sign

function, thereby facilitating stable hash code learning and effective binarization.

3.2.4. Tensor Dimension Reference

Table 1 summarizes tensor dimensions in GPMCL for batch size B:

Where: B = batch size (32),

d_{T}

= text feature dim (1386/10),

c l

= hash length (16–128). All modules maintain dimensional consistency for gradient flow and cross-modal alignment.

3.3. Semantic Similarity Construction

3.3.1. Adaptive Graph Construction

After extracting features from each modality, addressing the heterogeneity gaps has emerged as a significant issue in the field of cross-modal hashing. The construction of a similarity matrix has become a commonly adopted approach in such scenarios. Consequently, this research introduces a method for building a similarity matrix via graph propagation techniques [30]. Once the relevant features are acquired from both the image network and the text network, the process of constructing the graph propagation similarity matrix is outlined below:

\begin{matrix} F_{I} \in R^{N \times 512}, F_{T} \in R^{N \times 512} \end{matrix}

(12)

\begin{matrix} {\hat{F}}_{I} = \frac{F_{I}}{{∥F_{I}∥}_{2}}, {\hat{F}}_{T} = \frac{F_{T}}{{∥F_{T}∥}_{2}} \end{matrix}

(13)

After obtaining the normalized features, this paper constructs a k-nearest neighbors (KNN) graph. First, the Euclidean distance matrix between all sample pairs is computed:

D_{i j} = {∥ {\hat{f}}_{i} - {\hat{f}}_{j} ∥}_{2}^{2}, D \in R^{N \times N}

(14)

In practice, the KNN graph construction and subsequent graph propagation are performed within each mini-batch during training for computational efficiency. For a batch size of B samples, the distance matrix D has dimensions

B \times B

, and all subsequent operations scale with

O (B^{3})

complexity per batch. This batch-wise approach reduces the computational complexity from

O (N^{3})

for the entire dataset to

O (B^{3})

per batch, making the method scalable to large-scale datasets while maintaining the same algorithmic formulation.

For each node i, we select its k nearest neighbors based on the distance matrix

D_{i j}

. Here, k is a pre-defined hyperparameter that controls the sparsity of the graph and is typically set to a small constant to preserve local manifold structure. The neighbor selection is implemented using a threshold-based approach: for node i, we determine the distance threshold

τ_{i}^{(k)}

as the distance to its k-th nearest neighbor, and construct the adjacency matrix as follows:

A_{i j} = \{\begin{matrix} 1, & if D_{i j} ⩽ τ_{i}^{(k)} and i \neq j \\ 0, & otherwise \end{matrix}

(15)

where

τ_{i}^{(k)}

represents the distance between node i and its k-th nearest neighbor. This formulation ensures that each node is connected to exactly k nearest neighbors (potentially more than k in case of distance ties), resulting in a directed KNN graph.

3.3.2. Graph Propagation Process

In order to uncover the inherent semantic structure within the data, this paper employs an enhanced random walk propagation method [35] that updates the similarity matrix S through l layers of graph propagation. First, the directed adjacency matrix A is normalized row-wise:

\hat{A} = D^{- 1} A, where D_{i i} = \sum_{j} A_{i j}

(16)

where D is the degree matrix, and

\hat{A}

is the row-normalized adjacency matrix. The similarity matrix is then iteratively updated through graph propagation:

S_{\circ}^{(l + 1)} = \frac{1}{2} S_{\circ}^{(l)} + \frac{1}{2} \hat{A} S_{\circ}^{(l)} {\hat{A}}^{⊤}, \circ \in {I, T}, l = 0, 1, \dots

(17)

where

S_{\circ}^{(0)}

is initialized as the identity matrix I. This paper derives stable similarity matrices

S_{I}

and

S_{T}

through two layers of iterative updates (

l = 2

) within the graph propagation process.

3.3.3. Multi-Scale Structure Preservation

To enhance the robustness and semantic depth of the similarity matrices obtained through graph propagation, we integrate both local and global similarity information using a multi-scale structure preservation methodology. This study introduces the multi-scale structure preservation matrix as follows:

S_{\circ}^{multi} = \frac{1}{| K |} \sum_{k \in K} S_{\circ}^{(k)}, \circ \in {I, T}

(18)

where

S_{\circ}^{(k)} = I (S_{\circ} \geq θ^{(k)}) ⊙ S_{\circ}

represents the multi-scale similarity matrix for image or text at scale k,

θ^{(k)}

denotes the top-k similarity threshold,

I (\cdot)

is the indicator function, ⊙ refers to the Hadamard product, and K represents the selected predefined scale set.

3.3.4. Semantic Similarity Fusion

Following the acquisition of the graph propagation similarity matrix and the multi-scale structure-preserving matrix, this study integrates both matrices to enable a multi-scale enhancement of the similarity matrix:

S_{g p} = α S_{I} + (1 - α) S_{T}, S_{ms} = \frac{1}{2} (S_{I}^{multi} + S_{T}^{multi})

(19)

{\hat{S}}^{*} = \frac{1}{2} (S_{g p} + S_{m s})

(20)

S^{*} = \frac{1}{2} ({\hat{S}}^{*} + {\hat{S}}^{*}^{⊤})

(21)

where

S_{g p}

represents the fused graph propagation similarity matrix,

α

is a predefined hyperparameter,

S_{m s}

denotes the fused multi-scale structure-preserving matrix,

{\hat{S}}^{*}

stands for the unsymmetrized similarity matrix,

{\hat{S}}^{*}^{⊤}

represent its transpose, and

S^{*}

is the symmetrized final result of semantic similarity fusion. The symmetrization operation enhances the robustness of the algorithm.

3.3.5. Computational Complexity Analysis

To ensure scalability, our method adopts batch-wise processing with batch size

B ≪ N

, where N denotes the total number of samples. The per-batch complexity comprises three components:

KNN construction: $O (B^{2} (d + log k))$ for distance computation and neighbor selection.
Graph propagation: $O (l B^{3})$ for $l = 2$ layers of matrix multiplication.
Multi-scale processing: $O (| K | B^{2})$ for $| K |$ scales.

The overall per-epoch complexity is

O (N B d + l N B^{2} + | K | N B)

, which scales linearly with N. Memory consumption is reduced from

O (N^{2})

to

O (B^{2})

per batch, achieving a reduction factor of

{(N / B)}^{2}

. This batch-wise design preserves the theoretical formulation while enabling efficient processing of large-scale datasets.

3.4. Hashing Learning

3.4.1. Semantic Hashing Construction

To ensure dimensional consistency between hashing learning and the enhanced similarity matrix, this paper constructs a comprehensive hashing matrix [21] that includes an image hashing matrix, a text hashing matrix, and a cross-modal interaction hashing matrix:

\begin{matrix} B_{I} = cos (b_{I}, b_{I}) = \frac{b_{I}^{T} b_{I}}{{∥b_{I}^{T}∥}_{2} {∥b_{I}∥}_{2}}, B_{T} = cos (b_{T}, b_{T}) = \frac{b_{T}^{T} b_{T}}{{∥b_{T}^{T}∥}_{2} {∥b_{T}∥}_{2}} \end{matrix}

(22)

\begin{matrix} B_{I T} = cos (b_{I}, b_{T}) = \frac{b_{T}^{T} b_{I}}{{∥b_{T}^{T}∥}_{2} {∥b_{I}∥}_{2}}, B_{T I} = cos (b_{T}, b_{I}) = \frac{b_{I}^{T} b_{T}}{{∥b_{I}^{T}∥}_{2} {∥b_{T}∥}_{2}} \end{matrix}

(23)

where

B_{I}

denotes the image hashing matrix,

B_{T}

represents the text hashing matrix, and

B_{I T}

and

B_{T I}

correspond to the bidirectional cross-modal interaction hashing matrices.

3.4.2. Structure Preserving

To preserve the semantic structure encoded in the similarity matrix, this paper employs a structure-preserving loss [32] to minimize the discrepancy between the hashing matrices and the enhanced similarity matrix, thereby maintaining semantic consistency in the hash codes. The loss function consists of two complementary components designed to balance structural preservation and cross-modal alignment:

L_{s} = L_{s 1} + L_{s 2}

(24)

The first component,

L_{s 1}

, ensures that both intra-modal and cross-modal relationships in the Hamming space reflect the semantic similarities encoded in the enhanced similarity matrix

S^{*}

:

L_{s 1} = {∥S^{*} - B_{I}∥}_{F}^{2} + {∥S^{*} - B_{T}∥}_{F}^{2} + {∥S^{*} - B_{I T}∥}_{F}^{2} + {∥S^{*} - B_{T I}∥}_{F}^{2}

(25)

where

B_{I} = sign (b_{I}) \in {- 1, 1}^{N \times C}

and

B_{T} = sign (b_{T}) \in {- 1, 1}^{N \times C}

denote the binarized hash codes for image and text modalities respectively,

B_{I T}

and

B_{T I}

represent the cross-modal similarity matrices, and

{∥ \cdot ∥}_{F}

denotes the Frobenius norm. This term preserves the global semantic structure by aligning the pairwise code similarities with the enhanced similarity matrix

S^{*}

.

The second component,

L_{s 2}

, serves as an information entropy regularization term that specifically enhances the direct correlation between corresponding image-text pairs:

L_{s 2} = - \frac{1}{N} \sum_{i = 1}^{N} B_{I}^{(i)} \cdot B_{T}^{(i)}

(26)

where

B_{I}^{(i)}

and

B_{T}^{(i)}

represent the hash codes for the i-th sample from image and text modalities respectively. By maximizing the dot product between matching pairs (equivalent to minimizing the negative sum), this term encourages the hash codes of semantically related image-text pairs to converge in the Hamming space.

The combination of

L_{s 1}

and

L_{s 2}

provides a balanced approach to cross-modal hashing learning. While

L_{s 1}

preserves the overall semantic neighborhood structure captured by graph propagation and multi-scale fusion,

L_{s 2}

directly reinforces the alignment between corresponding samples across modalities. This dual formulation ensures that the generated hash codes are both structurally consistent and discriminatively aligned, with

L_{s 1}

maintaining global data manifold structure and

L_{s 2}

enhancing pairwise matching accuracy.

3.4.3. Feature Reconstruction Module

To guarantee that hash codes effectively retain the semantic features of original images and texts, this paper introduces a feature reconstruction loss term [24]. This term reduces the gap between the original features and those decoded or reconstructed from the hash codes:

L_{f} = {∥F_{I} - {F_{I}}^{*}∥}_{F}^{2} + {∥F_{T} - {F_{T}}^{*}∥}_{F}^{2}

(27)

where

F_{I}

and

F_{T}

represent the original image and text semantic features, respectively, while

{F_{I}}^{*}

and

{F_{T}}^{*}

denote the reconstructed image and text features that are decoded from hash codes. This module significantly enhances both the semantic preservation capabilities of hash representations and cross-modal consistency.

3.4.4. Adaptive Threshold Determination

To effectively distinguish between similar and dissimilar pairs in the search process, we propose an adaptive thresholding method based on a Gaussian Mixture Model (GMM) [36]. Considering that different datasets demand distinct thresholds, our approach models the data distribution to dynamically set thresholds for positive and negative samples.

First, we extract the off-diagonal elements

{s^{*}}_{i j} \in S^{*} (i \neq j)

from the similarity matrix, and model their distribution using a two-component GMM:

p (s^{*} ∣ ϕ) = \sum_{m = 1}^{2} π_{m} N (s^{*} ∣ μ_{m}, σ_{m}^{2})

(28)

Here,

ϕ = {\{π_{m}, μ_{m}, σ_{m}\}}_{m = 1}^{2}

denotes the model parameters—including the mixing coefficients, means, and variances of the two Gaussian components—and

N (\cdot)

represents the probability density function of the Gaussian distribution.

To validate the rationality of the two-component GMM assumption, we visualize the similarity distributions of three benchmark datasets (WIKI, MIRFLICKR-25K, NUSWIDE) in Figure 2. The figure demonstrates that the similarity distributions for all datasets present a well-separated bimodal structure. The two-component GMM provides an excellent fit to this distribution, where the left and right modes correspond to dissimilar and similar pairs. This serves as empirical evidence supporting the validity of our modeling assumption.

After estimating the parameters

ϕ

via the Expectation-Maximization (EM) algorithm, we determine the thresholds for positive and negative samples as follows:

τ_{p} = μ_{2} - λ (μ_{2} - μ_{1}), τ_{n} = μ_{1} + λ (μ_{2} - μ_{1})

(29)

In this equation,

τ_{p}

and

τ_{n}

are the thresholds for similar and dissimilar pairs, respectively. We set

λ = 0.1

as the boundary margin coefficient to avoid misclassification near the peak boundary. From Figure 2,

τ_{p}

and

τ_{n}

can effectively separate the two peaks of the distribution, which demonstrates the robustness of our threshold determination method.

3.4.5. Comparative Hashing Learning

To enhance the discriminative power of hash codes, we introduce a contrastive hashing [37] loss that utilizes adaptive thresholds. This method promotes greater similarity for semantically related pairs while diminishing similarity for unrelated pairs. Furthermore, by employing both positive and negative sample thresholds, this study integrates a dynamic safety margin [38] along with a robust filtering strategy to minimize interference from ambiguous boundary samples:

m = 0.05 (τ_{p} - τ_{n})

(30)

M_{p} = I (S^{*} > τ_{p} + m), M_{n} = I (S^{*} > τ_{n} - m)

(31)

where m denote the dynamic safety margin,

M_{p}

and

M_{n}

are the positive and negative sample selection criteria respectively, and

I (*)

represent the indicator function.

After establishing the criteria for selecting positive and negative samples, this study employs a square-root-normalized balancing factor [39] to calculate the final contrastive loss. This method addresses sample imbalance more effectively than traditional ratio- based approaches:

L_{c} = - \frac{1}{\sum M_{p}} \sum_{{S^{*}}_{i j} > τ_{p}} log σ (B_{I T}) - \frac{η}{\sum M_{n}} \sum_{{S^{*}}_{i j} < τ_{n}} log [1 - σ (B_{I T})]

(32)

where

σ (*)

denote the Sigmoid function,

{S^{*}}_{i j} > τ_{p}

represent elements in the enhanced similarity matrix

S^{*}

exceeding the positive threshold

τ_{p}

,

{S^{*}}_{i j} < τ_{n}

indicate elements below the negative threshold

τ_{n}

, and

η = \sqrt{\sum M_{n} / \sum M_{p}}

is the balancing factor.

3.5. Optimization

The proposed model learns high-quality hash codes by minimizing the following composite loss function:

min_{B_{I}, B_{T}} L = L_{s} + L_{f} + L_{c}

(33)

where

L_{s}

is the structural preservation loss,

L_{f}

is the feature reconstruction loss,

L_{c}

and is the contrastive hashing learning loss. The mini-batch approach enables iterative optimization of GPMCL. Through minimizing the loss function, GPMCL effectively captures the neighborhood structure and co-occurrence information of original instances. Additionally, the graph propagation-generated similarity matrix, combined with multi-scale augmentation, directs the learning process to produce high-quality hash codes. The stochastic gradient descent (SGD) algorithm [40] is employed to optimize the entire GPMCL model, with the detailed procedure outlined in Algorithm 1.

Algorithm 1 Graph-Propagated Multi-Scale Hashing with Contrastive Learning

Input: Training set

O = {(I_{i}, T_{i})}_{i = 1}^{N}

; batch size N; hash length

c l

; hyperparameters

α

, K.

Output: Network parameters

θ_{I}, δ_{I}, ω_{I}, θ_{T}, δ_{T}, ω_{T}, μ

.

1: Initialize

t \leftarrow 0

, parameters randomly.

2: Repeat

3:

t \leftarrow t + 1

, update

μ_{t} \leftarrow 1 + e^{γ t}

(

γ = 0.015

).

4: Sample batch

{(I_{i}, T_{i})}_{i = 1}^{N}

.

5: Extract features and construct enhanced similarity matrix

S^{*}

with Equations (14)–(21).

6: Generate hash codes

b_{I}

,

b_{T}

and reconstruct features with Equations (22) and (23).

7: Compute losses:

L_{s}

,

L_{f}

,

L_{c}

with Equations (24)–(32).

8: Update parameters via SGD with total loss

L = L_{s} + L_{f} + L_{c}

.

9: Until convergence.

10: Return learned parameters.

4. Experiment

4.1. DataSets

This study evaluates the proposed method on three benchmark datasets: Wikipedia [41], MIR-Flickr-25K [42], and NUS-WIDE [43]. To establish a consistent and powerful visual baseline, we uniformly employ the CLIP-ViT-B/16 (OpenAI, San Francisco, CA, USA) model to extract 512-dimensional image features for all datasets. For text representation, we follow common practice: Latent Dirichlet Allocation (LDA) is used for the Wikipedia dataset, while Bag-of-Words (BoW) representations are used for MIR-Flickr-25K and NUS-WIDE. This design allows us to fairly evaluate the core contribution of our cross-modal similarity learning and hashing framework, independent of legacy feature extractors. The detailed feature specifications and dataset statistics are provided in Table 2.

To ensure reproducibility, all random operations are controlled by fixing the random seed to 42 across the key software components: Python (3.8.19), NumPy (1.24.4), and PyTorch (2.4.0+cu118). The partitioning details for each dataset are as follows.

Wiki (Wikimedia Foundation, San Francisco, CA, USA) comprises 2866 image-text pairs from 10 categories. Following the standard protocol, the dataset is split into 2173 training pairs and 693 test pairs according to the official training/test labels. The test set serves as queries, while the training set forms the retrieval database, with no overlap between them.

MIR-Flickr-25K (Flickr/SmugMug Inc., San Francisco, CA, USA) contains 20,015 image-text pairs collected from Flickr. Samples with fewer than 20 text tags are filtered out. The dataset is partitioned using a category-stratified sampling strategy: for the first category, 160 samples are assigned to the test set and 400 to the training set; for each remaining category, 80 samples are added to the test set and 200 to the training set. Additional samples are randomly drawn from non-test data to form a final training set of 5000 pairs and a test set of 2000 pairs. The test set is used as queries, and the retrieval database consists of all non-test samples.

NUS-WIDE (National University of Singapore, Singapore) originally includes 269,648 pairs across 81 categories. We select 186,577 samples from the 10 most frequent categories. Partitioning follows the same category-stratified scheme as MIR-Flickr-25K: 200 samples from the first category are placed in the test set and 500 in the training set; for each subsequent category, 200 samples are added to the test set and 500 to the training set. This yields a training set of 5000 pairs and a test set of 2000 pairs. The test set acts as queries, and the retrieval database contains all non- test samples.

The detailed feature specifications and dataset statistics are provided in Table 2. It is important to clarify that our work follows a standard unsupervised hashing paradigm: no semantic labels from these target datasets are used for training. Our method relies on fixed, pre-trained feature extractors (CLIP for vision, LDA/BoW for text), and ground-truth labels are used solely for final evaluation, consistent with the literature.

4.2. Implementation Details

This study is implemented using the PyTorch (2.4.0+cu118) framework [44] and trained on an NVIDIA RTX 4080 Laptop GPU (NVIDIA Corporation, Santa Clara, CA, USA) (12 GB). The learning rates for both the image and text models are set to 0.01, with a batch size of 32. We use Stochastic Gradient Descent (SGD) as the optimizer, with a momentum of 0.9 and a weight decay of 0.0005.

For the image branch, we adopt a pre-trained ViT-B/16 model as the encoder and freeze its parameters to serve as the backbone network. On the text side, we employ different feature extraction methods tailored to the dataset characteristics. Specifically, for the MIR-Flickr-25K and NUS-WIDE datasets, we utilize a Bag-of-Words (BoW) model; for the Wikipedia dataset, we employ Latent Dirichlet Allocation (LDA). The extracted features from these models are then passed through two fully connected layers and combined with the original pretrained features via a residual connection, forming the text branch backbone. Both modalities use two fully connected layers as decoders to generate hash codes.

In terms of experimental parameter settings: for the Wikipedia dataset, we set the parameters to

α = 0.6

,

K = \{1, 2, 4\}

, and

k = 5

, with 50 training epoch; for the MIR-Flickr-25K dataset, the parameters are set to

α = 0.7

,

K = \{1, 2, 3\}

, and

k = 5

, with 30 epoch; and for the NUS-WIDE dataset, the parameters are set to

α = 0.2

,

K = \{1, 2, 3\}

, and

k = 5

, with 20 epoch. A detailed sensitivity analysis of these parameters is provided in Section 4.6.

4.3. Baseline Methods and Evaluation Metrics

For the accurate evaluation of the proposed GPMCL method’s effectiveness, we conduct comparisons with several state-of-the-art deep cross-modal hashing techniques, such as DJSRH [21], JDSH [23], DGCPN [22], DSAH [24], PDDH [8], and MPSAM [26]. These comparative methods reflect some of the latest advancements in this field and all utilize deep neural networks for framework construction. Most of these methods have publicly available source codes and datasets in their respective publications.

Following prior research, we use mean Average Precision (mAP) [45] to evaluate the retrieval performance of all methods. We compute the mAP scores for two tasks—image-to-text retrieval and text-to-image retrieval—where image queries

Q = [q_{1}, q_{2}, \dots q_{N}]

are used for retrieving text samples and vice versa. Given a set of queries, the mAP is calculated in the following manner:

m A P = \frac{1}{Q} \sum_{i = 1}^{N} A P_{i}

(34)

where N represents the number of queries Q, and

A P_{i}

(Average Precision) is computed as follows:

A P_{i} = \frac{1}{P_{i}} \sum_{k = 1}^{n} \frac{P_{i k}}{k} \times ϕ_{i k}

(35)

where n represents the total number of samples in the database.

P_{i}

denotes the number of samples in the database that are similar to query

q_{i}

.

P_{i k}

refers to the count of samples similar to query

q_{i}

identified among the top k retrieved results.

ϕ_{i k}

is a binary indicator function: when

ϕ_{i k} = 1

, it signifies that the k-th retrieved sample is similar to query

q_{i}

; when

ϕ_{i k} = 0

, it signifies that the k-th retrieved sample is dissimilar to query

q_{i}

. Two samples are regarded as similar if they share at least one common semantic label.

4.4. Precision Comparison

To further validate the overall effectiveness of the proposed method, we conduct comprehensive cross-modal retrieval experiments with rigorous statistical evaluation for our proposed GPMCL method. The experiments are performed on three benchmark datasets—MIR-Flickr-25K, NUS-WIDE, and Wiki—using hash codes of lengths 16, 32, 64, and 128.

To ensure statistical reliability of our proposed method and mitigate the influence of random data partitioning, we employ 5 independent random seeds for dataset splitting and model initialization specifically for GPMCL. For each hash code length and dataset, we report the mean mAP@50 score for GPMCL with standard deviation and 95% confidence interval based on 5 independent runs. Baseline methods are evaluated under the same experimental protocols but with single runs as commonly reported in related literature.

The detailed mAP@50 results comparing GPMCL with state-of-the-art methods are summarized in Table 3, which includes the average performance for GPMCL and single-run results for baseline methods. This provides a comprehensive comparison while ensuring statistical rigor for our proposed approach.

To provide full transparency regarding the variability of our results, we present the complete mAP@50 scores from all 5 independent runs with different random seeds in Table 4, Table 5 and Table 6. These detailed breakdowns show the consistency of GPMCL’s performance across different random initializations and data splits, further demonstrating the robustness of our method against random factors.

On the MIR-Flickr-25K dataset at 64 bits, GPMCL achieves a 0.7% improvement over the second-best method in the image-to-text (I2T) retrieval task as shown in Table 3. On the NUS-WIDE dataset at 64 bits, it outperforms the next-best approach by 0.6% in the same task. Notably, on the Wiki dataset at 64 bits, GPMCL achieves a substantial gain of 15.8% over the previous best, demonstrating a significant performance advantage.

In the text-to-image (T2I) retrieval task, GPMCL also exhibits strong performance (Table 3). At 64 bits, it surpasses the second-best method by 0.6% on the MIR-Flickr-25K dataset and by 3.2% on the NUS-WIDE dataset, reflecting its robustness. Although GPMCL continues to lead on the Wiki dataset in the T2I task at 64 bits, the margin is relatively small, with only a 0.2% improvement over the next-best method—likely due to the dataset’s limited size and imbalanced class distribution.

Importantly, the statistical results in Table 4, Table 5 and Table 6 confirm the consistency of GPMCL’s performance, with small standard deviations and narrow 95% confidence intervals across all experimental settings. This indicates that GPMCL’s superior performance is statistically reliable and not due to random factors.

Overall, GPMCL consistently demonstrates superior performance across datasets and retrieval directions, exhibiting excellent stability, generalization capability, and statistical reliability under different hash code lengths.

Figure 3 and Figure 4 present a comparison of the proposed GPMCL method with various mainstream approaches on three benchmark datasets (MIR-Flickr-25K, NUS-WIDE, and Wiki) under a 64-bit code length. Specifically, Figure 3 shows the top-N retrieval curves [20], while Figure 4 illustrates the precision-recall (PR) curves [19].

To further assess the retrieval effectiveness of GPMCL, we present Top-K precision curves on three public cross-modal datasets: MIR-Flickr-25K, NUS-WIDE, and Wiki (see Figure 3). The horizontal axis denotes the number of top-K retrieved samples, while the vertical axis indicates the corresponding average precision. Both image-to-text (I2T) and text-to-image (T2I) retrieval tasks are considered.

On MIR-Flickr-25K, GPMCL achieves competitive results in both directions. For I2T retrieval, it attains an average precision of approximately 0.9 at K = 500, slightly surpassing other methods. In the T2I task, GPMCL outperforms PDDH, DSAH, and DGCPN at most K values, reflecting its strong semantic representation and retrieval reliability.

For the NUS-WIDE dataset, GPMCL consistently demonstrates superior performance with a more gradual precision decay as K increases, confirming its robustness in large-scale retrieval scenarios. This advantage is consistently observed in both I2T and T2I retrieval tasks.

On the Wiki dataset, GPMCL demonstrates particularly outstanding performance in image-to-text retrieval, achieving a precision of 0.49 at K = 100 and showing statistically significant superiority over all baseline methods, with particularly notable advantages against PDDH and DSAH. Although performance variations are less pronounced in text-to-image retrieval, GPMCL remains competitively comparable to other state-of-the-art methods throughout the mid-K range.

In summary, GPMCL consistently delivers strong performance across datasets and tasks, validating its robustness and generalization in modeling cross-modal semantic correlations.

To further assess the retrieval performance of different methods across varying recall levels, we plot the Precision–Recall (PR) curves for both I2T and T2I tasks on the MIR-Flickr-25K, NUS-WIDE, and Wiki datasets (see Figure 4). PR curves intuitively reflect a model’s ability to maintain high precision while improving recall, serving as a key metric for evaluating retrieval quality.

On the MIR-Flickr-25K dataset, GPMCL demonstrates the best performance in the T2I retrieval task. Although the improvement over competing methods is modest, its precision consistently remains superior across all recall levels, reflecting a strong capability in cross-modal semantic modeling. In contrast, for the I2T task, GPMCL performs on par with or slightly below other state-of-the-art methods, with a slightly lower initial precision yet maintaining a stable retrieval trend.

On the NUS-WIDE dataset, due to the larger scale and increased label complexity, GPMCL maintains competitive performance in the T2I task, showing leading precision across most recall levels. The curve remains consistently above competing methods, indicating robust semantic alignment and effective false positive suppression under large-scale retrieval conditions. For the I2T task, GPMCL performs comparably to other advanced methods, demonstrating stable precision-recall characteristics.

In contrast, on the Wiki dataset—which features significant class imbalance—GPMCL achieves substantial improvements in both retrieval directions. In the I2T task, it attains a notably higher initial precision compared to the second-best method and maintains its advantage across all recall levels. In the T2I task, GPMCL also consistently outperforms baseline methods, highlighting its strong generalization ability under sparse and imbalanced conditions.

In summary, GPMCL achieves a favorable precision–recall trade-off across diverse datasets and tasks. It effectively suppresses false positives while maintaining high recall, reflecting robust semantic alignment and strong generalization in cross-modal retrieval.

4.5. Ablation Study

GPMCL primarily leverages graph propagation to fuse multimodal features and construct a similarity matrix, which is then used to build a structural preservation module and a feature reconstruction module. Additionally, a Gaussian Mixture Model (GMM) is employed to generate thresholds for positive and negative samples, enabling contrastive learning.

In this section, to demonstrate the effectiveness of each component in GPMCL, seven model variants are designed to evaluate the contribution of backbone architecture and individual modules to the overall performance. The variants of GPMCL are as follows:

Architecture Variants:
−
GPMCL-AlexNet: Replaces the CLIP visual backbone with AlexNet to evaluate the impact of visual backbone representational capacity while maintaining the same text encoder.
−
GPMCL-VGG19: Replaces the CLIP visual backbone with the standard VGG19 network to create a controlled experimental setting for a fair comparison, while keeping the text encoder unchanged.
Module Ablation Variants:
−
GPMCL-1: Removes the multi-scale structural preservation module. The model generates the similarity matrix solely through graph propagation.
−
GPMCL-2: Removes the entire structural preservation module, i.e., $L = L_{f} + L c$ .
−
GPMCL-2(a): Removes the $L s 1$ component from the structural preservation module, i.e., $L = L f + L s 2 + L c$ .
−
GPMCL-2(b): Removes the $L s 2$ component from the structural preservation module, i.e., $L = L f + L s 1 + L_{c}$ .
−
GPMCL-3: Removes the feature reconstruction module, i.e., $L = L_{s} + L_{c}$ .
−
GPMCL-4: Removes the contrastive hashing learning module, i.e., $L = L_{s} + L_{f}$ .

To comprehensively demonstrate the contribution of backbone architecture and each component in the GPMCL model to cross-modal retrieval performance, we conducted ablation experiments using these seven model variants on the MIR-Flickr-25K, NUS-WIDE, and Wiki datasets. The experimental results are summarized in Table 7.

GPMCL-AlexNet replaces the CLIP backbone with AlexNet, a shallower and computationally lighter architecture. On the Wiki dataset for the I2T task, its mAP@50 decreases from 0.623 to 0.530, a reduction of approximately 9.3%. This significant performance drop suggests that the limited representational capacity of AlexNet, particularly its inability to capture cross-modal semantic alignments learned through contrastive pre-training, severely constrains the model’s ability to handle complex multimodal retrieval tasks. Nevertheless, GPMCL-AlexNet still substantially outperforms JDSH, which uses the same backbone, demonstrating the effectiveness of our proposed modules.

GPMCL-VGG19 employs VGG19 as the backbone, a convolutional architecture with strong visual feature extraction capabilities but lacking CLIP’s cross-modal alignment properties. On the same Wiki I2T task, it achieves an mAP@50 of 0.552, showing a decrease of 7.1% compared to the CLIP-based backbone. This result indicates that while VGG19 provides powerful visual representations, the absence of pre-trained cross-modal alignment knowledge in CLIP limits the overall performance. The contrast highlights the importance of CLIP’s dual-encoder structure and contrastive pre-training for effective multimodal feature representation. Despite this limitation, GPMCL-VGG19 consistently outperforms DGCPN across all datasets and tasks.

The comparative evaluation under various backbone configurations demonstrates the consistent superiority of GPMCL’s core architecture. On MIR-Flickr-25K, GPMCL (AlexNet backbone) outperforms the AlexNet-based hashing baseline JDSH by 1.5% in I2T retrieval, while GPMCL (VGG19 backbone) surpasses the VGG19-based baseline DGCPN by 2.1% under the same I2T setting. The advantage is more pronounced on the challenging Wiki dataset, where GPMCL (AlexNet) achieves a 10.0% improvement over JDSH, and GPMCL (VGG19) shows an 11.1% improvement over DGCPN. Similarly, for T2I retrieval on NUS-WIDE, improvements of 2.2% against JDSH (with AlexNet) and 3.8% against DGCPN (with VGG19) are observed. These consistent advancements across datasets, retrieval tasks, and matched backbone architectures confirm that the performance gains are attributable to the proposed graph propagation and multi-scale framework, rather than merely stemming from backbone advantages.

GPMCL-1 is a variant without the multi-scale structural preservation module, relying solely on graph propagation to generate the similarity matrix between modalities. On the Wiki dataset for the I2T task, its mAP@50 drops from 0.623 to 0.610, a decrease of approximately 1.8%, indicating that this module is particularly important for handling complex structural and semantic distributions as well as limited training samples.

GPMCL-2 removes the entire structural preservation module. This version relies only on graph-propagated features and feature reconstruction, ignoring both structural constraints and the multi-scale preservation. It performs poorly across all tasks. For example, on the Wiki I2T task, the mAP@50 drops significantly from 0.623 to 0.577, a decline of 5.1%, showing that without structural constraints, the model struggles to capture high-order semantic relationships, leading to severe performance degradation.

Both GPMCL-2(a) and GPMCL-2(b) exhibit severely degraded performance, approaching the low level of GPMCL-2 on most datasets, with mAP@50 around 0.577 on Wiki I2T. Notably, the removal of

L s 1

(in GPMCL-2(a)) leads to a more pronounced performance drop on the NUS-WIDE dataset. These results indicate that the removal of either

L s 1

or

L s 2

alone causes a performance decline nearly as severe as removing the entire structural preservation module. This underscores that both components are essential and highly interdependent within the multi-scale framework; their synergistic effect is crucial for maintaining structural consistency, and the absence of either component significantly undermines the module’s overall effectiveness.

GPMCL-3 eliminates the feature reconstruction module, retaining only the structural preservation and contrastive learning components. Most tasks exhibit slight decreases in performance. For example, in the Wiki T2I task, the mAP@50 decreases by only 0.001. This suggests that the feature reconstruction module plays a subtle but consistent role in maintaining intra-modal structural consistency, although its contribution is less pronounced compared to the structural preservation component.

GPMCL-4 is the variant used to evaluate the role of contrastive learning. It removes the positive/negative sample partitioning and the contrastive loss, training the hash codes solely based on the graph-constructed similarity matrix and structural preservation. A consistent drop in mAP@50 of approximately 1.2-3.0% is observed across all datasets for both tasks. For example, in the MIR-Flickr-25K I2T task, the mAP@50 decreases from 0.939 to 0.925. This result indicates that contrastive learning contributes to identifying clearer feature distribution boundaries, thereby improving both discriminability and robustness.

In conclusion, the ablation studies validate the effectiveness and synergy of each component in GPMCL. While CLIP’s pre-trained cross-modal alignment provides a strong foundation, the proposed graph propagation, multi-scale structural preservation, feature reconstruction, and contrastive learning modules collectively deliver consistent improvements across diverse datasets and backbone architectures. These results demonstrate that GPMCL’s superiority stems from its integrated design rather than from any single component or pre-trained backbone.

4.6. Parameter Sensitivity

To further illustrate the stability of the proposed model under various configurations, this section conducts sensitivity experiments on three key hyperparameters:

$α$ : the weighting coefficient for integrating image and text features into the similarity matrix.
K: the predefined size of the multi-scale set within the similarity matrix.
k: The graph sparsity hyperparameter. It specifies the number of nearest neighbors selected for each node during the construction of the similarity graph.

The coefficient

α

is varied from 0.1 to 0.9 to examine the impact of the modality fusion ratio on retrieval performance. For the multi-scale set size K, we choose combinations:

{1, 2, 3}

,

{1, 2, 4}

,

{1, 2, 3, 4}

,

{1, 2, 3, 4, 5}

, with its upper limit determined by the batch size (set to 32 in this study). Based on empirical guidelines, K is selected within the range

⌊ 0.1 \times Batchsize ⌋ ⩽ K ⩽ ⌊ \sqrt{Batchsize} ⌋

, which ensures both model effectiveness and computational efficiency. For the KNN neighbor number k, we test values

{3, 5, 7, 9, 11, 13, 15}

to analyze its impact on performance.

Sensitivity experiments were conducted on the MIR-Flickr-25K, NUS-WIDE, and Wiki datasets. In these experiments, the number of neighbors k was fixed at

k = 5

, while only one parameter (

α

or K) was varied at a time, keeping the others constant, to observe its effect. Retrieval performance was evaluated using mAP@50 and Average mAP@50, where Average mAP@50 is defined as the mean of the mAP@50 scores for I2T and T2I retrieval, providing a more comprehensive measure of overall bidirectional retrieval effectiveness.

The sensitivity of parameter

α

, under a fixed setting of K and k with a 64-bit code length, is illustrated in Figure 5. Conversely, the sensitivity of parameter K, under a fixed setting of

α

and k, is depicted in Figure 6.

As shown in Figure 5, on the MIR-Flickr-25K dataset, when K is fixed at

K = \{1, 2, 3\}

, the model achieves its best performance at

α = 0.7

. On the NUS-WIDE dataset, the optimal performance is observed when K is set to

K = \{1, 2, 3\}

, and the model performs best at

α = 0.2

. For the Wiki dataset, with K fixed at

K = \{1, 2, 4\}

, the highest performance is achieved at

α = 0.6

. Overall, the performance variations across different values of K remain relatively smooth, indicating that the model exhibits strong robustness to this parameter. Although the optimal K value differs slightly among datasets, the fluctuations in performance are minor, demonstrating the model’s good adaptability and tunability across different settings.

As shown in Figure 6, on the MIR-Flickr-25K dataset, when

α

is fixed at

α = 0.7

, the model achieves the highest performance at

K = \{1, 2, 3\}

. On the NUS-WIDE dataset, the best result is obtained when

α

is set to

α = 0.2

, with peak performance at

K = \{1, 2, 3\}

. On the Wiki dataset, fixing

α

at

α = 0.6

the model reaches optimal performance at

K = \{1, 2, 4\}

. In general, the model exhibits relatively stable performance within a narrow variation range, indicating strong robustness to changes in

α

. This suggests that GPMCL is reliable and effective under reasonable ranges of the

α

hyperparameter.

Based on the above sensitivity analysis results, we selected the optimal parameter combinations of

α

and K that achieved the highest performance on each dataset to further investigate the impact of the neighbor number k on model retrieval performance. Under the fixed optimal configurations of

α

and K, experiments were conducted by setting k to

{3, 5, 7, 9, 11, 13, 15}

. The results, shown in Figure 7, present the mAP@50 performance trends of the model across the three datasets under different k values.

As shown in the results presented in Figure 7, it can be observed that different datasets exhibit similar trends in their sensitivity to the value of k. On the MIR-Flickr-25K dataset, the model achieves optimal performance when

k = 5

; similarly, on the NUS-WIDE dataset, the best performance is also observed at

k = 5

; and on the Wiki dataset,

k = 5

yields the highest Average mAP@50 value. Overall, when the value of k falls within a moderate range, the model maintains relatively stable retrieval performance. This outcome further verifies that the GPMCL model exhibits good robustness under different neighbor number settings, providing a reliable reference for parameter tuning in practical applications.

The sensitivity analysis demonstrates that GPMCL exhibits strong robustness across all three hyperparameters. While the optimal values of

α

and K vary slightly across datasets, the performance remains stable within reasonable ranges. Notably, the neighbor number k shows consistent optimal values across all three datasets, suggesting this parameter is less dataset-dependent and can be set reliably in practical applications. This robustness simplifies the parameter tuning process and enhances the model’s practical utility.

4.7. Efficiency and Complexity Analysis

Table 8 presents the parameter statistics of various unsupervised deep cross-modal hashing methods that utilize 64 bits hash codes across three benchmark datasets. Although GPMCL employs the CLIP-B/16 image encoder, which has a slightly larger number of parameters compared to AlexNet and VGG19, this increase is accompanied by a significant improvement in retrieval performance. This demonstrates excellent parameter efficiency and semantic modeling capabilities.

To comprehensively evaluate the practical deployment efficiency of our proposed GPMCL method, we report detailed efficiency metrics including: (1) indexing time for building the hash code database, (2) single query latency for both image-to-text (I2T) and text-to-image (T2I) retrieval tasks, and (3) memory consumption for storing hash codes. These metrics are measured under consistent hardware and software configurations (FAISS CPU index, identical computing environment) to ensure reliability.

Table 9 presents the efficiency analysis of GPMCL with 64-bit hash codes across three benchmark datasets. For the Wiki dataset, the hash code storage requires only 543.25 KB for the database set and 173.25 KB for the query set. The index construction time is exceptionally fast at 0.01 ms, demonstrating efficient database preparation. Most importantly, the single query latency achieves 0.1528 ms/query for I2T and 0.1551 ms/query for T2I retrieval, indicating real-time retrieval capability suitable for large-scale applications.

On the MIR-Flickr-25K dataset, GPMCL maintains efficient performance with an index time of 2.00 ms, I2T latency of 0.5399 ms/query, and T2I latency of 0.5710 ms/query; the total database storage is 4505.60 KB, with query storage of 500.00 KB. For the large-scale NUS-WIDE dataset, the index construction time is 25.02 ms, I2T latency is 0.7598 ms/query, T2I latency is 0.7370 ms/query, and database storage is 46,131.20 KB — scaling appropriately with dataset size while remaining computationally feasible.

Similar efficiency advantages are observed across all three datasets. The low memory footprint, fast indexing, and sub-millisecond query latency confirm that GPMCL not only achieves superior retrieval accuracy but also maintains excellent computational efficiency. While comparable efficiency metrics for baseline methods are not provided in their original publications, the absolute efficiency values of GPMCL demonstrate its practical viability for deployment in resource-constrained environments and large-scale retrieval systems.

The efficiency analysis demonstrates that GPMCL achieves practical deployment characteristics with fast indexing, real-time query processing, and compact storage representation. These efficiency gains can be attributed to the optimized hash code generation process and the effectiveness of the graph propagation module in producing well-structured similarity matrices that facilitate fast nearest neighbor search operations.

5. Discussion

This paper presents GPMCL, a novel unsupervised cross-modal hashing method that integrates graph-propagated multi-scale similarity learning with contrastive learning. By iteratively refining semantic relations through graph structures at both local and global levels, GPMCL constructs a more informative and robust similarity matrix for hash code learning. This effectively mitigates the challenges of modality heterogeneity and improves retrieval performance.

In addition, the proposed contrastive learning module, guided by a Gaussian Mixture Model, introduces adaptive thresholding to enhance sample discrimination. This further improves the semantic compactness of hash codes and contributes to robust cross- modal alignment.

Extensive experiments on public datasets demonstrate that GPMCL consistently outperforms prior state-of-the-art methods, validating the effectiveness of the proposed framework in unsupervised semantic modeling and retrieval.

Future work will explore more advanced fusion mechanisms between graph convolutional networks and multi-scale feature integration, as well as adaptive graph learning strategies under diverse data distributions to further improve retrieval robustness.

Author Contributions

Conceptualization, G.S. and Y.Z.; methodology, G.S. and Y.Z.; software, G.S.; validation, G.S. and Y.Z.; formal analysis, G.S. and Y.Z.; investigation, G.S.; resources, Y.Z.; data curation, G.S. and Y.Z.; writing—original draft preparation, G.S. and Y.Z.; writing—review and editing, G.S. and Y.Z.; visualization, G.S. and Y.Z.; supervision, Y.Z.; project administration, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

All datasets used in this study are publicly available from their respective official websites as cited in the references. To ensure full reproducibility of the experiments, the deterministic sampling protocols (including the fixed random seed of 42 for all random operations), exact partition indices of training/test/retrieval database for MIR-Flickr-25K and NUS-WIDE datasets, and complete dataset partitioning code are available upon reasonable request to the corresponding author.

Acknowledgments

During the preparation of this manuscript, the authors utilized ChatGPT-4o and DeepSeek-V3.1 for language refinement and improving the readability of selected sections. The authors have thoroughly reviewed, verified, and edited all AI-generated content and assume full responsibility for the accuracy and integrity of the work presented in this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

GPMCL	Graph-Propagated Multi-Scale Hashing with Contrastive Learning
I2T	Image-to-Text
T2I	Text-to-Image
mAP	mean Average Precision
PR	Precision-Recall
GMM	Gaussian Mixture Model
CLIP	Contrastive Language-Image Pre-training
MLP	Multi-Layer Perceptron
KNN	K-Nearest Neighbors
BoW	Bag-of-Words
LDA	Latent Dirichlet Allocation

References

Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal Machine Learning: A Survey and Taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 423–443. [Google Scholar] [CrossRef]
Rasiwasia, N.; Costa Pereira, J.; Coviello, E.; Doyle, G.; Lanckriet, G.R.G.; Levy, R.; Vasconcelos, N. A new approach to cross-modal multimedia retrieval. In Proceedings of the 18th ACM International Conference on Multimedia, Firenze, Italy, 25–29 October 2010; pp. 251–260. [Google Scholar]
Liu, H.; Wang, R.; Shan, S.; Chen, X. Deep Supervised Hashing for Fast Image Retrieval. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2064–2072. [Google Scholar]
Cao, W.; Feng, W.; Lin, Q.; Cao, G.; He, Z. A Review of Hashing Methods for Multimodal Retrieval. IEEE Access 2020, 8, 15377–15391. [Google Scholar] [CrossRef]
Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. Proc. IEEE 2024, 112, 1716–1754. [Google Scholar] [CrossRef]
Shi, Y.; Zhao, Y.; Liu, X.; Zheng, F.; Ou, W.; You, X.; Peng, Q. Deep Adaptively-Enhanced Hashing With Discriminative Similarity Guidance for Unsupervised Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7255–7268. [Google Scholar] [CrossRef]
Tu, R.C.; Mao, X.L.; Lin, Q.H.; Ji, W.J.; Qin, W.Z.; Wei, W.; Huang, H.Y. Unsupervised Cross-Modal Hashing via Semantic Text Mining. IEEE Trans. Multimed. 2023, 25, 8946–8957. [Google Scholar] [CrossRef]
Zeng, X.; Xu, K.; Xie, Y. Pseudo-label driven deep hashing for unsupervised cross-modal retrieval. Int. J. Mach. Learn. Cybern. 2023, 14, 3437–3456. [Google Scholar] [CrossRef]
Peng, Y.; Huang, X.; Zhao, Y. An Overview of Cross-Media Retrieval: Concepts, Methodologies, Benchmarks, and Challenges. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2372–2385. [Google Scholar] [CrossRef]
Lin, Z.; Ding, G.; Hu, M.; Wang, J. Semantics-preserving hashing for cross-view retrieval. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3864–3872. [Google Scholar]
Tang, J.; Wang, K.; Shao, L. Supervised Matrix Factorization Hashing for Cross-Modal Retrieval. IEEE Trans. Image Process. 2016, 25, 3157–3166. [Google Scholar] [CrossRef]
Xu, X.; Shen, F.; Yang, Y.; Shen, H.T.; Li, X. Learning Discriminative Binary Codes for Large-scale Cross-modal Retrieval. IEEE Trans. Image Process. 2017, 26, 2494–2507. [Google Scholar] [CrossRef] [PubMed]
Chen, Z.D.; Li, C.X.; Luo, X.; Nie, L.Q.; Zhang, W.; Xu, X.S. SCRATCH: A Scalable Discrete Matrix Factorization Hashing Framework for Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 2262–2275. [Google Scholar] [CrossRef]
Jiang, Q.Y.; Li, W.J. Deep Cross-Modal Hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3270–3278. [Google Scholar]
Tu, J.; Liu, X.; Lin, Z.; Hong, R.; Wang, M. Differentiable Cross-modal Hashing via Multimodal Transformers. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 453–461. [Google Scholar]
Huo, Y.; Qin, Q.; Dai, J.; Wang, L.; Zhang, W.; Huang, L.; Wang, C. Deep Semantic-Aware Proxy Hashing for Multi-Label Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 576–589. [Google Scholar] [CrossRef]
Weiss, Y.; Torralba, A.; Fergus, R. Spectral hashing. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–10 December 2008; pp. 1753–1760. [Google Scholar]
Kumar, S.; Udupa, R. Learning hash functions for cross-view similarity search. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, Barcelona, Spain, 16–22 July 2011; pp. 1360–1365. [Google Scholar]
Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; Shen, H.T. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 785–796. [Google Scholar]
Zhou, J.; Ding, G.; Guo, Y. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Gold Coast, Queensland, Australia, 6–11 July 2014; pp. 415–424. [Google Scholar]
Su, S.; Zhong, Z.; Zhang, C. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3027–3035. [Google Scholar]
Yu, J.; Zhou, H.; Zhan, Y.; Tao, D. Deep Graph-neighbor Coherence Preserving Network for Unsupervised Cross-modal Hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Liu, S.; Qian, S.; Guan, Y.; Zhan, J.; Ying, L. Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, China, 25–30 July 2020; pp. 1379–1388. [Google Scholar]
Yang, D.; Wu, D.; Zhang, W.; Zhang, H.; Li, B.; Wang, W. Deep Semantic-Alignment Hashing for Unsupervised Cross-Modal Retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval, Dublin, Ireland, 8–11 June 2020; pp. 44–52. [Google Scholar]
Hu, P.; Zhu, H.; Lin, J.; Peng, D.; Zhao, Y.P.; Peng, X. Unsupervised Contrastive Cross-Modal Hashing. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3877–3889. [Google Scholar] [CrossRef]
Chen, Y.; Tan, J.; Yang, Z.; Shi, Y.; Qin, J. Unsupervised multi-perspective fusing semantic alignment for cross-modal hashing retrieval. Multimed. Tools Appl. 2024, 83, 63993–64014. [Google Scholar] [CrossRef]
Guerrero-Contreras, G.; Balderas-Díaz, S.; Serrano-Fernández, A.; Muñoz, A. Enhancing Sentiment Analysis on Social Media: Integrating Text and Metadata for Refined Insights. In Proceedings of the 2024 International Conference on Intelligent Environments (IE), Málaga, Spain, 9–12 September 2024; pp. 62–69. [Google Scholar]
Balderas-Díaz, S.; Guerrero-Contreras, G.; Ramírez-Vela, M.; Toribio-Camuñas, S.; Gay, N.C.; Reguera, A.M. Inclusive Education and Cultural Heritage Access: The LECTPAT Platform Multimodal Online Dictionary. In Proceedings of the 2024 International Symposium on Computers in Education (SIIE), Madrid, Spain, 25–27 September 2024; pp. 1–6. [Google Scholar]
Csurka, G.; Dance, C.R.; Fan, L.; Willamowski, J.; Bray, C. Visual categorization with bags of keypoints. In Proceedings of the Workshop on Statistical Learning in Computer Vision, ECCV, Prague, Czech Republic, 11–14 May 2004. [Google Scholar]
Atwood, J.; Towsley, D. Diffusion-convolutional neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 2001–2009. [Google Scholar]
Jiang, Z.A. Multi-Scale Contrastive Learning Networks for Graph Anomaly Detection. In Proceedings of the 2024 4th Asia-Pacific Conference on Communications Technology and Computer Science (ACCTCS), Harbin, China, 22–24 March 2024; pp. 618–625. [Google Scholar]
Li, M.; Li, Y.; Ge, M.; Ma, L. CLIP-based fusion-modal reconstructing hashing for large-scale unsupervised cross-modal retrieval. Int. J. Multimed. Inf. Retr. 2023, 12, 2. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Zhong, L.; Yang, J.; Chen, Z.; Wang, S. Contrastive Graph Convolutional Networks With Generative Adjacency Matrix. IEEE Trans. Signal Process. 2023, 71, 772–785. [Google Scholar] [CrossRef]
Sohn, K.; Berthelot, D.; Li, C.L.; Zhang, Z.; Carlini, N.; Cubuk, E.D.; Kurakin, A.; Zhang, H.; Raffel, C. FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. arXiv 2020, arXiv:2001.07685. [Google Scholar]
Cao, Z.; Long, M.; Wang, J.; Yu, P.S. HashNet: Deep Learning to Hash by Continuation. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5609–5618. [Google Scholar]
Wang, F.; Liu, H. Understanding the Behaviour of Contrastive Loss. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2495–2504. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S.J. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 9260–9269. [Google Scholar]
Ruder, S. An overview of gradient descent optimization algorithms. arXiv 2016, arXiv:1609.04747. [Google Scholar]
Costa Pereira, J.; Coviello, E.; Doyle, G.; Rasiwasia, N.; Lanckriet, G.R.G.; Levy, R.; Vasconcelos, N. On the Role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 521–535. [Google Scholar] [CrossRef]
Huiskes, M.J.; Lew, M.S. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, Vancouver, BC, Canada, 30–31 October 2008; pp. 39–43. [Google Scholar]
Chua, T.S.; Tang, J.; Hong, R.; Li, H.; Luo, Z.; Zheng, Y. NUS-WIDE: A real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval, Santorini, Greece, 8–10 July 2009; Article No. 48. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. arXiv 2019, arXiv:1912.01703. [Google Scholar]
Wang, J.; Liu, W.; Kumar, S.; Chang, S.F. Learning to Hash for Indexing Big Data—A Survey. arXiv 2015, arXiv:1509.05472. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed GPMCL architecture. The image features are extracted by a CLIP model, while the text features are extracted using BoW or LDA. These features are then fused and refined via graph propagation and multi-scale learning to compute cross-modal similarity. The hash learning part integrates a feature reconstruction module and a contrastive learning module, optimized with intra-class and inter-class structural constraints, to generate the final hash codes.

Figure 2. Similarity distributions and GMM fitting results on three benchmark datasets. For each dataset:

μ_{1}

and

μ_{2}

are the means of the two Gaussian components;

τ_{p}

and

τ_{n}

are the positive and negative thresholds calculated by Equation (29); and Separation refers to the value of

μ_{2} - μ_{1}

.

Figure 2. Similarity distributions and GMM fitting results on three benchmark datasets. For each dataset:

μ_{1}

and

μ_{2}

are the means of the two Gaussian components;

τ_{p}

and

τ_{n}

are the positive and negative thresholds calculated by Equation (29); and Separation refers to the value of

μ_{2} - μ_{1}

.

Figure 3. Top-K precision comparison on three datasets.

Figure 4. Precision-Recall curves on three datasets.

Figure 5. Influence of hyperparameter

α

on mAP@50 across three benchmark datasets. The star (⋆) on each curve indicates the

α

value that achieves the best performance.

Figure 5. Influence of hyperparameter

α

on mAP@50 across three benchmark datasets. The star (⋆) on each curve indicates the

α

value that achieves the best performance.

Figure 6. Effect of multi-scale parameter K on retrieval performance (mAP@50) across three datasets. The star (⋆) on each curve indicates the K value that achieves the best performance.

Figure 7. The number of neighbors k on retrieval performance (mAP@50) across three datasets. The star (⋆) on each curve indicates the k value selected for the best performance.

Table 1. Tensor dimensions in GPMCL pipeline.

Stage	Image	Text
Input	$B \times 3 \times 224 \times 224$	$B \times d_{T}$
Feature Extraction	$B \times 512$	$B \times 512$
Hash Code	$B \times c l$	$B \times c l$
Reconstruction	$B \times 512$	$B \times 512$

Table 2. Statistics and Feature Representations of the Benchmark Datasets.

Datasets	Wiki	MIR-Flickr-25K	NUS-WIDE
Database	2866	20,015	186,577
Training	2173	5000	5000
Testing	693	2000	2000
Image Feature	CLIP-ViT-B/16	CLIP-ViT-B/16	CLIP-ViT-B/16
Text Feature	LDA (10-dim)	BoW (1386-dim)	BoW (1000-dim)

Table 3. The mAP@50 results of natural datasets at various code lengths.

Task	Method	MIR-Flickr-25K				NUS-WIDE				Wiki
Task	Method	16 Bits	32 Bits	64 Bits	128 Bits	16 Bits	32 Bits	64 Bits	128 Bits	16 Bits	32 Bits	64 Bits	128 Bits
I2T	DJSRH [21]	0.810	0.843	0.862	0.876	0.724	0.773	0.798	0.817	0.388	0.403	0.412	0.421
	JDSH [23]	0.832	0.853	0.882	0.892	0.736	0.793	0.832	0.835	0.313	0.432	0.430	0.447
	DGCPN [22]	0.852	0.867	0.892	0.905	0.789	0.814	0.825	0.848	0.425	0.437	0.441	0.457
	DSAH [24]	0.863	0.877	0.895	0.903	0.775	0.805	0.818	0.827	0.416	0.430	0.438	0.445
	PDDH [8]	–	0.915	0.929	0.940	–	0.824	0.841	0.856	–	0.464	0.467	0.467
	MPSAM [26]	0.847	0.864	0.874	0.887	0.758	0.784	0.805	0.820	0.412	0.421	0.425	0.428
	GPMCL	0.901	0.922	0.936	0.943	0.800	0.828	0.847	0.858	0.598	0.613	0.625	0.619
T2I	DJSRH [21]	0.786	0.822	0.835	0.847	0.712	0.744	0.771	0.789	0.611	0.635	0.646	0.658
	JDSH [23]	0.825	0.864	0.878	0.880	0.721	0.795	0.794	0.804	0.379	0.590	0.632	0.651
	DGCPN [22]	0.831	0.858	0.875	0.884	0.742	0.767	0.783	0.808	0.618	0.626	0.632	0.654
	DSAH [24]	0.846	0.860	0.881	0.882	0.770	0.790	0.804	0.815	0.644	0.650	0.660	0.662
	PDDH [8]	–	0.890	0.901	0.898	–	0.792	0.797	0.797	–	0.635	0.652	0.649
	MPSAM [26]	0.849	0.855	0.858	0.874	0.761	0.776	0.789	0.791	0.622	0.642	0.644	0.653
	GPMCL	0.876	0.899	0.907	0.911	0.789	0.823	0.836	0.843	0.640	0.657	0.662	0.666

Note: The best results are highlighted in bold.

Table 4. Cross-Modal Retrieval Performance of GPMCL on MIRFLICKR-25K Dataset with Statistical Evaluation.

Task	Metric	Hash Code Length
Task	Metric	16 Bits	32 Bits	64 Bits	128 Bits
I2T	Seed 1	0.896	0.926	0.939	0.944
	Seed 42	0.905	0.923	0.937	0.940
	Seed 123	0.898	0.923	0.938	0.945
	Seed 999	0.898	0.920	0.936	0.943
	Seed 2025	0.908	0.919	0.930	0.943
	Mean	0.9010	0.9222	0.9360	0.9430
	Std	0.0049	0.0027	0.0034	0.0018
	95% CI	±0.0061	±0.0034	±0.0042	±0.0022
	Final Report	0.901 ± 0.005	0.922 ± 0.003	0.936 ± 0.003	0.943 ± 0.002
T2I	Seed 1	0.877	0.898	0.907	0.909
	Seed 42	0.870	0.898	0.904	0.911
	Seed 123	0.876	0.897	0.903	0.909
	Seed 999	0.874	0.900	0.908	0.916
	Seed 2025	0.882	0.900	0.911	0.909
	Mean	0.8758	0.8986	0.9066	0.9108
	Std	0.0045	0.0013	0.0033	0.0033
	95% CI	±0.0056	±0.0016	±0.0041	±0.0041
	Final Report	0.876 ± 0.004	0.899 ± 0.001	0.907 ± 0.003	0.911 ± 0.003

Table 5. Cross-Modal Retrieval Performance of GPMCL on NUS-WIDE Dataset with Statistical Evaluation.

Task	Metric	Hash Code Length
Task	Metric	16 Bits	32 Bits	64 Bits	128 Bits
I2T	Seed 1	0.799	0.830	0.850	0.858
	Seed 42	0.801	0.828	0.845	0.861
	Seed 123	0.797	0.826	0.851	0.857
	Seed 999	0.802	0.826	0.844	0.857
	Seed 2025	0.800	0.832	0.846	0.859
	Mean	0.7998	0.8284	0.8472	0.8584
	Std	0.0019	0.0024	0.0028	0.0015
	95% CI	±0.0024	±0.0030	±0.0035	±0.0019
	Final Report	0.800 ± 0.002	0.828 ± 0.002	0.847 ± 0.003	0.858 ± 0.001
T2I	Seed 1	0.805	0.823	0.837	0.840
	Seed 42	0.758	0.822	0.840	0.839
	Seed 123	0.799	0.820	0.831	0.846
	Seed 999	0.791	0.825	0.835	0.845
	Seed 2025	0.792	0.825	0.836	0.846
	Mean	0.7890	0.8230	0.8358	0.8432
	Std	0.0195	0.0021	0.0033	0.0036
	95% CI	±0.0242	±0.0026	±0.0041	±0.0045
	Final Report	0.789 ± 0.020	0.823 ± 0.002	0.836 ± 0.003	0.843 ± 0.004

Table 6. Cross-Modal Retrieval Performance of GPMCL on WIKI Dataset with Statistical Evaluation.

Task	Metric	Hash Code Length
Task	Metric	16 Bits	32 Bits	64 Bits	128 Bits
I2T	Seed 1	0.593	0.612	0.623	0.617
	Seed 42	0.595	0.614	0.633	0.624
	Seed 123	0.598	0.615	0.622	0.617
	Seed 999	0.604	0.617	0.623	0.615
	Seed 2025	0.600	0.606	0.625	0.621
	Mean	0.5980	0.6128	0.6252	0.6188
	Std	0.0043	0.0038	0.0046	0.0036
	95% CI	±0.0053	±0.0047	±0.0057	±0.0045
	Final Report	0.598 ± 0.004	0.613 ± 0.004	0.625 ± 0.005	0.619 ± 0.004
T2I	Seed 1	0.637	0.653	0.664	0.665
	Seed 42	0.639	0.665	0.653	0.667
	Seed 123	0.635	0.649	0.668	0.664
	Seed 999	0.647	0.662	0.655	0.665
	Seed 2025	0.644	0.654	0.668	0.671
	Mean	0.6404	0.6566	0.6616	0.6664
	Std	0.0050	0.0074	0.0066	0.0027
	95% CI	±0.0062	±0.0092	±0.0082	±0.0034
	Final Report	0.640 ± 0.005	0.657 ± 0.007	0.662 ± 0.007	0.666 ± 0.003

Table 7. The mAP@50 of ablation experiments with 64 bits on three benchmark datasets.

Task	Method	MIR-Flickr-25K	NUS-WIDE	Wiki
I2T	GPMCL	0.939	0.850	0.623
	JDSH(AlexNet)	0.882	0.832	0.430
	DGCPN(VGG19)	0.892	0.825	0.441
	GPMCL-AlexNet	0.897	0.818	0.530
	GPMCL-VGG19	0.913	0.839	0.552
	GPMCL-1	0.925	0.836	0.610
	GPMCL-2	0.829	0.724	0.577
	GPMCL-2(a)	0.823	0.634	0.570
	GPMCL-2(b)	0.918	0.833	0.585
	GPMCL-3	0.936	0.848	0.614
	GPMCL-4	0.925	0.836	0.592
T2I	GPMCL	0.907	0.837	0.664
	JDSH(AlexNet)	0.878	0.794	0.632
	DGCPN(VGG19)	0.875	0.783	0.632
	GPMCL-AlexNet	0.877	0.816	0.643
	GPMCL-VGG19	0.897	0.821	0.637
	GPMCL-1	0.897	0.825	0.659
	GPMCL-2	0.841	0.704	0.608
	GPMCL-2(a)	0.845	0.604	0.589
	GPMCL-2(b)	0.891	0.793	0.629
	GPMCL-3	0.905	0.832	0.661
	GPMCL-4	0.886	0.814	0.619

Table 8. Number of parameters (in millions) for different methods with 64-bit hash codes on three benchmark datasets.

Method	Backbone	MIR-Flickr-25K	NUS-WIDE	Wiki
DJSRH [21]	AlexNet + MLP	145.8	144.2	140.1
JDSH [23]	AlexNet + MLP	145.8	144.2	140.1
DSAH [24]	AlexNet + MLP	144.5	146.1	140.4
DGCPN [22]	VGG19 + MLP	160.9	162.6	156.9
PDDH [8]	VGG19 + MLP	146.1	144.5	140.4
GPMCL	CLIP + MLP	152.6	152.0	150.5

Table 9. Retrieval efficiency metrics for GPMCL with 64-bit hash codes on three benchmark datasets.

Dataset	Index Time (ms)	I → T Latency (ms/Query)	T → I Latency (ms/Query)	DB Storage (KB)	Query Storage (KB)	Samples (DB/Query)
MIR-Flickr-25K	2.00	0.5399	0.5710	4505.60	500.00	18,015/2000
NUS-WIDE	25.02	0.7598	0.7370	46,131.20	500.00	184,577/2000
Wiki	0.01	0.1528	0.1551	543.25	173.25	2173/693

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, Y.; Shi, G. Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval. Appl. Sci. 2026, 16, 389. https://doi.org/10.3390/app16010389

AMA Style

Zhao Y, Shi G. Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval. Applied Sciences. 2026; 16(1):389. https://doi.org/10.3390/app16010389

Chicago/Turabian Style

Zhao, Yan, and Guohua Shi. 2026. "Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval" Applied Sciences 16, no. 1: 389. https://doi.org/10.3390/app16010389

APA Style

Zhao, Y., & Shi, G. (2026). Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval. Applied Sciences, 16(1), 389. https://doi.org/10.3390/app16010389

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Graph-Propagated Multi-Scale Hashing with Contrastive Learning for Unsupervised Cross-Modal Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Supervised Methods

2.2. Unsupervised Methods

2.3. Multi-Modal Representation and Feature Fusion Paradigms

3. The Proposed Method

3.1. Problem Definition

3.2. Network Architecture

3.2.1. Image Network

3.2.2. Text Network

3.2.3. Model Training Optimization

3.2.4. Tensor Dimension Reference

3.3. Semantic Similarity Construction

3.3.1. Adaptive Graph Construction

3.3.2. Graph Propagation Process

3.3.3. Multi-Scale Structure Preservation

3.3.4. Semantic Similarity Fusion

3.3.5. Computational Complexity Analysis

3.4. Hashing Learning

3.4.1. Semantic Hashing Construction

3.4.2. Structure Preserving

3.4.3. Feature Reconstruction Module

3.4.4. Adaptive Threshold Determination

3.4.5. Comparative Hashing Learning

3.5. Optimization

4. Experiment

4.1. DataSets

4.2. Implementation Details

4.3. Baseline Methods and Evaluation Metrics

4.4. Precision Comparison

4.5. Ablation Study

4.6. Parameter Sensitivity

4.7. Efficiency and Complexity Analysis

5. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI