Next Article in Journal
Overcoming Class Imbalance in Incremental Learning Using an Elastic Weight Consolidation-Assisted Common Encoder Approach
Previous Article in Journal
Reducing the Primary Resonance Vibrations of a Cantilever Beam Using a Proportional Fractional-Order Derivative Controller
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing

1
Dundee International Institute of Central South University, Central South University, Changsha 410083, China
2
School of Computer Science and Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(11), 1880; https://doi.org/10.3390/math13111880
Submission received: 2 May 2025 / Revised: 26 May 2025 / Accepted: 29 May 2025 / Published: 4 June 2025

Abstract

To address modality heterogeneity and accelerate large-scale retrieval, cross-modal hashing strategies generate compact binary codes that enhance computational efficiency. Existing approaches often struggle with suboptimal feature learning due to fixed activation functions and limited cross-modal interaction. We propose Unsupervised Contrastive Graph Kolmogorov–Arnold Networks (GraphKAN) Enhanced Cross-modal Retrieval Hashing (UCGKANH), integrating GraphKAN with contrastive learning and hypergraph-based enhancement. GraphKAN enables more flexible cross-modal representation through enhanced nonlinear expression of features. We introduce contrastive learning that captures modality-invariant structures through sample pairs. To preserve high-order semantic relations, we construct a hypergraph-based information propagation mechanism, refining hash codes by enforcing global consistency. The efficacy of our UCGKANH approach is validated by thorough tests on the MIR-FLICKR, NUS-WIDE, and MS COCO datasets, which show significant gains in retrieval accuracy coupled with strong computational efficiency.

1. Introduction

With the growing availability of multimodal data across diverse media platforms, there has been increasing academic attention on cross-modal retrieval in recent years [1]. Among various modalities, image and text data are the most prevalent and voluminous, where cross-modal hashing demonstrates superior retrieval efficiency [2,3]. Cross-modal hashing encodes data into unified binary hash representations and measures the Hamming distance over the binary codes derived from matched image and text features [4,5]. Using the great computational efficiency of the Hamming distance algorithm and the compactness of hash codes for less storage, this method improves large-scale cross-modal retrieval [6,7]. In contrast, cross-modal retrieval methods that project heterogeneous data into a shared embedding space often suffer from significant storage and computational burdens when handling large-scale datasets [8,9]. Therefore, cross-modal hashing approaches have shown their effectiveness in the era of multi-modal data, and a variety of new approaches building upon these methods have emerged in recent years [10].
The two main categories of cross-modal hashing techniques are shallow and deep learning-based models. Shallow methods rely on manually crafted features, which often fail to generate sufficiently representative and discriminative hash codes [11]. Deep cross-modal hashing techniques, on the other hand, combine hash function creation and feature representation learning into a single architecture, enhancing their capacity to capture complex latent semantics [12]. This typically leads to enhanced retrieval accuracy and better scalability [13]. Our study focuses on deep cross-modal hashing because of these benefits. Additionally, there are two main classifications for deep cross-modal hashing techniques, including supervised and unsupervised methods, depending on the presence or absence of explicit semantic supervision [14]. Unsupervised deep hashing leverages statistical correlations across modalities to learn hash functions, aiming to extract modality-invariant semantic representations (modality-invariant denotes feature representations that remain consistent across different data modalities while preserving semantic content) without requiring labeled data [15]. On the other hand, supervised deep hashing relies on predefined image–text similarity labels, which generally contribute to higher retrieval precision and stronger semantic consistency [16,17]. However, because large-scale labeled datasets are expensive and scarce, this work focuses on unsupervised deep hashing, exploring its potential in learning effective cross-modal hash representations without manual annotations [18,19].
Although deep cross-modal hashing approaches have achieved notable progress, they continue to encounter difficulties in feature representation, aligning global semantic embeddings, and addressing the brevity of text in image-oriented retrieval tasks [20,21]. First, the majority of models use convolutional neural networks (CNNs) for images and Bag-of-Words (BoW) for text as upstream feature extractors. These models use convolutional kernels to extract local features, but they are unable to depend on the total data instance over long distances [22]. Moreover, pre-trained transformer networks utilize the attention mechanism to model the global correlation of data patches with a built-in overall receptive field. Secondly, many cross-modal hashing techniques struggle to achieve precise alignment between the embedding spaces of heterogeneous modalities [23,24]. Many approaches only use pairwise similarity constraints, neglecting high-order relationships between samples, which can lead to suboptimal hash code learning [25]. Additionally, existing methods typically assume a one-to-one correspondence between image–text pairs, whereas real-world multimodal data often exhibit many-to-many relationships [26,27]. Without explicit modeling of such structures, the learned hash codes may not generalize well [28]. Finally, the effectiveness of text-to-image retrieval remains a key challenge. Because visual and textual information are represented differently, many models perform well in image-to-text retrieval, while text-to-image retrieval frequently suffers from semantic inconsistency [29]. Most deep hashing methods struggle to bridge the modality gap, particularly in cases where textual descriptions are abstract or contain high-level concepts that do not have direct visual counterparts [30].
To address the aforementioned limitations, we propose an Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-modal Retrieval Hashing model, named UCGKANH. This framework leverages Graph Kolmogorov–Arnold Networks (GraphKAN) [31] to improve cross-modal feature learning and integrates unsupervised contrastive learning to optimize hash code generation. Additionally, a hypergraph-based structure is introduced to model high-order semantic relations between modalities, which makes it easier to create hash codes that are more resilient and discriminative. Here are our contributions:
  • We present UCGKANH, an unsupervised cross-modal hashing framework that leverages contrastive learning and is further enhanced by GraphKAN and hypergraph-based modeling. Unlike existing deep unsupervised cross-modal hashing methods that rely solely on CNN-based encoders or basic graph structures [15,25], our model uniquely integrates GraphKAN to enhance feature expressiveness via learnable activation functions. By integrating Kolmogorov–Arnold Networks into the retrieval process, the model achieves more expressive and discriminative feature representations.
  • We design an unsupervised contrastive learning strategy tailored for cross-modal hashing. By leveraging instance-level contrastive learning without requiring explicit labels, our method significantly enhances the discrimination and consistency of hash codes across different modalities.
  • We incorporate hypergraph-based semantic structure modeling to capture high-order relationships across image–text pairs. This mitigates the shortcomings of traditional graph-based methods and enhances the generalization of the generated hash codes in challenging cross-modal retrieval environments. Specifically, our proposed method leverages the synergistic effect between GraphKAN and hypergraph via contrastive learning to enhance cross-modal hashing performance.

2. Related Work

Recent developments in deep cross-modal hashing are reviewed in this section, covering both supervised and unsupervised techniques, along with the integration of Kolmogorov–Arnold Networks.

2.1. Deep Cross-Modal Hashing

In order to close the semantic gap between various modalities, prior research on deep cross-modal hashing focuses on projecting multimodal data into a common representation space. Jiang et al. proposed the DCMH originating from CMH that firstly introduces hashing retrieval into cross-modal retrieval [32,33]. Deep learning networks are capable of capturing complex data relationships and handling high-dimensional information. In recent years, the performance of deep cross-modal hashing has been substantially improved through the incorporation of advanced neural modules and enhanced semantic alignment (mapping cross-modal data in shared embedding space) strategies. Deep cross-modal hashing techniques are generally categorized into supervised and unsupervised approaches. For supervised methods, the usage of semantic labels works as a guide for the hashing model to generate semantically consistent hash codes. Supervised methods typically utilize the semantic label as a similarity metric, and the EMCHL method indicatively constructs cross-modal similarity relationships on different views [34]. Moreover, DSFSH innovates the similarity matrix construction that preserves both intrinsic inter-modality and intro-modality simultaneously [35]. The semantic distribution of different modalities is also heterogeneous, and FedSCMR dealt with this problem by integrating federal learning as an overall optimization module by extracting shared cross-modal semantic representation [36]. Furthermore, labels are often noisy due to constraints in human and time resources, and DHRL incorporates semantic concept refinement and a rank-switching mechanism to address scenarios involving noisy supervision [37]. While supervised methods take advantage of labels to generate rich and accurate semantic hash codes, unsupervised approaches utilize deep neural networks to mine common features and implicit relationships between modalities with better generalization capabilities, for supervised methods may over-fit the data label distribution that contains bias. Contrastive learning is integrated into a deep hashing model and works like URWMCH; SACH utilized contrastive learning to improve modality consistency using contrastive loss [38,39]. Graph neural networks are also powerful in building complex cross-modal cross-modal relationships, and SGRN builds cross-modal graph relationships both in local and global [40]. Other works like JMGCH and UGRDH utilize adaptive weight assignment on graphs representing networks and combining teacher-student knowledge disillusion [41,42]. Additionally, works including HEH and UAHCH integrate hypergraph neural networks that capture the higher-order relationships across cross-modal data [43,44].
Although contrastive learning and graph-based approaches have shown progress, current cross-modal hashing methods continue to encounter difficulties in acquiring rich feature representations and effectively maintaining semantic structures. To address these issues, we propose an unsupervised contrastive graph Kolmogorov–Arnold networks enhanced cross-modal retrieval hashing framework. Unlike prior works, our method integrates Graph Kolmogorov–Arnold Networks (GraphKAN) to improve cross-modal feature learning, employs contrastive learning for robust hash optimization, and utilizes a hypergraph structure to capture high-order semantic relationships. Specifically, our approach differs from existing graph-based methods by employing learnable B-spline activation functions instead of fixed ReLU activations and from contrastive learning methods by using hypergraph-guided positive pair selection rather than random sampling strategies.

2.2. Kolmogorov–Arnold Networks

Kolmogorov–Arnold Networks (KANs) show as a promising alternative to replace Multi-Layer Perceptrons (MLPs), which is developed under the inspiration of the Kolmogorov–Arnold representation theorem [45]. Different from MLPs that employ fixed activation mechanisms at each node, KANs utilize adaptive, edge-specific activation functions modeled via univariate spline transformations. This design enables KANs to maintain competitive or even superior performance with fewer parameters while also improving interpretability through more intuitive visualization of learned behaviors. KANs has demonstrated its potential potential in many areas, showing as an innovation in the deep learning area. GraphKAN is proposed to integrate KANs with graph neural networks to improve its feature extraction ability [31]. The integration of KANs into cross-modal retrieval systems is an emerging research area. By leveraging KANs’ ability to learn flexible and interpretable activation functions, this approach holds promise for improving retrieval accuracy and providing more explainable results, addressing some of the shortcomings of the current cross-modal hashing techniques. However, existing KAN applications focus primarily on single-modal tasks, and our work represents the first integration of GraphKAN with hypergraph structures specifically designed for cross-modal hashing tasks.

3. Methodology

We present our designed framework for unsupervised cross-modal hashing in this section. Our approach is designed to address the limitations of existing cross-modal hashing methods in high-order structure modeling, semantic alignment, and feature representation learning. Figure 1 illustrates the workflow of our designed approach. The training process is summarized in Algorithm 1. A detailed analysis of the computational complexity of this algorithm is provided in Appendix A, including the time complexity (Appendix A.1) and space complexity (Appendix A.2).
Algorithm 1: UCGKANH Algorithm
1:
Input: Image features X I R n × d I , text features X T R n × d T , hash length r, layers L, λ , τ , clusters K
2:
Output: Trained hash model parameters θ = { W h , b h , W ( l ) , b ( l ) , c k } and hash codes B { 1 , 1 } n × r
3:
Normalize X I , X T to X ^ I , X ^ T
4:
Compute S I = X ^ I X ^ I , S T = X ^ T X ^ T
5:
Compute S using Equations (1) and (2)
6:
Build A init = S and G = ( V , E , W ) using Equations (3)–(5)
7:
Cluster S into K groups form E H and set w ( e k ) using Equation (7)
8:
Build hypergraph H using Equation (8)
9:
Compute d ( v ) and δ ( e ) using Equation (9), form D v , D e
10:
Initialize model parameters θ = { W h , b h , W ( l ) , b ( l ) , c k }
11:
Update Z [ X I , X T ] from hypergraph convolution layer using Equation (10)
12:
for  l = 1 to L do
13:
  Compute Q i = W Q z i , K j = W K z j , V i j = W V z j
14:
  Update A i j = softmax ( Q i K j d k ) · V i j using Equation (6)
15:
  for each v V  do
16:
    Compute z v ( l + 1 ) using Equations (11) and (12)
17:
  end for
18:
end for
19:
Set B = sign ( Z ( L ) W h + b h ) using Equation (13), Z ( L ) = [ Z I ( L ) ; Z T ( L ) ]
20:
Split B into B I , B T
21:
Define P ( v i ) using Equation (15)
22:
Compute L hg - contrast using Equation (16)
23:
Compute L = L hg - contrast + λ Z ( L ) B F 2 using Equation (17)
24:
Optimize L to update parameters θ
25:
return Trained model parameters θ and hash codes B

3.1. Notation

Throughout this paper, bold uppercase letters (e.g., M ) are used to represent matrices, and bold lowercase letters (e.g., a ) denote vectors. Additionally, the Frobenius norm of a matrix is expressed as | · | F , while the ReLU activation is denoted by σ ( · ) in the hypergraph convolutional neural network (HGCN). We define the dataset as D = ( x I i , x T i ) i = 1 n , representing a set of paired image and text samples. To be specific, the image and text features as X I R n × d I and X T R n × d T . Here, d I , d T indicate the dimensionality of image and text features, respectively, while n denotes the total number of image–text sample pairs. The primary objective is for f I ( X I ; θ i ) and f T ( X T ; θ t ) these 2 hash functions’ training, used to produce a unified binary code B { 1 , 1 } n × r , where r is the hash code’s bit length.

3.2. Model Architecture

3.2.1. Similarity Matrix and Graph Relation Construction

Given the image and text features X I R n × d I and X T R n × d T , we compute the inner-modal similarity matrices S I = X ^ I X ^ I [ 1 , + 1 ] n × n and S T = X ^ T X ^ T [ 1 , + 1 ] n × n , where X ^ I and X ^ T are the normalized features. Then, we linearly combine these self-similarity matrices using a weighting factor α 1 :
S 1 = α 1 S I + ( 1 α 1 ) S T .
To enhance the discrimination of similarity scores, we apply a Gaussian kernel weighting followed by an exponential transformation and normalization:
S weighted = exp ( 1 S 1 ) 2 2 σ 2 , S exp = exp ( S weighted ) , S = ( 1 α 2 ) S weighted + α 2 S exp n × n ,
where σ is a bandwidth parameter estimated as the median of pairwise distances in S 1 with a regularization term σ = max ( median , ϵ ) , ϵ = 0.01 to prevent degeneration, and α 2 [ 0 , 1 ] controls the influence of the exponential transformation. This transformation enhances the discrimination between high and low similarity pairs while maintaining numerical stability. The parameter α 2 balances between preserving local neighborhood structure and amplifying global discriminative properties.
To leverage the computed similarity measures for cross-modal learning, we construct a comprehensive graph structure G = ( V , E , W ) . The vertex set V comprises image and text instances: V = { v 1 , v 2 , , v 2 n } , where v i corresponds to the i-th image feature for i n and to the ( i n ) -th text feature for i > n . The edge set E is established based on the similarity matrix S , forming connections between both inter-modal and intra-modal nodes. Specifically, we create an edge e i j E between vertices v i and v j if their similarity S i j exceeds a threshold a:
E = { ( v i , v j ) S i j > a , v i , v j V } ,
where a is dynamically determined as the median value of all similarities in S . This threshold-based edge construction ensures that the graph captures meaningful semantic relationships while filtering out weak or potentially noisy connections.
The edge weight set W directly adopts the corresponding similarity values from S :
W i j = S i j , ( v i , v j ) E ,
which measures how strongly the connected nodes are related. This simply assigns the strength of the connection (edge weight) between two items to their calculated similarity score. The resulting graph G serves as the foundation for our GraphKAN layers and subsequent hypergraph enhancement. The adjacency matrix A R 2 n × 2 n of graph G is initially set as follows:
A init , i j = S i j , if ( v i , v j ) E 0 , otherwise
Here, this matrix formally represents the graph’s structure, where an entry shows the similarity if two items are connected by an edge and zero otherwise.
This graph structure effectively encodes both the local pairwise relationships and the global semantic structure across modalities, offering abundant relational information to support downstream feature extraction and hash code construction. In the subsequent GraphKAN layers, A is further refined using a self-attention mechanism to dynamically learn the affinity scores. The updated affinity matrix is computed as follows:
A i j = softmax Q i K j d k · V i j ,
where the dimension of the key vectors is represented as d k . Here, Q i , K j , and V i j are query, key, and value vectors derived from the node features Z i and Z j in the GraphKAN layer. By enabling A to adaptively focus on the most pertinent cross-modal links, this self-attention mechanism enhances the graph structure’s resilience for hashing.

3.2.2. Hypergraph Enhancement

To further improve the modeling of complex relationships across modalities, we introduce a hypergraph-based enhancement that shares the same vertex set V with the previously constructed graph G but captures different relationship patterns. The hypergraph H = ( V , E H , W H ) is defined, where E H denotes the set of hyperedges that can connect multiple nodes simultaneously, and W H specifies the corresponding hyperedge weights.
Hyperedges in E H are designed to link groups of nodes that exhibit strong cross-modal semantic coherence. We achieve this by performing clustering on the similarity matrix S using a spectral clustering technique, which partitions the nodes into K clusters (where K is a tunable parameter). Each cluster is treated as a hyperedge e k E H , connecting all nodes within that group. The weight of each hyperedge w ( e k ) W H is calculated as the mean similarity among the nodes it connects:
w ( e k ) = 1 | e k | 2 i , j e k S i j ,
where the number of nodes in the hyperedge e k is represented by | e k | . This formulation ensures that hyperedges connecting highly similar nodes receive larger weights, thereby strengthening the influence of coherent semantic clusters in the graph structure. The hypergraph’s structure is encoded using the incidence matrix H R | V | × | E H | , with entries defined as follows:
H v , e = 1 , if node v e , 0 , otherwise .
Here, v V and e E H . Specifically, an incidence matrix where H v , e = 1 indicates that node v belongs to hyperedge e, and 0 otherwise. This matrix acts like a lookup table, indicating whether a specific node belongs to a particular hyperedge. The following formula is used to calculate the node degree d ( v ) and hyperedge degree δ ( e ) :
d ( v ) = e E H H v , e , δ ( e ) = v V H v , e .
The diagonal degree matrices for nodes and hyperedges are denoted as D v and D e , respectively, where D e ( e , e ) = δ ( e ) and D v ( v , v ) = d ( v ) .
To incorporate the hypergraph into the hashing process, we apply a hypergraph convolutional layer that aggregates information across hyperedges. The updated feature representation Z for nodes is obtained by
Z = σ D v 1 2 H W H D e 1 H D v 1 2 X ,
where the concatenated feature matrix is represented by X = [ X I ; X T ] , the learnable weight matrix is represented by W H , and the ReLU activation function is represented by σ ( · ) . This operation normalizes the hypergraph structure while propagating high-order relational information to refine the feature representations, which are then used to create the final hash codes B by putting them into the hash function f ( · ) . This hypergraph enhancement improves the quality of the unified hash code B { 1 , 1 } n × r for retrieval tasks by allowing the model to capture complex cross-modal dependencies.

3.2.3. GraphKAN-Based Hashing

To further enhance the cross-modal hashing process by leveraging the expressive power of Kolmogorov–Arnold Networks (KANs) on graph structures, we introduce GraphKAN, a novel graph-based neural network tailored for unsupervised hash code learning. GraphKAN integrates the hypergraph-enhanced features with a KAN-based architecture to capture intricate non-linear relationships across modalities, resulting in unified hash codes B with improved quality.
The GraphKAN model operates on the cross-modal graph G = ( V , E , W ) and the hypergraph H = ( V , E H , W H ) constructed earlier. The input feature matrix X = [ X I ; X T ] R 2 n × d (where d = d I + d T ) is first processed by the hypergraph convolution to obtain the enhanced feature matrix Z . This Z serves as the input to the GraphKAN layer. The GraphKAN layer replaces traditional graph convolution operations with KAN-based transformations. For each node v V , the feature update is defined as follows:
z v ( l + 1 ) = u N ( v ) A v u · ϕ v u z u ( l ) , z v ( l ) · W ( l ) + b ( l ) ,
where ϕ v u ( · , · ) is parameterized with K = 5 B-spline basis functions, and B k ( · ) are learnable B-spline functions with coefficients c k . This learnable activation function enables adaptive feature transformations that can capture complex non-linear relationships between neighboring nodes beyond traditional fixed activations. The neighborhood of node v is indicated by N ( v ) , which includes both graph edges E and hypergraph hyperedges E H ; the feature vector of node v at layer l is z v ( l ) ; a learnable weight matrix is W ( l ) ; and a bias term is b ( l ) . The KAN-based activation function ϕ v u ( · , · ) models the pairwise interaction between nodes u and v. Specifically, ϕ v u is parameterized as a sum of B-spline functions:
ϕ v u ( z u , z v ) = k = 1 K c k B k ( z u z v ) ,
where B k ( · ) are B-spline basis functions, c k are learnable coefficients, and K is the number of spline components. This formulation enables the network to learn adaptive activation patterns based on the feature differences between neighboring nodes, allowing for more flexible and expressive cross-modal feature transformations than traditional fixed activation functions.
The hash codes are generated from the output of the last GraphKAN layer Z ( L ) . To ensure the B ’s binary nature, we apply a sign activation function followed by a quantization step:
B = sign Z ( L ) W h + b h ,
where W h and b h are learnable parameters for the hashing layer, and the sign ( · ) function maps values to { 1 , 1 } .

3.2.4. Contrastive Learning for Cross-Modal Alignment

We integrate a contrastive learning technique into our system to guarantee semantic alignment across picture and text modalities while maintaining the discriminative features of the hash codes. This approach leverages the GraphKAN-enhanced features and the hypergraph structure to align cross-modal representations in the Hamming space.
For image and text modalities, let Z I ( L ) R n × d and Z T ( L ) R n × d be the enhanced features derived from the final GraphKAN layer, respectively. B I = sign ( Z I ( L ) W h + b h ) and B T = sign ( Z T ( L ) W h + b h ) are modality-specific hash codes generated by these features, where W h and b h are shared parameters to ensure consistency across modalities. We define the contrastive loss L contrast based on the InfoNCE loss, where cos ( b i , b j ) = b i · b j | b i | | b j | is the cosine similarity. To reduce computational cost, negative samples in L hg - contrast are selected using a random sampling strategy with a fixed size of 100 per node. For a given image–text pair ( x I i , x T i ) , the contrastive loss is formulated as follows:
L contrast = 1 n i = 1 n [ log exp ( cos ( b I i , b T i ) / τ ) j = 1 n exp ( cos ( b I i , b T j ) / τ ) log exp ( cos ( b T i , b I i ) / τ ) j = 1 n exp ( cos ( b T i , b I j ) / τ ) ] ,
where cos ( · , · ) is the cosine similarity function, τ > 0 is a temperature parameter regulating the concentration of the distribution, and b I i and b T i are the hash codes for the i-th image and text instance. The paired image–text instance ( b I i , b T i ) is encouraged to have comparable hash codes by the first term, while the second term ensures symmetry by considering the text-to-image alignment.
To further enhance the contrastive learning process, we leverage the hypergraph H to guide the selection of positive and negative pairs. Specifically, hyperedges in E H naturally define clusters of semantically related nodes across modalities. For each node v i V , we define its positive set P ( v i ) as the set of nodes connected through the same hyperedge:
P ( v i ) = { v j e E H such that H v i , e = 1 and H v j , e = 1 } ,
and the negative set N ( v i ) = V P ( v i ) . The hypergraph-guided contrastive loss is then modified to focus on these sets:
L hg - contrast = 1 n i = 1 n log v j P ( v i ) exp ( cos ( b i , b j ) / τ ) v k V exp ( cos ( b i , b k ) / τ ) ,
where the hash code of node v i is b i , and the summation over P ( v i ) encourages similarity within the hyperedge cluster, while the denominator includes all nodes to contrast against negatives.

3.3. Overall Objective Function

The final objective function minimizes the difference between the binary hash codes and the continuous features by combining the contrastive loss with a quantization regularization term:
L = L hg - contrast + λ Z ( L ) B F 2 ,
where Z ( L ) = [ Z I ( L ) ; Z T ( L ) ] is the concatenated feature matrix from the GraphKAN layer, B = [ B I ; B T ] is the matrix of unified hash codes, and λ > 0 is a hyperparameter balancing the two terms. The Frobenius norm · F ensures that the continuous features Z ( L ) are close to the binary hash codes B , reducing quantization errors. This contrastive learning strategy, enhanced by hypergraph guidance, ensures that the hash codes capture both pairwise semantic alignment and higher-order structural relationships across modalities, leading to improved retrieval performance.

4. Experiment

The experimental setup and evaluation outcomes of the suggested model on two cross-modal retrieval tasks—text-to-image (T→I) and image-to-text (I→T)—are detailed in this section.

4.1. Implementation Details

Experiment of UCGKANH is conducted on a workstation with a 12-core Intel(R) Core i7-9700K CPU and an NVIDIA RTX 2080Ti GPU. Implementation is performed using PyTorch 1.9.0 on Ubuntu 20.04 LTS. For MIRFlickr-25k, α 1 = 0.6 , α 2 = 0.2 , λ = 0.1 , τ = 10 , L = 3 and K = 5 . For NUS-WIDE, α 1 = 0.7 , α 2 = 0.3 , λ = 0.001 , τ = 100 , L = 3 and K = 6 . For MS COCO, α 1 = 0.5 , α 2 = 0.5 , λ = 0.01 , τ = 0.1 , L = 4 and K = 3 . The batch size is set to 64 for MIRFlickr-25k and NUS-WIDE and 128 for MS COCO. Learning rate is set to 1 × 10 3 with the Adam optimizer.

4.2. Dataset Description

As common benchmarks in cross-modal retrieval studies, MIRFlickr-25K, NUS-WIDE, and MS COCO are three well-known datasets that we experimented with to assess the performance of our proposed model. MIRFlickr-25K comprises 25,000 image–tag pairs spanning 24 different concepts. After filtering out instances with fewer than 20 tags, we retained 20,015 pairs, with subsets of 2000 for query, 18,015 for retrieval, and 5000 for training from retrieval sets. NUS-WIDE contains a total of 269,648 image–text pairs categorized under 81 distinct concepts. We focused on 186,577 pairs belonging to the top 10 categories, partitioning them into 2000 query pairs, 184,577 sets for retrieval, and 5000 training pairs from retrieval sets. MS COCO features 123,287 image–text pairs distributed across 80 categories. For evaluation, we designated 2000 pairs for querying, the remaining 121,287 for retrieval, and 5000 for training from retrieval sets.

4.3. Baselines and Evaluation Criteria

We compare our unsupervised cross-modal retrieval hashing method with several of the top unsupervised hashing techniques in cross-modal retrieval to assess its performance. The selected baselines include Deep Binary Reconstruction (DBRC) [46], Correlation Identity Representation Hashing (CIRH) [47], Collective Matrix Factorization Hashing (CMFH) [48], Cross-View Hashing (CVH) [49], Aggregation-based Graph Convolutional Hashing (AGCH) [50], CLIP-based Fusion-modal Reconstructing Hashing (CFRH) [51], Unsupervised Deep Cross-Modal Hashing (UDCMH) [52], Inter-Media Hashing (IMH) [53], Deep Joint-Semantics Reconstructing Hashing (DJSRH) [54], Joint-modal Distribution-based Similarity Hashing (JDSH) [47], rapid image–text cross-modal hash retrieval (RICH) [55], Latent Semantic Sparse Hashing (LSSH) [56], Unsupervised Dual Deep Hashing (UDDH) [57], and Unsupervised Deep Hashing with Multiple Similarity Preservation (UMSP) [58]. For performance comparison, we adopt commonly used cross-modal retrieval metrics that rely on Hamming distance. The main evaluation measures include top-50 Mean Average Precision (mAP@50) and top-K retrieval accuracy.

4.4. Comparing with Baseline Methods

4.4.1. mAP Analysis

Our UCGKANH approach outperforms the top-listed methods in terms of retrieval accuracy across a range of hash bit lengths, as evidenced by the mAP results in Table 1. For the I→T task on the MIRFlickr-25K, NUS-WIDE, and MS COCO datasets, our UCGKANH approach improves retrieval accuracy by 0.7%∼1.4%, 0.7%∼2.6%, and 1.3%∼2.4%, respectively. Similarly, for the T → I task on these datasets, UCGKANH achieves gains of 0.4%∼1.1%, 0.4%∼3%, and 0.3%∼1.4%, respectively. These consistent improvements across datasets and hash bit lengths highlight the robustness of UCGKANH. The results underscore the effectiveness of our hypergraph-based approach, which excels at capturing high-order relationships and preserving cross-modal similarities. The significant mAP gains confirm that UCGKANH enhances retrieval performance reliably, making it a strong solution for cross-modal retrieval tasks.

4.4.2. Tok-K Analysis

To assess the ranking quality of UCGKANH in retrieval tasks, we investigate the top-K precision curves in the context of a 128-bit hash code. UCGKANH consistently beats the baseline approaches on all datasets and tasks, as shown by the top-K precision curves. On MIRFlickr-25K (Figure 2a,d), UCGKANH exhibits a steep initial decline in precision, indicating high accuracy for the top-ranked results, and maintains a significant lead over other methods as K increases, especially in the I→T task. On NUS-WIDE (Figure 2b,e), UCGKANH shows a more gradual decline in precision, reflecting the dataset’s diverse categories, yet it consistently achieves higher precision than baselines, particularly in the T→I task where its curve remains above others across all K values. On MS COCO (Figure 2c,f), UCGKANH demonstrates a stable and superior performance, with a slower precision drop compared to baselines, highlighting its robustness in handling complex semantic structures for both tasks. Overall, the trends in the top-K precision curves underscore the effectiveness of UCGKANH in delivering high-quality retrieval results across varying dataset complexities. The method’s ability to maintain higher precision at larger K values, especially on challenging datasets like NUS-WIDE and MS COCO, suggests the integration, including hypergraph enhancement, GraphKAN, and contrastive learning, enables UCGKANH to better capture cross-modal semantic relationships and produce more discriminative hash codes compared to existing approaches.

4.5. Computational Efficiency Analysis

To evaluate the computational efficiency of our proposed UCGKANH, we compare its time cost with several existing cross-modal hashing methods, as well as an ablation variant, UCGKANH-GCN. The UCGKANH-GCN variant replaces the GraphKAN layers in UCGKANH with standard Graph Convolutional Network (GCN) layers, serving as a direct baseline to assess the efficiency of the GraphKAN component. All experiments are conducted using 128-bit hash codes.
We focus on two primary time-based metrics:
  • Total Training Time (s): The overall time required to train the model for 50 epochs.
  • Query Time (s): The total time taken to generate hash codes for all samples in the query set during the inference phase.
Table 2 presents the computational time cost comparison. The training times for UCGKANH and UCGKANH-GCN are reported as the total for 50 epochs. Data for DJSRH, AGCH, CIRH, and CAGAN [59] are sourced from relevant literature benchmarks under similar settings.
The results in Table 2 demonstrate the computational efficiency of our proposed UCGKANH approach. Compared to its GCN-based variant (UCGKANH-GCN), UCGKANH shows notable improvements in efficiency, reducing training time by approximately 23% across all three datasets. This improvement can be attributed to the B-spline activation functions in GraphKAN layers, which enable more efficient computation compared to standard GCNs while maintaining representation quality. For query time, UCGKANH also demonstrates better performance, being about 17–22% faster than UCGKANH-GCN across the three datasets. This efficiency gain during inference is particularly important for real-world retrieval systems where query response time affects user experience. When compared with other state-of-the-art methods, UCGKANH shows competitive or superior efficiency. For training time, UCGKANH is significantly faster than AGCH, DJSRH, and CAGAN, and even outperforms CIRH. Similarly, the query time of UCGKANH is faster than all baseline methods across all datasets.
These efficiency gains are particularly notable given that UCGKANH simultaneously achieves higher retrieval accuracy than these baselines, as demonstrated in previous sections. The results suggest that the integration of GraphKAN not only enhances the model’s representation capability but also improves computational efficiency through its adaptive activation functions, making it well-suited for large-scale cross-modal retrieval tasks.

4.6. Parameter Sensitivity Analysis

To evaluate the robustness of UCGKANH, a parameter sensitivity analysis is conducted on the hyperparameters λ (quantization loss weight) and τ (temperature parameter) using 128-bit hash codes on the MIRFlickr-25K dataset. The parameters λ , τ R are tuned within the range of [ 10 5 , 10 4 ] . Parameter L N is tuned within the range of [ 1 , 5 ] , and K N is tuned within the range of [ 2 , 6 ] . For parameters α 1 , α 2 R , they are tuned within the range of [ 0.1 , 1 ] . We vary each parameter across a wide range while keeping other parameters fixed ( λ = 0.1 , τ = 10 , L = 3 , K = 5 , α 1 = 0.6 , and α 2 = 0.2 ). The impact is illustrated in Figure 3.
For λ (Figure 3a,g), the mAP increases with larger values, peaking at a moderate level, then gradually declines, indicating that a small λ fails to enforce quantization, while a large λ overemphasizes quantization at the expense of semantic alignment. Similarly, τ (Figure 3b,h) shows optimal mAP at a medium value, with performance dropping at both extremes, as a small τ sharpens the contrastive loss excessively, risking overfitting, and a large τ flattens the loss, reducing discriminative power. The trends for L (Figure 3c,i) and K (Figure 3d,j) reveal that mAP improves as these parameters increase, reaching a peak before stabilizing or slightly decreasing, suggesting that moderate values enhance feature learning and high-order relationship modeling, while excessive values may introduce noise or redundancy. In contrast, α 1 (Figure 3e,k) and α 2 (Figure 3f,l) exhibit minimal impact on mAP, with relatively flat curves across the tested range, indicating that the model is less sensitive to these weights for similarity combination and exponential transformation. Overall, UCGKANH shows greater sensitivity to λ , τ , L, and K, requiring careful tuning to balance quantization, semantic alignment, feature depth, and high-order relationships while demonstrating robustness to variations in α 1 and α 2 .

4.7. Ablation Study

Initially, we define the Base Model representing a fundamental baseline using a standard Graph Convolutional Network (GCN) without any of our proposed enhancements. Additionally, through a selective exclusion of key modules including hypergraph enhancement, GraphKAN, and contrastive learning, we conduct an ablation study to examine the individual effects of each component in our system. Three benchmark datasets—MIRFlickr-25K, NUS-WIDE, and MS COCO—are utilized for the experiments, and the hash code length is set to 128 bits, while results are demonstrated in Table 3. We define the following variants of our model for the ablation study, where “w/o” means “without”:
  • Base Model: This is a fundamental baseline using standard Graph Convolutional Network (GCN) without any of our proposed enhancements. It directly concatenates image and text features Z = [ X I ; X T ] and applies standard graph convolution with ReLU activation, optimizing only the quantization loss L base = λ Z ( L ) B F 2 .
  • w/o HGNN: We remove the hypergraph enhancement module, which utilizes Hypergraph Neural Network (HGNN), directly using the concatenated features Z = [ X I ; X T ] as input to the GraphKAN layer. The contrastive learning step does not use hypergraph-guided positive pairs.
  • w/o GraphKAN: We replace the GraphKAN module with a standard graph convolutional network (GCN), where the feature update is simplified to z v ( l + 1 ) = σ u N ( v ) A v u z u ( l ) W ( l ) , with σ being the ReLU activation.
  • w/o CL: We remove the contrastive learning (CL) loss, optimizing the model solely with the quantization loss L = λ Z ( L ) B F 2 .
Table 3. Ablation study results showing mAP across three datasets with different hash code lengths.
Table 3. Ablation study results showing mAP across three datasets with different hash code lengths.
TaskMethodMIRFlickr-25KNUS-WIDEMS COCO
16 Bits32 Bits64 Bits128 Bits16 Bits32 Bits64 Bits128 Bits16 Bits32 Bits64 Bits128 Bits
I→TBase Model0.7980.8210.8350.8420.7210.7430.7560.7680.7520.7890.8050.819
UCGKANH w/o HGNN0.8590.8980.9320.9390.7830.8170.8290.8450.8450.8930.9130.927
UCGKANH w/o GraphKAN0.8560.8930.9280.9320.7840.8120.8340.8430.8310.8890.9140.928
UCGKANH w/o CL0.8690.8960.9170.9280.7820.8110.8290.8370.8090.8720.9160.923
UCGKANH0.9080.9220.9400.9480.8180.8370.8570.8650.8600.9190.9290.946
T→IBase Model0.7620.7840.7980.8060.6950.7180.7310.7450.7210.7650.7820.798
UCGKANH w/o HGNN0.8370.8690.8760.8950.7820.7930.8070.8190.8390.9150.9270.932
UCGKANH w/o GraphKAN0.8170.8620.8830.8910.7810.7930.8130.8180.8410.8960.9270.935
UCGKANH w/o CL0.8420.8650.8830.8910.7710.7960.8050.8110.8360.8710.9170.929
UCGKANH0.8790.8960.9070.9140.7920.8070.8150.8260.8610.9170.9230.949
In summary, the ablation study demonstrates that our UCGKANH framework significantly outperforms the Base Model, highlighting the overall effectiveness of our approach. Furthermore, it confirms that all three components—hypergraph construction, GraphKAN, and contrastive learning—are essential for UCGKANH’s high retrieval performance. Among these, contrastive learning has the most significant impact when removed, followed by hypergraph enhancement and GraphKAN, underscoring their complementary roles in capturing high-order relationships, enhancing non-linear feature representation, and ensuring cross-modal semantic alignment.

4.8. Convergence Analysis

As seen in Figure 4, we examine the convergence curves of the normalization loss L over 100 epochs for 128-bit hash codes on MIRFlickr-25K, NUS-WIDE, and MS COCO in order to look at the training dynamics of our model. On MIRFlickr-25K (Figure 4a), the loss decreases rapidly from 0.8 to below 0.05 within 100 epochs, reflecting the dataset’s simpler semantic structure. On NUS-WIDE (Figure 4b), the loss starts at 0.9 and converges to around 0.05, showing a slightly slower decline due to its diverse categories. On MS COCO (Figure 4c), the loss decreases more gradually from 1.0 to above 0.1, indicating higher complexity. These trends demonstrate that UCGKANH adapts effectively to varying dataset complexities, achieving stable convergence across all datasets.

4.9. Case Study

We conduct a case study on the T→I task using the MIRFlickr-25K dataset with 128-bit hash codes to further demonstrate the efficacy of UCGKANH in cross-modal retrieval. Two example queries are evaluated, and the top-30 images that were recovered are displayed in Figure 5. The results demonstrate the model’s ability to capture semantic relationships between text queries and images, reflecting its robustness in handling diverse concepts. For the first query, “blue cloud mountain still british columbia hiking” (Figure 5a), UCGKANH successfully retrieves images that align with the described scene, predominantly featuring mountainous landscapes with blue skies and clouds, which are consistent with the British Columbia setting. The retrieved images capture the essence of hiking environments, showcasing natural scenery with clear skies, lakes, and peaks, indicating that the model effectively bridges the modality gap by mapping the textual description to visually relevant content. This highlights the strength of the hypergraph enhancement and contrastive learning components in preserving high-order semantic relationships. For the second query, “books laptop mouse white pink phone college summer” (Figure 5b), UCGKANH retrieves images that reflect a college or study environment, including books, laptops, and related items, with some images incorporating white and pink elements. The results capture the summer college context by including bright, casual settings with study materials, demonstrating the model’s capability to handle complex, multi-concept queries. Overall, these case studies underscore the effectiveness of UCGKANH in achieving semantically consistent cross-modal retrieval, leveraging its integrated components to align diverse textual and visual representations.

5. Conclusions and Future Improvements

We propose UCGKANH, a novel unsupervised cross-modal hashing framework that integrates hypergraph enhancement, GraphKAN, and contrastive learning to successfully close the modality gap between images and texts. Our approach leverages hypergraph structures to capture high-order semantic relationships, employs GraphKAN for non-linear feature representation, and aligns cross-modal representations in the Hamming space using contrastive learning. Numerous tests on the MIRFlickr-25K, NUS-WIDE, and MS COCO benchmark datasets show that UCGKANH achieves superior retrieval performance, with significant improvements in mAP across various hash code lengths, as well as stable convergence behavior as shown in the normalization loss curves.
For future improvements, we plan to explore the integration of adaptive hyperparameter tuning to further enhance the model’s performance on diverse datasets. Additionally, incorporating pre-trained multimodal models, such as CLIP, could improve the initial feature representations, potentially leading to better hash code quality. Last but not least, expanding UCGKANH to manage dynamic datasets with changing data distributions may increase its relevance in practical situations like those that are online and cross-modal.

Author Contributions

H.L.: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Writing—Original Draft. S.S.: Methodology, Software, Validation, Writing—Review and Editing. Y.Z.: Data Curation, Visualization, Writing—Review and Editing. R.X.: Data Collection, Experimental Support, Technical Assistance. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hunan Provincial Natural Science Foundation of China (2024JJ6533), in part by the High Performance Computing Center of Central South University, and in part by Professor Jizhong Zhao from Xi’an Jiaotong University.

Data Availability Statement

The data will be made available upon reasonable request to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Computational Complexity Analysis

This section provides a detailed theoretical analysis of the computational complexity for the UCGKANH model. We break down the complexity of each major component, where N is the number of nodes in the graph, d i n and d o u t are the input and output feature dimensions, respectively, L is the number of GraphKAN layers, M is the number of hyperedges, B is the batch size, and K is the number of basis functions in each KAN layer.

Appendix A.1. Time Complexity Analysis

The overall time complexity is a summation of costs from several stages: hypergraph construction, GraphKAN layer processing, contrastive learning, and final hash code generation.
The initial hypergraph construction phase involves multiple steps. First, similarity computation between all N nodes, typically using dot product similarity, requires O ( N 2 · d i n ) operations. Following this, top-k selection is performed for each node to identify its closest neighbors, contributing O ( N log k ) operations per node, which aggregates to a total complexity of O ( N 2 log k ) . Finally, hyperedge formation from these selected neighbors has a complexity of O ( M · k ) ; in typical cases, M is on the order of O ( N ) . Thus, the overall complexity for hypergraph construction is O ( N 2 · d i n + N 2 log k ) .
Processing through the GraphKAN layers forms the core of the model. Each of the L layers performs several computations. Basis function evaluation for each node across K basis functions has a complexity of O ( N · d i n · K ) per layer. Subsequently, message aggregation from neighbors, where d ¯ is the average node degree, involves O ( N · d ¯ · d o u t ) operations. A feature transformation, typically linear, adds O ( N · d i n · d o u t ) operations. Consequently, for L GraphKAN layers, the total complexity is O ( L · N · ( K · d i n + d ¯ · d o u t + d i n · d o u t ) ) . The contrastive learning component also contributes to the computational load. Identifying positive pairs using the hypergraph structure, termed positive pair identification, requires O ( M · k 2 ) operations. Within a batch of size B, similarity computation between all pairs has a complexity of O ( B 2 · d o u t ) . The loss computation, specifically for the InfoNCE loss, which includes softmax normalization, incurs O ( B 2 ) complexity. Therefore, the total complexity for contrastive learning sums to O ( M · k 2 + B 2 · d o u t ) . Finally, the hash code generation stage involves two main operations. Feature projection of the final node features to h-bit hash codes requires O ( N · d o u t · h ) operations. This is followed by quantization, such as applying the sign function, which has a complexity of O ( N · h ) .
Aggregating these components, the total computational complexity per training iteration, O total , is given by the sum of the complexities from hypergraph construction ( O hypergraph ), GraphKAN layers ( O GraphKAN ), contrastive learning ( O contrastive ), and hash code generation ( O hash ). This can be expressed as follows:
O total = O hypergraph + O GraphKAN + O contrastive + O hash = O ( N 2 · d i n ) + O ( L · N · K · d i n ) + O ( B 2 · d o u t ) + O ( N · d o u t · h )
where the final line represents the dominant terms from each component.

Appendix A.2. Space Complexity

The space complexity analysis considers the memory required for storing the model’s structures and parameters. This includes hypergraph storage, where the adjacency information of M hyperedges, each of approximate size k, requires O ( M · k ) space. Storing node features across L layers, with d m a x as the maximum feature dimension, occupies O ( L · N · d m a x ) space. The GraphKAN parameters themselves also contribute significantly; each of the L layers stores K basis functions, each with d i n × d o u t parameters, resulting in O ( L · K · d i n · d o u t ) space. Furthermore, during the backpropagation phase, gradient storage requires an additional space comparable to the parameters, i.e., O ( L · K · d i n · d o u t ) . Therefore, the total space complexity of the UCGKANH model is O ( L · N · d m a x + L · K · d i n · d o u t + M · k ) .

References

  1. Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-Modal Retrieval: A Systematic Review of Methods and Future Directions. Proc. IEEE 2024, 112, 1716–1754. [Google Scholar] [CrossRef]
  2. Bin, Y.; Li, H.; Xu, Y.; Xu, X.; Yang, Y.; Shen, H.T. Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023. [Google Scholar] [CrossRef]
  3. Huang, H.; Nie, Z.; Wang, Z.; Shang, Z. Cross-modal and uni-modal soft-label alignment for image-text retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; AAAI Press: Washington, DC, USA, 2024. [Google Scholar] [CrossRef]
  4. Liu, K.; Gong, Y.; Cao, Y.; Ren, Z.; Peng, D.; Sun, Y. Dual semantic fusion hashing for multi-label cross-modal retrieval. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI ’24), Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar] [CrossRef]
  5. Hu, Z.; Cheung, Y.M.; Li, M.; Lan, W. Cross-Modal Hashing Method With Properties of Hamming Space: A New Perspective. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 7636–7650. [Google Scholar] [CrossRef] [PubMed]
  6. Sun, Y.; Dai, J.; Ren, Z.; Chen, Y.; Peng, D.; Hu, P. Dual Self-Paced Cross-Modal Hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 26–27 February 2024; Volume 38, pp. 15184–15192. [Google Scholar] [CrossRef]
  7. Li, F.; Wang, B.; Zhu, L.; Li, J.; Zhang, Z.; Chang, X. Cross-Domain Transfer Hashing for Efficient Cross-Modal Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 9664–9677. [Google Scholar] [CrossRef]
  8. Li, B.; Li, Z. Large-Scale Cross-Modal Hashing with Unified Learning and Multi-Object Regional Correlation Reasoning. Neural Netw. 2024, 171, 276–292. [Google Scholar] [CrossRef]
  9. Chen, Y.; Long, J.; Guo, L.; Yang, Z. Supervised Semantic-Embedded Hashing for Multimedia Retrieval. Knowl.-Based Syst. 2024, 299, 112023. [Google Scholar] [CrossRef]
  10. Chen, H.; Zou, Z.; Liu, Y.; Zhu, X. Deep Class-Guided Hashing for Multi-Label Cross-Modal Retrieval. Appl. Sci. 2025, 15, 3068. [Google Scholar] [CrossRef]
  11. Wu, Y.; Li, B.; Li, Z. Revising similarity relationship hashing for unsupervised cross-modal retrieval. Neurocomputing 2025, 614, 128844. [Google Scholar] [CrossRef]
  12. Liu, H.; Xiong, J.; Zhang, N.; Liu, F.; Zou, X.; Köker, R. Quadruplet-Based Deep Cross-Modal Hashing. Intell. Neurosci. 2021, 2021, 9968716. [Google Scholar] [CrossRef]
  13. Shen, X.; Huang, Q.; Lan, L.; Zheng, Y. Contrastive transformer cross-modal hashing for video-text retrieval. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI ’24), Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar] [CrossRef]
  14. Zhang, M.; Li, J.; Zheng, X. Semantic embedding based online cross-modal hashing method. Sci. Rep. 2024, 14, 736. [Google Scholar] [CrossRef]
  15. Wu, R.; Zhu, X.; Yi, Z.; Zou, Z.; Liu, Y.; Zhu, L. Multi-Grained Similarity Preserving and Updating for Unsupervised Cross-Modal Hashing. Appl. Sci. 2024, 14, 870. [Google Scholar] [CrossRef]
  16. Su, H.; Han, M.; Liang, J.; Liang, J.; Yu, S. Deep supervised hashing with hard example pairs optimization for image retrieval. Vis. Comput. 2023, 39, 5405–5420. [Google Scholar] [CrossRef]
  17. Chen, Y.; Long, Y.; Yang, Z.; Long, J. Parameter Adaptive Contrastive Hashing for multimedia retrieval. Neural Netw. 2025, 182, 106923. [Google Scholar] [CrossRef] [PubMed]
  18. Qin, Q.; Huo, Y.; Huang, L.; Dai, J.; Zhang, H.; Zhang, W. Deep Neighborhood-Preserving Hashing With Quadratic Spherical Mutual Information for Cross-Modal Retrieval. IEEE Trans. Multimed. 2024, 26, 6361–6374. [Google Scholar] [CrossRef]
  19. Kang, X.; Liu, X.; Zhang, X.; Xue, W.; Nie, X.; Yin, Y. Semi-Supervised Online Cross-Modal Hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 17770–17778. [Google Scholar] [CrossRef]
  20. Jiang, D.; Ye, M. Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 2787–2797. [Google Scholar] [CrossRef]
  21. Luo, H.; Zhang, Z.; Nie, L. Contrastive Incomplete Cross-Modal Hashing. IEEE Trans. Knowl. Data Eng. 2024, 36, 5823–5834. [Google Scholar] [CrossRef]
  22. Chen, B.; Wu, Z.; Liu, Y.; Zeng, B.; Lu, G.; Zhang, Z. Enhancing cross-modal retrieval via visual-textual prompt hashing. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI ’24), Jeju, Republic of Korea, 3–9 August 2024. [Google Scholar] [CrossRef]
  23. Zhu, J.; Ruan, X.; Cheng, Y.; Huang, Z.; Cui, Y.; Zeng, L. Deep Metric Multi-View Hashing for Multimedia Retrieval. In Proceedings of the 2023 IEEE International Conference on Multimedia and Expo (ICME), Brisbane, Australia, 10–14 July 2023; pp. 1955–1960. [Google Scholar] [CrossRef]
  24. Xie, X.; Li, Z.; Li, B.; Zhang, C.; Ma, H. Unsupervised cross-modal hashing retrieval via Dynamic Contrast and Optimization. Eng. Appl. Artif. Intell. 2024, 136, 108969. [Google Scholar] [CrossRef]
  25. Wang, J.; Shi, H.; Luo, K.; Zhang, X.; Cheng, N.; Xiao, J. RREH: Reconstruction Relations Embedded Hashing for Semi-paired Cross-Modal Retrieval. In Advanced Intelligent Computing Technology and Applications; Lecture Notes in Computer Science; Springer: Singapore, 2024; Volume 14879. [Google Scholar] [CrossRef]
  26. Jiang, X.; Hu, F. Multi-scale Adaptive Feature Fusion Hashing for Image Retrieval. Arab. J. Sci. Eng. 2024. [Google Scholar] [CrossRef]
  27. Li, Y.; Zhen, L.; Sun, Y.; Peng, D.; Peng, X.; Hu, P. Deep Evidential Hashing for Trustworthy Cross-Modal Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 27 February–2 March 2025; AAAI Press: Washington, DC, USA, 2025; Volume 39, pp. 18566–18574. [Google Scholar] [CrossRef]
  28. Li, Y.; Long, J.; Huang, Y.; Yang, Z. Adaptive Asymmetric Supervised Cross-Modal Hashing with consensus matrix. Inf. Process. Manag. 2025, 62, 104037. [Google Scholar] [CrossRef]
  29. Yang, Y.; Wang, Y.; Wang, Y. SDA: Semantic Discrepancy Alignment for Text-conditioned Image Retrieval. In Findings of the Association for Computational Linguistics: ACL 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5250–5261. [Google Scholar] [CrossRef]
  30. Liu, N.; Wu, G.; Huang, Y.; Chen, X.; Li, Q.; Wan, L. Unsupervised Contrastive Hashing With Autoencoder Semantic Similarity for Cross-Modal Retrieval in Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 6047–6059. [Google Scholar] [CrossRef]
  31. Zhang, F.; Zhang, X. GraphKAN: Enhancing Feature Extraction with Graph Kolmogorov Arnold Networks. arXiv 2024, arXiv:2406.13597. [Google Scholar]
  32. Jiang, Q.Y.; Li, W.J. Deep Cross-Modal Hashing. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 3270–3278. [Google Scholar] [CrossRef]
  33. Cao, Y.; Liu, B.; Long, M.; Wang, J. Cross-Modal Hamming Hashing. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  34. Tan, J.; Yang, Z.; Ye, J.; Chen, R.; Cheng, Y.; Qin, J.; Chen, Y. Cross-modal hash retrieval based on semantic multiple similarity learning and interactive projection matrix learning. Inf. Sci. 2023, 648, 119571. [Google Scholar] [CrossRef]
  35. Ng, W.W.Y.; Xu, Y.; Tian, X.; Wang, H. Deep supervised fused similarity hashing for cross-modal retrieval. Multimed. Tools Appl. 2024, 83, 86537–86555. [Google Scholar] [CrossRef]
  36. Li, A.; Li, Y.; Shao, Y. Federated learning for supervised cross-modal retrieval. World Wide Web 2024, 27, 41. [Google Scholar] [CrossRef]
  37. Shu, Z.; Bai, Y.; Yong, K.; Yu, Z. Deep Cross-Modal Hashing With Ranking Learning for Noisy Labels. IEEE Trans. Big Data 2025, 11, 553–565. [Google Scholar] [CrossRef]
  38. Chen, Y.; Long, Y.; Yang, Z.; Long, J. Unsupervised random walk manifold contrastive hashing for multimedia retrieval. Complex Intell. Syst. 2025, 11, 193. [Google Scholar] [CrossRef]
  39. Cui, J.; He, Z.; Huang, Q.; Fu, Y.; Li, Y.; Wen, J. Structure-aware contrastive hashing for unsupervised cross-modal retrieval. Neural Netw. 2024, 174, 106211. [Google Scholar] [CrossRef] [PubMed]
  40. Yao, D.; Li, Z.; Li, B.; Zhang, C.; Ma, H. Similarity Graph-correlation Reconstruction Network for unsupervised cross-modal hashing. Expert Syst. Appl. 2024, 237, 121516. [Google Scholar] [CrossRef]
  41. Meng, H.; Zhang, H.; Liu, L.; Liu, D.; Lu, X.; Guo, X. Joint-Modal Graph Convolutional Hashing for unsupervised cross-modal retrieval. Neurocomputing 2024, 595, 127911. [Google Scholar] [CrossRef]
  42. Sun, L.; Dong, Y. Unsupervised graph reasoning distillation hashing for multimodal hamming space search with vision-language model. Int. J. Multimed. Inf. Retr. 2024, 13, 16. [Google Scholar] [CrossRef]
  43. Chen, Y.; Long, Y.; Yang, Z.; Long, J. Unsupervised Adaptive Hypergraph Correlation Hashing for multimedia retrieval. Inf. Process. Manag. 2025, 62, 103958. [Google Scholar] [CrossRef]
  44. Zhong, F.; Chu, C.; Zhu, Z.; Chen, Z. Hypergraph-Enhanced Hashing for Unsupervised Cross-Modal Retrieval via Robust Similarity Guidance. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3517–3527. [Google Scholar] [CrossRef]
  45. Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2025, arXiv:2404.19756. [Google Scholar]
  46. Hu, D.; Nie, F.; Li, X. Deep Binary Reconstruction for Cross-Modal Hashing. IEEE Trans. Multimed. 2019, 21, 973–985. [Google Scholar] [CrossRef]
  47. Zhu, L.; Wu, X.; Li, J.; Zhang, Z.; Guan, W.; Shen, H.T. Work Together: Correlation-Identity Reconstruction Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Knowl. Data Eng. 2023, 35, 8838–8851. [Google Scholar] [CrossRef]
  48. Ding, G.; Guo, Y.; Zhou, J. Collective Matrix Factorization Hashing for Multimodal Data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 2083–2090. [Google Scholar] [CrossRef]
  49. Kumar, S.; Udupa, R. Learning hash functions for cross-view similarity search. In Proceedings of the IJCAI Proceedings-International Joint Conference on Artificial Intelligence, Catalonia, Spain, 16–22 July 2011; pp. 1360–1365. [Google Scholar]
  50. Zhang, P.F.; Li, Y.; Huang, Z.; Xu, X.S. Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval. IEEE Trans. Multimed. 2022, 24, 466–479. [Google Scholar] [CrossRef]
  51. Liu, M.; Liu, Y.; Guo, M.; Longfei, M. CLIP-based Fusion-modal Reconstructing Hashing for Large-scale Unsupervised Cross-modal Retrieval. Int. J. Multimed. Inf. Retr. 2023, 12, 139–149. [Google Scholar] [CrossRef]
  52. Wu, G.; Lin, Z.; Han, J.; Liu, L.; Ding, G.; Zhang, B.; Shen, J. Unsupervised deep hashing via binary latent factor models for large-scale cross-modal retrieval. In Proceedings of the IJCAI International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 2854–2860. [Google Scholar]
  53. Song, J.; Yang, Y.; Yang, Y.; Huang, Z.; Shen, H.T. Inter-media hashing for large-scale retrieval from heterogeneous data sources. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, New York, NY, USA, 22–27 June 2013; pp. 785–796. [Google Scholar] [CrossRef]
  54. Su, S.; Zhong, Z.; Zhang, C. Deep Joint-Semantics Reconstructing Hashing for Large-Scale Unsupervised Cross-Modal Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3027–3035. [Google Scholar] [CrossRef]
  55. Li, B.; Yao, D.; Li, Z. RICH: A rapid method for image-text cross-modal hash retrieval. Displays 2023, 79, 102489. [Google Scholar] [CrossRef]
  56. Zhou, J.; Ding, G.; Guo, Y. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, Washington, DC, USA, 14–18 July 2014; pp. 415–424. [Google Scholar] [CrossRef]
  57. Zhang, B.; Zhang, Y.; Li, J.; Chen, J.; Akutsu, T.; Cheung, Y.M.; Cai, H. Unsupervised Dual Deep Hashing With Semantic-Index and Content-Code for Cross-Modal Retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 387–399. [Google Scholar] [CrossRef]
  58. Xiong, S.; Pan, L.; Ma, X.; Beckman, E. Unsupervised deep hashing with multiple similarity preservation for cross-modal image-text retrieval. Int. J. Mach. Learn. Cybern. 2024, 15, 4423–4434. [Google Scholar] [CrossRef]
  59. Li, Y.; Ge, M.; Li, M.; Li, T.; Xiang, S. CLIP-Based Adaptive Graph Attention Network for Large-Scale Unsupervised Multi-Modal Hashing Retrieval. Sensors 2023, 23, 3439. [Google Scholar] [CrossRef]
Figure 1. The overall framework of the proposed UCGKANH method.
Figure 1. The overall framework of the proposed UCGKANH method.
Mathematics 13 01880 g001
Figure 2. The performance of UCGKANH method with 128-bits in terms of top-K precision on MIRFlickr, NUS-WIDE, and MS COCO datasets.
Figure 2. The performance of UCGKANH method with 128-bits in terms of top-K precision on MIRFlickr, NUS-WIDE, and MS COCO datasets.
Mathematics 13 01880 g002
Figure 3. The effects of the parameters with 128-bit hash code length on MIRFlickr-25K.
Figure 3. The effects of the parameters with 128-bit hash code length on MIRFlickr-25K.
Mathematics 13 01880 g003
Figure 4. The convergence curves of UCGKANH in the case with 128-bit code length on the three datasets.
Figure 4. The convergence curves of UCGKANH in the case with 128-bit code length on the three datasets.
Mathematics 13 01880 g004
Figure 5. The multi-media retrieval results on text to image task with hash code length of 128 bit. (a) Query: “blue cloud mountain still be British-columbia Canada hiking”. The retrieved images demonstrate strong semantic relevance to natural landscapes. (b) Query: “books laptop mouse white phone pink college summer”. The retrieved results predominantly reflect indoor and educational scenes. Both (a) and (b) correspond to the T→I task conducted on the MIRFlickr-25K dataset.
Figure 5. The multi-media retrieval results on text to image task with hash code length of 128 bit. (a) Query: “blue cloud mountain still be British-columbia Canada hiking”. The retrieved images demonstrate strong semantic relevance to natural landscapes. (b) Query: “books laptop mouse white phone pink college summer”. The retrieved results predominantly reflect indoor and educational scenes. Both (a) and (b) correspond to the T→I task conducted on the MIRFlickr-25K dataset.
Mathematics 13 01880 g005aMathematics 13 01880 g005b
Table 1. The mAP results across three datasets of each method.
Table 1. The mAP results across three datasets of each method.
TaskMethodMIRFlickr-25KNUS-WIDEMS COCO
16 Bits32 Bits64 Bits128 Bits16 Bits32 Bits64 Bits128 Bits16 Bits32 Bits64 Bits128 Bits
I T CVH0.6060.5990.5960.5980.3720.3620.4060.3900.5050.5090.5190.510
IMH0.6120.6010.5920.5790.4700.4730.4760.4590.5700.6150.6130.587
LCMH0.5590.5690.5850.5930.3540.3610.3890.383
CMFH0.6210.6240.6250.6270.4550.4590.4650.4670.6210.6690.5250.562
LSSH0.5840.5990.6020.6140.4810.4890.5070.5070.6520.7070.7460.773
DBRC0.6170.6190.6200.6210.4240.4590.4470.4470.5670.5910.6170.627
RFDH0.6320.6360.6410.6520.4880.4920.4940.508
UDCMH0.6890.6980.7140.7170.5110.5190.5240.558
DJSRH0.8100.8430.8620.8760.7240.7730.7980.8170.6780.7240.7430.768
AGCH0.8650.8870.8920.9120.8090.8300.8310.8520.7410.7720.7890.806
CIRH0.9010.9130.9290.9370.8150.8360.8540.8620.7970.8190.8300.849
RICH0.8690.8750.9080.9250.7900.8060.8420.852
CFRH0.9020.9140.9360.9450.8070.8240.8540.8590.8450.8950.9160.928
UMSP0.9010.9050.9290.9420.8140.8310.8470.858
UDDH0.8440.8990.9120.7910.8010.822
UCGKANH0.9080.9220.9400.9480.8180.8370.8570.8650.8600.9190.9290.946
T I CVH0.5910.5830.5760.5760.4010.3840.4420.4320.5430.5530.5600.542
IMH0.6030.5950.5890.5800.4780.4830.4720.4620.6410.7090.7050.652
LCMH0.5610.5690.5820.5820.3760.3870.4080.419
CMFH0.6420.6620.6760.6850.5290.5770.6140.6450.6270.6670.5540.595
LSSH0.6370.6590.6590.6720.5770.6170.6420.6630.6120.6820.7420.795
DBRC0.6180.6260.6260.6280.4550.4590.4680.4730.6350.6710.6970.735
RFDH0.6810.6930.6980.7020.6120.6410.6580.680
UDCMH0.6920.7040.7180.7330.6370.6530.6950.716
DJSRH0.7860.8220.8350.8470.7120.7440.7710.7890.6500.7530.8050.823
AGCH0.8290.8490.8520.8800.7690.7800.7980.8020.7460.7740.7970.817
CIRH0.8670.8850.9000.9010.7740.8030.8100.8170.8110.8470.8720.895
RICH0.8300.8430.8850.9020.7710.7770.8020.822
CFRH0.8740.8850.8960.9100.7800.7910.7980.8170.8520.9030.9200.937
UMSP0.8620.8660.8790.8860.7720.7830.7940.805
UDDH0.8350.8580.8690.7710.7850.802
UCGKANH0.8790.8960.9070.9140.7920.8070.8150.8260.8610.9170.9230.949
Table 2. Time cost comparison of UCGKANH, UCGKANH-GCN, and various baseline methods with 128-bit hash codes. Training times for UCGKANH and UCGKANH-GCN are total for 50 epochs.
Table 2. Time cost comparison of UCGKANH, UCGKANH-GCN, and various baseline methods with 128-bit hash codes. Training times for UCGKANH and UCGKANH-GCN are total for 50 epochs.
MethodTotal Training Time (s)Query Time (s)
MIRFlickr-25KNUS-WIDEMS COCOMIRFlickr-25KNUS-WIDEMS COCO
DJSRH743.68783.42935.4112.6791.3485.64
AGCH826.32865.92958.9625.36152.98108.65
CIRH309.86304.43377.3311.3593.7472.12
CAGAN817.74861.13947.3720.15112.8494.30
UCGKANH-GCN279.15263.17347.4211.84106.4883.26
UCGKANH215.62203.21268.489.2688.5369.05
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lin, H.; Shen, S.; Zhang, Y.; Xia, R. Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing. Mathematics 2025, 13, 1880. https://doi.org/10.3390/math13111880

AMA Style

Lin H, Shen S, Zhang Y, Xia R. Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing. Mathematics. 2025; 13(11):1880. https://doi.org/10.3390/math13111880

Chicago/Turabian Style

Lin, Hongyu, Shaofeng Shen, Yuchen Zhang, and Renwei Xia. 2025. "Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing" Mathematics 13, no. 11: 1880. https://doi.org/10.3390/math13111880

APA Style

Lin, H., Shen, S., Zhang, Y., & Xia, R. (2025). Unsupervised Contrastive Graph Kolmogorov–Arnold Networks Enhanced Cross-Modal Retrieval Hashing. Mathematics, 13(11), 1880. https://doi.org/10.3390/math13111880

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop