CLIP-Based Adaptive Graph Attention Network for Large-Scale Unsupervised Multi-Modal Hashing Retrieval

With the proliferation of multi-modal data generated by various sensors, unsupervised multi-modal hashing retrieval has been extensively studied due to its advantages in storage, retrieval efficiency, and label independence. However, there are still two obstacles to existing unsupervised methods: (1) As existing methods cannot fully capture the complementary and co-occurrence information of multi-modal data, existing methods suffer from inaccurate similarity measures. (2) Existing methods suffer from unbalanced multi-modal learning and data semantic structure being corrupted in the process of hash codes binarization. To address these obstacles, we devise an effective CLIP-based Adaptive Graph Attention Network (CAGAN) for large-scale unsupervised multi-modal hashing retrieval. Firstly, we use the multi-modal model CLIP to extract fine-grained semantic features, mine similar information from different perspectives of multi-modal data and perform similarity fusion and enhancement. In addition, this paper proposes an adaptive graph attention network to assist the learning of hash codes, which uses an attention mechanism to learn adaptive graph similarity across modalities. It further aggregates the intrinsic neighborhood information of neighboring data nodes through a graph convolutional network to generate more discriminative hash codes. Finally, this paper employs an iterative approximate optimization strategy to mitigate the information loss in the binarization process. Extensive experiments on three benchmark datasets demonstrate that the proposed method significantly outperforms several representative hashing methods in unsupervised multi-modal retrieval tasks.


Introduction
With advances in sensing technology and the proliferation of multi-modal data from different sources (e.g., images, voice and video), it is essential to analyze and process these cross-modal data. People are no longer content with a single form of access to data, which makes it urgent to retrieve multimedia data swiftly and efficiently. However, multi-modal data are massive, heterogeneous and highly dimensional, and the retrieval of multi-modal data takes a great deal of time and storage space [1]. Therefore, it is crucial to reduce the storage space of multi-modal data and improve retrieval performance. Among many retrieval methods, hash retrieval [2,3] has attracted extensive research for its efficient storage and retrieval efficiency.
The basic idea of cross-modal hash retrieval [4,5] is to use the sample pair information of different modalities, learn the hash transform of different modalities and map the data of different modalities to a Hamming binary space [6] while keeping the similarity of the data in the process of mapping. (Data with more similar original semantics are projected into the Hamming common space, and the distance between their hash codes is closer.) Fast cross-modal retrieval is then achieved in the Hamming space. Figure 1 illustrates the cross-modal hash retrieval task. Cross-modal hashing can be classified into two categories: supervised and unsupervised methods. Supervised methods [7][8][9] use semantic labels to bridge heterogeneity gaps and semantic gaps, which often achieve better retrieval accuracy. Jiang et al. first proposed a novel approach called deep cross-modal hashing (DCMH) [7], which integrates feature learning and hash code learning into an end-to-end learning framework. However, the large amount of manually annotated label information is expensive, noisy, and difficult to implement in practical scenarios. The unsupervised methods [10][11][12][13] eliminate the dependence on label information and only consider paired multimedia data. Due to the lack of jointly trained label information, unsupervised crossmodal hashing methods suffer from inaccurate training objectives and limited retrieval accuracy. Unsupervised methods have been little explored relative to supervised methods, and this work aims to enhance the retrieval performance of cross-modal hashing under unsupervised conditions. Figure 1. A brief illustration of multi-modal hashing retrieval. The cross-modal hashing method maps the original data into a unified Hamming (binary code) space. Meanwhile, the semantic similarity of the data is preserved during the mapping process. (The more semantically similar the original multimedia data is mapped to a common Hamming space, the closer the distance between their hash codes and vice versa.) In recent years, due to the powerful feature extraction capabilities of deep neural networks [14][15][16], unsupervised cross-modal hash retrieval methods based on deep learning have made great progress. The deep unsupervised cross-modal hashing methods [11,17,18] use the deep neural network to extract the feature representation of different modal data and establish the semantic association of different modalities at a high level, thereby achieving large performance improvement. Liu et al. proposed a Joint Modal Distribution-based Similarity Hash (JDSH) [11], which fully preserves the semantic relevance between data by constructing a joint modal similarity matrix and designing a similarity weighting scheme.
Although these unsupervised methods achieve impressive performance, most of them suffer from inaccurate similarity and modal imbalance, leading to sub-optimal retrieval performance. In particular, it is difficult to comprehensively measure complex data correlations with simple data features of different modalities. The original structure of the hash code is destroyed during the process from the real value to the binarization, and there is information loss. In addition, multi-modal learning suffers from imbalance problems due to the modality gap and data bias [19,20], and the training efficiency is still limited.
To address these issues, we propose a novel CLIP-based Adaptive Graph Attention Network (CAGAN) for large-scale unsupervised multi-modal hashing retrieval. The framework is shown in Figure 2, and the contributions are as follows: • We propose a novel unsupervised cross-modal hashing method that uses the multimodal model CLIP (Contrastive Language-Image Pre-Training) [21,22] to extract cross-modal features and designs a cross-modal similarity enhancement module to integrate the similar information of different modalities, thereby providing a better supervisory signal for hash code learning. • To alleviate the problem of unbalanced multi-modal learning, an adaptive graph attention module was designed to act as an auxiliary network to assist in the learning of hash functions. The module employs an attention mechanism to enhance similar information and suppress irrelevant information and mines graph neighborhood correlations through graph convolutional neural networks. • In addition, an iterative approximation optimization strategy is used to reduce the information loss in the hash code binarization process. Sufficient experiments on three benchmark datasets show the proposed method outperforms other state-of-the-art deep unsupervised cross-modal hashing methods.

Related Work
We briefly introduce related multi-modal hashing work in this section, including deep unsupervised multi-modal hashing, attention-based hashing, and graph-based methods.

Deep Unsupervised Multi-Modal Hashing
Cross-modal hashing methods fall into two categories: supervised methods for labeled multi-modal data and unsupervised methods for paired multi-media data. Unsupervised hashing methods have more research value and application prospects because of their label independence. Deep neural networks [23,24] have shown remarkable ability in encoding deep features of different modal data; thus, deep unsupervised cross-modal hashing retrieval has attracted increasing research. One of the most representative works is deep joint-semantics reconstructing hashing (DJSRH) [10], which proposes a novel method for reconstructing multi-modal hash matrices by designing a joint semantic affinity matrix to unify the similarity relations of different modal data. High-order nonlocal hashing (HNH) [25] constructs a more comprehensive similarity matrix by considering the similarity relationship between multimodal data from both local and non-local perspectives. DGCPN proposed by Yu et al. [17] preserves the consistency of graph neighbors by integrating information between data and their neighbors and moderates the combined similarity retention loss using three different forms of data similarity. Deep adaptively enhanced hashing (DAEH) [26] proposes a strategy with discriminative similarity guidance and adaptive enhancement optimization that uses information theory to discover weaker hash functions and augment them with additional teacher networks. However, these unsupervised methods suffer from the problem of inaccurate similarity measures, which leads to limited retrieval performance. Inspired by vision and language pre-training (VLP) and related works [21,27,28], we extract cross-modal features using contrastive language-image pre-training (CLIP), which uses the Transformer [16] to achieve fine-grained semantic alignment of image patches and text words and employs contrast learning for large-scale data training. Furthermore, a multi-modal similarity enhancement module is designed to fuse and enhance the similarity information of different modal data, which can effectively alleviate the inaccurate similarity measure of multi-modal data.

Attention-Based Methods
Recently, attention mechanisms [16,29] have attracted extensive attention due to their satisfactory performance in various domains, such as machine translation and image processing. By focusing on the information that is more critical to the current target among the many inputs and reducing the attention to other information or even filtering out irrelevant information, the attention mechanism can solve the information redundancy problem and improve the efficiency and accuracy of the task processing. In recent years, attention-based cross-modal retrieval methods [30][31][32] have been initially explored. Attention-aware deep adversarial hashing (ADAH) [33] proposes an adversarial hash network with an attention mechanism to enhance the measure of content similarity by selectively paying attention to the informative parts of multi-modal data. The self-constraining and attention-based hashing network (SCAHN) [29] proposes an approach for bit-scalable cross-modal hashing that incorporates early and late label constraints into both hash encoding learning and hash representations. Attention-guided semantic hashing (AGSH) [30] adopts an attention mechanism that pays attention to the associated feature features. It can preserve the semantic information in different modal features through the attention module so as to construct an attention-aware semantic affinity matrix. However, unsupervised cross-modal hash retrieval based on an attention mechanism is rarely explored.

Graph-Based Methods
Graph Convolutional Networks (GCNs) [34] have shown excellent performance in learning representations of graph-structured data and have generated extensive research interest in areas such as intelligent transportation, social networks and pharmaceutical medicine. Graph neural networks [34,35] utilize a regression neighborhood aggregation strategy to compute the features of each data node. In recent years, cross-modal hashing methods based on GCNs have received extensive attention [36,37]. In particular, graph convolutional network hashing (GCNH) [38] introduces an asymmetric graph convolution layer that addresses the problems of scalability and out-of-sample extension when exploiting affinity graphs for hashing. Graph convolutional multi-modal hashing (GCMH) [39] proposes multiple modality-individual GCNs under semantic guidance to act on each modality independently to preserve intra-modality similarity, and then fuse the output representations into a fusion graph with an adaptive weighting scheme. Aggregation-based graph convolutional hashing (AGCH) [36] designed an elegant aggregation strategy that leverages multiple similarity measures to build an accurate semantic similarity matrix and employs graph convolutional neural networks to aggregate similarity information across modal data, which further mines the semantic relevance of different modal data. However, these methods cannot comprehensively utilize the features of different modalities to build semantic affinity graphs, resulting in inaccurate relationships between data nodes. Hence, an adaptive graph attention module is designed to address this problem. It uses an attention mechanism to learn a semantic affinity graph and aggregates information between similar nodes through graph convolution, thereby enabling similar data to generate more consistent hash codes. In addition, using GCNs to assist a hash network's learning can effectively alleviate the problem of multi-modal learning imbalance [40].

Methodology
In this section, we will elaborate on the proposed CAGAN model, including the following subsections: problem definition and notation, an overview of the model framework, objective function and optimization of the network. It is worth noting that our approach uses batch training and the variables will be represented in a batch manner.

Notation and Problem Definition
To better understand the cross-modal retrieval task and the proposed method, we first introduce some notational definitions used in this paper. Given a cross-modal dataset where v i and t i represent pairs of image-text, we divide the data into mini-batches of training samples o = o 1 , o 2 , · · ·, o j . For each randomly sampled batch of training samples{o k = [v k , t k ]} m k=1 , where mdenotes the batch size, we use F v ∈ R m×512 and F t ∈ R m×d t for the visual and textual representations. Meanwhile, we denote the hash codes generated by the hash coding network as B v ∈ {−1, +1} m×c and B t ∈ {−1, +1} m×c , the hash codes generated by the graph convolutional neural network as B g v ∈ {−1, +1} m×c and B g t ∈ {−1, +1} m×c , where c represents the length of the hash code. In the phase of building the similarity matrix, we first normalize F v and F t to F v and F t ; then, we use the cosine similarity to calculate the visual and textual modality similarity matrices S v = cos( F v , F v ) and S t = cos( F t , F t ), respectively, which in turn are used to describe the original image and the inherent similarity between textual data. Furthermore, we can consider the generated hash codes B v and B t as feature vectors that can only contain the vertices of a high-dimensional space. From this perspective, neighboring vertices correspond to similar hash codes; that is, the Hamming distance between two hash codes can be expressed in terms of their cosine angular distance, and the cosine distance of vectors − → x and − → y is defined as follows: where · 2 denotes the l 2 -normalization of the vectors, and the cosine matrix of the samples reflects the cosine similarity relation between the hash codes, which is equivalent to their Hamming distance relation discussed below. The Hamming distance can be computed as the dot product of two binary codes; it is the number of different characters in a string of equal length and is used to measure the distance between hash codes. Given two hash codes h i and h j , the Hamming distance formula is as follows: where c is the length of the hash code, and h i h j is the dot product of the hash codes h i and h j . Cross-modal hashing improves retrieval speed and reduces storage consumption by projecting different modal data into a unified Hamming space. It is important to note that the original semantic similarity of the data is preserved in the data projection.

Framework Overview
As depicted in Figure 2, the CAGAN framework is an end-to-end model, which includes four main modules, i.e., the deep feature encoding module, the multi-perspective similarity aggregation module, the adaptive graph attention module, and the hash code reconstruction module. We will elaborate on the implementation process of each module below.
Deep feature encoding module. The deep encoding module contains two main networks: visual encoding network and text encoding network. The CLIP-represented vision-language pre-trained (VLP) model has proven to be more effective at learning both textual and visual representations. In this paper, we adopt the CLIP encoder and multilayer perceptrons (MLPs) as the backbone network, which can extract richer cross-modal semantic features. We denote the visual encoder as Enc v and the textual feature encoder as Enc t , while the symbols are expressed as follows: where V and T represent batches of image and text training samples. θ v and θ t represent the parameters of the visual and textual feature encoding network. Then, we use MLP to learn the hash function; the formula is as follows: Therefore, we can encode the rich semantic features of different modalities to better describe the semantic similarity between the original data and further guide the learning of hash codes.
where α denotes the number of iterations. As the number of iterations increases, the hyperbolic tangent function converges to a symbolic function: lim α→∞ tanh(αx) = sign(x).
The iterative approximate optimization strategy is used to mitigate information loss in the hash code binarization process. Multi-modal similarity Enhance Module. Unsupervised hashing methods cannot construct a multi-label similarity matrix to guide the learning of hash codes due to the inability to obtain the labels of the samples. As described in [31,38,41], building a similarity matrix using deep neural networks to capture the complementary and coexistence information of the original data is a superior method, which can provide effective selfsupervision for the learning of hash functions. In particular, we use mini-batch visual features For the textual modality, we directly leverage the features F t = f t i m i=1 ∈ R m×d t processed by the bag of words to create the text cosine similarity Subsequently, we construct a cross-modal similarity matrix to capture the co-occurrence similarity of different modal instances. In particular, we use the visual modality similarity matrix S v and textual modality similarity matrix S t to construct a cross-modal cosine similarity matrix S c that can preserve co-occurrence information between image and text modal instances. The equation for the fusion process is described as follows: where (·) T indicates the transposition of the matrix. In addition, we construct a semantically preserved affinity matrix S A that integrates information from different matrices; the formula is expressed as follows: where η, β, λ are balancing hyper-parameters that trade off the degree of importance of similarity matrices between image and text modalities. Finally, we performed similarity enhancement on the fused affinity matrix S A with the following formula: where s max , s min , s mean denote the maximum, minimum and mean of the similarity matrix, respectively. The formula for the similarity matrix enhancement is as follows: After the similarity enhancement, the similarity enhanced matrix can be formed as: . Compared with previous unsupervised methods, this similarity enhancement brings similar data closer and dissimilar data further by setting a threshold, thus providing a better supervision signal for the learning of hash codes.
Adaptive graph attention module. The module employs an attention mechanism to learn the similarity matrix of adaptive modalities; the formula is as follows: where W v and W t represent the projection matrices of visual and textual modalities, and γ is a trade-off parameter. In our experiments, we found that a two-layer graph convolutional network has the best expressiveness. Because graph convolutional networks use fixed filters for learning, too many layers generally limit the expression of the network [41]. Subsequently, we pass the attention similarity matrix into a two-layer graph convolutional network that aggregates information between similar nodes to generate more consistent hash codes: Z where D ii = ∑ j s ij , W (1) and W (2) are the matrices of parameters, while σ 1 and σ 2 denote the activation functions for the first and second layers. Z (i) * represents the output of the i-th layer of a visual and textual modality graph convolutional network. Therefore, we can use the attention mechanism to learn the similarity between data. During training, the attention matrix is iteratively updated to maximize the similarity relationship between instances and then aggregate the information of similar nodes through a graph convolutional network to generate a more consistent hash code, which helps improve the image and text retrieval performance. The hash code generated by graph convolution is as follows: where α denotes the number of iterations. We used an iterative approximate optimization strategy to optimize the hash code. When lim α→∞ tanh(αx) = sign(x), the discrete problem is transformed into a series of continuous optimization problems, which can effectively alleviate the problems of information loss and instability in the process of binarization.
Hash codes reconstructing module. We construct the similarity matrices . Finally, we construct the loss functions with them and the similarity enhancement matrix S E . These loss function formulas are as follows: where L Intra and L Cross denote the intra-modal loss and cross-modal loss, respectively. L Gcn represents the graph convolution reconstructing loss. µ is a scale hyper-parameter that can regulate the quantization scope for the enhanced matrix, and the symbol ⊗ indicates the Hadamard matrix product.

Objective Function and Optimization
The parameters of the entire network are iteratively updated through the backpropagation algorithm until the network converges and the reconstructing procedure of the hash codes is finished. The formula of the total loss is as follows: where ε and varphi are trade-off hyper-parameters. Minimizing the above loss function allows similar data to generate more consistent hash codes. The proposed CAGAN method can be iteratively optimized batch by batch. By minimizing the loss in Equation (14), CAGAN assists the learning of the hash network through an adaptive graph attention network, effectively capturing the neighborhood structure and co-occurrence information of the original instance to generate high-quality hash codes. The entire CAGAN model can be optimized using SGD and Adam optimization algorithms, and the process of CAGAN is detailed in Algorithm 1.

4:
Randomly sampling m text-image pairs O m = {v k , t k } m k=1 to construct mini-batch training data; after that, data enhancement and normalization are performed on the image.

5:
Batches of images and texts obtain batches of visual features F v and textual features F t through image and text encoding networks. Calculate the cosine similarity of F v and F t to construct the similarity matrix S v , S t and affinity matrix S A according to the Equations (6) and (7). 6: The affinity matrix S A is enhanced to obtain the matrix S E according to Equation (8) and (9), then, the matrix S E is used as the input of the graph adaptive attention module to learn the hash code.  Compute the loss for the whole network according to the Equation (14). 9: Back propagation through stochastic gradient descent and Adam's algorithm to optimize the parameters of the network. 10: until convergence 11: return the parameters of the entire network θ * , * ∈ {v, t, Hv, Ht, Gv, Gt}.

Experiments
In this part, to evaluate the effectiveness of the proposed CAGAN, comprehensive experiments were carried out on three multi-media benchmark datasets (MS COCO, MIRFLICKR-25K and NUS-WIDE). Firstly, we briefly introduce the datasets and evaluation metrics. Secondly, the proposed CAGAN was compared with several advanced baseline methods, including LSSH [42], IMH [43], CMFH [44], RFDH [45], UDCMH [6], AGCH [36], UGACH [46], SRCH [37], UKD [47], DJSRH [10], JDSH [11], DSAH [18], HNH [25], DGCPN [17], DUCH [48] and DAEH [26]. For the fairness of the experiments, we put the baseline methods with the same experimental settings together for comparison. Finally, the proposed methods were empirically analyzed by parameter-sensitive analysis, ablation study, training efficiency and visualization. [49]: The multi-label dataset on the Flickr website currently has 25,000 photos and associated text description labels from 24 different categories. To represent relevant textual content, it also provides a 1386-dimensional feature vector obtained from principal component analysis of binary label vectors. For a fair comparison, we adopted the same setting as the previous methods [26] and randomly selected 5000 and 2000 samples as training and test sets, respectively. NUS-WIDE [50]: The dataset contains 269,648 images collected from real scenes with their corresponding textual descriptions and labels. In this paper, we follow the setup of previous work and select the 10 most widely used concepts and their associated 186,577 image-text pairs. For each text, a 1000-dimensional BOW feature representation was provided by principal component analysis. We randomly selected 2000 samples and 5000 samples as the test and training sets, respectively.

MIRFLICKR-25K
MS COCO [51]: It is a widely used and diverse dataset for object recognition, multimedia retrieval and semantic segmentation. The dataset contains 123,287 images obtained from intricate, everyday scenes where objects in the photographs were localized by careful segmentation. In our experiments, we used 87,081 photographs with 91 categories of information, each corresponding text represented by a 2000-dimensional bag-of-words vector. In addition, we randomly selected 5000 image-text pairs as the query set and the remaining pairs as the retrieval set; 10,000 photo and phrase pairs from the retrieval set were randomly selected as the training set. We show the structure of the MS COCO dataset in Figure 3 and summarize the statistics for the three datasets in Table 1.

Evaluation Metrics
Cross-modal image-text retrieval focuses on two search tasks: "Text-query-Image (T → I)" and "Image-query-Text (I → T)". They use an instance of one modality as a query point to retrieve similar data from another modality in the database. In experiments, we employ two widely used metrics for retrieval measurement: Mean Average Precision (MAP) and the precision of top-N curve to measure the proposed model's retrieval performance compared with other methods. Precision and ranking information can be reflected well in the measurement methods. In particular, given a query set Q = [q 1 , q 2 , · · ·, q M ], the MAP is denoted as follows: where q i indicates the query instances, and M indicates the total number of query instances. In addition, n represents the number of instances in the dataset, k denotes the number of instances returned during the search procedure and L q indicates the number of data in the dataset associated with the query data, i.e., the total amount of data instances. P(k) is the accuracy rate of the top k samples retrieved during the search procedure. ∆R(k) represents the recall value as the number of instances ranging from k − 1 to k. Average precision is equal to the average retrieval accuracy in a single query datum. The precision of the top-N curve is also an important measurement indicator that indicates the precision at various numbers of retrieved instances. The precision of the top-N curve represents the average accuracy of the top N after the retrieval results are sorted, which reflects the generalization ability and comprehensive performance of a model.
In the experiments, we use the CLIP (VIT/B-16) [21] model pre-trained on 400 million image-text pairs and the bag-of-words model for the feature encoding module. VIT/B-16 represents one of the models of CLIP, which uses the vision transformer [24] to model image modal data. Subsequently, we construct similarity matrices for different modalities from the encoded image and text features. For the adaptive graph attention module, we perform adaptive attention learning on the similarity-enhanced affinity matrix, which consists of two graph convolutional layers and a multi-layer perceptron (512 → 1024 → C). Finally, we reconstruct the hash codes generated by the multi-layer perceptron (D f → 4096 → C, where D f represents the dimension of image or text features, and C represents the length of the hash code and the graph convolutional network to generate more consistent hash codes for related data through Equations (11) and (12). For the optimization process of the network, we adopt the SGD and Adam optimizer with a learning rate of 0.01, weight decay of 5e-4 and momentum of 0.9. The batch size is set to 32 for three benchmark datasets at the training stage.

Comparison Results and Discussions
In experiments, we compare two cross-modal retrieval tasks: I → T and T → I: using image query texts and vice versa. In this subsection, we compare the retrieval performance of all baselines and CAGAN in terms of MAP and Top-N precision curves in the two retrieval tasks, respectively. MAP comparison results: Table 2 displays the MAP@5000 results of the proposed CAGAN compared with other state-of-the-art unsupervised cross-modal hashing methods at hash code lengths from 16 bits to 128 bits on three benchmark datasets (MIRFlickr-25K, NUS-WIDE and MS COCO). As can be seen from the data in Table 2, our proposed method outperforms all compared baselines. It is worth noting that the first four approaches are traditional methods, and the rest are deep-neural-network-based methods. The methods based on deep neural networks have achieved great performance improvement because of the strong nonlinear feature extraction capabilities of neural networks. Compared to some advanced unsupervised cross-mode hashing baselines, our method has about 1.5-3% performance improvement, which confirms the superiority of the proposed CAGAN. In addition, the improvement in our proposed method on the NUS-WIDE dataset is relatively small-about 0.6-2.2%-because the NUS-WIDE dataset contains a small number of categories. The performance improvement of our method is more obvious on MSCOCO with a large number of categories and still maintains a good performance in the case of a lower hash code length. It reflects the excellent ability of the proposed model for fine-grained retrieval, and it is more suitable for practical application. Table 2. The MAP@5000 results on cross-modal retrieval tasks (I→T indicates the image search text task and vice versa) and three benchmark datasets. The best outcomes are highlighted in bold, and sub-optimal results are underlined. To further verify the effectiveness of the proposed CAGAN, we compare five additional deep cross-modal unsupervised hashing methods in the MIRFLICKR-25K and NUS-WIDE datasets, and the comparison results are presented in Tables 3 and 4. We compare methods with the same experimental setup together in Table 3. It can be seen that our proposed method outperforms all comparison methods in the MAP@50 and MAP@ALL settings. On the MIRFLICKR-25K dataset, the proposed method outperforms the existing methods in MAP@50, even when compared with the state-of-the-art AGCH method. The proposed method has 2%-4% performance improvement. On the NUS-WIDE dataset, GAGAN's MAP@50 has a 1%-2% improvement in retrieval accuracy compared with the compared methods. In addition, the MAP@ALL performance on both datasets is also significantly improved. The results in Table 3 further illustrate the effectiveness of the CAGAN method.    Top-N precision curves: Figure 4 shows the top-N precision curves of the proposed method and all eleven baseline methods compared on three multimedia datasets. The top-N accuracy curves are drawn by changing the number of retrieved samples from 1 to 5000, and it reflects the model's fluctuations in retrieval accuracy as the number of retrievals increases. As can be seen from the curves in Figure 3, our method outperforms all contrasting baselines, which intuitively reflects the efficiency of our CAGAN. It is worth noting that as the number of retrieved instances increases, the top-N precision curve decreases slowly. A reasonable explanation is that our proposed adaptive graph attention module can assist the learning of hash codes, thereby generating more high-quality hash codes. Finally, together with the MAP comparison results, the top-N precision curve can also illustrate that our proposed method mitigates the loss of accuracy in the process of binarization, thus improving retrieval performance and maintaining a high accuracy rate as the number of retrieved samples grows.

Ablation Study
To demonstrate the effectiveness and contribution of each module in our proposed approach, ablation experiments were carried out for each module. To this end, five variants of the model were designed to verify the impact of each module on the overall model. The results of the ablation experiments are shown in Table 5. These variant models are elaborated as follows:   The MAP results for the different variants on three multi-media datasets are presented in Table 4. Accordingly, we can conclude the following:

MIRFLICKR-25K NUS-WIDE
• Analysis of Table 4 shows that each module plays a significant role in the overall model. Among them, CAGAN-2 has the most obvious performance drop because language is human-refined information, and the similarity matrix constructed from text is sparse. However, CAGAN-1 only uses image features to build a similarity matrix but with less performance degradation. One potential reason for this is that images contain richer, fine-grained semantic information. The results from CAGAN-1 and CAGAN-2 demonstrate the effectiveness of our proposed multi-modal similarity enhancement module. • The adaptive graph attention module also has an impact on the performance of the proposed CAGAN. Specifically, from the results of CAGAN-3 and CAGAN-5, it can be seen that both the graph convolutional neural network and the attention mechanism contribute to the performance improvement of the model by about 1.5-2.5%.
In addition, we performed ablation experiments on different backbone networks, and the MAP results on MIRFLICKR-25K are shown in Table 6. We find that using CLIP as the backbone network has the best performance, followed by ResNet-152, which reflects that CLIP has excellent visual-linguistic feature extraction ability and is well-suited for cross-modal tasks.

Parameter Sensitivity Analysis
We analyze several hyper-parameters that could affect the results of the proposed method, and the analysis results are shown in Figure 5. The analysis was carried out by the controlled variable method, where one parameter was changed in the experimental setup and the values of the other parameters were fixed. η and β modulate the effect of image and text similarity on model performance, respectively. It is observed that η and β remain relatively stable around the range of 0.01 to 2, and when they are larger than 2, a large drop occurs. λ is a trade-off parameter for cross-modal similarity; it remains stable around 0.1, and when λ > 0.1, there is a significant decline. Therefore, properly adjusting the similarity can lead to satisfactory results. Analysis of the results in Figure 5 shows that our method is not sensitive to the choice of ε and ϕ in the range [0.1, 2], ε and ϕ weigh the contribution of intra-modal loss and graph convolution loss, and proper adjustment can make the model achieve optimal performance. µ is a scale hyper-parameter that can regulate the quantization scope for the matrix, which can adjust the matrix value to a reasonable range and improve the retrieval performance. In summary, reasonable tuning of the parameters allows the model to maintain advanced retrieval performance, and the proposed method is robust to hyper-parameters within a reasonable interval.

Training Efficiency and Convergence Testing
In this subsection, we investigate the convergence and training efficiency of the proposed CAGAN on three baseline datasets. Figure 6a shows the final loss function convergence curve at 16-bit hash code length, and Figure 6b displays the change curve of MAP as the number of iterations increases. The following conclusions can be drawn from the results in Figure 6. First, as the number of optimization iterations increases, the loss function gradually decreases, and the results show that the optimization process can improve the encoding ability of the hash function. In addition, the loss function can converge to the optimal result after dozens of iterations, illustrating that our method reduces training time consumption and improves training efficiency. Finally, the results show that the proposed network converges to the optimal point within dozens of iterations, validating that our proposed network is suitable for unsupervised hash retrieval tasks. Table 7 shows the computational complexity and training inference time of the proposed method and several advanced models on the MIRFLICKR-25K dataset. Although our proposed method is larger than the other methods in terms of number of parameters, our method converges faster due to our use of a multimodal model with frozen weights. In summary, the proposed CAGAN is more advantageous in terms of both retrieval accuracy and training time.  Figure 7 shows an example of the visualization results of CAGAN on the image and text retrieval task. The first column is the sample queried, the hash code is generated from the queried sample, the Hamming distance is calculated in the database through the hash code and the top five most similar results retrieved after Hamming sorting are displayed in the remaining columns. It is worth noting that the data boxed in red in Figure 7 indicate data that do not quite match the semantics of the query. One potential reason for this is that there is not enough data similar to the query due to data bias. Although these semantically incompatible data are retrieved, they are somehow related to the query data. Overall, it can be observed that the proposed method returns plausible retrieval results through Hamming sorting.

Conclusions
In this paper, to solve the problem of multi-modal data retrieval generated by different sensors, we proposed an effective and novel CLIP-based Adaptive Graph Attention Network applied to unsupervised multi-modal hashing retrieval tasks. To the best of our knowledge, we first apply CLIP to unsupervised multi-modal hashing. We designed a multi-modal similarity enhancement module to enhance data similarity, which helps improve retrieval accuracy. In addition, an iterative approximation optimization strategy is used to reduce the information loss during hash code binarization. Finally, a well-designed graph adaptive attention module can assist the learning of the hash network and alleviate the problem of unbalanced multi-modal learning. Sufficient experiments carried out on three benchmark datasets demonstrate that the proposed method outperforms several representative advanced methods. In the future, we will further investigate the performance of CAGAN on other retrieval tasks.

Conflicts of Interest:
The authors declare that the publication of this paper has no conflicts of interest.