1. Introduction
With the increasing sophistication of Internet technologies, users generate vast amounts of multimodal data, including images, text, audio, and video, on a daily basis. This proliferation has led to exponential growth in multimedia repositories, which exhibit inherent heterogeneity and complex semantic relationships across modalities [
1]. To enable effective cross-modal information fusion, it is essential to bridge the representational gaps between diverse data types, mitigating discrepancies in feature structure and distribution [
2]. In this context, cross-modal hashing has emerged as a critical technique that learns unified binary hash codes to establish semantic correspondences across modalities. By mapping heterogeneous data into a common Hamming space, it significantly improves retrieval efficiency and scalability while preserving cross-modal semantic consistency, thereby supporting large-scale multimedia search applications [
3].
In the past few years, the rapid evolution of deep learning methodologies has profoundly accelerated developments in cross-modal hashing. The incorporation of deep neural networks, particularly through architectures such as convolutional and recurrent networks, has established itself as a predominant research avenue. These models facilitate more effective representation learning by capturing high-level semantic correlations across heterogeneous data modalities, thereby substantially enhancing retrieval accuracy and computational efficiency. Furthermore, the synergy between feature abstraction capabilities of deep learning and hashing techniques offers promising potential for scalable multimedia search in real-world applications [
4]. Despite significant progress in deep cross-modal hashing, critical challenges persist. A primary issue is the difficulty of constructing semantically coherent and structure-preserving mappings across heterogeneous modalities. Moreover, existing methods struggle to reconcile inherent semantic discrepancies between diverse data types (e.g., images and text) due to their distinct feature distributions and structural properties. Consequently, semantic information is often lost during cross-modal alignment, leading to incomplete or distorted representations. These limitations collectively hinder semantic understanding and constrain the performance and generalizability of cross-modal retrieval systems.
To overcome the challenges identified in existing approaches, we introduce a feature fusion-based proxy hashing framework designed for cross-modal retrieval tasks. Raw image and text data are fed into a feature extraction network to obtain image and text features, respectively. These features are then integrated through a feature fusion module to combine multimodal semantic information, generating fused features that are both discriminative and robust. Finally, the hash function learning process is optimized through a carefully designed joint loss function that integrates three complementary components: cross-modal proxy loss, cross-modal irrelevant loss, and cross-modal consistency loss. In the feature extraction stage, the proposed method employs the CLIP model as its backbone architecture, replacing conventional convolutional neural networks (CNNs). CLIP incorporates a dual-encoder architecture comprising a visual encoder for image processing and a transformer-based encoder for textual input. On the image side, the ViT-B/16 pre-trained model of CLIP is used for feature extraction. On the text side, Byte-Pair Encoding (BPE) is employed for tokenization, and the processed text vectors are input into the text transformer encoder. For the feature fusion module, inspired by the multi-head attention mechanism in the Transformer architecture, a cross-modal feature fusion mechanism is designed. It integrates semantic information from images and text via an attention-weighted strategy to produce a unified feature representation with both discriminability and robustness. For the hash learning module, we propose a joint loss function comprising cross-modal proxy loss, cross-modal irrelevant loss, and cross-modal consistency loss, which replaces traditional pairwise or triplet losses. This design effectively preserves inter-sample similarity rankings and mitigates inter-modal semantic gaps. Specifically, the cross-modal proxy loss aligns heterogeneous features via proxy representations to bridge semantic gaps; the cross-modal irrelevant loss reduces modality-specific noise by penalizing irrelevant correlations between irrelevant pairs; and the cross-modal consistency loss ensures the compatibility and effectiveness of fused features, providing stable and semantically rich inputs for hash code generation. This multi-objective optimization framework enhances the discriminability of hash codes while improving the model’s robustness and generalization capability in large-scale cross-modal retrieval.
In summary, the contributions of this paper are as follows:
We propose a cross-modal feature fusion module incorporating attention mechanisms to integrate semantic information from images and text. This module effectively eliminates heterogeneity between modalities, producing discriminative and robust fused features that mitigate the semantic gap.
Based on the feature fusion module, a cross-modal consistency loss is proposed and proxy hashing is introduced, thereby constructing a joint loss function composed of cross-modal proxy loss, cross-modal irrelevance loss, and cross-modal consistency loss. Compared to traditional pairwise and triplet losses, this loss function effectively preserves the accuracy of inter-sample similarity ranking and mitigates the semantic gap between modalities.
Extensive experiments conducted on three widely used benchmark datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art approaches in cross-modal retrieval tasks.
3. Our Method
3.1. Notation Definition
This paper primarily focuses on the cross-modal retrieval task between image and text modalities. Given a dataset containing image-text pairs, it can be represented as a set , where , with denoting the -th image sample and representing the -th text sample. is the label corresponding to , where indicates the number of categories. The core objective of deep cross-modal hashing is to learn hash functions and for images and text, respectively, and then map the image and text data into corresponding binary hash codes and , where represents the length of the binary hash codes.
3.2. Overall Framework
The overall framework of FFCPH is illustrated in
Figure 1. First, raw image and text data are input into the feature extraction network to obtain image features
and text features
. Subsequently, a feature fusion module integrates multimodal semantic information to generate fused features
that exhibit both discriminative and robust properties. Finally, hash function learning is accomplished under the guidance of a joint loss function composed of cross-modal proxy loss, cross-modal Irrelevant loss, and cross-modal consistency loss. Implementation details of each component will be elaborated in subsequent sections.
3.3. Feature Extraction Module
This paper employs the CLIP pre-trained network for feature extraction of images and text [
26]. The CLIP model utilizes contrastive learning to train on a large-scale dataset of image-text pairs, enabling effective alignment of image and text features. The architecture of CLIP incorporates two primary encoding branches: a visual encoder and a linguistic encoder. The visual component commonly utilizes a Vision Transformer (ViT) as its backbone, whereas the linguistic encoder is built upon a standard Transformer-based framework. For the image modality, CLIP employs a pre-trained ViT-B/16 model to obtain discriminative visual representations. For the text modality, Byte Pair Encoding (BPE) is applied for tokenization, and the processed token sequences are fed into the text encoder.
3.3.1. Image-Side Network
Feature extraction for images is performed by the image encoder of CLIP. First, the input image undergoes preliminary feature extraction through convolutional layers. The feature maps output by the convolutional layers are then flattened into one-dimensional vectors and combined with trainable positional encodings to preserve spatial information. Finally, the processed feature vectors are fed into the image encoder composed of 12 encoding blocks. After processing through these 12 blocks, the final image features are obtained as .
The equations are as follows:
where
denotes the learnable positional encoding, and
represents the concatenation operation.
3.3.2. Text-Side Network
Text feature extraction is performed using the CLIP text transformer encoder. Initially, the input text is tokenized via Byte Pair Encoding (BPE), resulting in a sequence of 512 tokens. Each token is then augmented with a trainable positional encoding. The resulting text vectors are then fed into the transformer-based text encoder, which produces high-dimensional semantic embeddings through multi-layer self-attention and feed-forward computation, ultimately yielding the final distributed representations of the input text. The final extracted text feature is .
3.4. Feature Fusion Module
Traditional cross-modal methods often perform interaction between image and text features by simply concatenating their feature vectors or merging them directly through fully connected layers. While such approaches combine information from both modalities, they essentially represent a linear and semantically agnostic fusion strategy. In complex semantic scenarios, the fused representations often lack the capacity to adequately model high-level semantic interactions across modalities, which consequently results in the omission of pivotal semantic cues during the integration process.
To address this issue, this paper draws inspiration from the multi-head attention mechanism in the Transformer and designs a feature fusion module incorporating an attention mechanism [
27]. The core idea is to treat text features as “sequential semantic descriptions” (serving as key vectors
and value vectors
), while image features are regarded as objects to be “queried” and “enhanced” by contextual semantics (serving as query vectors
). The attention mechanism empowers the feature fusion module to adaptively quantify the correlation strength, represented as attention weights, among individual elements of both visual and textual features. Based on these weights, the textual semantic information is weighted and aggregated, ultimately producing fused features that retain the original image information while incorporating relevant textual semantics.
3.4.1. The Standard Multi-Head Attention Mechanism
To elucidate the underlying principle of the feature fusion module, it is imperative to first revisit the mathematical formulation of the canonical multi-head attention mechanism, a core component of Transformer-based architectures. In standard multi-head attention, the input is structured into three distinct vector representations: queries (
), keys (
), and values (
), typically projected from the same feature space or modality. For each attention head, the mechanism computes a weighted sum of value vectors, where the weights are derived from a scaled dot-product affinity between queries and keys. The computation of a single attention head is defined as:
where
,
, and
represent the query matrix, key matrix, and value matrix, respectively;
denotes the dimensionality of the key vector
, and
is the scaling factor used to prevent excessively large dot product results that may lead to vanishing gradients in the softmax operation [
27]. The attention weight distribution is obtained through the above formula. Subsequently,
,
, and
are projected
times using different linear projection matrices
,
, and
, and
attention heads are computed in parallel. The outputs from all attention heads are concatenated and subsequently transformed through a linear projection layer to form the final multi-head attention representation. This process can be formally expressed as:
where
represents the trainable output projection matrix that integrates information captured by different attention heads.
3.4.2. Module Architecture
The architecture of the feature fusion module is shown in
Figure 2. The purpose of this module is to combine text and image features through a specific method, thereby eliminating the heterogeneity between images and text and obtaining fused features that incorporate semantic information from both modalities. Considering that image and text features in this context are global feature vectors rather than long sequences, we adopt the computational logic of single-head attention to meet the efficiency and discriminative requirements of the cross-modal hashing task. Additionally, in the following sections, we define the image features as
(the query) and the text features as both
(the matched keys) and
(the aggregated values).
The previously obtained image and text features are denoted as
and
, respectively, where
represents the batch size and
is the length of the feature vectors, consistent with the output dimensionality of the CLIP model. The attention map is first computed as follows:
Here, matrix multiplication is used to compute the attention map between image features and text features. The shape of changes from to , while the shape of also changes from to . Therefore, the resulting attention map has the shape . The attention map represents the global semantic correlation strength between image features (as queries) and text features (as keys).
To transform the raw scores into valid probability distribution weights, we apply the
function to each row of the attention map
along the column direction for normalization:
Subsequently, the normalized attention weights
are employed to perform a weighted summation on the original text features
, generating a semantically filtered text vector
tailored to the current image:
At this point, the text vector is no longer an independent text feature but a dynamically reorganized representation enriched with cross-modal correlation information guided by the image features.
Finally, by adding the text feature (which has been filtered and reconstructed through the attention mechanism) to the original image feature, the final fused feature
is obtained:
Among them, serves as a learnable parameter that modulates the contribution of the weighted textual features to the resulting fused representation.
Within the feature fusion module, the attention mechanism serves as a semantic filter that dynamically identifies and emphasizes the most relevant elements from the input representations. For a given image, it automatically assigns higher weights to text describing its core content while suppressing irrelevant or redundant textual information, thereby achieving effective information filtering. By allowing image features to actively ‘query’ text features and performing a weighted summation in the feature space, this module mathematically establishes a nonlinear mapping relationship from the visual space to the textual semantic space. Such an interactive mapping more effectively narrows the distance between different modalities compared to simply projecting features into a common subspace.
It is noteworthy that the standard Transformer architecture typically utilizes Multi-Head Attention (MHA) to jointly attend to information from different representation subspaces at different positions, thereby avoiding the averaging effect that may occur in a Single-Head mechanism. In our design, we adopted a Single-Head attention mechanism for the feature fusion module. This choice was primarily driven by considerations of computational efficiency and model compactness, particularly given that the features extracted by the CLIP backbone are already high-level, semantically aligned representations. While the Single-Head mechanism offers a favorable trade-off between retrieval accuracy and computational overhead for our specific FFCPH framework, we acknowledge that it may have inherent limitations in capturing diverse semantic interactions compared to MHA. Exploring lightweight multi-head strategies to further enhance feature fusion capability remains a promising direction for future work.
3.5. Proxy Hash
Proxy hashing is a method for learning hash codes, often applied to problems involving multimodal data, such as images and text. Compared to conventional pairwise and triplet losses, the proxy hashing approach introduces a set of learnable proxy vectors to simulate the semantic centers of different classes, enabling direct optimization of the relationships between proxies and samples. This reduces computational complexity while achieving competitive performance.
The core idea of proxy hashing is to introduce a set of learnable proxy vectors, which represent different semantic categories in a shared embedding space. The similarity relationships between these vectors and sample points serve as constraints for optimization. Specifically, as illustrated in
Figure 3, four proxies
,
,
and
are embedded in the common space, while
and
are two sample points. If sample
is associated with proxy
and unrelated to other proxies, the proxy-based method will minimize the distance between
and
while maximizing the distances between
and other semantically irrelevant proxies. Similarly, if sample
is associated with proxy
and unrelated to other proxies, the method will likewise minimize the distance between
and
while pushing
away from unrelated proxies. This mechanism of attracting similar points and repelling dissimilar proxies effectively promotes clustering of cross-modal data in the shared semantic space, thereby laying the foundation for generating well-structured binary hash codes.
The initialization of proxy vectors is critical to ensuring training stability and convergence efficiency. In this work, the Kaiming initialization method is employed to randomly generate the initial values of each proxy vector. The specific formulation is as follows:
Here, , denotes the total number of categories, represents the dimensionality of the proxy vectors for each category, and stands for the feature value of the -th dimension of the -th category. For a given tensor , the Kaiming initialization method samples values from a normal distribution .
One advantage of this initialization method is that, in neural networks with activation functions, it maintains consistent variance distributions of inputs and outputs across layers, thereby helping to alleviate problems like gradient explosion and facilitating faster convergence during initial training. Furthermore, the proxy vectors can be jointly optimized with the encoding network during training and adaptively adjusted according to changes in the semantic distribution of multimodal samples.
3.6. Loss Function Design
To generate discriminative binary hash codes, this paper constructs a joint loss function composed of cross-modal proxy loss, cross-modal Irrelevant loss, and cross-modal consistency loss. This function aims to explicitly model multi-label semantic relationships through learnable proxy vectors, while leveraging consistency constraints to ensure the effectiveness of multimodal information fusion. Compared to traditional pairwise loss and triplet loss, the proposed loss function more effectively preserves similarity ranking among samples and alleviates semantic discrepancies across modalities. The subsequent sections will introduce these three loss functions separately. Additionally, the previously extracted image feature , text feature , and fused feature will be denoted as , , , respectively, for notational convenience.
3.6.1. Cross-Modal Proxy Loss
The fundamental principle of cross-modal proxy loss is to guide hash code learning through a set of learnable proxy vectors. For multi-label samples, which may be associated with multiple proxies, the loss function aims to minimize the distance between image features, text features, fused features, and semantically relevant proxies while maximizing their distance from semantically irrelevant proxies. Specifically, let
denote the learnable proxies and
represent the number of categories. The cosine similarity between image features, text features, fused features, and the proxies is calculated as follows:
After obtaining the cosine similarity scores among the image features, text features, fused features, and the corresponding proxy vectors, the positive sample proxy loss for each feature type can be calculated as follows:
where
,
, and
denote the cosine similarity between the
-th image, text, and fused features and the
-th proxy, respectively;
is the one-hot encoded label matrix indicating whether the
-th sample belongs to the
-th category; and
is the number of positive samples, i.e., the count of elements with a value of 1 in the label matrix. The positive sample proxy loss is designed to reduce the distance between image features, text features, fused features, and semantically relevant proxies.
Similarly, to increase the distance between samples and irrelevant proxies, the negative sample proxy loss is formulated as follows:
where
is the number of negative samples, i.e., the count of elements with a value of 0 in the label matrix;
is a threshold parameter used to control the boundary of similarity, where only cosine similarities exceeding
will incur penalties. The negative sample proxy loss serves to maximize the distance between image features, text features, fused features, and semantically irrelevant proxies.
In summary, the total cross-modal proxy loss is obtained by summing the positive sample proxy loss and the negative sample proxy loss:
3.6.2. Cross-Modal Irrelevant Loss
To further enhance the discriminative ability of hash codes, it is necessary to explicitly maximize the distance between irrelevant sample pairs. Therefore, a cross-modal Irrelevant loss is designed. This work refers to the method of defining irrelevant pairs in HyP
2 Loss [
28]: if the dot product of the label vectors of two samples is zero and the magnitudes of their respective label vectors are both greater than 1, the two samples are considered an irrelevant pair.
For cross-modal datasets, three types of irrelevant pairs may exist: irrelevant pairs between image samples, irrelevant pairs between text samples, and, correspondingly, irrelevant pairs between images and text. Thus, three types of Irrelevant losses are designed in this work: intra-modal Irrelevant loss (for image-image and text-text pairs) and inter-modal Irrelevant loss (for image-text pairs).
For image samples, the Irrelevant loss is formulated as follows:
where
is the number of irrelevant pairs,
is a hyperparameter,
is the number of samples from different categories, and
is the cosine similarity between the
-th image feature and the
-th image feature.
Similarly, the Irrelevant loss for text samples is formulated as follows:
The Irrelevant loss between image and text samples is formulated as follows:
In summary, the final cross-modal Irrelevant loss is obtained by summing the three Irrelevant losses:
3.6.3. Cross-Modal Consistency Loss
To guarantee that the fused feature = generated by the feature fusion module effectively preserves semantic information from both the original image and text modalities while promoting semantic alignment and matching between the two modalities, this paper designs a cross-modal consistency loss. This loss function constrains the distribution consistency between the fused feature and the original unimodal features in the semantic space, achieving complementary enhancement of multimodal information.
The specific formulation of the cross-modal consistency loss is as follows:
where
denotes the mean squared error, calculated as follows:
where
denotes the dimensionality of the feature vector.
By minimizing the cross-modal consistency loss, the fused features can maintain numerical distribution proximity to the original image and text features, ensuring no loss of semantic information. This provides stable and semantically rich multimodal input for subsequent hash code generation, thereby enhancing the accuracy and robustness of cross-modal retrieval.
3.6.4. Total Loss
By summing the cross-modal proxy loss, cross-modal Irrelevant loss, and cross-modal consistency loss, the final total loss can be obtained as:
where
denotes the cross-modal proxy loss,
represents the cross-modal Irrelevant loss, and
corresponds to the cross-modal consistency loss. The hyperparameter
controls the influence of the cross-modal Irrelevant loss on the total loss.
The cross-modal proxy loss establishes semantic associations between samples and category proxies to enhance the discriminability of hash codes. It aligns heterogeneous samples with their corresponding category-level proxies in a shared space, improving both intra-class compactness and inter-class separability. Concurrently, the cross-modal irrelevant loss increases the distance between irrelevant pairs through margin constraints, effectively reducing false correlations. Lastly, the cross-modal consistency loss preserves semantic and structural compatibility between original and fused features, ensuring the fused representations retain critical multimodal information while enabling effective hash space alignment. This loss provides a robust feature foundation that supports the performance of other objective functions, ultimately improving cross-modal retrieval accuracy and efficiency.
4. Experimental Results and Analysis
4.1. Experimental Setup
Experiments were performed on three widely adopted datasets: MIRFLICKR-25K, NUS-WIDE, and MS COCO. These datasets are commonly employed in a variety of vision-language research tasks, such as image classification, object recognition, and cross-modal retrieval between images and text.
The MIRFLICKR-25K dataset consists of 24,581 images, each associated with a corresponding textual description. The images are annotated with 24 distinct labels, corresponding to 24 semantic categories.
NUS-WIDE is a larger dataset comprising 269,648 image-text pairs from 81 categories. Owing to the severe class imbalance, with many categories containing only a limited number of samples, 21 common categories were selected in this study, which consists of 195,834 image-text pairs.
MS COCO is a dataset designed to advance research in various computer vision domains, particularly object detection and image captioning. It includes 40,504 validation images and 82,785 training images, each paired with textual descriptions. The dataset covers 80 object categories.
All experiments were performed on an NVIDIA RTX 3090Ti GPU using the Adam optimizer for parameter updates. The hyperparameter was set to 0.8, the initial learning rate was 0.001, and the batch size was uniformly set to 128.
4.2. Baseline Methods and Evaluation Metrics
To evaluate the efficacy of the proposed FFCPH approach, ten deep cross-modal hashing techniques were employed as baseline methods for comparative analysis, including DCMH [
29], SSAH [
30], AGAH [
31], DADH [
32], MSSPQ [
14], DCHMT [
13], MIAN [
25], DAPH [
33], MCCMR [
18], and DNPH [
34]. All compared methods utilize deep neural networks as their backbone frameworks, and their source codes and data are mostly provided in the corresponding publications.
To ensure comprehensive and credible performance evaluation, three commonly used evaluation metrics from information retrieval were utilized: mean Average Precision (MAP) [
35], Precision-Recall (PR) curves, and TopN-Precision (TopN-P) curves.
MAP, an abbreviation for mean Average Precision, serves as a standard metric for assessing comprehensive effectiveness. A higher MAP value reflects better retrieval accuracy of the deep hashing method.
The formula for MAP is as follows:
The Precision-Recall (PR) curve visually represents the trade-off between precision and recall across different retrieval thresholds, such as Hamming distance thresholds or similarity thresholds. By traversing different thresholds, a series of data points is obtained, and connecting these points yields the PR curve.
The TopN curve serves as a graphical tool to assess the effectiveness of a ranking model by focusing on its top-N returned items. TopN precision quantifies the ratio of relevant instances within these top-N retrieved results. By plotting the precision curve as increases from 1, the model’s performance at different retrieval depths can be intuitively assessed. A steeper initial slope and a consistently higher overall curve indicate better performance of the model in the top-N results.
4.3. Experimental Results Comparison
To further substantiate the thoroughness and efficacy of the proposed approach, cross-modal retrieval trials were performed across three datasets employing hash codes with lengths of 16, 32, and 64. As shown in
Table 1, FFCPH demonstrates the best retrieval performance on all three widely used datasets—MIRFLICKR-25K, NUS-WIDE, and MS COCO—in cross-modal retrieval applications.
For the I→T task on MIRFLICKR-25K, FFCPH slightly underperforms DCHMT at 16 bits but outperforms DNPH by 1.8% and 2.8% at 32 and 64 bits, respectively. In the T→I task, it also trails SSAH at 16 bits, while surpassing MCCMR by 0.2% and 1.7% at 32 and 64 bits, respectively.
For the I→T task on the NUS-WIDE dataset, FFCPH achieves improvements of 1.5%, 2.3%, and 2.4% over DNPH, DAPH, and DNPH at 16, 32, and 64 bits, respectively. In the T→I task, it outperforms DNPH by 1.8%, 1.9%, and 1.0% at the three bit lengths.
For the I→T task on the MS COCO dataset, FFCPH surpasses DNPH by 4.5%, 4.8%, and 4.5% at 16, 32, and 64 bits, respectively. In the T→I task, it outperforms DNPH by 4.6%, 4.2%, and 3.5% at the corresponding bit lengths.
These results show that the performance improvement of the proposed method is more pronounced on MS COCO than on NUS-WIDE and MIRFLICKR-25K. This is due to the fact that MS COCO contains 80 categories, significantly more than the 21 in NUS-WIDE and 24 in MIRFLICKR-25K. While previous methods are highly sensitive to the number of categories, the proposed approach demonstrates greater robustness and broader applicability.
Overall, FFCPH performs best on all three datasets, further confirming its stability and generalization ability under varying hash code dimensions.
Figure 4 and
Figure 5 present a comparison of top-N curves and PR curves between the FFCPH method and several mainstream approaches at a 32-bit encoding length on three datasets.
To further observe the retrieval performance of FFCPH, Top-N Precision curves were generated for three datasets, as shown in
Figure 4.
For the MIRFLICKR-25K and MS COCO datasets, FFCPH consistently demonstrates superior performance compared to other methods at all retrieval points in both I→T and T→I tasks, demonstrating its strong semantic modeling capability and retrieval stability. In the I→T task on NUS-WIDE, FFCPH initially exhibits lower Precision than DADH but surpasses DADH when N exceeds 100, maintaining its leading performance thereafter, indicating superior robustness compared to other methods.
Overall, FFCPH exhibits excellent performance across all tasks on the three mainstream datasets. This performance shows the model’s exceptional adaptability and reliable effectiveness in learning and transferring semantic associations across different modalities.
To further observe the retrieval performance of FFCPH, PR curves were generated for three datasets, as shown in
Figure 5.
On the MIRFLICKR-25K dataset, FFCPH achieved the best results in both I→T and T→I tasks, consistently outperforming other methods across all retrieval points. This demonstrates FFCPH’s ability to stably retrieve more relevant samples.
On the NUS-WIDE dataset, the overall precision of FFCPH in both tasks was slightly lower than that on MIRFLICKR-25K. Nevertheless, it maintained optimal performance and stability across the entire recall range.
On the MS COCO dataset, FFCPH exhibited slightly lower precision than DNPH and MIAN at low recall rates (Recall < 0.4) in the I→T task but consistently outperformed other methods when Recall > 0.4, with significantly higher precision. In the T→I task, FFCPH showed marginally lower precision than DNPH when Recall < 0.2 but maintained superiority thereafter. These results indicate the high robustness and strong generalization capacity of the FFCPH model.
In summary, FFCPH demonstrates high precision-recall balance across different datasets and tasks, effectively controlling false positives while ensuring sufficiently high recall rates.
4.4. Ablation Study
The FFCPH framework processes raw image and text data through a feature extraction network to obtain image features and text features separately. These features are then integrated via a feature fusion module to generate fused features that exhibit both discriminative power and robustness. Finally, hash function learning is accomplished under the guidance of a joint loss function composed of cross-modal proxy loss, cross-modal Irrelevant loss, and cross-modal consistency loss.
In this section, to validate the effectiveness of individual components in FFCPH, we design four variants of the model to evaluate the contribution of each component to overall performance. The variants are constructed as follows:
FFCPH-1: Replaces the CLIP-based image encoder with a ResNet-18 network.
FFCPH-2: Removes the feature fusion module and the cross-modal consistency loss. Only the cross-modal proxy loss and cross-modal Irrelevant loss are used for optimization. In the cross-modal proxy loss, only image and text features are considered—fused features are excluded, i.e., .
FFCPH-3: Removes the cross-modal Irrelevant loss and optimizes parameters using only the cross-modal proxy loss and cross-modal consistency loss, i.e.,
FFCPH-4: Removes the cross-modal proxy loss and optimizes the model using only the cross-modal Irrelevant loss and cross-modal consistency loss, i.e., .
To evaluate the impact of individual components in the FFCPH model on cross-modal retrieval performance, experiments were carried out using four model variants across three datasets, focusing on the I→T task. The corresponding results are provided in
Table 2.
FFCPH-1, a variant that replaces the CLIP image encoder with ResNet18, exhibits a significant decline in mAP across all three datasets for various hash code lengths. Specifically, the average mAP decreases by 3.0% on MIRFLICKR-25K, 2.1% on NUS-WIDE, and 5.8% on MS COCO. This indicates that the CLIP image encoder, by modeling long-range visual dependencies, extracts higher-quality semantic features compared to ResNet18.
FFCPH-2, which removes the feature fusion module and cross-modal consistency loss and optimizes the model solely with the cross-modal proxy loss and cross-modal Irrelevant loss, shows a marked reduction in mAP across all datasets and hash code lengths. The average mAP drops by 0.8% on MIRFLICKR-25K, 0.6% on NUS-WIDE, and 1.3% on MS COCO. This demonstrates that the feature fusion module and cross-modal consistency loss ensure the effectiveness and modality compatibility of fused features, providing stable and semantically rich multimodal inputs for subsequent hash code generation, and thereby enhancing retrieval accuracy and robustness.
FFCPH-3, which eliminates the cross-modal Irrelevant loss and optimizes parameters using only the cross-modal proxy loss and cross-modal consistency loss, also experiences a noticeable decline in mAP. The average reduction is 0.7% on MIRFLICKR-25K, 0.4% on NUS-WIDE, and 1.0% on MS COCO. This confirms that the cross-modal Irrelevant loss captures fine-grained semantic relationships between data samples and disentangles irrelevant pairs.
FFCPH-4, a variant that removes the cross-modal proxy loss and optimizes the model using only the cross-modal Irrelevant loss and cross-modal consistency loss, exhibits a significant decrease in mAP. The average reduction is 1.1% on MIRFLICKR-25K, 1.0% on NUS-WIDE, and 1.9% on MS COCO. This indicates that the cross-modal proxy loss ensures that relevant data-proxy pairs are embedded closely while pushing apart irrelevant pairs.
In summary, the ablation study validates the contributions of all modules in FFCPH to overall retrieval performance. The cross-modal proxy loss establishes semantic associations between samples and category proxies, ensuring the discriminability of hash codes. The cross-modal Irrelevant loss explicitly separates irrelevant sample pairs, enhancing the distinctiveness of hash codes. The feature fusion module and cross-modal consistency loss focus on guaranteeing the effectiveness and modality compatibility of fused features, providing a high-quality feature representation foundation for the aforementioned losses. These modules collaboratively contribute to the superiority of FFCPH in cross-modal retrieval.
4.5. Parameter Sensitivity Analysis
This section primarily conducts parameter analysis for
.
Figure 6 illustrates the variation in mAP across different tasks and hash code lengths on the MIRFLICKR-25k dataset when
takes different values. The values of
range from 0.1 to 0.9 and are used to evaluate the impact of cross-modal uncorrelated loss on retrieval performance. Specific results are shown in
Figure 6.
As shown in
Figure 6, in the I→T task, the model achieves optimal performance at
= 0.2 for 16-bit,
= 0.8 for 32-bit, and
= 0.8 for 64-bit.
In the T→I task, the model achieves optimal performance at = 0.2 for 16-bit, = 0.1 for 32-bit, and = 0.8 for 64-bit.
Based on comparative analysis, we set the hyperparameter = 0.8 in our experiments.