3.1. Overall Framework
As illustrated in
Figure 1, which presents the foundational learning framework of our proposed method, our objective is to direct the generation of hash functions using the learned hash codes of cross-modal features. This approach enables the conversion of image features and text features into binary hash codes, thereby achieving efficient and accurate cross-modal retrieval.
Our framework for learning cross-modal hash codes comprises three main components: an image feature enhancement network, a self-attention-based graph feature fusion module, and the hash function learning. We utilize AlexNet and BoW networks to extract image and text features, respectively. First, we feed the image features into the image feature enhancement network to improve their quality, resulting in enhanced image features . Next, we combine these enhanced image features with the text features to create a fused feature representation . We also construct an adjacency matrix for the enhanced image features and the text features to enable subsequent GCN-based feature fusion. Following this, we apply GCN to perform feature fusion on the enhanced image features, text features, and the fused features. The GCN processed enhanced image features , text features , and fused features then undergo a self-attention operation to produce enhanced representations for each. After completing the self-attention process, we splice the three enhanced features together to create new fused feature representations . Finally, a unique binary hash code is generated for these new features, which is used alongside a loss function to guide the learning of the hash function.
The image feature enhancement network enhances the representation of image features by performing multi-scale learning on features extracted from the images. In the graph feature fusion module, we achieve cross-modal feature fusion by applying graph convolutional networks across different modalities, which enhances the interaction between inter-modal features. Additionally, we introduce a self-attention mechanism to dynamically adjust the weights of image features, text features, and complementary features. This adjustment improves the representation of important information, leading to a more refined feature representation.
Secondly, in the hash function learning part, we generate a unique binary hash code by utilizing the new features obtained from this self-attention-based feature fusion. This result is further processed using a hash function to create a distinctive binary hash code. This process guides the development of both the image hash function and the text hash function . Ultimately, the input features are efficiently mapped into a compact and semantically consistent hash space, allowing us to obtain both the image hash code and the text hash code, thereby enabling effective cross-modal retrieval.
3.2. Image Feature Enhancement Network
In academic research, multi-scale feature learning has garnered significant attention and has been applied across various fields [
29]. This approach enhances the semantic information of images by extracting local features at different scales, allowing for more effective representation of images in complex scenes. However, existing studies might not fully capture the comprehensive and rich semantic information needed for more intricate image scenarios. Therefore, this section will focus on strategies to better express this rich semantic information at the initial stage, to aid in subsequent experiments.
Firstly, the image feature enhancement network is mainly designed to dynamically fuse multi-scale local and global features through multi-scale feature learning, so as to obtain the enhanced expression of image features. Specifically, it is to extract multi-scale image feature representation (i.e.,
) by different sizes of convolution kernels (1 × 1, 3 × 3, 5 × 5, 7 × 7, 9 × 9) from
after completing the image feature extraction and weighted fusion by the dynamic feature
and ultimately to obtain the enhanced representation of the image feature
, where
denotes the intermediate feature output of multi-scale feature learning. Thus, we can derive the calculation below:
where
denotes these different kernel sizes. The weights
for each scale are dynamically learned by applying the constructed MLP network to
, thus resulting in the following:
The final convolution is performed through a 1 × 1 kernel (i.e.,
), which is convolved and linked to
to obtain an enhanced version of the image feature representation
, and then the dimensions are unified as
. The computational formula is expressed as follows:
3.3. Graph Feature Fusion Module
Recently, GCN has garnered significant attention, which excels at capturing complex relationships and features among nodes in images and has experienced rapid advancements in both image processing and natural language processing [
30,
31]. In this work, we will utilize GCN to fuse cross-modal features and apply varying weights to enhance the important information within these features through a self-attention mechanism. This approach aims to improve the expressive capabilities of the features and enrich the semantic representation among cross-modal features. In this process, symmetry is represented by the use of symmetrical image–text pairs during training. We utilize similarity matrices to establish and preserve this symmetrical relationship when applying GCN for feature learning. This approach helps to more accurately capture the correlation between multi-modal features and enhances the effectiveness of feature learning. In this context, the nodes of a graph network represent feature points extracted from images or text. The edges connecting these nodes are represented by an adjacency matrix, which indicates the similarity between the image or text data points. Initially, the features from the images and text are processed by their respective GCN encoders. These features are then combined to create fused features. The processing can be mathematically detailed using feature fusion as an example, as outlined below.
First, we constructed the adjacency matrix
for the enhanced image features
and text features
, and
is calculated by the following formula:
where
denotes the similarity matrix of the image modality,
denotes the similarity matrix of the text modality, and
is the balance parameter. The computation rules of
and
are the same, and the formula for each element of
is as follows:
where
denotes the cosine similarity between the two vectors, and
and
denote the instance features of the
th and
-th images, respectively.
In the feature fusion stage, first we spliced the already obtained image feature representation
and text feature representation
to obtain the new feature representation
(features fused from
and
) and used GCN to obtain the new feature
after the fusion of
and image–text features. The new feature representation is
given by the following formula:
where
denotes the adjacency matrix,
is the degree matrix of
, and
is the tier
learnable weight matrix.
In order to further enhance the representation of the features, we subsequently continued the self-attention operation for
,
, and
, respectively, in which the weights of the features are dynamically adjusted by calculating the interrelationships among the features, thus enhancing the representation of the features
,
, and
. Taking
as an example, the input feature is
with dimension
, where
is the number of samples and
is the feature dimension. We define three linear transformations:
,
, and
, where
,
, and
are learnable weight matrices that are used to compute the query, key, and value, respectively. Subsequently, the dot product of the query and key is computed and normalized by the
function to obtain the attention weights with dimension
. Attention weights are calculated by the following formula:
where
is the dimension of the key, which is used to scale the dot product result to prevent the gradient from disappearing. The values are then weighted and summed using the attention weights to obtain enhanced feature representation. The formula for weighted summation is the following:
Among them, and have the same dimension , so the output dimension of is . Finally, the representations , , and of the enhanced features obtained by the self-attention operation are subjected to feature fusion to obtain the new representation . In the self-attention calculations, we enhanced the representation of important features by incorporating an attention mechanism into image features, text features, and fused features. This is achieved by calculating and dynamically adjusting the self-attention weights through a fully connected layer.
3.4. Hash Function Learning
This research aims to develop an effective cross-modal hash retrieval algorithm using hash functions. After completing the image–text feature fusion process, the newly generated features will undergo further processing with hash functions to create unique binary hash codes. These codes will guide the development of both image and text hash functions, which will be optimized by minimizing target functions. This approach will enable the efficient mapping of input features into a compact and semantically consistent hash space. By leveraging hash functions, we aim to preserve the semantic information of multi-modal data, thereby improving both the efficiency and accuracy of cross-modal retrieval.
Specifically, the hash learning is performed on the new feature matrix
after performing feature fusion, where
is the number of samples and
is the feature dimension, which is mapped to a lower dimension
by a hash function
, denoted by Equation (7):
The fused features after hash coding can be expressed as
. The fusion of features of the hash code
guides the learning of the image hash function
and text hash function
. Optimization of image hash function
and text hash function
is adjusted through the target function, and the formula is as follows:
where
is the image features,
is the text features,
is the parameters of the image hash function, and
is the parameters of the text hash function. The final image hash code is denoted as
, and the text hash code is denoted as
. Overall, we first learn the uniform hash code representation
through the feature learning process in the above sections. The hash function learning process is implemented by using this hash code as supervision. Furthermore, we introduce a series of loss terms in the next subsection to ensure that the proximity of the data points is preserved during the learning process.
3.5. Loss Functions
The overall loss function of AMGFH consists of several components. First, the discretization loss (i.e., DIS) ensures that the generated hash code closely resembles binarization, thereby enhancing the retrieval efficiency of the hash code. The specific formula is as follows:
where
denotes the generated hash code. By minimizing this loss function, the model can learn how to generate hash codes that are close to binarization, thus improving retrieval efficiency.
Subsequently, the regularization loss (i.e., REG) is used to prevent model overfitting by constraining the model parameters through
regularization. The specific formula is as follows:
where
is the regularization parameter, and
is the model parameter. By minimizing this loss function, the model is able to avoid overfitting, thus improving the generalization ability of the model.
Then, the adversarial loss (i.e., ADV) is used to enhance the generation of hash codes, which is prompted by the idea of generative adversarial network (GAN) training to make the generated hash codes more effective. The specific formula for this term is as follows:
where
is the discriminator,
is the generator,
is the true data distribution, and
is the generated data distribution. By minimizing this loss function, the model is able to learn how to generate more realistic hash codes, thus improving retrieval performance. By combining Equations (8)–(11), the overall loss function is optimized in an iterative training process, where the network converges in the end to perform the subsequent cross-modal retrieval.