HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval

Image-text retrieval aims to search related results of one modality by querying another modality. As a fundamental and key problem in cross-modal retrieval, image-text retrieval is still a challenging problem owing to the complementary and imbalanced relationship between different modalities (i.e., Image and Text) and different granularities (i.e., Global-level and Local-level). However, existing works have not fully considered how to effectively mine and fuse the complementarities between images and texts at different granularities. Therefore, in this paper, we propose a hierarchical adaptive alignment network, whose contributions are as follows: (1) We propose a multi-level alignment network, which simultaneously mines global-level and local-level data, thereby enhancing the semantic association between images and texts. (2) We propose an adaptive weighted loss to flexibly optimize the image-text similarity with two stages in a unified framework. (3) We conduct extensive experiments on three public benchmark datasets (Corel 5K, Pascal Sentence, and Wiki) and compare them with eleven state-of-the-art methods. The experimental results thoroughly verify the effectiveness of our proposed method.


Introduction
In the information age, images and texts are the two most significant data for understanding the natural world. Therefore, designing efficient retrieval methods has become an essential prerequisite for obtaining multi-modal information. For example, when users are interested in an image, they can use it to retrieve related texts through effective image-text retrieval technologies, and vice versa. However, the heterogeneous properties of images and texts make the mutual retrieval between them quite challenging. Therefore, to realize high-precision image-text retrieval [1], the heterogeneous gap [2] should be well solved.
To bridge this gap, mainstream researches on image-text retrieval tend to put stress on learning about the two different patterns of common embedded in space, which can be roughly divided into (1) global-level alignment methods [3][4][5][6] and (2) local-level alignment methods [7][8][9][10][11]. To be specific, the goal of global-level alignment methods is to map the whole image and text into a common potential embedded space and further calculate the image-text similarity. However, there actually exist visual objects and textual key words in images and text. It is of great importance to take advantage of their local-level features when calculating the image-text similarity. To find solutions for these problems, a number of local-level alignment methods that can learn the relationship between features of image patches and words have been proposed. For example, SCAN [7] adopts a stacked crossattention module to conduct local-level alignment between visual objects and textual key words to capture more comprehensive cross-modal associations. Besides, most image-text retrieval methods usually use triplet loss [12] to optimize the parameters of models and achieve better image-text retrieval performance.
Although all these methods turn out to perform well, we believe that there are still two limitations that prevent them from achieving better retrieval performance. That is, the motivations of this work lie in the following two aspects: • Global-level features contain general information about images or texts, while locallevel features focus on their details. However, most existing methods only take a single-level feature into account to calculate the image-text similarity, while ignoring the different roles and effects of different features. Therefore, we propose to fully explore the integration of hierarchical alignment features so that image-text retrieval can provide a more accurate retrieval result. • Using triplet loss for optimization will bring about 2 disadvantages. Firstly, the training samples are constructed into triples, and then produce a large number of redundant pairs containing a small amount of information. Randomly sampling these training pairs will result in slow convergence. Secondly, the triplet loss optimizes all training pairs with the same strength, which fails to fully use the training samples with differentiation and will lead to performance degradation. So we suggest considering two iterative stages of sampling and weighting in the design of the loss function.
To filter out redundant information, we only select some representative samples: (1) to generate positive pairs, samples that are farther away from the anchor are chosen, and (2) to generate negative pairs, samples that are closer to the anchor are utilized. For better performance, we assign different weights to different positive pairs and negative pairs by fully exploiting the discriminative training samples.
Inspired by the above discussions, our research suggests a Hierarchical Adaptive Alignment Network (HAAN) for Image-text Retrieval, which combines the hierarchical alignment and the adaptive optimization together to enhance the performance of image-text retrieval.
To be specific, as shown in Figure 1, when it comes to matching an image with a text, we integrate global-level alignment and local-level alignment together to learn better imagetext correlation. For global-level alignment, it refers to learning the similarity of the whole image and text based on global-level features. Different from global-level alignment, locallevel alignment refers to learning the similarity of image blocks and keywords based on local-level features. Then, we carry out an adaptive optimization for global-level similarity and local-level similarity and fuse them together. The main contributions we made in this paper can be concluded in the following points: A boy doing a wheelie on a plank with the beach in the background. A boy is doing a wheelie on a bike. A boy on a bike doing a wheelie at the end of a diving board. A boy wearing a helmet does a trick on his dirt bike near a beach. A boy on bike popping a wheelie from a rooftop.
A boy doing a wheelie on a plank with the beach in the background. A boy is doing a wheelie on a bike. A boy on a bike doing a wheelie at the end of a diving board. An example of hierarchical alignment for image-text correlation learning, which not only explores global-level alignment between the whole image and text, but also considers the local-level alignment between image blocks and keywords.
• A hierarchical adaptive alignment network is proposed to innovatively exploit multilevel clues within images and texts, which fully explores the integration of the globallevel and local-level features to improve the performance of image-text retrieval. • We put forward an adaptive weighted loss method to accurately optimize image-text similarity through two stages. In stage 1, we select positive and negative pairs that contain rich information to accelerate convergence. In stage 2, we design different weights for different pairs to achieve better performance. • Based on extensive experiments on three widely-used benchmark datasets, it is shown that compared with several state-of-the-art image-text retrieval approaches, the method we proposed tends to achieve the best performance.
The organizational structure of the following content is as follows: Firstly, we review the generation related to this study in Section 2 briefly. Then, Section 3 details the interpretation of our method. Afterward, we conduct a series of related experiments to verify and analyze the proposed method in Section 4. Lastly, Section 5 provides a summary of the whole paper.

Related Work
In this section, we briefly review representative methods for image-text retrieval. Specifically, we discuss the mainstream methods in Section 2.1, and then discuss the application of metric learning for image-text retrieval in Section 2.2.

Image-Text Retrieval
In order to bridge the heterogeneous gap between images and texts, the mainstream of existing methods focuses on building a common embedding space to calculate the similarity between different modalities. Learning image-text correlation with global-level or local-level features is very commonly seen in previous works.

Image-Text Retrieval Using Global-Level Features
Various image-text retrieval methods mainly concentrate on global-level information to achieve image and text matching, which is embodied in capturing the global-level visualtextual correspondence. In 2017, breakthroughs were made in research by Faghri et al. [5], where the scholars encoded the image by a CNN, and a GRU-based text encoder to extract the feature of sentences is proposed. Wang et al. [4] put forward a method with a twobranch network to analyze the correspondence between different modalities in a creative way. By incorporating the generative model into image-text embedding, Gu et al. [3] conducted research to explore richer representations. According to the research conducted by Wen et al. [6], scholars put forward a cross-memory network with pair discrimination, by which the common knowledge between image and text modalities is captured. Although these methods have made great achievements, the local-level alignment between different modalities is ignored.

Image-Text Retrieval Using Local-Level Features
The image-text retrieval methods depending on local-level features also become predominant in recent years. It aligns differentiated image areas with corresponding index terms that describe certain objects. To find out the latent region-word correspondences, Lee et al. [7] proposed a stacked cross-attention module. A bi-directional focal attention network was presented by Liu et al. [8]. In this network, image-text alignment can be analyzed through an emphasis on relevant fragments. Aiming at the relationship-enhanced visual features, a visual reasoning network introduced by Li et al. [9] performed successfully as well. Additionally, Chen et al. [11] put forward an iterative matching scheme, which worked creatively for the recurrent attention memory module designed to capture the image-text correspondences it owned. Zhang et al. [10] proposed a novel negative-aware attention framework, in which both the positive influence of matched fragments and the negative consequence from mismatched fragments were taken into account. Scholars made full use of these together to deduce image-text similarity.
In conclusion, most current methods only take single-level information into consideration when calculating the image-text similarity. It is worth noting that, different from previous studies mentioned above, we entirely search and fuse the global-and local-level information in image and text, yielding more semantic information for the sake of image-text retrieval.

Metric Learning of Image-Text Retrieval
Recently, metric learning has become a hot topic, which is designed to use a loss function to measure similarity and then improve the method's performance by pulling semantically relevant samples closer and pushing apart semantically irrelevant samples. A triplet loss was proposed by Schroff et al. [12] which attempts to explore a feature space, where positive samples stayed closer and negative samples stayed farther to anchors. When it comes to the image-text retrieval task with n image-text pairs, the time complexity of triplet loss was O(N 3 ), and it was not feasible to traverse all sample pairs during training. To conclude, selecting typical samples attaches great significance to metric learning.
A deep coupled metric learning proposed by Liong et al. [13] can manage to reduce the modality map by two nonlinear transformations. In research by Faghri et al. [5], a variant triplet loss for image-text matching was introduced while improved results were also reported. Xu et al. [14] proposed a modality classifier in their studies, which is utilized to make sure the transformed features showing statistically indistinguishable. Nevertheless, the methods we have discussed all share a balanced view of positive and negative pairs.
In conclusion, previous works as mentioned above can not precisely distinguish samples based on levels of significance. Some works even treat the optimization of various samples equally, resulting in poor retrieval performance and slow convergence. In this paper, we suggest an adaptive weighted loss that integrates pairs mining and pairs weighting together in a unified framework to optimize image-text similarity more accurately.

Our Method
This section aims to offer an interpretation of the method we proposed. Firstly, the general framework and feature extraction are explained in Sections 3.1 and 3.2, respectively. Next, the Global-level Image-Text Similarity Computation Module (GCM) and the Local-level Image-Text Similarity Computation Module (LCM) are elaborated in Section 3.3. Finally, the Adaptive Weighted Loss (AWL) is explained detailedly in Section 3.4. In addition, all the important notations are listed in Table 1. In Figure 2, we provide the pipeline diagram of the whole solution, which illustrates the calculation process of HAAN. In general, solid lines represent the global-level data streams, while dotted lines represent the local-level data streams. These two types of data streams are optimized by AWL and then fused together with linear weights to obtain the image-text similarity. For the global-level data streams, we extract the features of the whole images and the whole texts using CNN and Bi-GRU, respectively, and then calculate the cosine similarity. For local-level data streams, we extract the features of image patches and words with CNN and Bi-GRU as well, and estimate the similarity between them using the cross-attention mechanism.
Image-text  the global-level similarity matrix L F the local-level similarity matrix L C the optimal global-level similar matrix L F the optimal local-level similar matrix L the matrix used to perform image-text retrieval

Framework of HAAN
The HAAN consists of a Global-level Alignment Network (GAN) and a Local-level Alignment Network (LAN) based on GCM and LCM, respectively. Note that, the proposed AWL τ(·) is used to optimize the image-text similarity matrix in each subnetwork.
Firstly, we define the global-level objective function C in GAN as follows: where the global-level similarity matrix L C is calculated by GCM and ϑ C are the parameters of GAN. Then, we derive the gradient of the function C with respect to ϑ C as follows: where B is the batch size. Afterwards, the optimal global-level similarity matrix L C is solved as follows: Secondly, the local-level objective function F in LAN is represented as follows: where the local-level similarity matrix L F is calculated by LCM and ϑ F are the parameters of LAN. Similarly, we get the optimal local-level similarity matrix L F by deriving the gradient of the function F with respect to ϑ F as follows.
Thus, the optimal global-level similarity matrix L F is obtained as follows.
Finally, in Equation (7) global-level and local-level optimal similarity matrices are fused by a linear weighted fusion strategy.
where L is used to perform image-text retrieval, and υ 1 , υ 2 represent the fusion coefficients. The framework of HAAN is shown in Figure 3. Noticeably, the features of the images and texts are extracted by CNN and Bi-GRU respectively, and then sent to GCM and LCM to obtain the corresponding similarity matrices. Through the optimization of AWL, the optimal similarity matrices are further obtained and fused. In addition, solid lines and dotted lines are used to represent the global-level data streams and the local-level data streams, respectively.

Bi-GRU
Linear Weighted Operation to fuse Multi-level Information Linear Weighted Operation to fuse Multi-level Information : Figure 3. The overall framework of HAAN method. In general, it is composed of a global-level alignment network (shown at the left half) and a local-level alignment network (shown at the right half). Among them, GCM and LCM are designed to work out the global-and local-level image-text similarity matrix. Besides, AWL is responsible for adaptively optimizing the similarity matrix before linear weighted fusion.
To further illustrate details of HAAN, the deep learning network architectures of it are described in Figure 4.

Propagate in that direction
Maxtrix multiolication Dot product Linear weighting Backprop (and produced derivatives)

Bi-GRU
A boy on bike popping a wheelie from a rooftop near a beach.

Bi-GRU
A boy on bike popping a wheelie from a rooftop near a beach.

Image-Text Similarity
Attention mechanism AM : Particularly, the left part and the right part of Figure 4 are corresponding to the GAN and LAN, respectively. In the left part, we input the one-hot vector of each word to Bi-GRU, and the feature of the whole text is solved by computing the average of word features that are output from Bi-GRU, meanwhile, we utilize VGGnet to extract image features. Afterward, the initial image-text similarity is calculated by doing a dot product between the feature vectors of images and texts, and then it is optimized by AWL. Furthermore, the right part is just similar to the left one, except that the fine-grained keywords and image patches are input to LAN. Afterward, the image-to-text attention mechanism is used to compute the initial image-text similarity. In the end, the optimized global-level similarity and local-level similarity are linearly weighted and fused to obtain the final image-text similarity. Additionally, the gradient descent method used in HAAN is clearly presented as well.

Feature Extraction
Given a dataset {(I p , T q ) N p,q=1 } consisting of N pairs of images and texts. Besides, there are α patches in image I p and β words in text T q . We first extract global-level and local-level features and then encode them into a common embedding space. Additionally, we use Convolutional Neural Network (CNN) for image feature extraction and Bidirectional GRU (Bi-GRU) for text feature extraction.

Global-Level Feature Extraction
In our work, the global-level feature vectors of I p and T q are donated as n C p ∈ R 1024 and m C q ∈ R 1024 , respectively. The detailed process of global-level feature extraction is explained as follows.
Global-level feature of image. We extract the feature vector u C p ∈ R 4096 of image I p from FC-7 of pre-trained VGGnet [15]. Then, the feature vector is projected to the 1024-d embedding space through the fully connected layer as Equation (8).
where K C and b C refer to the weight matrix and bias term to be optimized, and n C p ∈ R 1024 is the image global-level feature vector.
Global-level feature of text. Firstly, the words in each text are represented as one-hot vectors, and then they are embedded into 300-d feature space. Formally, the y th word in T q is donated as k q,y ∈ R 300 . Secondly, original textual features are mapped into a 1024-d embedding space for making a direct comparison between images and texts. As shown in Equation (9), we use the Bi-GRU to model the textual context of text T q from both two different directions.
where − → h q,y , ← − h q,y indicate the forward and backward hidden states of Bi-GRU. The feature vector m q,y for y th word in text T q is computed with m q,y = ( − → h q,y + ← − h q,y )/2. Finally, the global-level feature vector of T q is obtained by calculating the average value of all word vectors:

Local-Level Feature Extraction
Additionally, the local-level feature vectors of I p and T q are donated as N F p = {n F p,x |x = 1, ..., α, n F p,x ∈ R 1024 } and M F q = {m F q,y |y = 1, . . . , β, m F q,y ∈ R 1024 }, respectively. The detailed process of local-level feature extraction is as follows.
Local-level feature of image. Similarly, we also extract the local-level feature vectors of I p by VGGnet, which are donated as U F p = {u F p,x |x = 1, · · · , α, u F p,x ∈ R 4096 }. Then, they are also projected into the 1024-d embedding space through the fully connected layer like Equation (8), and N F p = {n F p,x |x = 1, · · · , α, n F p,x ∈ R 1024 } donate the local-level feature vectors of I p .
Local-level feature of text. Like Equation (9), we utilize the Bi-GRU to extract the feature of each word in each text: m q,y = ( − → h q,y + ← − h q,y )/2. Thus, the local-level feature vectors of T q are represented as M F q = {m F q,y |y = 1, . . . , β, m F q,y ∈ R 1024 }.

Image-Text Similarity Calculation
GCM and LCM are used to calculate the image-text similarity at the global level and the local level, respectively. Notably, to learn local-level correlations between images and texts more accurately, the attention mechanism is employed in LCM, which can fully aggregate local-level matches between patches and words.

GCM: Global-Level Image-Text Similarity Computation Module
The global-level feature vectors n C p and m C q are input to GCM to calculate the globallevel image-text similarity as follows.

LCM: Local-Level Image-Text Similarity Computation Module
Correspondingly, the local-level feature vectors N F p and M F q are fed into LCM to obtain the local-level image-text similarity. In LCM, we learn a cross-attention embedding space to figure out the latent alignment relationship between local-level features of images and texts.
Firstly, we calculate the cosine similarity matrix U with N F p and M F q to reveal the associations between all possible patch-word pairs. Equation (11) represents the association between the x th patch and the y th word. Then we normalize U according to its column dimension as Equation (12).
, relu(n) = max(0, n) Afterward, for the xth patch in I p , the text-context feature vector κ p,x is defined as a weighted integration with representations of words through the attention mechanism. Furthermore κ p,x is computed using Equation (13).
where λ is performed as the temperature-inverse parameter for the softmax function, and adjusts the smoothness of the attention distribution. In order to evaluate the importance of each image patch in a given text context, we compute a cosine function as Equation (14).
The similarity L F (p, q) is obtained as Equation (15) by averaging all relevance scores.
Lastly, the global-level image-text similarity matrix L C , as well as the local-level imagetext similarity matrix L F , are figured out and then optimized by the proposed AWL.

The Adaptive Weighted Loss
The proposed AWL is used to optimize the image-text similarity matrix more precisely in two stages, which not only has the characteristic of fast convergence but also adaptively optimizes the image-text similarity to improve the performance of image-text retrieval.

Image-Text Pairs Sampling
Given an image or a text as an anchor, the texts or images from the same class are used to form positive pairs with it, while the texts or images from the different classes are exploited to construct negative pairs. Notably, the least similar positive pair and the most similar negative pair are used to perform informative pairs sampling.
Formally, assume that s i refers to an anchor and s j is a candidate, and s i , s j belong to class h i and h j , respectively. If s i , s j are from the same class, i.e., h i = h j , they are a positive pair and the similarity between them is donated as L + ij . Besides, the similarity between a negative pair is donated as L − ij when h i = h j . In our work, we propose to sample informative positive and negative pairs through the following conditions, respectively.
where ρ is a given margin. For anchor s i , we denote the sets of sampled positive and negative pairs as S + i and S − i , respectively.

Image-Text Pairs Weighting
For the input image-text similarity matrix L, the gradients of the proposed AWL τ(L, ϑ) are as follows.

∂τ(L, ϑ) ∂ϑ
where ϑ are the model parameters to be learned, and B is the batch size in the training process. Notably, we define W = ∂τ(L,ϑ) ∂L ij as a weight that indicates the role of each similarity in parameter optimization. In the following part, we elaborate on how to obtain different weights for different similarities.
In order to fully exploit the imbalance information existing in different pairs, we design two weighting schemes for the sampled positive and negative pairs, respectively. The weight between the anchor s i and the positive candidate s j is computed as: Similarly, the weight between the anchor s i and the negative candidate s j is calculated as: To integrate W + ij and W − ij into a unified representation, we introduce the indicator function 1(·) in AWL.
Afterwards, the weight between the anchor s i and the candidate s j is redefined as follows: where s k and s j both belong to positive candidates or negative candidates.
To improve the performance of image-text retrieval, larger weights are given to positive pairs with lower similarity, while larger weights are allocated to negative pairs with higher similarity. It is obvious that this strategy takes advantage of potential interactions among different pairs to learn the adaptive weights, which are used to optimize the image-text similarity. Finally, according to Equations (18)- (20), our proposed AWL is presented as follows.
Particularly, the gradient of τ(L, ϑ) with respect to L ij is calculated by judging as follows.
Especially, AWL is adopted in both the GAN module and the LAN module of HAAN.

Experiment
Experiments on 3 widely used cross-modality datasets will be conducted in this section. We compare their performance with 11 state-of-the-art methods, highlighting the advancement of HAAN. Furthermore, parameter sensitivity, convergence analysis and ablation studies are presented to demonstrate the effectiveness of HAAN and the contribution of each component in it.

Implementation Details
In this section, we provide details of model settings and training settings of HANN in this experiment.

Simulation Parameters
Here are several experiments involving simulation parameters and descriptions in Table 2 to further assist in understanding the HAAN model.

Model Settings
As we have mentioned in Section 3.2, α is set as 9. Specifically, we separate images into 3 × 3 patches in order to balance the computational cost and data capacity in local-level features extracting. As mentioned in Section 3.3, we refer to [7,19] and thus set λ as 9, and the sensitivity of parameters about HAAN is elaborated detailedly in Section 4.5.

Training Settings
Our hierarchical alignment networks (i.e., GAN and LAN) are trained E epochs in a mini-batch by the Adam optimizer [20] with the batch size as B. It is worth noting that we normalize the common embedding features for each mini-batch by the 2 -norm as described in [21], which regularizes the model to prevent overfitting. More importantly, the maximum gradient norm is set to 2 to avoid gradient explosion for gradient clipping.
For all models on all datasets, we set the learning rate for the first E/2 epochs at 0.0002, and decrease it by 0.1 for the rest epochs. Particularly, the mini-batch size is set as 100 for Corel 5K with 100 epochs being considered; the bath size of Pascal Sentence is set as 10 with 30 epochs being utilized; the batch size for Wiki appears as 20 with 20 epochs. As these datasets contain training sets with different sizes, the quantity of iterations in each epoch is not fixed. We select the snapshot with the best result on the validation set for testing. At each epoch, our research assesses the efficacy of each model on the validation set to get the best model based on the mAP score. Next, we assess the best model for experimental results on the testing set. The HAAN approach is implemented by Pytorch [22] using the NVIDIA GeForce RTX 2080 GPU.

Evaluation Metric and Compared Methods
We perform image-text retrieval tasks on the above three datasets, and the tasks are divided into the following two types: (1) Search text by image (I2T) (2) Search image by text (T2I) The mean Average Precision (mAP) is useful when testing the general performance of certain algorithms. The first step taken to work out mAP is to get the average precision (AP) of a set of R retrieved documents by Equation (25) here T represents how many relevant documents appear in the retrieved set, while P(r) means the precision of the top r retrieved documents. If the rth retrieved document turns out to be relevant (where relevant means belonging to the class of the query) then δ(r) = 1, or δ(r) = 0. Then, we average the AP values over all queries in the query set to calculate the mAP. Alternatively, methods with larger mAP turn out to be more effective. Apart from this, the precision-recall curve is another metric to measure the effectiveness of different methods. The PR curves show the varying trend of retrieval accuracy under all recall values. Similar to features of mAP, the curve that can enclose the larger area means a better result the model can achieve.

Comparison Results
Our HAAN method and 11 contrasting methods on all datasets are compared in terms of (1) I2T mAP scores, (2) T2I mAP scores and (3) mPA(AVG) scores (i.e., the average scores between (1) and (2)), as shown in Table 3. We use "•" to mark the traditional method and exploit "•" to represent the deep learning method. In addition, the best results are shown in bold. From Table 3, we can easily find that HAAN achieves the best retrieval performance. Furthermore, HAAN improves the mAP(AVG) scores by 1.83 %, 1.20 % and 1.89 % respectively over the previous best model VSRN++ on Corel 5K, Pascal Sentence and Wiki. The performance of VSRN++ on I2T is better than that of HAAN, but only 0.57% higher, while HAAN can achieve similar high performance on both I2T and T2I, which indicates that HAAN is easier to solve practical problems.
It is worth noting that the text in a Pascal Sentence appears as a set of sentences, but in Corel 5K and Wiki, it is represented as a set of tags. Looking at mAP scores, HAAN performs better in image-text retrieval regardless of whether sentences or labels are used. We also find that the deep learning-based image-text retrieval methods perform better than traditional image-text retrieval methods. Next, the tasks of I2T and T2I are conducted on all datasets, and the PR curves are shown in Figure 5. From Figure 5, we can see that HAAN has the best overall performance because the area of the PR curve of HAAN tends to be larger than the area covered by the PR curves of other methods. Noticeably, VSRN++ is superior to HAAN only in the task of I2T in Pascal Sentence as shown in Figure 5c. However, HAAN is superior to VSRN++ in all other respects.  To better evaluate our method, we focus on the training time of deep learning methods to conduct a comparative experiment. Specifically, source codes of all the methods are implemented on the same machine with a single GPU. From Table 4, our findings go as follows. In the first place, DCCA and SCAN require the shortest training time, but perform less competitively than other deep learning methods in terms of image-text retrieval. Second, although MAVA, SGRAF, SCL, CGMN, NAAF have nearly the same training time as HAAN, HAAN outperforms them on image-text retrieval tasks. Finally, VSRN++ is secondary only to HAAN in image-text retrieval though it costs the longest training time. 29% compared with that of VSRN++ on the three data sets, respectively, which is very significant. (3) SCL, CGMN and NAAF with outstanding performance can be observed, but they are not as good as HAAN. The reason is that these three methods do not consider the global-level information and the local-level information. Therefore, HAAN, which considers both global-level information and local-level information and further optimizes the two kinds of information, easily beats these three methods for roughly the same amount of training time. (4) The comprehensive performance of HAAN is the best on all datasets. The reason is that HAAN can mine and fuse complementarities in multi-level data to cross the heterogeneous gap. Specifically, HAAN can accurately describe complex nonlinear image-text relationships, which is a distinct advantage over traditional methods. Since HAAN utilizes global-level and local-level information, it also significantly outperforms SCAN. Although both MAVA and SGRAF entirely use global-level and local-level data, HAAN keeps its advantages owing to the proposed AWL loss, which can accurately optimize image-text similarity by integrating pair mining and pair weighting in a unified framework.
In conclusion, HAAN fuses global-level and local-level information, and uses the proposed AWL to mine and enhance the two kinds of information, so the retrieval accuracy reaches the optimum. In addition, the first stage of AWL (i.e., image-text pairs sampling) selects valuable information while filtering redundant information, which accelerates the convergence and reduces the training time. HAAN achieves the effect of fast speed and high precision.

Parameter Sensitivity and Convergence Analyses
In this section, we conduct sensitivity analysis for the parameters, and convergence analysis for the hierarchical alignment network. The parameters involved in the proposed method is υ 1 and υ 2 mentioned in Section 3.1, ρ mentioned in Section 3.4.1. Besides, parametric sensitivity analyses are evaluated using mAP (AVG).
First, we set ρ to {0.2, 0.4, 0.6, 0.8, 1} and the experimental results are shown in Figure 6. It can be concluded that when ρ is 0.6, on the three selected datasets, the average mAP scores of I2T and T2I are the highest. To be specific, the highest scores of mAP on Corel 5K, Pascal Sentence and Wiki are 0.5751, 0.6410, and 0.5546, respectively.  Figure 7. According to the experimental results, when the ratio of υ 1 and υ 2 is 1:1, the mAP scores on the three datasets all reach the highest or are close to the highest, which proves that the importance of our two networks is basically the same. When the value of υ 1 is fixed, the mAP value will first increase and then decrease as the value of υ 2 increases from small to large. When the values of υ 1 and υ 2 are close to each other, the larger the mAP value will be, which confirms our conclusion above. When the value of υ 2 differs greatly from that of υ 1 , the value of mAP decreases rapidly. This indicates that the complementarity of global-level information and local-level information is very necessary to enhance the performance of image-text retrieval. Finally, the results of the convergence experiment for GAN are shown in Figure 8. We can easily observe that the objective function value of C monotonically decreases at each iteration. The reason lies in that our proposed AWL loss is effective. The convergence of LAN is not reported, because it is just similar to GAN.

Ablation Study
In this section, a series of ablation studies are conducted under different configurations of critical components of HAAN, in order to study the contribution of each component in the model.
As shown in Table 5, several models are provided for ablation studies to reveal the effectiveness of GCM, LCM, AWL (Stage 1) and AWL (Stage 2). Particularly, "•" represents that the module (or loss function) is not contained in the model, while "•" denotes that the module (or loss function) is contained in the model. To further demonstrate the effectiveness of AWL, we combine it with Triplet loss (TRI) [12] for comparison. In the ablation model, TRI is used to replace AWL. We provide 7 combinations of the above 5 components, (e.g., HAAN-GCM represents HAAN with only the GCM module). The experimental results of our proposed ablation studies are shown in Table 6, from which the following conclusions can be drawn. Note that the best results in Table 6 are shown in bold.   The map value of HAAN-LCM is 1.45%, 2.09%, and 1.7% higher than that of HAAN-GCM on Corel5K, Pascal sentence and Wiki, respectively. This is because the LAN captures more details through the attention mechanism to get more valuable information.
The performance of HAAN-GCM-LCM is better than that of HAAN-GCM and HAAN-LCM. This shows that the local-level information is complementary and that better performance can be achieved by integrating the two networks (i.e., GCM and LCM). • The mAP scores of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2) are very close, indicating that the two stages of AWL play almost the same importance in image-text similarity optimization. Furthermore, HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2) are significantly better than HAAN-GCM-LCM-TRI. It is worth noting that the map value performed by any stage of the AWL shows higher than that of Triplet loss on three datasets by at least 1.3%, 2.63%, and 2.4%, respectively. This is due to the two stages of AWL addressing two major flaws of triplet loss respectively. • The mAP score of HAAN is much higher than that of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCM-AWL (Stage 2). This is because the integration of the two stages (i.e., AWL (Stage 1) and AWL (Stage 2)) compensates for the defect for a single stage of AWL. Specifically, (1)  To further verify the effectiveness of each factor, we conduct a series of ablation studies in the experiment. Furthermore, we add a new column titled "AVG of all datasets" in Table 6. First of all, it can be convinced that the performances of HAAN-GCM and HAAN-LCM are close to each other, which proves that fine-grained data and coarse-grained data are of equal importance in the task of image-text retrieval. The values of HAAN-GCM-LCM-AWL (Stage 1) and HAAN-GCM-LCMAWL (Stage 2) are approximately the same, meaning that the effect of Stage 1 and Stage 2 are almost equal. Furthermore, we can observe that when a certain stage of AWL is added, the performance of HAAN is improved by about 3% compared with that of the global-level or local-level alone. The huge improvement it involves shows that AWL is very effective in promoting the performance of HAAN. At the same time, it is clearly shown that the performance of HAAN is about 2% better than using AWL in a certain stage alone, making sure the efficiency of aggregation of the two modules.
From an overall point of view, all these four modules attach great importance. GCM and LCM lay the foundation for subsequent optimization and further improvement of the model. AWL, when dealing with fully aggregated information (i.e., global-level, and local-level information), can quickly improve the overall performance of the model. When two stages are employed together, the optimization effect of AWL improves by more than about 4% compared to TRI. This also confirms the remarkable optimization effect of AWL, performing a much better result. Further, for your convenience, we have listed the main contents below.
In conclusion, we can draw the following conclusions: (1) when performing the task of image-text retrieval, each component in HAAN plays a positive role; (2) HAAN effectively mines and fuses complementarity in multi-granularity data, which can provide essential clues for bridging the heterogeneous gap.

Qualitative Results
We provide typical examples of image-text retrieval on the Pascal Sentence dataset by two state-of-the-art image-text retrieval methods (i.e., VSRN++ and HAAN) as well as HAAN. It shows the top ten results for I2T and T2I correspondences for a specific query. In particular, in Figure 9a, we select two queries of the I2T for retrieval of "cow" and "dog". In Figure 9b, we select two queries on T2I for retrieval of "aeroplane" and "train".
Black and white cows behind a fence. Black and white cows grazing in a pen. The black and white cows pause in front of the gate. Two black and white cows behind a metal gate against a partly cloudy blue sky. Two cows in a field grazing near a gate.

Query
Method Top 10 Results (I→T)

Cow
The large cows hover over the young calf. Three adult cows and one baby cow stand on the grass. Three brown cows and a small calf in a field. Three cows in a green pasture surrounding a baby cow. Three large brown cows stand aver a calf in the pasture.
A cows ass and some buildings . Back end of cow that is grazing on grass.
The black and brown cow is standing on the green grass.The udders of a dairy cow that is standing in a pasture near a large building.View of a cow's udder in a field.
An ox stands in a field. A yak with a long, camel colored coat standing in a field. A young highlander cow stands in a pasture . Closeup of a bull with hair covering its eyes . Shaggy brown bull with his hair in his eyes.
Animals find things to eat and drink on the river bed. cows grazing on plant life near a river. The bank of a river with two animals in the center. Two cows eat some bushes beside a body of water. Two goats near a pond.
A close-up of a cream animal with big ears and a black nose. A white and gray cow with very large ears stands in a grassy field. A white cow with big ears looking into the camera. A white cow with ears as large as its head. The cow is grazing in a field.
A girl with glasses and a brown cow . A woman petting a brown cow . A women in glasses stands next to a cow in a field. A women standing next to a brown cow. A young brunette beside a brown cow.
Cattle grazing on a snowy field. Cattle in a snow-covered field. Cows grazing in a snow covered field . Five cows grazing in a snow covered field. Three black cows and one brown cow stand in a snowy field.
A goat grazing by the water. A pair of goats grazing, with a body of water and mountain behind them. Two animals grazing next to a mountain, and a body of water. Two mountain goats grazing in front of an alpine lake. Two sheep grazing next to a lake with mountains in the background. A line outside of a bookstore on a street. A woman with a baby stroller and others at a bookstore. Group of people walking into a book store. People stand outside a bookstore. Several people gather outside of a schoolbook store.
A man on a motorcycle performing a wheelie. A man riding a motorbike on its backwheel while standing up. A shot from a car of a person doing a wheelie on a motorcycle. Motorcyclist performing a wheelie next to a car. Someone on a motorcycle doing a wheelie, as seen from a car window.
The large cows hover over the young calf. Three adult cows and one baby cow stand on the grass. Three brown cows and a small calf in a field. Three cows in a green pasture surrounding a baby cow. Three large brown cows stand aver a calf in the pasture.
A bride and groom along with other family members in a church. A bride and groom pose with their family. A wedding of a man and woman with their parents and siblings in a church. Wedding photo of bride and groom with several other people. Wedding photo on the alter.
A unique living room with tv, brown table, green and brown walls, and white chairs. A very retro living room with a television in it. Modern room with TV in a round brown frame. Television encased in a wood frame in a lobby.  For the task I2T, HAAN shows the best performance because its query results have the fewest errors. It is worth noting that, as in the retrieval of "cow", the error text still contains some correct words (e.g., "black", "white" and "face") that match the correct semantic information in the query image. Furthermore, for the task T2I, VSRN++ and NAAF all make more mistakes. At the same time, HAAN obtains the results with the fewest mistakes, which partially deviate from the semantic information but contain features similar to correct semantic information. For example, images of birds in flight appear in the retrieval of "aeroplane". In contrast, VSRN++ and NAAF get more errors and deviate from the correct semantic information largely.
From this, we can conclude that HAAN significantly outperforms VSRN++ and NAAF when performing tasks I2T and T2I. It should be noted that NAAF is the worst performer among the three methods, not only for the much more wrong results it returns in both retrieval tasks but also for the semantic concept of wrong results that are totally different from the correct semantic information. For example, when searching for "aeroplane", the search results show pictures of motorcycles and trucks; when searching for "train", the search results show pictures of buildings and interiors of rooms. These search results that seriously deviate from the correct semantic information are entirely unacceptable. It shows that the performance of the NAAF is the worst one.
All in all, HAAN is superior to the two most advanced methods, achieving the best performance.

Conclusions
In this paper, we put forward HAAN to explore image-text alignment. First, hierarchical alignment networks (i.e., GCM and LCM) are proposed to exploit the rich complementary information in global-level and local-level features for image-text correlation learning. Secondly, our AWL integrates pairs mining and pairs weighting to optimize image-text similarity calculated from two modules (i.e., GCM and LCM). Experimental results show that our proposed HAAN achieves the optimal achievement in image-text retrieval tasks, and each component of HAAN is proven to be effective.
In the future, we will try more levels of alignment, and verify the scalability of HAAN (i.e., cross-modality for other types of modalities for retrieval tasks, (e.g., video queries text)) for more practical applications.
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Data will be made available on request.

Acknowledgments:
The authors would like to express their thanks to the researchers who provide source codes.

Conflicts of Interest:
The authors declare no conflict of interest.