On the Limitations of Visual-Semantic Embedding Networks for Image-to-Text Information Retrieval

Visual-semantic embedding (VSE) networks create joint image–text representations to map images and texts in a shared embedding space to enable various information retrieval-related tasks, such as image–text retrieval, image captioning, and visual question answering. The most recent state-of-the-art VSE-based networks are: VSE++, SCAN, VSRN, and UNITER. This study evaluates the performance of those VSE networks for the task of image-to-text retrieval and identifies and analyses their strengths and limitations to guide future research on the topic. The experimental results on Flickr30K revealed that the pre-trained network, UNITER, achieved 61.5% on average Recall@5 for the task of retrieving all relevant descriptions. The traditional networks, VSRN, SCAN, and VSE++, achieved 50.3%, 47.1%, and 29.4% on average Recall@5, respectively, for the same task. An additional analysis was performed on image–text pairs from the top 25 worst-performing classes using a subset of the Flickr30K-based dataset to identify the limitations of the performance of the best-performing models, VSRN and UNITER. These limitations are discussed from the perspective of image scenes, image objects, image semantics, and basic functions of neural networks. This paper discusses the strengths and limitations of VSE networks to guide further research into the topic of using VSE networks for cross-modal information retrieval tasks.

The state-of-the-art VSE networks are: VSE++ [1], SCAN [2], VSRN [3], and UNITER [4]. To tackle cross-modal information retrieval using VSE-based networks, Faghri et al. [1] proposed a hard negative loss function as part of VSE++, for the network to learn to make the relevant target closer to the query than other items in the corpus. Lee et al. [2] applied the stacked cross attention (SCAN) mechanism to align image regions and words for improving VSE networks. Li et al. [3] proposed the visual semantic reasoning network (VSRN) for extracting high-level visual semantics with the aid of the graph convolution network (GCN) [11]. Chen et al. [4] introduced a pre-trained network with the transformer [12], namely the universal image-text representation (UNITER), to unify various cross-modal tasks such as VQA and image-text matching.
Existing literature on VSE networks provides an evaluation of the performance of these networks on benchmark datasets such as MSCOCO [13] and Flickr30K [14], but to the best of the authors' knowledge, there are no papers that provide an in-depth investigation of the performance of the methods and their limitations. Some of the known limitations are that the network architectures of VSE++ and SCAN have not been specifically designed to extract high-level visual semantics [3]. Additionally, VSRN lacks a suitable attention mechanism [4] to align the image and text in the same latent space, and UNITER requires large amounts of data for pre-training [15]. However, it is important to understand the limitations of these networks in order to devise strategies for mitigating these limitations when developing new and improved VSE networks and/or when extending the capabilities of existing ones. Therefore, this study aims to identify, investigate, and classify the limitations of state-of-the-art VSE networks (VSE++, SCAN, VSRN, and UNITER) when they are applied to information retrieval tasks.
This paper is organised as follows. Section 2 provides an overview of VSE networks. Section 3 discusses the dataset, evaluation measures, and experiment methodologies. Section 4 presents the results of VSE++, SCAN, VSRN, and UNITER, summarises the limitations of VSE networks, and discusses their strengths and limitations to guide further research. Section 5 provides a conclusion and future work.

Related Methods
VSE networks aim to align representations of relevant images and descriptions in the same latent space for cross-modal information retrieval. As shown in Figure 1, a typical VSE network embeds features of image regions of interest (ROIs) and word vectors of descriptions. Where ROIs are analysed by object detection models [16][17][18], faster R-CNN [16] is commonly used by VSE networks, including VSE++, SCAN, VSRN, and UNITER.
For image-to-text retrieval, the VSE network finds the most relevant descriptions to the image query by ranking the similarity scores between the image query's embedding and a set of description embeddings. For text-to-image retrieval, the VSE network finds the most relevant images to the description query by ranking the similarity scores between the description query's embedding with a set of image embeddings. In the literature, VSE++ [1] was a milestone for VSE networks. It deployed a fully connected neural network to extract image features of ROIs obtained by a faster R-CNN [16] and a gated recurrent unit (GRU) network [19] to embed textual features. In addition, online hard-negative mining was proposed to improve the network's convergence of this joint-embedding space [1]. In particular, the hardest negative triplet loss (MH) function defined in Formula (1) was used in the network training phase.  [1]. Attention-based models have been exploited to improve the performance of VSE++. SCAN [2] was proposed to emphasise salient regions and keywords for the alignment when building the embedding space. The importance of each image region is learned based on the similarity between individual attended sentence word vectors and image region features. Wang et al. [15] proposed the position attention network (PFAN++) that utilises the positionfocused attention module to enhance image feature representations of visual semantics. Recently, the GCN has also been utilised in VSE networks by VSRN [3]. VSRN [3] was proposed to explore object spatial relationships in an image to strengthen image feature representations. After identifying ROIs in the image, the relationship between each region is learned using (2) for feature enhancement.
where R represents the correlation between a set of image features denoted as a matrix V, the weight parameters ϕ and φ can be learnt through back propagation. Then, the residual connection to the GCN is added for enhancement based on Formula (3).
where W g is the weight matrix of the GCN layer, W r is the weight matrix of the residual structure, and R is the affinity matrix from Formula (2), so the output V * is the relationenhanced representation of image regions. In addition, VSRN further connected the output of GCN with a GRU [19] for global visual semantic reasoning. The transformer architecture is emerging in VSE networks since it has achieved great success in natural language processing [12]. UNITER [4] adopts a transformer model for joint multi-modal embedding learning. The model of UNITER was pre-trained with four tasks, including masked language modeling (MLM), masked region modeling (MRM), image-text matching (ITM), and word-region alignment (WRA), on large amounts of data. These tasks facilitate the creation of a generalisable embedding space with contextual attention on both image and textual features as well as modeling their joint distribution.

Materials and Methods
Image-to-text retrieval refers to the task of retrieving relevant image descriptions when given an image as a query. This section describes two experiments. Experiment 1 evaluates the performance of state-of-the-art VSE networks (VSE++ [1], SCAN [2], VSRN [3], and UNITER [4]) for the task of image-to-text retrieval using the Flickr30k dataset. Experiment 2 provides a limitation analysis of the performance of two networks, VSRN [3] and UNITER [4]. Note that for experiment 2, VSRN and UNITER were selected because they outperformed other networks based on the results of experiment 1 (Section 4.1) and because recent comparisons found in the literature [3,4] also show those to be the best-performing VSE networks.

Dataset and VSE Network Preparation
Flickr30K [14], a benchmark dataset that is typically used for evaluating the performance of deep learning models, was utilised for the experiments. Every image in the Flickr30K dataset is associated with five relevant textual descriptions, as shown in Figure 2. Flickr30K was split into training, validation, and test sets containing 29,000, 1014, and 1000 images, respectively [7]. For a fair comparison of VSE++, SCAN, VSRN, and UNITER, the following adaptations were made. For VSE++ [1], the backbone of the faster R-CNN object detection method was changed from VGG19 [20] to ResNet-101 [21] to be consistent with SCAN, VSRN, and UNITER. For SCAN [2], image-to-text LogSumExp (i-t LSE) pooling was adopted because it was found to be the most suitable model setting for image-to-text attention models when tested on Flickr30K [2]. UNITER's image-text matching (ITM) model was selected for the task. No adaptations were made to VSRN [3], or to UNITER's ITM model. Note that UNITER's [4] ITM model has been pre-trained on four large datasets, namely, MS COCO [13], Visual Genome [22], Conceptual description [23], and SBU captions [24]. VSE++, SCAN, and VSRN have not been pre-trained on other datasets.

Performance Evaluation Measures
The measures adopted for evaluating the image-text retrieval performance of the VSE networks are Recall, Precision, F 1 -score, and interpolated Precision-Recall PR curves. These are described below.
Recall is the percentage of relevant textual descriptions retrieved over the total number of textual descriptions relevant to the query. Recall is computed using (4).

Recall =
Total number o f relevant textual descriptions retrieved Total number o f relevant textual descriptions (4) Precision is a percentage of relevant textual descriptions retrieved over the total number of textual descriptions retrieved. Precision is computed using (5).

Precision =
Total number o f relevant textual descriptions retrieved Total number o f textual descriptions retrieved (5) F β -score combines the results of Recall and Precision using (6): The F β -score can be adjusted to allow for weighting Precision or Recall more highly. As the importance of Recall and Precision is equal in the image-text retrieval task, the experiment sets the parameter β = 1 as shown in (7): A PR curve is a plot of the Precision (y-axis) and the Recall (x-axis) for different thresholds. An interpolated PR curve [25] shows Precision (P) interpolated for each standard Recall (R) level as shown in (8).
Specifically, P(R j ), the maximum Precision at any Recall between the jth and (j + 1)th level is taken to interpolate for R j , shown in (9).

Experiment 1 Methodology for the Comparison of VSE Networks for Image-to-Text Retrieval
Experiment 1 compares the image-to-text retrieval performance of VSE++, SCAN, VSRN, and UNITER for 1000 image queries from Flickr30K's test set. As previously mentioned, every Flickr30K image has five textual descriptions, and therefore there are 5000 descriptions in the query set (1000 images × 5 descriptions per image = 5000 descriptions).
Given a query, the top n image descriptions that are relevant to a query are retrieved. Then, the retrieval performance of each model is evaluated based on three strategies: the relevance of the first retrieved description (i.e., Recall@1); whether any 1 of the 5 descriptions are retrieved in the top n results (i.e., Recall@5, @10, @20), which is the same evaluation strategy followed by [1][2][3][4]; and the retrieval performance of the model with regard to retrieving all 5 relevant descriptions when looking at the top n retrieved descriptions (i.e., Recall, Precision, F 1 -score @5, @10, @20, @50, @100), which is a tougher strategy than the one followed by [1][2][3][4].
Furthermore, the performance of the algorithms in retrieving all 5 descriptions is evaluated for two main reasons: (1) because the ability to retrieve all relevant descriptions is important for information retrieval (IR) systems [26] and (2) given that the study focuses on the analysis of limitations of VSE networks, evaluating the performance of VSE networks using the more challenging criteria of retrieving the 5 descriptions can be more beneficial for the task of discovering the limitations of VSE networks when they are used for IR tasks.

Experiment 2 Methodology for Finding the Limitations of VSE Networks
Experiment 2 analyses the performance of the two VSE networks, i.e., VSRN and UNITER, that performed best in experiment 1, in order to identify their limitations. Figure 3 illustrates the methodology for experiment 2. A description of each step is provided below: Step 1: The query set comprises images and their corresponding relevant descriptions from the test (n = 1000 queries) and validation (n = 1014 queries) sets of Flickr30K combined into a single set containing 2014 queries.
Step 2: For the purposes of evaluating the performance of VSRN and UNITER across the different image classes, the images found in the query set were grouped into classes. Therefore, the query images were classified into 453 classes using the ImageNet [27] class labels (up to 1000) with the aid of a trained Resnet [21] model. Table 1 shows the labels of the 40 largest classes (i.e., they contain the largest number of images).
Step 3: VSRN and UNITER were evaluated on different image classes from Step 2 using the evaluation measures of average Recall@5 and average Precision@1 (see Section 3.2).
Step 4: The top 25 worst-performing classes (and their images), based on average Recall@5 results, were extracted to be used for the task of identifying the limitations of the models. Classes with fewer than 10 images were removed.
Step 5: All images with irrelevant retrieved descriptions when using the Precision@1 evaluation measure were taken and analysed manually to identify reasons that the models did not retrieve those descriptions, and to further summarise those reasons into a set of limitations.

Results
This section describes the results of experiments 1 and 2. The experimental methodology of each experiment is presented in Sections 3.3 and 3.4, respectively. Initially, VSE++, SCAN, VSRN, and UNITER were evaluated in terms of their performance in retrieving any one of the five relevant textual descriptions for each query. Performance was averaged across all n = 1000 queries to obtain the average Recall@1, @5, @10, and @20 values, as shown in Table 2. UNITER achieved the highest average Recall (i.e., average Recall@1 = 80.8%), VSRN, SCAN, and VSE++ achieved 69.3%, 67.5%, and 40.0% on average Recall@1, respectively. These results are consistent with those reported in [1][2][3][4]. Next, VSE++, SCAN, VSRN, and UNITER were evaluated in terms of their performance in retrieving all five of the relevant textual descriptions for each query. Table 3 compares the performance of the models when considering the top K results, where K is a predefined number of descriptions retrieved. The results show that UNITER consistently achieved the best performance across all evaluation measures and for all K values.  Figure 4 presents the interpolated PR curves of each image-text retrieval model. UNITER outperformed all other three networks, followed by VSRN. Figure 4 highlights that UNITER and VSRN are more effective models when considering both the Recall and Precision evaluation measures. In conclusion, the results of experiment 1 demonstrate that the best-performing image-to-text retrieval model is UNITER, followed by VSRN. Figure 5 shows the computation time of VSE++, SCAN, and VSRN against the number of training samples from Flickr30K [14] for one epoch. The running time of each epoch was the same for each individual algorithm, and therefore for ease of comparison the computation time was calculated for one epoch for each training sample. The number of training samples increased from 2900 to 29,000 in steps of 2900. The three lines that fit the data points of VSE++, SCAN, and VSRN follow the equations of T VSE++ (n) = 0.0018n − 0.2173, T SCAN (n) = 0.0119n + 0.5239, and T VSRN (n) = 0.0159n + 0.0365, respectively. UNITER has not been included in the comparison because it was pre-trained on four other datasets [13,[22][23][24], whereas the other three models, VSE++, SCAN, and VSRN, were only trained on Flickr30K. Hence, for a consistent comparison of the performance of the models, UNITER was excluded from the comparison.

Results of Experiment 2: Limitations of VSE Networks
This experiment concerns an analysis of the limitations of VSE networks. Only the UNITER and VSRN models were utilised for this experiment since the experiment 1 results and the literature agreed that these are the best-performing VSE models. Table 4 shows the 25 worst-performing classes for the query set using the methodology described in Section 3.4.
The results of VSRN and UNITER contain 16 identical classes, and this suggests that they share some common limitations. Focusing on these worst-performing classes, an indepth analysis of the retrieved descriptions that are irrelevant to the image queries in these classes has revealed 10 limitations of VSE networks they and are summarised in Table 5. These limitations were generalised into four groups. The discussion that follows refers to the limitations provided in Table 5. The limitations of group 1 occur when a VSE network does not globally understand the image scene. Limitation 1 shows that VSRN cannot specifically distinguish the importance between foreground and background information in an image. For example, in Figure 6a, people and room facilities in the background caused VSRN to misinterpret the image as 'mopping in a subway station' rather than 'counting change in a store'. However, UNITER overcomes this limitation by using the self-attention mechanism of the multi-layer transformer to learn the relations between image objects. Furthermore, limitation 2 reveals an issue of VSRN and UNITER with missing key image objects in understanding the image. For example, the retrieved textual description from VSRN and UNITER for Figure 6b does not mention that a person is standing in front of the restaurant. Therefore, the background and object information in an image needs to be considered from a global angle in VSE networks. Missing key objects Key objects, which are important to the image content, are ignored.

Group 2: The VSE networks do not give enough attention to the detailed visual information (all limitations apply to VSRN and UNITER)
3 Errors in retrieved descriptions Details of objects from the retrieved textual descriptions do not match the details of the image. 4 Partially redundant descriptions Only part of the retrieved textual description is relevant to the image. 5 Object counting error The networks cannot correctly count objects in an image.

Group 3: The VSE networks' capability in extracting the higher-level visual semantics needs to be improved (all limitations apply to VSRN and UNITER)
6 Visual reasoning error The capability for extracting the higher-level semantics for visual reasoning of the VSE networks is inadequate. 7 Imprecise descriptions Retrieved descriptions do not provide enough detail to describe the rich content of images. 8 Action recognition issue Actions and postures of objects in retrieved textual descriptions sometimes do not match the image content.
Group 4: The basic functions, i.e., object detection and recognition, of neural networks need to be improved (all limitations apply to VSRN and UNITER) 9 Detection error Some key objects are missed at the object detection stage. 10 Recognition error Image object attributes are recognised incorrectly.
Limitations in group 2 revealed that the VSE network does not give enough attention to the details of image objects, and hence the detailed descriptions in the retrieved textual descriptions cannot match image object features correctly. For example, with regard to limitation 3, Figure 6c shows that the retrieved result from VSRN and UNITER both mistakenly described too many details about the woman. Limitation 4 is when one part of the textual description relates to the image, but the other part does not match the image. For example, Figure 6d is a result of VSRN with no text in the description for this image about the wedding. Figure 6e shows that UNITER missed 'people are controlled by officers' in its textual description. Limitation 5 reveals the VSE network has no accurate concept of the number of main objects in an image. There are two people in Figure 6f, but the results of VSRN and UNITER described three people and one person, respectively. These limitations suggest that learning image details is currently a challenge for VSE networks.
Group 3 generalises that limitations 6, 7, and 8 are related to a network's ability to extract higher-level visual semantics. For Figure 6g, VSRN and UNITER retrieved the description related to 'rhythmic gymnasts' and 'clutching yellow ski handles by a man', respectively, an error related to limitation 6. Limitation 7 was derived after observing many irrelevant cases where the retrieved textual descriptions are too simple to describe the rich contents of the image. Taking Figure 6h as an example, the retrieved result by VSRN and UNITER did not describe the man's smile and his exact action. Limitation 8 shows that VSRN and UNITER did not perform well in retrieving the descriptions of the images which contained human postures and actions. For example, in Figure 6i, the action of 'roping animal' is recognised as 'laying' by VSRN and UNITER. These limitations suggest that VSE networks are limited in their ability to extract higher and more complex visual semantic information at present. Group 4 shows the basic functions of neural networks, with object detection (limitation 9) and recognition (limitation 10) also influencing the performance of networks. The small 'monk' objects in Figure 6j, an example of limitation 9, were not detected and neither VSRN nor UNITER could give correct descriptions. 'The man being shaved' in Figure 6k, an example of limitation 10, was mistakenly recognised as a 'child' and 'woman' by VSRN and UNITER, respectively.

Discussion on the Strengths and Limitations of VSE Networks
This subsection summarises the strengths and limitations of the VSE++, SCAN, VSRN, and UNITER networks. Figure 7 shows that for 60% of the 2014 image queries, VSRN and UNITER both retrieved the same relevant descriptions at first rank for those queries; and also shows that for 10% of the image queries, VSRN and UNITER retrieved irrelevant descriptions at first rank. However, for 30% of the queries there was no agreement between VSRN and UNITER in the retrieved descriptions.
In the discussion that follows focuses on comparing the attention mechanisms of VSE networks as a strategy for understanding how the attention mechanisms impact their imageto-text retrieval performance. Five attention mechanisms are utilised by the networks for giving attention to important information across images and text. These mechanisms are: (1) Image-text attention aligns image regions and words with crossing modalities; (2) Image-self attention weights the relations between image regions; (3) Text-self attention weights the relations between words; (4) Detailed visual attention weights the detail features in the image object; (5) Global visual reasoning attends to the relations between a group rather than a pair of image objects for reasoning the visual semantics globally. Table 6 is used for indicating how the attention mechanisms impact the performance (i.e., average Precision@1) for VSE++, SCAN, VSRN, and UNITER. The joint analysis of Tables 5 and 6 reveals the strengths and limitations of VSE++, SCAN, VSRN, and UNITER. They are described below: 1.
VSRN applies a GRU for global visual reasoning based on the pairwise relations between image regions extracted by GCN. The limitations in group 1 indicate that the performance of global visual reasoning for VSRN still needs to be improved. Compared to VSRN, UNITER benefits from the multi-layers of the transformer, and it has overcome limitation 1. However, the limitation of missing key image objects by VSRN and UNITER indicates that global visual reasoning is still a challenging problem for VSE networks. 2. Table 6 shows that none of the networks has the attention mechanisms to achieve detailed visual attention. Misclassified cases of VSRN and UNITER from group 2 limitations reveal that the current VSE networks are not using detailed information for cross-modal information retrieval. However, the matched details between image and text should play a positive role in retrieval, while the unmatched parts should make a negative contribution to matching in further research. 3.
VSRN performs image-self attention by using GCN to compute the relations between image regions, so the average Precision@1 of VSRN is 1.8% higher than that of SCAN, as shown in Table 6. UNITER applies transformers to achieve image-self, text-self, and image-text attentions, and it outperformed other networks by more than 11% in average Precision@1. Therefore, this progress shows that the extraction of high-level visual semantics can improve the VSE networks. According to the limitations of group 3, as described in Table 5, there is still a need to improve the extraction of visual semantics for VSRN and UNITER, so higher-level visual semantics are necessary to VSE networks. In addition, SCAN outperformed VSE++ by using the stacked cross attention on image-text, where the average Precision@1 was improved by almost 27.5%. UNITER also uses the transformer for image-text attention, thus cross-modal attention is effective in VSE networks. However, cross-modal attention requires the network to iteratively process image and text pairs, and the retrieval time for 1000 queries of SCAN and UNITER is 187.3 seconds and 4379 seconds, respectively, which is too slow for practice.

4.
Group 4 limitations illustrate that VSE networks still need to perfect the basic functions, such as object detection and recognition, of neural networks. At present, the two-stage VSE networks rely on the reliability of the object detection stage.  Table 5. One image contains 5 relevant descriptions, and the first retrieved description by VSRN and UNITER is denoted as retrieved result at rank 1 (i.e., Result@1).

Conclusions and Future Work
This study evaluates and compares the performance of four VSE networks, namely VSE++, SCAN, VSRN, and UNITER, for the task of image-to-text retrieval using the Flickr30k dataset. Two experiments were carried out. The first experiment evaluated the retrieval performance of the VSE networks and the second experiment analysed the performance of two of the best-performing VSE networks (i.e., VSRN and UNITER) to determine their limitations. The results of the first experiment revealed that the pretrained UNITER network achieved the highest retrieval performance across all evaluation measures, followed by the VSRN network. The results of experiment 2 revealed that VSE networks suffer from various limitations which are mainly related to global reasoning, background confusion, attention to detail, and extraction of higher-level visual semantics. Furthermore, the overall retrieval efficiency of the networks needs to be improved for them to be adopted for cross-modal information retrieval tasks, and hence to be embedded in search engines. Understanding the limitations can help researchers advance the area of VSE networks by utilising that knowledge to build future VSE networks that overcome these limitations.
Images can contain various objects and interactions between objects, relative positions of objects, and other high-level semantic concepts, and therefore understanding image content is important in VSE networks for information retrieval [3]. The progress of VSE networks for image-text retrieval is currently determined by comparing the Recall of networks on the Flickr30k public dataset [1][2][3][4]. Most of the work about VSE networks thus far has focused on certain challenges such as image-text alignment [2,4], visual position attention [15], and visual reasoning [3]. To the best of the authors' knowledge, there is no comprehensive analysis of the limitations of state-of-the-art VSE networks. This study experimentally analyses the performance of VSE algorithms and provides a summary of their limitations from the perspective of image content understanding. Most limitations discussed in this paper can be independently extended to a research direction. The analysis of these limitations will benefit the cross-modal research community and guide future research directions for VSE networks.
Future work includes developing methods for the extraction of higher-level visual semantics based on in-depth relations between image regions, and also developing suitable attention mechanisms that will enable networks to attend to the details of image objects. Future work also includes developing algorithms for improving the efficiency of the pre-trained network which uses the cross-modal attention mechanism and evaluating these networks in real practice. Importantly, there is a lack of research and evaluations of VSE networks when using adversarial data samples. Hence, future work can also include comparing the performance of VSE networks when adopting various perturbation approaches to generate adversarial images and descriptions. Such an analysis can provide an understanding of how adversarial samples can affect the retrieval results of VSE networks, which can aid the development of algorithms and solutions for overcoming the limitations of VSE networks on adversarial samples.

Data Availability Statement:
The code for the experiments presented in this paper can be found in the project's GitHub repository https://github.com/yangong23/VSEnetworksIR (accessed on 25 July 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: