Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Zeng, Ruigeng; Ma, Wentao; Wu, Xiaoqian; Liu, Wei; Liu, Jie

doi:10.3390/electronics13020300

Open AccessArticle

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

by

Ruigeng Zeng

^1,†

,

Wentao Ma

^2,*,†,

Xiaoqian Wu

²,

Wei Liu

^3,*

and

Jie Liu

¹

Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China

²

School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China

³

School of Management Science and Engineering, Anhui University of Finance & Economics, Bengbu 233030, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2024, 13(2), 300; https://doi.org/10.3390/electronics13020300

Submission received: 16 November 2023 / Revised: 2 January 2024 / Accepted: 3 January 2024 / Published: 9 January 2024

(This article belongs to the Special Issue Deep Learning in Multimedia and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model’s representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.

Keywords:

cross-modal retrieval; semantic gap; two-stage training strategy

1. Introduction

With the popularity of the Internet and mobile imaging devices, multimedia data (including text, images, videos, and audio) is exponentially increasing [1,2,3,4,5]. Fueled by this data surge, image–text cross-modal retrieval has emerged as a fundamental and challenging task in information retrieval. Cross-modal retrieval, formally, is a technique that concentrates on utilizing an instance from one modality as a query to retrieve a semantically related instance from another modality [6,7,8,9]. Despite the diverse and multi-sourced nature of data among various modalities, they exhibit shared common characteristics and semantic relationships when depicting identical scenes, entities, or interactions. Hence, the crucial challenge in image–text cross-modal retrieval is bridging the semantic gap between image and text.

Existing methods in image–text cross-modal retrieval aim to map features from different modalities into a shared representation space. This allows for direct comparison between items of various modalities: positive instances are drawn closer together, while negative ones are distanced apart. Some methods [6,7,10] extract coarse-grained global features from images and text, calculating similarity based on these global features. However, these methods only capture the rough semantic correlation between different modalities and fail to describe the local semantic correspondence between image regions and text words effectively. To address this limitation, fine-grained cross-modal retrieval methods [3,8,9,11,12,13,14,15,16,17] have been proposed for modelling the local similarity between image regions and text words. Currently, fine-grained image–text cross-modal retrieval methods can be roughly divided into two categories: (1) Graph-free paradigm [6,7,10,11,12,13,14,18,19,20,21,22,23,24,25,26,27]: These methods typically encode multi-level feature representations using the output of the last layer of the encoder and then fuse these multi-level similarities to obtain the final cross-modal similarity. Additionally, region-level features from visual target detectors (e.g., Faster R-CNN [28]) are employed to establish semantic alignment between image regions and text words. (2) Graph-based paradigm [8,9,15,16,17,29,30,31,32,33]: These methods build hierarchical semantic graph nodes from the last layer of the encoder, bridging the semantic gap between images and text via complicated graph reasoning and multi-level feature fusion.

While coarse-grained image–text matching methods have significantly advanced cross-modal retrieval techniques and showcased promising performance, two persistent challenges remain:

Insufficient semantic interaction between image and text. Most existing image–text matching methods follow a dual-tower structure [2,3,7,34,35], where single-modal pre-training models act as encoders to, respectively, extract visual and text features (for instance, CNN and BERT are utilized for image and text feature extraction, correspondingly). However, those methods have limitations: (1) single-modal pre-training models are typically trained as semantic classifiers, overlooking fine-grained visual details in the image, such as color relationships and entity interactions; (2) the encoder simultaneously processes features from multiple modalities without inter-modality correlations, leading to a loss of interaction information between different modalities. Hence, enhancing the feature representation between image and text within the dual-tower structure remains a challenge.
Feature representation distribution of intra-modality is overlooked. Since each training pair within a batch consists of one image and one text while using the pairwise ranking loss to measure the distance. However, this similarity measure predominantly focuses on inter-modality distance and barely explicitly considers the distribution of intra-modality feature representation. For example, the pairwise ranking loss used in the training phase might not discern subtle differences between semantically similar images, potentially causing the model to retrieve the same sentences. Consequently, addressing the simultaneous regulation of intra- and inter-modality feature distribution poses a challenge.

To address the aforementioned challenges, this paper proposes the Instance Contrastive Embedding method, called IConE, for image–text cross-modal retrieval. Specifically, we harness the knowledge from existing multi-modal pre-training models and transfer it to the cross-modal retrieval task, enhancing the feature representation and compensating for the lack of inter-modality interaction in the dual-tower structure. Additionally, a contrastive loss is proposed to refine the semantic gap between modalities by leveraging the natural pairing of multi-modal data (For Challenge 1). Furthermore, to comprehensively consider the feature representation distribution of intra- and inter-modality, pairs of images and corresponding sentences are labeled as “image–text group” (namely, “image–text group” is assumed to represent a distinct class, as illustrated in Figure 1). Based on this assumption, we design an instance loss to discern differences between every image and sentence from different groups, providing a better initialization for the contrastive loss (for Challenge 2). The main contributions of this work can be summarized as follows:

We propose a fine-grained cross-modal retrieval method driven by multi-modal pre-training knowledge. Namely, the knowledge from CLIP is transferred to the image–text matching task, addressing the lack of interaction between modalities in the dual-tower structure while bolstering the feature representation capabilities.
To regulate the feature representation distribution of intra- and inter-modality, we design an instance loss specifically tailored for the “image–text group”, which explicitly considers instance-level classification. Then, integrating this instance loss with a contrastive loss via the two-stage training strategy to bridge the semantic gap between images and text.
On two widely used public benchmark datasets for image–text cross-modal retrieval, our proposed IConE model is evaluated in comparison to 20 SoTA baseline methods and a series of ablation variants. The results demonstrated significant performance improvements of up to 99.5 and 34.4 on the Rsum metric.

The remainder of this paper is organized as follows: Section 2 provides a brief overview of related work, followed by the introduction of our proposed IConE model in Section 3. Section 4 and Section 5 outline the experimental configurations and results analysis, respectively. Finally, conclusions are given in Section 6.

2. Related Work

In this section, we provide a brief review of the most relevant literature to this research, focusing on image–text cross-modal retrieval and the contrastive language-vision pre-training model.

2.1. Image–Text Cross-Modal Retrieval

The image–text cross-modal retrieval task is designed to explore the correspondence between image and text. The existing matching methods can be roughly divided into two categories: graph-free paradigm [6,7,10,11,12,13,14,18,19,20,21,22,23,24,25,26,27] and graph-based paradigm [8,9,15,16,17,29,30,31,32,33].

Graph-free paradigm. In early studies following the graph-free paradigm, global features from image and text are extracted individually and then mapped into a common representation space for similarity measurement. For instance, Andrea et al. [18]. pioneered the unification of image and text feature representation by linear mapping, that is, respectively, using CNN and LSTM to extract features from image and text, and then embed these features into a common representation space for cross-modal matching [20]. Another significant work is proposed by Faghri et al. [6], which introduced a novel visual-semantic embedding for cross-modal retrieval. Specifically, this method can enhance the widely used ranking loss by integrating a regularization term for hard negative sample mining. An improvement in the performance of image–text cross-modal retrieval is achieved through fine-tuning and data augmentation. Zheng et al. [7] propose a Dual-Path model, similar to the approach discussed in this paper, based on the unsupervised assumption that each “image–text group” represents a class and is trained using an instance loss. The Dual-Path employs a CNN+CNN structure for effective end-to-end fine-tuning training. Similarly, Vendrov et al. [19] design a sequentially embedded learning strategy, maintaining a well-ordered map in a visually semantic hierarchical structure. While these methods show promising results in image–text cross-modal retrieval, they mainly embed global feature representations and overlook the fine-grained semantic associations between image and text. To tackle this limitation, recent research concentrates on learning correspondences between image regions and text words, achieving semantic coverage from coarse-to-fine [11,12,13,14]. For instance, Xu et al. [11] propose a cross-modal hybrid feature fusion method to capture interactions between image and text, which learns image–text similarity by fusing feature representation of intra- and inter-modality, providing robust semantic interactions between image regions and text words. Another work by Lan et al. [13] proposes a multi-level matching network model that incorporates multi-level similarity between image and text via adaptive matching integration strategies.

Graph-based paradigm. To leverage the intra-modality relationship between image regions and text words more effectively, extensive works have been devoted to capturing intricate structural information in data by employing graph nodes and integrating Graph Convolutional Networks (GCN). For instance, in the VSRN model, Li et al. [30] utilize GCN to understand relationships between image regions. This method fuses the locally extracted feature representations into the global feature representation. Wang et al. [29] aims to capture entities in both images and texts, as well as the relationships between these entities, which achieves this by, respectively, embedding feature representations of images and texts into visual and text scene graphs. In contrast, the Scene Graph Matching (SGM) model utilizes two graph encoders to extract entity-level and relational-level features from the graphs, enabling effective image–text cross-modal matching. Li et al. [8] propose a Multi-Level Similarity Learning (MLSL) model, which delves into semantic-level, structure-level, and context-level information of image–text pairs. This model bridges the heterogeneous and semantic gaps between different modalities. Additionally, Cheng et al. [16] design a Graph-Based Cross-Modal Graph Matching Network (CGMN) capable of exploring intra- and inter-modality relationships without the need for additional interactive networks, which also proposes a novel graph node matching loss to facilitate fine-grained cross-modal alignments, enabling intricate intermodal reasoning.

2.2. Language-Vision Pre-Training Model

Until now, cross-modal retrieval has seen significant advancements through the integration of the Contrastive Language-Image Pre-training (CLIP). Large-scale multi-modal pre-training models like WenLan [36], Google’s ALIGN [37], and Microsoft’s Florence [38] have been trained on billions of image–text pairs. Notably, recent research has highlighted CLIP’s exceptional visual representation capabilities, leading to its application in various fine-grained visual tasks such as video-text retrieval [1,39] and text-based cross-modal person ReID [40]. Given the success of CLIP in capturing nuanced visual information, it becomes imperative to apply it to tasks like image–text cross-modal matching, which requires fine-grained understanding. Therefore, this paper extends the concepts introduced by X-CLIP [1] and CLIP4Clip [39], harnessing the established capabilities of the CLIP model in cross-modal retrieval. By transferring CLIP’s knowledge to the domain of image–text cross-modal matching, the objective of this work is to extract interactive information between modalities, significantly enhancing the data’s feature representation abilities.

3. Design of the IConE Model

Figure 2 illustrates the overall architecture of our proposed IConE framework for image–text cross-modal retrieval task. This framework consists of three modules: image–text feature representation, instance classification loss, and InfoNCE contrastive loss of common space.

3.1. Image–Text Feature Representation

Image Feature Representation. For image feature representation encoding, this paper follows the practice of some previous works [1,39]: a standard Vision Transformer (ViT) with 12 layers is employed as the image encoder. Specifically, given an image

v \in V^{h \times h \times c}

, the image encoder is first initialized using the open-source checkpoints of CLIP [41]. Then, a ViT-based visual word segmentation process is implemented, i.e., the image is firstly split into an

n = h \times w / p^{2}

discrete fixed-sized non-overlapping patch sequence, where p denotes the patch size, n is the number of patches in an image, and then the discrete patch sequence is mapped into 1D token sequence

{f_{i}^{v}} |_{i = 1}^{n}

by a trainable linear projection; finally, the token discrete sequence with [CLS] as the prefix is input into the ViT. Consequently, the output of the last layer of the encoder is considered as the feature representation of the image, denoted as

f_{v} = {v_{g}^{12}, v_{1}^{12}, v_{2}^{12}, \dots, v_{n}^{12}}

. Here,

v_{g}^{12}

represents the coarse-grained visual feature representation at the image level,

{v_{1}^{12}, v_{2}^{12}, \dots, v_{n}^{12}}

denotes the fine-grained visual feature representation at the patch level.

Text Feature Representation. Since the IConE proposed in this paper is a typical dual-tower structure, the text encoding is similar to the image encoding. That is, given any text sentence

t \in T^{1 \times m}

, we directly adopt pre-trained BERT [42] with 12 layers as text encoder. Concretely, the tokenizer is used for token discretization of the original text, then [CLS] token padding is carried out on the front of the text sequence and input the text encoder. Thus, it is possible to obtain the coarse-grained feature representation

t_{g}^{12}

at the text level and the fine-grained text feature representation

{t_{1}^{12}, t_{2}^{12}, \dots, t_{m}^{12}}

at the word level, which is the token of [CLS] and the output of the corresponding word token, where m is the number of words in the text. Thus, the feature representation of the text encoder output (that is, the output of the last layer of BERT) can be recorded as

f_{t} = {t_{g}^{12}, t_{1}^{12}, t_{2}^{12}, \dots, t_{m}^{12}}

.

3.2. Instance Classification Loss

Inspired by prior work [2,7], to comprehensively and explicitly consider the feature representations distribution of intra- and inter-modality, we invent an initial classification method, termed instance loss, for image–text cross-modal retrieval. Specifically, an image and its corresponding text sentence are defined as an “image/text group”, i.e., each “image/text group” is assumed to belong to a distinct category. Essentially, the instance loss is a softmax loss that classifies an “image/text group” into one of numerous classes, enabling the model to distinguish differences between each pair of images and sentences (from different groups). Given the mixed data of both modalities in the image–text cross-modal retrieval task, the instance loss can be formalized as follows:

P_{v} = softmax (W_{s h a r e}^{⊤} f_{v}),

(1)

L_{v} = - log (P_{v} (c)),

(2)

P_{t} = softmax (W_{s h a r e}^{⊤} f_{t}),

(3)

L_{t} = - log (P_{t} (c)),

(4)

In the above equation,

f_{v}

and

f_{t}

represent the feature representation vectors of the image and text, respectively. While the

W_{s h a r e}

denotes the parameters of the last fully connected layer of the encoder, conceptualized as the weight values

W_{s h a r e} = [W_{1}, W_{2}, \dots, W_{z}]

, where each

W_{i}

is a 2048-dimensional vector, and z represents the number of “image/text groups” in a dataset.

P (c)

signifies the likelihood of the prediction being correct for class c. It is important to note that this paper enforces a shared weight

W_{s h a r e}

for the last fully connected layer of both modalities. This shared weight mechanism ensures that the model learns image–text features in a common representation space, preventing the learning of features that do not align across modalities. Consequently, the final instance loss for both modalities is formulated as follows:

L_{i n s} = L_{v} + L_{t} .

(5)

3.3. Contrastive Loss

To comprehensively and explicitly consider the feature representations distribution of intra- and inter-modality, we propose instance loss, to provide better initialization for fine-grained cross-modal retrieval. Meanwhile, we incorporate contrastive learning to bridge the semantic gap between modalities. Specifically, for any given image–text pair (

v_{i}, t_{i}

) in each batch size, the feature encoding backbone network generates a pair of image–text feature representation embeddings (

f_{v}^{i}, f_{t}^{i}

). Following previous work [1,39], we directly employ the InfoNCE loss [43] as a measure to optimize the

B \times B

similarity matrix. Hence, the joint minimization of the symmetric bi-directional search loss is expressed as:

L_{v 2 t} = - \frac{1}{B} \sum_{i = 1}^{B} log \frac{exp (S (f_{v}^{i}, f_{t}^{i}) / τ)}{\sum_{j = 1}^{B} exp (S (f_{v}^{i}, f_{t}^{j}))},

(6)

L_{t 2 v} = - \frac{1}{B} \sum_{i = 1}^{B} log \frac{exp (S (f_{v}^{i}, f_{t}^{i}) / τ)}{\sum_{j = 1}^{B} exp (S (f_{v}^{j}, f_{t}^{i}) / τ)},

(7)

L_{v - t} = L_{v 2 t} + L_{t 2 v},

(8)

where

S (f_{v}^{i}, f_{t}^{i})

is the cosine distance between the image and text, B is the batch size, and

τ

is the temperature hyperparameter that controls the probability distribution peaks, and it affects how well the model discriminates between negative samples. The sum of index j is over one positive and

(B - 1)

negative samples for a given image query

v_{i}

or text query

t_{i}

.

3.4. Objective Function and Training Strategy

Objective Function. Instance loss can explicitly consider the distribution of feature representation within the modalities, and contrastive loss can bridge the gap between image and text. Therefore, by combining the advantages of instance loss and contrastive loss, this paper comprehensively considers the feature representation distribution of intra- and inter-modality, realizing superior performance in image–text cross-modal retrieval. Thus, the objective function that needs to be optimized in the training stage is as follows:

L = λ_{1} L_{i n s} + λ_{2} L_{v - t},

(9)

where

λ_{1}

and

λ_{2}

are the pre-defined weights of different loss functions.

The Strategy of Training. Inspired by prior work [2,7], our IConE adopts a novel two-stage training strategy:

Stage I: In this stage, the pre-trained image–text cross-modal retrieval of the dual-tower backbone network is employed, namely, fixing the weight parameters of the ViT image encoder and BERT text encoder, and using only the proposed instance loss $L_{i n s}$ (i.e., $λ_{1}$ = 1, $λ_{2}$ = 0) to fine-tune the remaining parameters. The purpose of this stage is to thoroughly consider the feature representation distribution of intra-modality, providing a better initialization for the subsequent contrastive loss.
Stage II: Next, when Stage I has converged, the entire IConE model is fine-tuned end-to-end, i.e., both instance losses $L_{i n s}$ and contrast losses $L_{v - t}$ are combined (i.e., $λ_{1}$ = 1, $λ_{2}$ = 1). The primary objective of Stage II is to leverage both instance classification and matching ranking, maximizing the benefits of the two loss functions and effectively bridging the semantic gap between image and text.

4. Experimental Settings

This section will briefly introduce two widely used public datasets, evaluation metrics, experimental settings, and baseline methods.

4.1. Datasets

Flickr30k [44] is a large-scale multimodal dataset comprising 31,783 images collected from the Flickr website, with each image associated with 5 text descriptions. Following the experimental partitioning criteria outlined in the VSE++ [6]: 29,783 images are allocated for training, while 1000 images are used for validation and testing.

MS-COCO [45] encompasses 123,287 images, each accompanied by five text descriptions. Adhering to a standard experimental partitioning guideline: 113,287 images are earmarked for training, 5000 for validation, and another 5000 for testing. In addition, this paper involves multiple evaluations, including testing on the 1k test image (referred to as MS-COCO 1k) with more than 5 times the cross-average test, and the complete 5k test image (known as MS-COCO 5k).

4.2. Evaluation Metric

For the image–text cross-modal retrieval task, we employ widely adopted metrics [2,6,35], including Recall@K, Median Rank (MedR), and Mean Rank (MnR), to assess the model’s performance, where Recall@K represents the probability of a correct match appearing in the top-K of the result sorting list, with higher scores indicating better performance. For setting K, this paper follows the practice of existing work: K = 1, 5, and 10 (abbreviated as R@1, R@5, and R@10). MedR and MnR are, respectively, the median and mean of the ranks closest to the actual results in the sorted list, with lower values signifying better performance. Additionally, we also utilize the “Rsum” to measure the overall retrieval quality of the model, which can be defined as:

Rsum = \underset{Text \to Image}{\underset{︸}{R @ 1 + R @ 5 + R @ 10}} + \underset{Image \to Text}{\underset{︸}{R @ 1 + R @ 5 + R @ 10}} .

(10)

4.3. Implementation Details

Our IConE model, as proposed in this paper, is implemented using PyTorch, and all experiments are conducted on a server equipped with 4 Nvidia Tesla V100 32GB GPUs. For image and text encoding, CLIP’s ViT-B/32 and BERT are utilized to extract image and text features, respectively (refer to the image–text feature representation in Section 3.1 for details). The training employed a two-stage strategy (see the training strategy in Section 3.4). The ADAM optimizer [46,47] is used during training, with a learning rate of

1 \times 10^{- 4}

for Stage I and

1 \times 10^{- 5}

for Stage II. The same learning rate adjustment strategy [48] is applied to both stages. The IConE undergoes a total of 40 epochs, including 25 epochs for Stage I initialization training and 15 epochs for Stage II fine-tuning training. The batch size is set to 128, and the temperature hyperparameter

τ

is defaulted to 0.05 in all datasets. In the testing phase, the epoch that yields the best Rsum metric on the two validation sets is selected for inference.

4.4. Baselines

The IConE is compared with two paradigms (graph-based and graph-free) that include 20 classical or SoTA baselines for image–text cross-modal retrieval tasks.

Graph-free paradigms: VSE++ [6], Dual-Path [7], GSLS [12], CMHF [11], SCAN [21], CAMP [22], CASC [26], CAAN [23], CVSE [24], Meta-SPN [25], and MLMN [13].

Graph-based paradigms: SGM [29], VSRN [30], SGSIN [31], MLSL [8], GSMN [32], ReSG [9], CSCC [15], CGMN [16], and HSLM [17].

5. Experimental Results and Analysis

This section will demonstrate the results of the experiment on two public benchmark datasets and report corresponding analyses, i.e., to guide the evaluation objectives by attempting to answer the following five research questions (RQs):

RQ1: Does IConE outperform the SoTA baselines overall on two public benchmark datasets?
RQ2: How does the two-stage training strategy influence the performance of cross-modal retrieval?
RQ3: What is the sensitivity of the hyperparameter $τ$ ?
RQ4: What is the generalization of the two-stage training strategy?
RQ5: What are the qualitative results for feature representation, cross-modal bi-directional retrieval, and heat map visualization?

5.1. Comparison with SoTA baselines

To answer RQ1, we quantitatively show the superiority and effectiveness of the proposed IConE, by comparing it with several SoTA baseline methods on two widely used benchmark datasets, Flickr30k and MS-COCO. From the experimental results shown in Table 1 and Table 2, we can draw the following two observations:

Table 1 illustrates the comparison between IConE and SoTA baseline methods on the Flickr30k dataset: (1) IConE surpasses 11 methods based on the graph-free paradigm across all evaluation metrics. Specifically, in comparison to the traditional method VSE++ [6], our IConE can enhance the R@1 index for text retrieval images and image retrieval text by 21.5% and 25.4%, respectively. Moreover, the Rsum index, which reflects the overall retrieval quality, sees a substantial improvement of 99.5 points. Notably, compared to MLMN [13], the representative SoTA method based on the graph-free paradigm, IConE achieves superior results across all indicators, with a relative improvement of 22.9 points in the Rsum indicator. In a departure from MLMN [13], our proposed IConE leverages only one level of feature representation for modality alignment, avoiding the decomposition of image and text data into multiple levels of representation through intricate parsing methods, thereby significantly reducing model complexity and memory consumption. (2) Benefiting from effective graph node modeling and graph reasoning, the performance of graph-based paradigm methods generally exceeds that of graph-free paradigm methods. Our IConE, belonging to the graph-free paradigm, outperforms 9 methods of the graph-based paradigm in most evaluation metrics. Particularly, compared to the earlier SGM [29], IConE can realize the best performance across all indicators, with a remarkable improvement of 30.7 points by the Rsum index. Compared to the representative baseline method, CSCC, IConE achieves sub-optimal results in the R@1 and R@5 indicators for text retrieval images. Similarly, this trend is observed in the HSLM [17], where IConE yields sub-optimal results for R@1 and R@5 in image retrieval text. Despite being lower than some strong baselines in some indicators, IConE consistently maintains superior performance in the Rsum index, which reflects the overall search quality.
Table 2 provides a comparison between IConE and SoTA baselines on the MS-COCO dataset: IConE showcases superior performance across nearly all metrics. In comparison to the traditional Dual-Path [7], on the 1k test set, the R@1 index for text retrieval images and image retrieval text sees improvements of 10.1% and 8.7%, respectively, and the Rsum is enhanced by 34.4 points. On the 5k test set, the bi-directional retrieval R@1 index improves by 9.6% and 11.1%, respectively, and the Rsum index sees an increase of 56.6 points. While some metrics of IConE on the MS-COCO 1k and MS-COCO 5k test sets are lower than those of the strong baseline GSLS [12] and SGM [29], they remain comparable. Particularly, in comparison to GSLS, competitive results are achieved by IConE, especially notably for the significant 5.4% improvement in the R@1 index for image retrieval text. In contrast to SGM, while IConE may not achieve the best performance in certain bi-directional retrieval metrics, it consistently maintains the most advanced results in Rsum, reflecting the overall search quality.

In a word, IConE consistently outperforms SoTA baselines, particularly excelling in comparison to graph-free paradigm methods. This superiority can be attributed to IConE’s effective exploration of feature representation distributions of intra- and inter-modality, leading to competitive results.

5.2. Ablation Studies

To answer RQ2, we conduct a series of ablation experiments on two datasets, Flickr30k and MS-COCO, to analyze the effect of different components, including the two-stage training strategy, instance classification loss

L_{i n s}

, and contrastive loss

L_{v - t}

. The results shown in Table 3 lead to the following conclusions:

The training results in Stage I on both datasets show that the ablation variants with $L_{i n s}$ and the one with $L_{v - t}$ both achieve promising improvements. Particularly, when $L_{i n s}$ is used alone, despite its focus on the feature representation distribution of intra-modality and the limitation in bridging the semantic gap between modalities, it still attains competitive performance. Specifically, compared to the traditional SoTA baseline Dual-Path, it exhibits a 45.4 point improvement by the Rsum indicator on the Flickr30k. On the other hand, $L_{v - t}$ , designed to bridge the semantic gap between modalities, achieves superior performance compared to using only $L_{i n s}$ .
By utilizing both losses for joint fine-tuning training in Stage II, the overall performance of the full IConE model significantly improves compared to Stage I. Specifically, the Rsum index on Flickr30k, MS-COCO1k, and MS-COCO 5k reaches 509.3, 502.3, and 394.5, respectively. This underscores that the instance loss in Stage I effectively regularizes the model, providing a better initialization for the contrastive loss. As a result, the model can comprehensively consider both instance classification and matching ranking, effectively bridging the semantic gap in image–text cross-modal retrieval.

5.3. Sensitivity Analysis of the Hyperparameter $τ$

To answer RQ3, we also investigate the impact of the temperature hyperparameter

τ

on IConE. As depicted in Figure 3,

τ

regulates the penalty intensity of the contrastive loss in Equations (6) and (7). Through manual fine-tuning of

τ

to observe its experimental effect, the results indicate that the performance of IConE in terms of Rsum slightly increases with the rise of

τ

, reaching its peak at

τ

= 0.05, and then starts to decline. The main reason is that the similarity is very sensitive to the hyperparameter

τ

, whose values that are too large or too small adversely affect the performance. Consequently,

τ

is set to 0.05 by default in this paper.

5.4. Two-Stage Generalization of Training Strategy

To answer RQ4, we adjust the text feature representation encoder (utilizing BERT, GRU and LSTM) and the image feature representation encoder (utilizing ResNet-50, ResNet-152 and ViT), respectively, to assess the generalization of the two-stage training strategy. As can be seen from Table 4:

The performance trends for both the image encoder backbone networks (ResNet-50 and ResNet-152) and text encoder backbone networks (GRU and LSTM) are as we expected: Using the two losses, $L_{i n s}$ and $L_{v - t}$ , separately in Stage I produces gratifying results and yields more significant improvement than the $L_{v - t}$ loss alone. Combining the two losses in Stage II further enhances model performance compared to Stage I, demonstrating the strong generalization of the two-stage training strategy.
Furthermore, the experimental results in both Table 3 and Table 4 also validate the effectiveness of the CLIP multi-modal pre-training model for cross-modal retrieval tasks. In this paper, the ViT initialized with CLIP is employed as the backbone network for the image encoder. This approach leverages the knowledge from the existing multi-modal pre-training model for the cross-modal retrieval task, enabling the utilization of interactive information between images and text, thereby enhancing the model’s feature representation capabilities. In contrast, ResNet-50 and ResNet-152 are single-modal pre-training models for images, operating independently from text encoders to extract image features without interaction. Consequently, IConE utilizing CLIP-initialized ViT as the image encoder achieves superior performance.

5.5. Qualitative Visualization Results

To answer RQ5, we also conduct a series of visualization experiments, including feature representation visualization, image–text cross-modal retrieval visualization, and regional heat map visualization, to offer a more intuitive illustration of the IConE’s performance.

For feature representation visualization: The t-SNE [49] is employed to embed Flickr30k and MS-COCO feature representations into the visual-text joint semantic common representation space. As shown in Figure 4 and Figure 5, the colors represent semantics and shapes denote modalities (i.e., images and text). In Figure 4a and Figure 5a, it is apparent that the image–text feature representation extracted by the pre-trained backbone network lacks consistency. The original features of images and texts are mixed together, making it difficult to distinguish between them, hindering the achievement of superior performance in cross-modal retrieval. Contrastingly, Figure 4b and Figure 5b depict the distribution of image–text feature representations learned by IConE. This distribution exhibits strong differentiation, allowing it to compress semantically related positive pairs together and disperse semantically unrelated negative pairs. In other words, samples of different instance categories are divided into clusters of different categories. Consequently, the experimental results of feature representation visualization demonstrate that IConE effectively bridges the semantic gap between image and text, learning discriminant representations of different instance classes.

For cross-modal retrieval visualization: text query images and image query texts are visualized on the Flickr30k and MS-COCO, respectively. Figure 6 and Figure 7 presents an example of text retrieval images on two datasets, showcasing the top 5 similarity matches for each given query text. As each image corresponds to 5 text sentences in the Flickr30k and MS-COCO, only one correct image can be queried in text retrieval. Correct matches are indicated by a green box, while incorrect matches are marked with a red one. In the last line of Figure 6 and Figure 7, the top left image is semantically close to the text sentence and the actual image.

Figure 8 and Figure 9 illustrate examples of image retrieval text on two datasets. Similar to text retrieval images, only matches ranked in the top 5 in terms of similarity are listed for each given query image. Correct matches are indicated in green font, while incorrect matches are marked in red. The example demonstrates that IConE can retrieve almost all sentences related to the semantics of the queried image, and even some incorrect sentences show similarities in local fine-grained semantics. For instance, in the first example of Figure 8, although the fifth sentence is a false match, the entities in the image are semantically related to the sentences “girls playing volleyball” and “striking the ball”. This indicates that IConE is adept at focusing on the feature representations distribution of intra- and inter-modality, enabling fine-grained matching.

For regional heat map visualization: We also list the regional heat maps learned by IConE on Flickr30k. As can be seen from Figure 10, IConE places more emphasis on the fine-grained information of image and text modalities, such as the interaction between entities.

6. Conclusions

In this paper, an instance contrastive embedding model, referred to as IConE, is proposed for image–text cross-modal retrieval. IConE alleviates the issue of missing intermodal interaction information in the two-tower structure by leveraging knowledge from multi-modal pre-training models. Additionally, we design an instance contrast loss, treating each “image/text group” as a class, to explicitly consider the feature representation distribution of intra-modality. The novel two-stage training strategy combines the strengths of instance loss and contrast loss. Extensive experiments on two public benchmark datasets demonstrate that IConE can achieve competitive results.

Author Contributions

Conceptualization, R.Z. and W.M.; methodology, R.Z. and and W.M.; software, R.Z.; validation, R.Z., W.M. and X.W.; formal analysis, W.M. and W.L.; writing—original draft preparation, W.M.; writing—review and editing, W.M. and R.Z.; visualization, X.W.; supervision, J.L.; funding acquisition, J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China under Grant 62202001 and 3177167, and the National Key Research and Development Program of China under Grant 2021YFB0300101.

Data Availability Statement

Data is contained within the article.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Ma, Y.; Xu, G.; Sun, X.; Yan, M.; Zhang, J.; Ji, R. X-CLIP: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of the ACM International Conference on Multimedia, Lisbon, Portugal, 10–14 October 2022; pp. 638–647. [Google Scholar]
Ma, W.; Chen, Q.; Zhou, T.; Zhao, S.; Cai, Z. Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval. IEEE Trans. Circuits Syst. Video Technol. 2023, 33, 5486–5497. [Google Scholar] [CrossRef]
Chen, S.; Zhao, Y.; Jin, Q.; Wu, Q. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10638–10647. [Google Scholar]
Ma, W.; Wu, X.; Zhao, S.; Zhou, T.; Guo, D.; Gu, L.; Cai, Z.; Wang, M. FedSH: Towards Privacy-preserving Text-based Person Re-Identification. IEEE Trans. Multimed. 2023; early access. [Google Scholar]
Wu, X.; Ma, W.; Guo, D.; Tongqing, Z.; Zhao, S.; Cai, Z. Text-based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning. In Proceedings of the AAAI, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Faghri, F.; Fleet, D.J.; Kiros, J.R.; Fidler, S. Vse++: Improving visual-semantic embeddings with hard negatives. arXiv 2017, arXiv:1707.05612. [Google Scholar]
Zheng, Z.; Zheng, L.; Garrett, M.; Yang, Y.; Xu, M.; Shen, Y.D. Dual-path convolutional image-text embeddings with instance loss. ACM Trans. Multimed. Comput. Commun. Appl. 2020, 16, 1–23. [Google Scholar] [CrossRef]
Li, W.H.; Yang, S.; Wang, Y.; Song, D.; Li, X.Y. Multi-level similarity learning for image-text retrieval. Inf. Process. Manag. 2021, 58, 102432. [Google Scholar] [CrossRef]
Liu, X.; He, Y.; Cheung, Y.M.; Xu, X.; Wang, N. Learning relationship-enhanced semantic graph for fine-grained image–text matching. IEEE Trans. Cybern. 2022; early access. [Google Scholar]
Chen, J.; Hu, H.; Wu, H.; Jiang, Y.; Wang, C. Learning the best pooling strategy for visual semantic embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 15789–15798. [Google Scholar]
Xu, X.; Wang, Y.; He, Y.; Yang, Y.; Hanjalic, A.; Shen, H.T. Cross-modal hybrid feature fusion for image-sentence matching. ACM Trans. Multimed. Comput. Commun. Appl. 2021, 17, 1–23. [Google Scholar] [CrossRef]
Li, Z.; Ling, F.; Zhang, C.; Ma, H. Combining global and local similarity for cross-media retrieval. IEEE Access 2020, 8, 21847–21856. [Google Scholar] [CrossRef]
Lan, H.; Zhang, P. Learning and integrating multi-level matching features for image-text retrieval. IEEE Signal Process. Lett. 2021, 29, 374–378. [Google Scholar] [CrossRef]
Li, Z.; Guo, C.; Feng, Z.; Hwang, J.N.; Xue, X. Multi-view visual semantic embedding. In Proceedings of the International Joint Conference on Artificial Intelligence, Vienna, Austria, 23–29 July 2022; pp. 1130–1136. [Google Scholar]
Zeng, P.; Gao, L.; Lyu, X.; Jing, S.; Song, J. Conceptual and syntactical cross-modal alignment with cross-level consistency for image-text matching. In Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 2205–2213. [Google Scholar]
Cheng, Y.; Zhu, X.; Qian, J.; Wen, F.; Liu, P. Cross-modal graph matching network for image-text retrieval. ACM Trans. Multimed. Comput. Commun. Appl. 2022, 18, 1–23. [Google Scholar] [CrossRef]
Zeng, S.; Liu, C.; Zhou, J.; Chen, Y.; Jiang, A.; Li, H. Learning hierarchical semantic correspondences for cross-modal image-text retrieval. In Proceedings of the International Conference on Multimedia Retrieval, Newark, NJ, USA, 27–30 June 2022; pp. 239–248. [Google Scholar]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.; Mikolov, T. DeViSE: A deep visual-semantic embedding model. In Proceedings of the International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2121–2129. [Google Scholar]
Vendrov, I.; Kiros, R.; Fidler, S.; Urtasun, R. Order-embeddings of images and language. arXiv 2015, arXiv:1511.06361. [Google Scholar]
He, Y.; Xiang, S.; Kang, C.; Wang, J.; Pan, C. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multimed. 2016, 18, 1363–1377. [Google Scholar] [CrossRef]
Lee, K.H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 201–216. [Google Scholar]
Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; Shao, J. Camp: Cross-modal adaptive message passing for text-image retrieval. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5764–5773. [Google Scholar]
Zhang, Q.; Lei, Z.; Zhang, Z.; Li, S.Z. Context-aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 3536–3545. [Google Scholar]
Wang, H.; Zhang, Y.; Ji, Z.; Pang, Y.; Ma, L. Consensus-aware visual-semantic embedding for image-text matching. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 18–34. [Google Scholar]
Wei, J.; Xu, X.; Wang, Z.; Wang, G. Meta self-paced learning for cross-modal matching. In Proceedings of the ACM International Conference on Multimedia, Chengdu, China, 20–24 October 2021; pp. 3835–3843. [Google Scholar]
Xu, X.; Wang, T.; Yang, Y.; Zuo, L.; Shen, F.; Shen, H.T. Cross-modal attention with semantic consistence for image-text matching. IEEE Trans. Neural Netw. Learn. Syst. 2020, 31, 5412–5425. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Liu, H.; Wang, H.; Liu, M. Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Process. Lett. 2022, 29, 1332–1336. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 91–99. [Google Scholar]
Wang, S.; Wang, R.; Yao, Z.; Shan, S.; Chen, X. Cross-modal scene graph matching for relationship-aware image-text retrieval. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1508–1517. [Google Scholar]
Li, K.; Zhang, Y.; Li, K.; Li, Y.; Fu, Y. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 4654–4662. [Google Scholar]
Pei, J.; Zhong, K.; Yu, Z.; Wang, L.; Lakshmanna, K. Scene graph semantic inference for image and text matching. Acm Trans. Asian-Low-Resour. Lang. Inf. Process. 2022, 22, 144. [Google Scholar] [CrossRef]
Liu, C.; Mao, Z.; Zhang, T.; Xie, H.; Wang, B.; Zhang, Y. Graph structured network for image-text matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10921–10930. [Google Scholar]
Long, S.; Han, S.C.; Wan, X.; Poon, J. Gradual: Graph-based dual-modal representation for image-text matching. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3459–3468. [Google Scholar]
Wang, L.; Li, Y.; Huang, J.; Lazebnik, S. Learning two-branch neural networks for image-text matching tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 394–407. [Google Scholar] [CrossRef] [PubMed]
Ma, W.; Chen, Q.; Liu, F.; Zhou, T.; Cai, Z. Query-adaptive late fusion for hierarchical fine-grained video-text retrieval. IEEE Trans. Neural Netw. Learn. Syst. 2022; early access. [Google Scholar]
Huo, Y.; Zhang, M.; Liu, G.; Lu, H.; Gao, Y.; Yang, G.; Wen, J.; Zhang, H.; Xu, B.; Zheng, W.; et al. WenLan: Bridging vision and language by large-scale multi-modal pre-training. arXiv 2021, arXiv:2103.06561. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
Yuan, L.; Chen, D.; Chen, Y.L.; Codella, N.; Dai, X.; Gao, J.; Hu, H.; Huang, X.; Li, B.; Li, C.; et al. Florence: A new foundation model for computer vision. arXiv 2021, arXiv:2111.11432. [Google Scholar]
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing 2022, 508, 293–304. [Google Scholar] [CrossRef]
Yan, S.; Dong, N.; Zhang, L.; Tang, J. CLIP-driven fine-grained text-image person re-identification. arXiv 2022, arXiv:2210.10276. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Kenton, J.D.M.W.C.; Toutanova, L.K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Young, P.; Lai, A.; Hodosh, M.; Hockenmaier, J. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2014, 2, 67–78. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Hu, P.; Peng, X.; Zhu, H.; Zhen, L.; Lin, J. Learning cross-modal retrieval with noisy labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 5403–5413. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Ge, R.; Kakade, S.M.; Kidambi, R.; Netrapalli, P. The step decay schedule: A near optimal, geometrically decaying learning rate procedure for least squares. In Proceedings of the Advances in Neural Information Processing Systems 32 (NeurIPS 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

Figure 1. We define an “image/text group” as an image with its associated sentences, thus we view every “image/text group” as a distinct class during training, yielding the instance loss.

Figure 2. Overview of the proposed IConE framework.

Figure 3. The sensitivity of hyperparameter

τ

on Flickr30k and MS-COCO 1k.

Figure 3. The sensitivity of hyperparameter

τ

on Flickr30k and MS-COCO 1k.

Figure 4. Visualization of semantic common representation space for Flickr30k by using the t-SNE [49]. The same color indicates relevant semantics, the shapes represent different modalities. (a) The pre-train backbone. (b) Our IConE.

Figure 5. Visualization of semantic common representation space for MS-COCO 1k by using the t-SNE [49]. The same color indicates relevant semantics, the shapes represent different modalities. (a) The pre-train backbone. (b) Our IConE.

Figure 6. Text-to-image retrieval examples on Flickr30k testing set. We visualize top 5 retrieved images (green: correct; red: incorrect).

Figure 7. Text-to-image retrieval examples on MS-COCO testing set. We visualize top 5 retrieved images (green: correct; red: incorrect).

Figure 8. Image-to-text retrieval examples on Flickr30k testing set with top 3 retrieved texts (green: correct; red: incorrect).

Figure 9. Image-to-text retrieval examples on MS-COCO testing set with top 3 retrieved texts (green: correct; red: incorrect).

Figure 10. Heat map of our IConE on Flickr30k. Key phrases within the query text are highlighted in blue.

Table 1. Comparison results on the Flick30K. Bold-labeled data refers to the best results.

Methods		Text-to-Image					Image-to-Text					Rsum	Parameter Size
Methods		R@1	R@5	R@10	MedR	MnR	R@1	R@5	R@10	MedR	MnR	Rsum	Parameter Size
Graph-free paradigm	VSE++ [6]	39.6	70.1	79.5	2	-	52.9	80.5	87.2	1	-	409.8	-
	Dual-Path [7]	39.1	69.2	80.9	2	-	55.6	81.9	89.5	1	-	416.2	-
	GSLS [12]	43.4	73.5	82.5	2	-	68.2	89.1	94.5	1	-	451.2	-
	CMHF [11]	45.4	76.6	85.0	-	-	63.6	88.6	94.0	-	-	453.2	-
	SCAN [21]	48.6	77.7	85.2	-	-	67.4	90.3	95.8	-	-	465.0	-
	CAMP [22]	51.5	77.1	85.3	-	-	68.1	89.7	95.2	-	-	466.9	-
	CASC [26]	50.2	78.3	86.3	-	-	68.5	90.6	95.6	-	-	469.5	-
	CAAN [23]	52.8	79.0	87.9	-	-	70.1	91.6	97.2	-	-	478.6	-
	CVSE [24]	52.9	80.4	87.8	-	-	73.5	92.1	95.8	-	-	482.5	-
	Meta-SPN [25]	53.3	80.2	87.2	-	-	72.5	93.2	96.7	-	-	483.1	-
	MLMN [13]	55.3	80.2	85.6	-	-	75.9	93.3	96.1	-	-	486.4	-
Graph-based paradigm	SGM [29]	53.5	79.6	86.5	1	-	71.8	91.7	95.5	1	-	478.6	-
	VSRN [30]	54.7	81.8	88.2	-	-	71.3	90.6	96.0	-	-	482.6	-
	SGSIN [31]	53.9	80.1	87.2	-	-	73.1	93.6	96.8	-	-	484.7	-
	MLSL [8]	56.8	83.3	91.3	-	-	72.2	92.4	98.2	-	-	494.2	-
	GSMN [32]	57.4	82.3	89.0	-	-	76.4	94.3	97.3	-	-	496.7	-
	ReSG [9]	58.0	83.1	88.7	-	-	77.2	94.2	98.2	-	-	499.4	-
	CGMN [16]	59.9	85.1	90.6	-	-	77.9	93.8	96.8	-	-	504.1	-
	HSLM [17]	60.7	84.7	90.1	-	-	79.9	95.7	97.5	-	-	508.6	-
Our IConE (Graph-free paradigm)		61.1	86.0	91.7	1	6.1	78.3	94.6	97.6	1	2.6	509.3	225M

Table 2. Performance comparison between our proposed IConE and recent SoTA on MSCOCO. MS-COCO 5K and MS-COCO 1K denote the evaluation settings of the full 5K and average of 5-fold 1K test images. Bold-labeled data refers to the best results.

Methods	Text-to-Image					Image-to-Text					Rsum	Parameter Size
Methods	R@1	R@5	R@10	MedR	MnR	R@1	R@5	R@10	MedR	MnR	Rsum	Parameter Size
MS-COCO 1k
Dual-Path [7]	47.1	79.9	90.0	2	-	65.6	89.8	95.5	1	-	467.9	-
VSE++ [6]	52.0	84.3	92.0	1	-	64.6	90.0	95.7	1	-	478.6	-
SCAN [21]	53.0	85.4	92.9	-	-	67.5	92.9	97.6	-	-	489.3	-
GSLS [12]	58.6	88.2	94.9	1	-	68.9	94.1	98.0	1	-	502.7	-
SGM [29]	57.5	87.3	94.3	1	-	73.4	93.8	97.8	1	-	504.1	-
Our IConE	57.2	86.2	93.5	1	4.6	74.3	93.8	97.3	1	2.5	502.3	-
MS-COCO 5k
Dual-Path [7]	25.3	53.4	66.4	5	-	41.2	70.5	81.1	2	-	337.9	-
VSE++ [6]	30.3	59.4	72.4	4	-	41.3	71.1	81.2	2	-	355.7	-
SCAN [21]	34.4	63.7	75.7	-	-	46.4	77.4	87.2	-	-	384.8	-
CASC [26]	34.7	64.8	76.8	-	-	47.2	78.3	87.4	-	-	389.2	-
SGM [29]	35.3	64.9	76.5	3	-	50.0	79.3	87.9	2	-	393.9	-
Our IConE	34.9	64.4	75.7	3	18.4	52.3	79.5	87.7	1	7.7	394.5	225M

Table 3. Pairwise ranking loss and instance loss retrieval results on Flickr30k. Except for the different losses, we apply the entirely same backbone and hyperparameter.

Methods	Stage	Text-to-Image					Image-to-Text					Rsum
Methods	Stage	R@1	R@5	R@10	MedR	MnR	R@1	R@5	R@10	MedR	MnR	Rsum
Flickr30k
Only $L_{i n s}$	I	48.2	76.6	84.7	2	11.6	68.3	89.9	93.9	1	3.6	461.6
Only $L_{v - t}$	I	52.8	81.5	88.3	1	7.5	68.8	91.1	95.5	1	3.3	478.0
Full IConE (with $L_{i n s}$ and $L_{v - t}$ )	II	61.1	86.0	91.7	1	6.1	78.3	94.6	97.6	1	2.6	509.3
MS-COCO 1k
Only $L_{i n s}$	I	41.3	75.2	86.3	2	9.1	59.2	85.4	92.9	1	4.4	440.3
Only $L_{v - t}$	I	49.8	81.7	91.4	1.6	5.2	60.2	87.4	94.2	1	3.8	464.7
Full IConE (with $L_{i n s}$ and $L_{v - t}$ )	II	57.2	86.2	93.5	1	4.6	74.3	93.8	97.3	1	2.5	502.3
MS-COCO 5k
Only $L_{i n s}$	I	21.4	47.1	60.2	6	40.9	35.9	64.1	74.8	3	17.9	303.5
Only $L_{v - t}$	I	28.4	56.4	68.5	4	21.7	36.3	65.8	77.2	3	14.1	332.6
Full IConE (with $L_{i n s}$ and $L_{v - t}$ )	II	34.9	64.4	75.7	3	18.4	52.3	79.5	87.7	1	7.7	394.5

Table 4. The generalization of our two-stage strategy on three datasets (including Flickr30k, MS-COCO 1k, and MS-COCO 5k) using different image encoders and text encoders.

Methods	Stage	Feature Representation		Text-to-Image			Image-to-Text			Rsum
Methods	Stage	Image	Text	R@1	R@5	R@10	R@1	R@5	R@10	Rsum
Flickr30k
Only $L_{i n s}$	I	ResNet-50	BERT	20.7	45.9	58.4	30.5	57.5	69.2	282.2
Only $L_{v - t}$	I	ResNet-50	BERT	25.4	52.9	65.6	34.8	63.8	75.8	318.3
Full IConE	II	ResNet-50	BERT	32.0	59.7	70.8	46.3	72.6	81.6	363.0
Only $L_{i n s}$	I	ResNet-152	BERT	22.3	49.2	61.5	34.8	62.4	72.8	303.0
Only $L_{v - t}$	I	ResNet-152	BERT	27.1	55.9	68.3	38.4	67.9	79.1	336.7
Full IConE	II	ResNet-152	BERT	36.4	65.0	75.2	51.8	77.2	84.9	390.5
Only $L_{i n s}$	I	ViT	BERT	48.2	76.6	84.7	68.3	89.9	93.9	461.6
Only $L_{v - t}$	I	ViT	BERT	52.8	81.5	88.3	68.8	91.1	95.5	478.0
Full IConE	II	ViT	BERT	61.1	86.0	91.7	78.3	94.6	97.6	509.3
Only $L_{i n s}$	I	ViT	GRU	46.8	78.5	86.6	69.4	90.0	94.4	465.7
Only $L_{v - t}$	I	ViT	GRU	51.4	80.1	87.4	69.6	89.3	94.9	472.7
Full IConE	II	ViT	GRU	52.0	80.1	87.5	70.5	89.3	94.4	473.8
Only $L_{i n s}$	I	ViT	LSTM	43.7	76.1	85.4	64.6	88.6	93.3	451.7
Only $L_{v - t}$	I	ViT	LSTM	49.9	79.4	86.6	66.7	89.8	95.1	467.5
Full IConE	II	ViT	LSTM	66.9	90.7	95.0	50.6	79.5	87.0	469.7
MS-COCO 1k
Only $L_{i n s}$	I	ResNet-50	BERT	24.2	57.4	73.0	37.4	69.2	81.8	343.0
Only $L_{v - t}$	I	ResNet-50	BERT	32.0	67.7	82.1	40.0	74.4	85.4	381.6
Full IConE	II	ResNet-50	BERT	37.7	72.3	84.3	54.1	83.2	90.9	422.5
Only $L_{i n s}$	I	ResNet-152	BERT	25.4	59.9	75.1	39.1	71.1	83.8	354.4
Only $L_{v - t}$	I	ResNet-152	BERT	34.3	70.0	83.5	41.9	75.8	86.7	392.2
Full IConE	II	ResNet-152	BERT	43.7	78.7	89.1	60.3	86.8	93.4	452.0
Only $L_{i n s}$	I	ViT	GRU	49.9	83.4	92.2	68.0	91.1	96.2	480.8
Only $L_{v - t}$	I	ViT	GRU	53.5	85.4	93.3	67.2	91.3	96.5	487.2
Full IConE	II	ViT	GRU	54.5	85.7	93.6	68.0	91.7	96.9	490.4
Only $L_{i n s}$	I	ViT	LSTM	48.9	82.5	91.8	67.4	90.9	96.4	477.9
Only $L_{v - t}$	I	ViT	LSTM	52.8	85.3	93.4	67.1	91.2	96.5	486.3
Full IConE	II	ViT	LSTM	53.7	85.6	93.6	66.9	91.8	96.8	488.4
MS-COCO 5k
Only $L_{i n s}$	I	ResNet-50	BERT	9.7	27.7	40.3	17.0	40.2	53.4	188.3
Only $L_{v - t}$	I	ResNet-50	BERT	14.1	37.1	50.6	18.7	44.2	59.2	223.9
Full IConE	II	ResNet-50	BERT	18.5	43.8	57.4	31.4	60.1	72.5	283.7
Only $L_{i n s}$	I	ResNet-152	BERT	10.2	29.5	42.6	18.4	42.6	55.9	199.2
Only $L_{v - t}$	I	ResNet-152	BERT	15.7	39.6	53.1	21.0	45.9	60.1	235.4
Full IConE	II	ResNet-152	BERT	23.1	50.9	64.3	37.9	66.2	77.1	319.5
Only $L_{i n s}$	I	ViT	GRU	28.1	57.0	70.2	44.5	73.5	83.4	356.7
Only $L_{v - t}$	I	ViT	GRU	31.8	60.7	73.1	43.3	71.6	82.9	363.4
Full IConE	II	ViT	GRU	32.7	61.6	73.8	43.4	73.0	83.4	367.9
Only $L_{i n s}$	I	ViT	LSTM	27.2	55.9	69.5	42.8	72.4	82.6	350.4
Only $L_{v - t}$	I	ViT	LSTM	30.8	60.1	72.6	43.3	71.5	82.9	361.2
Full IConE	II	ViT	LSTM	31.5	61.1	73.4	43.5	72.2	83.7	365.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, R.; Ma, W.; Wu, X.; Liu, W.; Liu, J. Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding. Electronics 2024, 13, 300. https://doi.org/10.3390/electronics13020300

AMA Style

Zeng R, Ma W, Wu X, Liu W, Liu J. Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding. Electronics. 2024; 13(2):300. https://doi.org/10.3390/electronics13020300

Chicago/Turabian Style

Zeng, Ruigeng, Wentao Ma, Xiaoqian Wu, Wei Liu, and Jie Liu. 2024. "Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding" Electronics 13, no. 2: 300. https://doi.org/10.3390/electronics13020300

APA Style

Zeng, R., Ma, W., Wu, X., Liu, W., & Liu, J. (2024). Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding. Electronics, 13(2), 300. https://doi.org/10.3390/electronics13020300

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Abstract

1. Introduction

2. Related Work

2.1. Image–Text Cross-Modal Retrieval

2.2. Language-Vision Pre-Training Model

3. Design of the IConE Model

3.1. Image–Text Feature Representation

3.2. Instance Classification Loss

3.3. Contrastive Loss

3.4. Objective Function and Training Strategy

4. Experimental Settings

4.1. Datasets

4.2. Evaluation Metric

4.3. Implementation Details

4.4. Baselines

5. Experimental Results and Analysis

5.1. Comparison with SoTA baselines

5.2. Ablation Studies

5.3. Sensitivity Analysis of the Hyperparameter $τ$

5.4. Two-Stage Generalization of Training Strategy

5.5. Qualitative Visualization Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding

Abstract

1. Introduction

2. Related Work

2.1. Image–Text Cross-Modal Retrieval

2.2. Language-Vision Pre-Training Model

3. Design of the IConE Model

3.1. Image–Text Feature Representation

3.2. Instance Classification Loss

3.3. Contrastive Loss

3.4. Objective Function and Training Strategy

4. Experimental Settings

4.1. Datasets

4.2. Evaluation Metric

4.3. Implementation Details

4.4. Baselines

5. Experimental Results and Analysis

5.1. Comparison with SoTA baselines

5.2. Ablation Studies

5.3. Sensitivity Analysis of the Hyperparameter τ

5.4. Two-Stage Generalization of Training Strategy

5.5. Qualitative Visualization Results

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.3. Sensitivity Analysis of the Hyperparameter $τ$