Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching

Song, Rui; Tian, Jiwei; Zhu, Peican; Chen, Bin

doi:10.3390/electronics14153098

Open AccessArticle

Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching

¹

School of Journalism and Communication, Xi’an International Studies University, Xi’an 710128, China

²

Air Traffic Control and Navigation College, Air Force Engineering University, Xi’an 710051, China

³

School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

⁴

School of Computer Science, Northwestern Polytechnical University, Xi’an 710072, China

⁵

Unit 93212 of People’s Liberation Army of China, Dalian 116000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3098; https://doi.org/10.3390/electronics14153098

Submission received: 30 June 2025 / Revised: 23 July 2025 / Accepted: 27 July 2025 / Published: 3 August 2025

(This article belongs to the Topic Graph Neural Networks and Learning Systems)

Download

Browse Figures

Versions Notes

Abstract

With the diversification of information retrieval methods, news retrieval tasks have gradually evolved towards multimodal retrieval. Existing methods often encounter issues such as inaccurate alignment and unstable feature matching when handling cross-modal data like text and images, limiting retrieval performance. To address this, this paper proposes an innovative multimodal news retrieval method by introducing the Learnable Alignment Module (LAM), which establishes a learnable alignment relationship between text and images to improve the accuracy and stability of cross-modal retrieval. Specifically, the LAM, through trainable label embeddings (TLEs), enables the text encoder to dynamically adjust category information during training, thereby enhancing the alignment capability of text and images in the shared embedding space. Additionally, we propose three key alignment strategies: logits calibration, parameter consistency, and semantic feature matching, to further optimize the model’s multimodal learning ability. Extensive experiments conducted on four public datasets—Visual News, MMED, N24News, and EDIS—demonstrate that the proposed method outperforms existing state-of-the-art approaches in both text and image retrieval tasks. Notably, the method achieves significant improvements in low-recall scenarios (R@1): for text retrieval, R@1 reaches 47.34, 44.94, 16.47, and 19.23, respectively; for image retrieval, R@1 achieves 40.30, 38.49, 9.86, and 17.95, validating the effectiveness and robustness of the proposed method in multimodal news retrieval.

Keywords:

multimodal retrieval; text–image matching; multimodal feature fusion; news reporting; deep learning

1. Introduction

In today’s information-driven society, news retrieval has become one of the core tools for information acquisition [1,2]. With the widespread use of the internet, the volume of news content has grown explosively, making it increasingly difficult for users to quickly filter out the relevant information they need from millions of news articles [3,4]. Traditional news retrieval systems typically rely on keyword matching and rule-based retrieval methods, which are inefficient when handling vast amounts of information and fail to capture the diversity and complexity of news content [5]. In recent years, the research focus of news retrieval has shifted from pure textual information retrieval to more diversified, multimodal information retrieval. This shift not only requires extracting semantics from news text but also effectively processing news images, videos, and other modalities, allowing systems to more accurately capture user needs [6,7]. With the advancement of technology, enhancing the accuracy and robustness of news retrieval systems has become a key research direction in both academia and industry [8].

Deep learning, particularly neural network-based models, has brought revolutionary progress to the field of information retrieval in recent years [9]. Traditional keyword-based retrieval methods often fail to understand the semantic relationships between words, leading to inaccurate retrieval results [10]. In contrast, deep learning significantly improves retrieval performance by automatically extracting features and learning the underlying patterns in the data. Especially in the field of natural language processing (NLP), deep learning models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformer-based models have been successfully applied to tasks like news retrieval, recommendation systems, and question answering [11]. At the same time, advances in computer vision (CV) have provided new opportunities for multimodal learning. In multimodal retrieval, text and images are the most common information sources [12]. Recent work, leveraging pretrained multimodal models like CLIP (Contrastive Language-Image Pretraining) and ALIGN, allows researchers to map images and text into a shared embedding space where related text and images can be aligned, significantly improving retrieval accuracy [13].

However, despite the significant progress of deep learning in information retrieval, existing multimodal retrieval methods still face challenges and shortcomings [14]. Firstly, most of the existing methods rely on feature extraction from single modalities or use simple alignment methods for cross-modal learning, failing to fully leverage the complex semantic relationships between different modalities [15]. For example, there is rich cross-modal information between images and text, and single-modal representations cannot comprehensively capture the semantic connections between images and text. Therefore, how to process and fuse this information remains a challenge [16]. Secondly, although recent works have attempted to introduce multimodal alignment strategies, there is still a lack of systematic and in-depth research on the feature representation and alignment methods for different modalities [17]. Most models adopt rough alignment strategies, ignoring the fine-grained alignment of multimodal information, which leads to instability in performance during real-world applications, especially when facing different data sources and scenarios [18]. Therefore, optimizing the alignment mechanism for multimodal features and improving cross-modal semantic understanding is still a pressing issue in news retrieval tasks.

To address the aforementioned issues, we propose a novel multimodal news retrieval model—NewNet. Unlike existing methods, NewNet adopts a more refined cross-modal alignment mechanism by introducing the trainable label embedding (TLE) mechanism and multimodal feature fusion strategies, significantly improving the semantic alignment between text and images. Furthermore, NewNet introduces three innovative alignment strategies to optimize multimodal learning. Firstly, we apply parameter consistency by jointly learning the parameters of both text and image modalities. This allows the model to better capture the semantic relationships between modalities through shared parameters, enabling text and image features to complement each other and enhancing the overall retrieval performance. Next, we enhance feature similarity alignment by calculating the similarity between text and image features, which improves the semantic feature matching between the two modalities and enables more accurate representations in multimodal learning. Finally, we utilize logits calibration through knowledge distillation, aligning the output logits to enhance the model’s generalization ability and stability. This strategy optimizes the model’s classification capabilities by aligning the outputs of smaller models with those of larger, pretrained models, particularly in scenarios with fewer samples or imbalanced data distributions.

The three contributions of this paper are as follows:

This study effectively integrates the features of text and images by designing a shared multimodal embedding space and enhances the matching accuracy between different modalities in the retrieval system through optimized similarity comparisons within this space.
This paper introduces an LAM (Learnable Alignment Module) to generate TLEs. By explicitly modeling the semantic relationships between text and categories, this module enhances the stability of text representations and improves the cross-modal matching between text and images, thereby increasing the retrieval accuracy.
This paper proposes three cross-modal alignment losses: parameter consistency, semantic feature matching, and logits calibration. These losses optimize text–image alignment at three different levels: parameter, feature, and logits, allowing the model to learn cross-modal relationships more accurately. As a result, the retrieval system maintains high performance across various news retrieval tasks.
The structure of this paper is as follows: Section 2 reviews the related work, focusing on recent advances in text–image retrieval and multimodal retrieval. It highlights the strengths and limitations of existing approaches, particularly in terms of cross-modal alignment and semantic matching. Section 3 presents the proposed method in detail, introducing the overall architecture of the multimodal news retrieval model, NewNet, and elaborating on the design and implementation of the Learnable Alignment Module (LAM) and the three key alignment strategies. Section 4 describes the experimental evaluation, where the proposed method is thoroughly assessed on four public datasets—Visual News, MMED, N24News, and EDIS—across multiple performance metrics. Finally, Section 5 concludes the paper by summarizing the main contributions and outlining potential future directions in multimodal retrieval research.

2. Related Work

2.1. Advancements in Image Retrieval

Multimodal information retrieval has become a key research area in recent years, aiming to improve the accuracy and efficiency of retrieving relevant data from various modalities, such as text, images, and even videos [19,20]. The rapid development of deep learning and neural networks has significantly advanced this field, making the effective integration of different modalities possible [21].

One of the most representative models in multimodal learning is CLIP, introduced by OpenAI [22]. CLIP is trained on a large-scale dataset of text–image pairs to create a shared embedding space where both text and image features can be aligned [23]. This enables the model to perform zero-shot image classification, as well as text-to-image retrieval and image-to-text retrieval tasks, without requiring task-specific training [24]. CLIP has greatly advanced multimodal retrieval by leveraging contrastive learning, which aligns textual descriptions with visual representations [25]. Following CLIP, several other models have been proposed to further improve the integration of text and images. BLIP (Bootstrapping Language-Image Pretraining) enhances multimodal retrieval by introducing a novel pretraining method, where a generative language model is used to generate descriptive captions for images [26]. This not only improves text-to-image retrieval but also enhances image captioning tasks. Additionally, ALBEF (Align Before Fuse) proposed an effective method for aligning visual and textual features at the feature level before fusion, resulting in improved performance in downstream tasks like visual question answering (VQA) and image retrieval [27]. Another notable model is CoCa (Contrastive Captioner), which takes a contrastive approach for aligning text and image representations while generating captions [28]. CoCa’s ability to model the relationship between images and text has proven effective in various retrieval tasks. Similarly, Flamingo utilizes a few-shot learning approach, enabling the model to adapt to new tasks with minimal examples, which makes it highly effective for tasks requiring generalization from limited data. More recently, BLIP-2 introduced a new paradigm in multimodal retrieval by using vision–language pretraining to incorporate both dense image features and text descriptions in a more efficient and scalable manner [29,30]. Kosmos-1, another important model, extended multimodal learning to more complex reasoning tasks by integrating textual and visual information into a single model capable of more comprehensive decision-making [31]. LLaVA (Large Language and Vision Assistant) further pushed the boundary by enhancing the conversational abilities in multimodal settings, offering improved multimodal dialogue generation and information retrieval based on both images and text (Liu et al., 2024) [32].

These models, while differing in their specific architectures and objectives, all aim to improve the alignment between textual and visual data [33]. They demonstrate significant advancements in both feature extraction and fusion techniques, enhancing the quality of multimodal retrieval tasks such as text-to-image and image-to-text retrieval, as well as tasks like visual question answering and caption generation [34]. Despite their success, challenges still remain in handling more complex, dynamic, and cross-modal relationships, and improving scalability for real-world applications. Our work, NewNet, builds upon these advancements by proposing novel alignment strategies and a more robust framework for multimodal news retrieval [35].

2.2. Advancements in Multimodal Retrieval

In the field of multimodal information retrieval, aside from the alignment of text and images, many other research directions focus on improving the performance of multimodal systems [36,37]. In recent years, more and more studies have focused on how to utilize cross-modal information for deeper semantic understanding and reasoning, offering new perspectives and methods for multimodal retrieval tasks [38].

One important direction in this field is vision–language models. Many advanced models have made significant breakthroughs in combining large-scale language models such as T5 and GPT-3 with visual features [39]. By integrating image content with textual descriptions, these models are capable of understanding and generating natural language text that describes images, and can also query relevant images through natural language input. For example, models like ViLT and VisualBERT directly embed visual features into text encoders, enhancing the alignment between text and images through a shared latent space, thus improving the performance of multimodal tasks [40]. In addition to the integration of language and vision, audio and video modalities have also become an important part of multimodal retrieval research. For instance, models like MERLOT and LXMERT improve the model’s ability to understand complex multimodal data by jointly learning the relationships between video, text, and audio signals. In multimodal video retrieval tasks, these methods improve the accuracy and efficiency of video retrieval by handling temporal sequence information in videos and the associations between audio and visual content [41]. Furthermore, improving cross-modal alignment strategies has also been a key research area in multimodal retrieval [42]. Traditional methods primarily rely on feature learning from a single modality, but recent advancements in contrastive learning and self-supervised learning have provided new approaches for cross-modal alignment [43,44]. Contrastive learning methods such as SimCLR and BYOL, by optimizing the distance between image and text representations, have successfully enhanced the alignment between images and text [45]. Self-supervised learning methods like CLIP and ALIGN, through pretraining on large-scale unlabeled data, have further strengthened the capability of cross-modal learning, enabling models to generalize more effectively to new tasks [43].

Despite these advancements, challenges remain in improving retrieval performance in more complex multimodal scenarios, such as those involving different data sources and information types, as found in news retrieval [46]. Most existing methods focus on single modalities or basic cross-modal alignment. The deeper exploration of complex relationships between modalities, as well as improving model stability and generalization ability, are still open problems in the future of multimodal retrieval [47].

3. Methodology

3.1. Overall Network

This method aims to construct an efficient multimodal news retrieval model named NewNet. As shown in Figure 1, we propose a dual-stream framework. In this framework, the text encoder uses a Transformer network to process the input text and generate a vector representation of the text. On the other hand, the image encoder adopts a Vision Transformer (ViT) architecture to process the input image and extract its features. The features of both the text and the image are then mapped into a shared multimodal embedding space, enabling similarity comparisons within this space.

To further optimize the alignment between text and image, we introduce a trainable label embedding mechanism to enhance the expressiveness of the text representation. Specifically, the text input passes through the LAM, which is responsible for semantically expanding the text input and generating label embeddings related to categories or features. These embeddings are then fed into the text encoder for processing, greatly enhancing the representational power of the text.

To better achieve cross-modal modeling, we propose three cross-modal alignment strategies aimed at improving the model’s performance in cross-modal understanding. Firstly, the parameter consistency strategy employs the LAM mechanism to ensure consistency in the parameter learning of both the text encoder and image encoder. We introduce frozen label embeddings as a supervisory signal and optimize the loss function to adjust the features of both modalities, ensuring that they maintain the same distribution structure in the shared space. This enhances the expressiveness of cross-modal features. Secondly, the semantic feature matching strategy calculates the cosine similarity loss between text and image features to ensure that semantic information from both modalities aligns as closely as possible in the shared space. This optimization process promotes stronger associations between text and image representations, thereby improving the accuracy of cross-modal retrieval. Finally, the logits calibration strategy employs knowledge distillation to use the classification output of a pretrained model as a guiding signal to refine the classification probabilities of the target model. By minimizing the knowledge distillation loss, the model achieves better generalization to unseen data, particularly in cases of imbalanced datasets or scarce samples, thus further enhancing the robustness of the cross-modal retrieval task.

3.2. Learnable Alignment Module

In multimodal news retrieval tasks, textual and visual representations exhibit significant modality differences, making their alignment in a shared space a critical challenge. To address this issue, we propose the LAM, whose core function is to generate TLEs. This enhances the category awareness of text representations and facilitates effective cross-modal information fusion. Given an input text sequence,

X = {x_{1}, x_{2}, \dots, x_{N}}

, a traditional text encoder

f_{T} (\cdot)

learns textual representations solely based on word context:

h_{T} = f_{T} (X)

(1)

However, this approach lacks global category information modeling, leading to misalignment in cross-modal tasks where textual features may not directly correspond to visual features. Therefore, we introduce the LAM at the input stage of the text encoder. This module generates trainable label embeddings, enriching text features with category-related information to enhance semantic representation.

The core idea of the LAM is to learn a category embedding matrix

L \in ℝ^{C \times d}

, where

C

is the number of categories, and

d

is the embedding dimension. For each input text, we first compute its category weight distribution:

w_{X} = g (X) \in ℝ^{C}

(2)

where

g (X)

is a category prediction network, such as an MLP or an attention-based weighting generator.

Finally, we integrate the original text features with the trainable label embedding to obtain the final text representation:

h_{L A M} = α h_{T} + (1 - α) W_{X}

(3)

where the parameter

α

is a learnable balancing factor used to adjust the contribution of the original text encoding

h_{T}

and the category embedding

w_{X}

.

By explicitly modeling category information, text representations become more discriminative, facilitating alignment with visual features in the shared multimodal space. The LAM allows category embeddings to be optimized during training, enabling adaptive adjustment of text–category relationships and improving the retrieval accuracy. In multimodal tasks, the LAM serves as a bridge, allowing text to express not only its semantic meaning but also align better with visual features.

To ensure the stability of the LAM, we introduce a regularization constraint during training:

L_{r e g} = λ ‖ L ‖^{2}

(4)

where

λ

is the regularization coefficient.

The final training objective combines text encoding loss and regularization loss, jointly optimizing the trainable label embeddings generated by the LAM to effectively adapt to cross-modal retrieval tasks.

3.3. Transfer Learning Strategy

In the multimodal news retrieval task, to efficiently learn joint representations of text and images, we adopt a transfer learning strategy by leveraging the CLIP pretrained model as the backbone for the text encoder and image encoder. CLIP aligns text and images in a shared embedding space through contrastive learning on a large-scale text–image paired dataset. Therefore, we utilize CLIP’s pretrained weights for parameter transfer and fine-tune them within our multimodal framework to optimize performance for the news retrieval task.

Assume the text–image paired dataset is given by

D = {(T_{i}, I_{i})}_{i = 1}^{N}

where

T_{i}

represents the text input and

I_{i}

represents the corresponding image input. The CLIP pretrained text encoder

f_{T} (\cdot)

and image encoder

f_{I} (\cdot)

learn a shared representation space by maximizing the contrastive similarity between text and its matching image. Therefore, in our framework, we directly initialize the text encoder and image encoder with the pretrained CLIP model:

h_{T} = f_{T} (T), h_{I} = f_{I} (I)

(5)

where

h_{T}, h_{I} \in ℝ^{d}

are the embedding vectors for text and image, respectively, and

d

is the dimension of the shared representation space.

To adapt to the news retrieval task, we perform task-adaptive fine-tuning (TAF) based on the pretrained CLIP encoder. Specifically, we design a Parameter Adaptation Layer (PAL) on top of the CLIP pretrained model to perform lightweight adjustments on the encoder outputs:

h_{T^{'}} = W_{T} h_{T} + b_{T}

(6)

h_{I^{'}} = W_{I} h_{I} + b_{I}

(7)

where

W_{T}, W_{I} \in ℝ^{d \times d}

and

b_{T}, b_{I} \in ℝ^{d}

are learnable parameter matrices and bias vectors, which adjust the original CLIP features for task-specific adaptation.

3.4. Loss Function

In the multimodal news retrieval task, to ensure that text and image features are effectively aligned in the shared representation space, we design three key loss functions: parameter consistency, semantic feature matching, and logits calibration. These loss functions optimize cross-modal learning from three aspects: model parameter alignment, feature representation matching, and classification output alignment, thereby improving the retrieval accuracy and stability.

Parameter Consistency. In multimodal learning, feature spaces of different modalities often exhibit distribution discrepancies, which can affect cross-modal matching effectiveness. To address this issue, we introduce the parameter consistency loss

L_{OP}

, which constrains the mapping between the TLE and the frozen label embedding, ensuring the stability of text features in the category space.

Let

L_{trainable} \in ℝ^{C \times d}

denote the trainable class embeddings and

L_{frozen} \in ℝ^{C \times d}

denote the frozen class embeddings. The parameter consistency loss is defined as follows:

L_{OP} = \frac{1}{C} \sum_{i = 1}^{C} ‖ L_{trainable, i} - L_{frozen, i} ‖_{2}^{2}

(8)

where

C

is the number of categories, and

L_{trainable, i}

and

L_{frozen, i}

represent the trainable and frozen embeddings for the

i

-th category, respectively.

This loss ensures that during training, the trainable class embeddings do not deviate significantly from the pretrained model’s class representations, preventing overfitting or representation drift. Additionally, it enhances the category awareness of text features, making them more suitable for cross-modal matching tasks.

Semantic Feature Matching. Semantic consistency between text and images is crucial for cross-modal retrieval. Therefore, we design the semantic feature matching loss

L_{\cos}

, which measures the similarity between text and image features in the shared space and optimizes their alignment.

Let

h_{T} \in ℝ^{d}

denote the text feature generated by the text encoder, and

h_{I} \in ℝ^{d}

denote the image feature generated by the image encoder. We employ cosine similarity loss to measure their semantic similarity:

S (h_{T}, h_{I}) = \frac{h_{T} \cdot h_{I}}{‖ h_{T} ‖ ‖ h_{I} ‖}

(9)

The goal is to maximize the similarity of positive pairs (matching text–image samples) while minimizing the similarity of negative pairs. We adopt a contrastive learning formulation and define the loss function as follows:

L_{\cos} = - \sum_{i = 1}^{N} \log \frac{\exp (S (h_{T_{i}}, h_{I_{i}}) / τ)}{\sum_{j = 1}^{N} \exp (S (h_{T_{i}}, h_{I_{j}}) / τ)}

(10)

where

τ

is the temperature parameter, and

(h_{T_{i}}, h_{I_{i}})

forms a positive sample pair, while

h_{I_{j}}

represents a distractor sample from all image features.

This loss function encourages the model to learn aligned cross-modal features, making text and images closer in the shared embedding space while ensuring clear separation between negative samples to prevent semantic confusion.

Logits Calibration. To further enhance model classification stability and generalization, we introduce the logits calibration loss

L_{KD}

, utilizing knowledge distillation (KD) to align the classification prediction distribution of the current model with that of a pretrained frozen teacher model.

Let the teacher model output logits be

z_{t}

and the student model output logits be

z_{s}

. We employ Kullback–Leibler (KL) divergence to measure the distribution difference:

L_{KD} = \sum_{i = 1}^{N} D_{KL} (σ (z_{s}^{i} / T) ‖ σ (z_{t}^{i} / T))

(11)

where

σ (\cdot)

denotes the softmax function,

T

is the temperature parameter for smoothing the distribution and improving robustness, and

D_{KL} (p ‖ q)

denotes the KL divergence, defined as follows:

D_{KL} (p ‖ q) = \sum_{j} p_{j} \log \frac{p_{j}}{q_{j}}

(12)

The optimization of this loss function aims to make the student model’s output probability distribution as close as possible to that of the teacher model, allowing it to learn the classification decisions of the teacher model and reducing generalization errors caused by data noise or class imbalance.

Final Loss Function. Combining the three loss components above, our final optimization objective is formulated as follows:

L = λ_{1} L_{OP} + λ_{2} L_{\cos} + λ_{3} L_{KD}

(13)

where

λ_{1}, λ_{2}, λ_{3}

are hyperparameters controlling the weights of parameter consistency, semantic feature matching, and logits calibration losses, respectively.

4. Experiments and Results

4.1. Datasets

This study uses four public multimodal news datasets, Visual News, MMED (Multi-domain and Multi-modality Event Dataset), N24News, and EDIS (Entity-Driven Image Search).

The Visual News dataset [48] is a large-scale news dataset that contains more than 1 million news pictures and their corresponding news articles, covering multiple news fields such as politics, economy, science and technology, sports, and culture. In this study, we screened 217,430 high-quality text–image pairs from the dataset, including 150,616 training sets, 30,588 validation sets, and 36,226 test sets.

The MMED dataset [49] focuses on cross-domain and cross-modal news event analysis, covering 412 real-world news events, consisting of 25,165 news articles and 76,516 images. This study selected 30,480 data from the MMED dataset and divided them into 20,365 training sets, 4690 validation sets, and 5425 test sets. When screening data, we ensure that the selected samples cover multiple news events to enhance the generalization ability of the model, and remove news illustrations with low resolution or blur to ensure the quality of training data.

The N24News dataset [50] consists of news from the New York Times, including 24 categories, covering topics such as politics, finance, technology, health, entertainment, and sports, and is suitable for multimodal news classification and cross-modal retrieval tasks. This study selected 28,216 news samples from the N24News dataset and divided them into 19,324 training sets, 3467 validation sets, and 5425 test sets.

The EDIS dataset [51] is a cross-modal image retrieval dataset for the news field, containing 1 million web page images, each with a detailed text description. This study screened 19,674 samples from the dataset and divided them into 14,002 training sets, 2678 validation sets, and 2994 test sets.

Data Preprocessing. In order to ensure the consistency of different datasets, we performed unified text and image preprocessing on all data. In terms of text processing, we removed irrelevant information such as special symbols and HTML tags, and used WordPiece word segmentation (for English datasets) or BPE (Byte Pair Encoding) word segmentation (for mixed language data). In addition, we normalized the text length to ensure that short and long texts maintained a uniform length distribution when input encoding. In terms of image processing, we uniformly scaled images to 224 × 224 to adapt to the input requirements of the CLIP pretrained model, and used standard normalization and random augmentation to improve the generalization ability of the model.

In the process of data selection and division, we primarily considered the following key factors: Firstly, the original dataset is large, and directly using all the data for training and testing would result in excessively high computational costs. Therefore, we selected representative samples to ensure the feasibility of the experiment. Specifically, we used data diversity and computational resource limitations as selection criteria, ensuring that the chosen samples cover the main features of the dataset and avoid computational bottlenecks caused by the large scale of the dataset. Secondly, to improve data quality, we removed low-quality samples, with criteria including samples with missing content, those with mismatched text and images, and other data that did not meet the expected quality standards. Thirdly, we ensured that the samples from different news categories were evenly distributed to avoid overfitting or bias in the model caused by data imbalance. By evenly distributing the samples, we enhanced the model’s ability to adapt to all news categories, preventing performance degradation caused by imbalanced categories. Additionally, we selected datasets from multiple sources to enhance the model’s generalization ability, making it adaptable to various news retrieval scenarios. Through careful data screening and division, we ensured that the model could maintain excellent performance under different data distributions and application scenarios.

The datasets selected for this study cover mainstream news reports (Visual News, N24News), social media event data (MMED), and search engine news image data (EDIS), ensuring data diversity and enabling the model to better adapt to actual news retrieval tasks. Through careful data screening and division, we ensure that the model can maintain good performance under different data distributions and application scenarios.

4.2. Experimental Details

Experimental Environment. This study was conducted in a high-performance computing environment, utilizing a multi-GPU server for deep learning model training and inference. An efficient data processing framework was employed to ensure the reproducibility and stability of experiments. The experimental environment consists of both hardware and software configurations, detailed as follows. The experiments were performed on a high-performance computing server equipped with four NVIDIA A100 GPUs (80 GB HBM2e memory per GPU (NVIDIA Corporation, Santa Clara, CA, USA)), supporting large-scale deep learning tasks with parallel computation. The server’s CPU is an AMD EPYC 7742 (64 cores, 2.25 GHz (AMD, Santa Clara, CA, USA)), providing efficient computational support for data preprocessing and model training. Additionally, the system is equipped with 1 TB DDR4 memory (CXMT, Hefei, China) and an 8 TB NVMe SSD (SK Hynix, Seoul, Republic of Korea) for data storage, ensuring fast data loading and training processes. The experiments were conducted on Ubuntu 20.04 LTS as the operating system, with Python 3.9.12 used for model development and training. PyTorch 2.0.1 was chosen as the deep learning framework, with CUDA 11.8 and cuDNN 8.6 utilized for GPU acceleration to enhance computational efficiency. During multi-GPU training, NVIDIA NCCL (NVIDIA Collective Communications Library) was used for inter-GPU communication, and PyTorch DistributedDataParallel (DDP) was employed for distributed training. Pretrained models were loaded using Hugging Face Transformers 4.29.2, supporting CLIP encoder transfer learning. For efficient data processing, Pandas 1.5.3 and NumPy 1.23.5 were used for data manipulation, while OpenCV 4.6.0 was applied for image preprocessing. Additionally, TorchVision 0.15.2 was used for data augmentation. To facilitate large-scale text–image retrieval, Weaviate 3.1 was employed as a vector database, and Faiss 1.7.4 was used for efficient similarity search.

To address reproducibility on lower-end hardware, we report that our base model requires approximately 42 GB VRAM during training with a batch size of 32. On more accessible GPUs such as the NVIDIA RTX 3090 (24 GB), training can still be performed by reducing the batch size to 8 and enabling gradient accumulation. Additionally, all models support mixed-precision (FP16) training to significantly reduce memory usage.

Throughout the experiments, FP16 (half-precision computation) was used to improve computational efficiency. The AdamW optimizer was applied for gradient optimization, with an initial learning rate of

3 \times 10^{- 5}

and a weight decay of 0.01. All experiments were conducted within a Docker 23.0.1 container environment to ensure reproducibility. Additionally, Weights & Biases (WandB) was used for experiment tracking and hyperparameter tuning.

Baseline Model. In this paper, we compare several representative baseline models to comprehensively evaluate the performance of different methods in the visual–text matching task. These models include VSE++ [52], which emphasizes optimizing the visual–text embedding space; SCAN [53], which leverages stacked cross-attention mechanisms to capture fine-grained semantic alignment; and MTFN [54], which employs a multi-task fusion strategy and demonstrates outstanding performance across various vision–language tasks. Additionally, we incorporate SAM [55], which enhances feature alignment through a dynamic attention mechanism; AMFMN [56], which focuses on multi-scale feature encoding and fusion; and LW-MCR [57], a lightweight model designed for resource-constrained environments. Furthermore, we include MAFA-Net [57], which employs a multi-head attention fusion strategy to address complex scenarios; and FBCLM [57], which leverages contrastive learning strategies to improve generalization performance. Each of these models has unique strengths, covering a range of optimization strategies from accuracy enhancement to lightweight design, providing a comprehensive benchmark for our comparative analysis.

4.3. Experimental Results

Results on Visual News. On the Visual News dataset, our method achieved significant performance improvements in text retrieval. As shown in Table 1, our model achieved an R@10 of 91.25%, significantly outperforming other approaches. Similarly, in image retrieval, our model excelled, reaching an R@1 of 40.30% and R@5 of 75.15%, surpassing existing methods by several percentage points. These results indicate that our approach effectively captures the semantic relationship between news text and images, thereby improving the retrieval accuracy. Since Visual News primarily consists of news from mainstream media, where the text is well-structured and the image–text alignment is relatively high, our method leverages parameter consistency alignment and semantic feature matching to enhance the consistency of textual and visual features, leading to improved retrieval performance.

Results on MMED. As shown in Table 1, our method achieved substantial improvements in R@1 and mR compared to other methods on the MMED dataset. Particularly in text retrieval, our model reached an R@10 of 69.09%, significantly higher than existing methods. In image retrieval, our model also performed exceptionally well, achieving an R@10 of 83.27%. These results demonstrate the strong generalization ability of our method on social media and multi-domain news event datasets.

The MMED dataset includes news articles from diverse sources, featuring varied text styles and inconsistent image quality, posing a greater challenge for model robustness. The experimental results indicate that our proposed TLE mechanism effectively enhances the category awareness of text representations, allowing the model to maintain high accuracy even in complex cross-modal matching tasks.

Results on N24News. On the N24News dataset, as shown in Table 2, our method achieved an R@10 of 87.53% in text retrieval, outperforming the best existing method by several percentage points. In image retrieval, the R@5 and R@10 increased to 78.65% and 92.14%, respectively. These results demonstrate that our model adapts well to category-structured news data, ensuring more precise text–image matching.

Since the N24News dataset includes 24 different news categories, traditional methods struggle to model cross-category semantic relationships effectively. However, our logits calibration mechanism optimizes the distribution across different categories, allowing the model to achieve high performance across all news types.

Results on EDIS. As shown in Table 2, our method achieved an R@10 of 88.42% in image retrieval on the EDIS dataset, significantly outperforming other approaches. In text retrieval, the R@5 and R@10 reached 65.28% and 81.56%, respectively, surpassing all competing methods.

Since EDIS is a search engine-based cross-modal retrieval dataset, it contains substantial noise in data quality and text–image alignment, making retrieval particularly challenging. The experimental results demonstrate that our method maintains stable performance on this dataset, validating the effectiveness of our multimodal feature alignment strategy. Specifically, in handling imbalanced search engine data, our method leverages cross-modal matching optimization to reduce false matches, thereby enhancing the retrieval accuracy.

Figure 2 and Figure 3 provide visual representations of Table 1 and Table 2, respectively, allowing a more intuitive view of the advantages of our model. Our method achieved the best performance across all four datasets due to the introduction of the LAM, combined with parameter consistency alignment, semantic feature matching, and logits calibration. These components significantly optimize text–image cross-modal matching. Moreover, our model exhibited remarkable improvements in low-recall scenarios (R@1), demonstrating its capability to retrieve the most relevant news results with high precision without relying on post-filtering steps. Compared to traditional methods, our model not only captures deep semantic relationships between news text and images but also maintains high generalization ability across different types of news data. This advantage makes our approach particularly effective in real-world news retrieval applications, providing a superior user experience.

Our model performs excellently on four datasets (Visual News, MMED, N24News, and EDIS), mainly due to several innovations and optimizations. Firstly, the LAM and TLE mechanism significantly improve the cross-modal alignment accuracy. Particularly in low-recall scenarios (R@1), the model effectively captures the deep semantic relationships between text and images, thereby enhancing the retrieval performance. Secondly, by introducing multi-level alignment strategies such as logits calibration, parameter consistency, and semantic feature matching, the model further optimizes cross-modal learning ability, ensuring consistency and stability in complex datasets like MMED. In handling diverse text styles and inconsistent image quality, the TLE mechanism strengthens the model’s category awareness, enabling efficient retrieval across different news categories, with notable performance in the MMED dataset. Additionally, the cross-modal matching optimization mechanism improves the model’s robustness, maintaining high retrieval accuracy even in noisy data and imperfect alignments, especially demonstrating excellent stability and adaptability on the EDIS dataset.

4.4. Ablation Study

To assess the impact of each key module on the overall model performance, we conducted an ablation study, where we systematically removed logits calibration, parameter consistency, and semantic feature matching, and evaluated their effects on text retrieval and image retrieval tasks. The experimental results are shown in Table 3. Removing any one of these modules results in a decrease in model performance, indicating the importance of these modules in cross-modal retrieval tasks.

As shown in Table 3, when the logits calibration mechanism was removed, the text retrieval R@1 dropped to 16.91%, image retrieval R@1 dropped to 14.62%, and mR decreased to 38.8. This result indicates that logits calibration plays a crucial role in optimizing category distribution and enhancing the model’s generalization ability. The mechanism adjusts the model’s prediction distribution through knowledge distillation, making the retrieval targets across different categories more balanced, which effectively reduces the impact of class imbalance on model performance. Without this module, the model’s ability to retrieve low-frequency categories decreases, resulting in lower overall recall.

Removing Parameter Consistency. After removing parameter consistency, the text retrieval R@1 dropped to 17.57%, image retrieval R@1 dropped to 17.66%, and mR decreased to 41.05. The role of this mechanism is to constrain the learning of TLE, ensuring that it does not deviate from pretrained category representations during training. The experimental results show that removing this module leads to a significant shift in the feature distribution of text and image, making it difficult for the model to establish stable cross-modal representations, which in turn affects retrieval accuracy.

Removing Semantic Feature Matching. As shown in Table 3, removing semantic feature matching led to a significant decrease in performance, with the text retrieval R@1 dropping to 13.59%, image retrieval R@1 dropping to 11.86%, and mR decreasing to 35.83. This indicates that this module has the most significant impact on model performance. The role of semantic feature matching is to enforce consistency between text and image features in the shared embedding space using cosine similarity loss, which enhances the cross-modal matching ability. Without this module, the model struggles to effectively align text and image features, resulting in a decrease in recall for retrieval tasks.

Compared to the results of the ablation study, the full NewNet model achieved the best retrieval performance, with the text retrieval R@1 increasing to 19.23%, image retrieval R@1 increasing to 17.95%, and mR reaching 41.23. The full model combines logits calibration, parameter consistency, and semantic feature matching, ensuring the stability and accuracy of cross-modal retrieval. The significant improvement in low-recall rates (R@1) further demonstrates that our method can retrieve the most relevant results more precisely, without relying on candidate set filtering.

The ablation study shows that the three modules—logits calibration, parameter consistency, and semantic feature matching—play a critical role in improving the retrieval performance. Among them, the semantic feature matching module is the most crucial for text–image alignment, as its removal results in the most significant performance drop. The parameter consistency mechanism ensures the stability of text and image features in the shared space, improving the matching accuracy. The logits calibration mechanism optimizes the category distribution, making the retrieval results more balanced across different categories. These results further validate the effectiveness of the alignment strategy proposed in this paper for multimodal news retrieval tasks.

4.5. Limitations and Future Directions

Although the multimodal news retrieval method proposed in this paper has achieved significant performance improvements on multiple datasets, there are still some limitations. Firstly, while the proposed multimodal alignment strategy has yielded good results in handling cross-modal data noise and inconsistency, there is still room for improvement in certain scenarios with strong noise. For example, in the EDIS dataset, where the image–text matching is relatively low, our model performs robustly but is still challenged by image quality and text diversity. Future work could focus on enhancing noise filtering and robustness training, allowing the model to maintain high retrieval accuracy when facing low-quality and inconsistent data.

Secondly, the model’s computational efficiency and training time still have room for improvement. Although we have utilized multi-GPU training and some acceleration strategies, the complexity of the model, especially when handling large-scale datasets, results in longer training times. Future work could explore lightweight model designs and techniques such as knowledge distillation to reduce the computational burden of the model and improve the inference speed, particularly for scenarios that require quick responses in real-world applications.

In addition, this paper primarily focuses on news image and text retrieval tasks. Future work could extend this method to more complex multimodal tasks, such as multimodal sentiment analysis, cross-modal reasoning, and multimodal dialogue generation. The richness and timeliness of news data make it an important application scenario. However, with the continuous development of multimodal data, more cross-domain tasks and multi-task learning will become new research directions. To this end, the model presented in this paper could be further optimized to handle the fusion of more modalities, such as incorporating video and audio information, providing richer contextual support for multimodal understanding.

Additionally, while the proposed model achieves good results on several publicly available datasets, these datasets often have certain biases and limitations, making it difficult to cover the complexities of all real-world application scenarios. Therefore, future research could focus on collecting and annotating more diverse, large-scale multimodal datasets, including more varied news sources and event types, thus expanding the application of multimodal news retrieval methods to a wider range of real-world scenarios.

Finally, with the application of the model to large-scale datasets, future research could further explore lightweight strategies to enhance the model’s efficiency and scalability. Knowledge distillation, as an effective lightweight technique, reduces the model’s parameter size and computational complexity by training a smaller student model to mimic the behavior of a larger teacher model. In the context of multimodal news retrieval, the application of knowledge distillation not only allows for maintaining a high retrieval performance while reducing computational resources but also accelerates the inference speed to meet the fast-response demands of real-world applications. Future research could combine knowledge distillation with other optimization techniques to further improve the performance of multimodal retrieval systems in resource-constrained environments, facilitating their widespread deployment in practical applications.

5. Conclusions

This paper proposes a news retrieval method based on multimodal alignment, combining text and image information. By introducing trainable label embedding and multimodal alignment strategies (logits calibration, parameter consistency, semantic feature matching), the method effectively improves the performance of multimodal news retrieval. The experimental results demonstrate that our method achieves significant performance improvements on four datasets, Visual News, MMED, N24News, and EDIS, especially excelling in low-recall rates (R@1), where it can more precisely retrieve the most relevant results.

Through ablation experiments, we further verify the importance of logits calibration, parameter consistency, and semantic feature matching in cross-modal retrieval. These experiments prove the key role each module plays in enhancing the alignment ability of text and image features. Furthermore, the pretraining model transfer learning strategy based on CLIP proposed in this paper provides an effective solution for multimodal retrieval tasks. In conclusion, the method proposed in this paper offers an effective solution for multimodal news retrieval, achieving good performance across different datasets and tasks, and providing valuable insights for future research and applications in multimodal retrieval.

Author Contributions

Conceptualization, R.S. and J.T.; methodology, R.S.; software and validation, P.Z.; formal analysis, B.C.; writing—original draft preparation, R.S.; writing—review and editing, R.S. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the China Postdoctoral Science Foundation under Grant Number 2024M752586, Young Talent Fund of Association for Science and Technology in Shaanxi under Grant 20240105, Shaanxi Provincial Natural Science Foundation under Grant 2024JC-YBQN-0620, and Shaanxi Province Postdoctoral Research Funding Project under Grant 2023BSHYDZZ20.

Data Availability Statement

Data available on request from the authors.

Conflicts of Interest

The authors declare no conflict of interest.

References

Mi, L.; Li, S.; Chappuis, C.; Tuia, D. Knowledge-aware cross-modal text-image retrieval for remote sensing images. In Proceedings of the Second Workshop on Complex Data Challenges in Earth Observation (CDCEO 2022), Vienna, Austria, 23–25 July 2022. [Google Scholar]
Mishra, S.; Shukla, P.; Agarwal, R. Analyzing machine learning enabled fake news detection techniques for diversified datasets. Wirel. Commun. Mob. Comput. 2022, 2022, 1575365. [Google Scholar] [CrossRef]
Cheng, L.; Zhu, P.; Tang, K.; Gao, C.; Wang, Z. GIN-SD: Source detection in graphs with incomplete nodes via positional encoding and attentive fusion. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Li, Y.; Pan, Y. A novel ensemble deep learning model for stock prediction based on stock prices and news. Int. J. Data Sci. Anal. 2022, 13, 139–149. [Google Scholar] [CrossRef]
Aslam, N.; Ullah Khan, I.; Alotaibi, F.S.; Aldaej, L.A.; Aldubaikil, A.K. Fake detect: A deep learning ensemble model for fake news detection. Complexity 2021, 2021, 5557784. [Google Scholar] [CrossRef]
Kaliyar, R.K.; Goswami, A.; Narang, P. FakeBERT: Fake news detection in social media with a BERT-based deep learning approach. Multimed. Tools Appl. 2021, 80, 11765–11788. [Google Scholar] [CrossRef] [PubMed]
Zhu, P.; Hua, J.; Tang, K.; Tian, J.; Xu, J.; Cui, X. Multimodal fake news detection through intra-modality feature aggregation and inter-modality semantic fusion. Complex Intell. Syst. 2024, 10, 5851–5863. [Google Scholar] [CrossRef]
Palani, B.; Elango, S.; Viswanathan, K.V. CB-Fake: A multimodal deep learning framework for automatic fake news detection using capsule neural network and BERT. Multimed. Tools Appl. 2022, 81, 5587–5620. [Google Scholar] [CrossRef]
Choudhary, M.; Chouhan, S.S.; Pilli, E.S.; Vipparthi, S.K. BerConvoNet: A deep learning framework for fake news classification. Appl. Soft Comput. 2021, 110, 107614. [Google Scholar] [CrossRef]
Dubey, S.R. A decade survey of content based image retrieval using deep learning. IEEE Trans. Circuits Syst. Video Technol. 2021, 32, 2687–2704. [Google Scholar] [CrossRef]
Tuyet, V.T.H.; Binh, N.T.; Quoc, N.K.; Khare, A. Content based medical image retrieval based on salient regions combined with deep learning. Mob. Netw. Appl. 2021, 26, 1300–1310. [Google Scholar] [CrossRef]
Li, X.; Yang, J.; Ma, J. Recent developments of content-based image retrieval (CBIR). Neurocomputing 2021, 452, 675–689. [Google Scholar] [CrossRef]
Selvakanmani, S.; Ashreetha, B.; Devi, G.N.R.; Misra, S.; Jayavadivel, R.; Perli, S.B. Deep learning approach to solve image retrieval issues associated with IOT sensors. Meas. Sens. 2022, 24, 100458. [Google Scholar] [CrossRef]
Kannagi, A.; Lanke, R. Image Retrieval based on Deep Learning-Convolutional Neural Networks. In Proceedings of the 2022 International Interdisciplinary Humanitarian Conference for Sustainability (IIHC), Bengaluru, India, 18–19 November 2022. [Google Scholar]
Srinivasan, K.; Raman, K.; Chen, J.; Bendersky, M.; Najork, M. Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Montreal, QC, Canada, 11–15 July 2021. [Google Scholar]
Li, Y.; Fan, H.; Hu, R.; Feichtenhofer, C.; He, K. Scaling language-image pre-training via masking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Li, X.; Jin, J.; Zhou, Y.; Zhang, Y.; Zhang, P.; Zhu, Y.; Dou, Z. From matching to generation: A survey on generative information retrieval. ACM Trans. Inf. Syst. 2024, 43, 1–62. [Google Scholar] [CrossRef]
Luo, H.; Ji, L.; Zhong, M.; Chen, Y.; Lei, W.; Duan, N.; Li, T. Clip4clip: An empirical study of clip for end to end video clip retrieval. arXiv 2021, arXiv:2104.08860. [Google Scholar] [CrossRef]
Wang, Z.; Wu, Z.; Agarwal, D.; Sun, J. Medclip: Contrastive learning from unpaired medical images and text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; The Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Xia, Z.; Feng, X.; Lin, J.; Hadid, A. Deep convolutional hashing using pairwise multi-label supervision for large-scale visual search. Signal Process. Image Commun. 2017, 59, 109–116. [Google Scholar] [CrossRef]
Shafiq, M.; Gu, Z. Deep residual learning for image recognition: A survey. Appl. Sci. 2022, 12, 8972. [Google Scholar] [CrossRef]
Lee, J.; Kim, J.; Shon, H.; Kim, B.; Kim, S.H.; Lee, H.; Kim, J. Uniclip: Unified framework for contrastive language-image pre-training. Adv. Neural Inf. Process. Syst. 2022, 35, 1008–1019. [Google Scholar]
Xing, X.; Wang, B.; Ning, X.; Wang, G.; Tiwari, P. Short-term OD flow prediction for urban rail transit control: A multi-graph spatiotemporal fusion approach. Inf. Fusion 2025, 118, 102950. [Google Scholar] [CrossRef]
Liu, R.; Vakharia, V. Optimizing supply chain management through BO-CNN-LSTM for demand forecasting and inventory management. J. Organ. End User Comput. 2024, 36, 1–25. [Google Scholar] [CrossRef]
Ran, J.; Zou, G.; Niu, Y. Deep Learning in Carbon Neutrality Forecasting: A Study on the SSA-Attention-BIGRU Network. J. Organ. End User Comput. 2024, 36, 1–23. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Li, J.; Selvaraju, R.; Gotmare, A.; Joty, S.; Xiong, C.; Hoi, S.C.H. Align before fuse: Vision and language representation learning with momentum distillation. Adv. Neural Inf. Process. Syst. 2021, 34, 9694–9705. [Google Scholar]
Yu, J.; Wang, Z.; Vasudevan, V.; Yeung, L.; Seyedhosseini, M.; Wu, Y. Coca: Contrastive captioners are image-text foundation models. arXiv 2022, arXiv:2205.01917. [Google Scholar]
Li, H.; Xiong, W.; Cui, Y.; Xiong, Z. A fusion-based contrastive learning model for cross-modal remote sensing retrieval. Int. J. Remote Sens. 2022, 43, 3359–3386. [Google Scholar] [CrossRef]
Li, W.; Cai, Y.; Hanafiah, M.H.; Liao, Z. An empirical study on personalized product recommendation based on cross-border e-commerce customer data analysis. J. Organ. End User Comput. 2024, 36, 1–16. [Google Scholar] [CrossRef]
Li, J.; Li, D.; Savarese, S.; Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023. [Google Scholar]
Liu, S.; Cheng, H.; Liu, H.; Zhang, H.; Li, F.; Ren, T.; Zou, X.; Yang, J.; Su, H.; Zhu, J. Llava-plus: Learning to use tools for creating multimodal agents. In European Conference on Computer Vision; Springer Nature: Cham, Switzerland, 2024. [Google Scholar]
Zhang, L.; Liu, J.; Wei, Y.; An, D.; Ning, X. Self-supervised learning-based multi-source spectral fusion for fruit quality evaluation: A case study in mango fruit ripeness prediction. Inf. Fusion 2025, 117, 102814. [Google Scholar] [CrossRef]
Huang, J.; Yu, X.; An, D.; Ning, X.; Liu, J.; Tiwari, P. Uniformity and deformation: A benchmark for multi-fish real-time tracking in the farming. Expert Syst. Appl. 2025, 264, 125653. [Google Scholar] [CrossRef]
Zhang, H.; Yu, L.; Wang, G.; Tian, S.; Yu, Z.; Li, W.; Ning, X. Cross-modal knowledge transfer for 3D point clouds via graph offset prediction. Pattern Recognit. 2025, 162, 111351. [Google Scholar] [CrossRef]
Tufchi, S.; Yadav, A.; Ahmed, T. A comprehensive survey of multimodal fake news detection techniques: Advances, challenges, and opportunities. Int. J. Multimed. Inf. Retr. 2023, 12, 28. [Google Scholar] [CrossRef]
Yin, S.; Zhu, P.; Wu, L.; Gao, C.; Wang, Z. GAMC: An unsupervised method for fake news detection using graph autoencoder with masking. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
Müller-Budack, E.; Theiner, J.; Diering, S.; Idahl, M.; Hakimov, S.; Ewerth, R. Multimodal news analytics using measures of cross-modal entity and context consistency. Int. J. Multimed. Inf. Retr. 2021, 10, 111–125. [Google Scholar] [CrossRef]
Pandeya, Y.R.; Lee, J. Deep learning-based late fusion of multimodal information for emotion classification of music video. Multimed. Tools Appl. 2021, 80, 2887–2905. [Google Scholar] [CrossRef]
Bandyopadhyay, D.; Hasanuzzaman, M.; Ekbal, A. Seeing through VisualBERT: A causal adventure on memetic landscapes. arXiv 2024, arXiv:2410.13488. [Google Scholar] [CrossRef]
Zellers, R.; Lu, X.; Hessel, J.; Yu, Y.; Park, J.S.; Cao, J.; Farhadi, A.; Choi, Y. Merlot: Multimodal neural script knowledge models. Adv. Neural Inf. Process. Syst. 2021, 34, 23634–23651. [Google Scholar]
Ma, X.; Li, Y.; Asif, M. E-commerce review sentiment analysis and purchase intention prediction based on deep learning technology. J. Organ. End User Comput. 2024, 36, 1–29. [Google Scholar] [CrossRef]
Hambarde, K.A.; Proenca, H. Information retrieval: Recent advances and beyond. IEEE Access 2023, 11, 76581–76604. [Google Scholar] [CrossRef]
Hua, J.; Cui, X.; Li, X.; Tang, K.; Zhu, P. Multimodal fake news detection through data augmentation-based contrastive learning. Appl. Soft Comput. 2023, 136, 110125. [Google Scholar] [CrossRef]
Wu, H.; Zhang, Z.; Zhu, Z. Cross-Checking-Based Trademark Image Retrieval for Hot Company Detection. J. Organ. End User Comput. 2024, 36, 1–12. [Google Scholar] [CrossRef]
Almayyan, W.I.; AlGhannam, B.A. Detection of Kidney Diseases: Importance of Feature Selection and Classifiers. Int. J. E-Health Med. Commun. 2024, 15, 1–21. [Google Scholar] [CrossRef]
DeMatteo, C.; Jakubowski, J.; Stazyk, K.; Randall, S.; Perrotta, S.; Zhang, R. The Headaches of Developing a Concussion App for Youth: Balancing Clinical Goals and Technology. Int. J. E-Health Med. Commun. 2024, 15, 1–20. [Google Scholar] [CrossRef]
Bondielli, A.; Dell’Oglio, P.; Lenci, A.; Marcelloni, F.; Passaro, L. Dataset for multimodal fake news detection and verification tasks. Data Brief 2024, 54, 110440. [Google Scholar] [CrossRef]
Yang, Z.; Luo, D.; You, J.; Guo, Z.; Yang, Z. Multimodal Conditional VAE for Zero-Shot Real-World Event Discovery. In Lecture Notes in Computer Science, Proceedings of the International Conference on Advanced Data Mining and Applications, Shenyang, China, 27–29 August 2023; Springer Nature: Cham, Switzerland, 2023. [Google Scholar]
Wang, Z.; Shan, X.; Zhang, X.; Yang, J. N24news: A new dataset for multimodal news classification. arXiv 2021, arXiv:2108.13327. [Google Scholar]
Liu, S.; Feng, W.; Fu, T.-J.; Chen, W.; Wang, W.Y. Edis: Entity-driven image search over multimodal web content. arXiv 2023, arXiv:2305.13631. [Google Scholar]
Zhang, H.; Yang, M. Asymmetric Polysemous Reasoning for Image-Text Matching. In Proceedings of the 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, 1–4 December 2023. [Google Scholar]
Li, L.; Kong, X.; Zhao, X.; Huang, T.; Li, W.; Wen, F.; Zhang, H.; Liu, Y. SSC: Semantic scan context for large-scale place recognition. In Proceedings of the 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Prague, Czech Republic, 27 September–1 October 2021. [Google Scholar]
Cheng, Q.; Zhou, Y.; Huang, H.; Wang, Z. Multi-attention fusion and fine-grained alignment for bidirectional image-sentence retrieval in remote sensing. IEEE/CAA J. Autom. Sin. 2022, 9, 1532–1535. [Google Scholar] [CrossRef]
Yan, K.; Cai, J.; Jin, D.; Miao, S.; Guo, D.; Harrison, A.P.; Tang, Y.; Xiao, J.; Lu, J.; Lu, L. SAM: Self-supervised learning of pixel-wise anatomical embeddings in radiological images. IEEE Trans. Med. Imaging 2022, 41, 2658–2669. [Google Scholar] [CrossRef] [PubMed]
Yuan, Z.; Zhang, W.; Rong, X.; Li, X.; Chen, J.; Wang, H.; Fu, K.; Sun, X. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–19. [Google Scholar] [CrossRef]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote sensing cross-modal text-image retrieval based on global and local information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]

Figure 1. NewNet architecture. The text encoder is based on the Transformer network, and the image encoder is based on Vision Transformer. Both modalities are mapped to a shared embedding space and the retrieval performance is enhanced through cross-modal alignment strategies.

Figure 2. Comparison of experimental results on Visual News and MMED datasets.

Figure 3. Comparison of experimental results on N24News and EDIS datasets.

Table 1. Comparison of experimental results on Visual News and MMED datasets. The optimal results are highlighted in bold.

Method	Visual News							MMED
	Text Retrieval			Image Retrieval			mR	Text Retrieval			Image Retrieval			mR
	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@1	R@5	R@10	R@1	R@5	R@10	mR
VSE++ [52]	12.41	44.85	65.82	10.2	31.9	56.95	37.03	24.24	53.55	67.33	6.32	33.66	51.13	39.36
SCAN [53]	14.39	45.82	67.17	12.86	50.48	77.34	44.77	19.06	51.82	74.24	17.69	57.22	76.31	49.35
MTFN [54]	10.57	47.72	64.39	14.29	52.48	79.05	44.75	20.79	51.82	68.33	13.89	55.61	77.69	48.15
SAM [55]	12.38	47.25	76.31	10.65	47.71	93.90	47.95	9.71	34.7	53.9	7.83	28.91	59.76	32.45
AMFMN [57]	16.77	45.81	68.67	12.96	53.34	79.53	46.18	29.41	58.74	67.32	13.55	60.9	71.72	51.82
LW-MCR [57]	13.22	50.46	71.53	15.15	50.48	77.65	46.42	20.79	53.14	77.69	15.62	58.98	80.44	51.11
MAFA-Net [57]	14.61	56.2	95.8	10.4	48.3	84.2	51.53	20.7	76.5	13.2	31.5	82	52.7	50.25
FBCLM [57]	28.65	63.91	82.96	27.43	72.77	94.48	61.72	25.96	56.5	75.91	27.2	70.42	89.78	57.67
Ours	47.34	78.21	91.25	40.30	75.15	94.79	72.11	44.94	69.09	86.32	38.49	69.45	83.27	61.621

Table 2. Comparison of experimental results on N24News and EDIS datasets. The optimal results are highlighted in bold.

Method	N24News							EDIS
	Text Retrieval			Image Retrieval			mR	Text Retrieval			Image Retrieval			mR
	R@1	R@5	R@10	R@1	R@5	R@10	mR	R@1	R@5	R@10	R@1	R@5	R@10	mR
VSE++ [52]	4.58	10.71	18.66	4.02	12.52	19.3	11.63	11.58	28.85	40.8	8.99	26.07	39.87	26.03
SCAN [53]	7.05	14.09	21.04	5.91	17.6	27.93	15.44	12.26	29.78	40.58	11.02	30.58	43.32	27.47
MTFN [54]	6.22	13.72	20.94	6.1	18.37	30.69	16.01	11.6	28.85	38.18	11.16	32.57	47.04	28.12
SAM [55]	14.78	32.8	48.5	12.7	36.9	54.6	33.65	10.58	22.75	38.64	10.79	32.75	46.38	27.14
AMFMN [57]	6.59	16.28	24.6	6.1	19.48	32.64	17.62	11.83	25.98	43.01	12.71	35.89	56.07	30.92
LW-MCR [57]	5.59	14.55	21.49	5.5	20.05	33.67	16.79	10.93	28.97	38.81	10.45	35.27	55.23	29.78
MAFA-Net [57]	13.51	36.9	55.61	14.1	33.6	48.8	33.75	10.87	28.79	40.35	11.25	38.25	50.24	27.65
FBCLM [29]	14.47	28.37	38.8	14.74	39.94	58.14	32.41	14.04	31.73	47.09	11.64	38.21	59.14	33.64
GaLR [1]	7.79	21.1	32.3	6.7	20.8	33.5	20.16	11.62	31.62	37.31	10.55	37.76	52.88	31.61
Ours	16.47	39.25	53.82	9.86	29.76	45.15	31.85	19.23	43.27	54.17	17.95	49.75	68.78	41.23

Table 3. Ablation study on the loss functions of NewNet. The optimal results are highlighted in bold.

Task	Text Retrieval			Image Retrieval			mR
Task	R@1	R@5	R@10	R@1	R@5	R@10	mR
w/o Logits Calibration	16.91	36.82	51.64	14.62	45.96	66.86	38.8
w/o Parameter Consistency	17.57	39.25	54.08	17.66	49.12	68.63	41.05
w/o Semantic Feature Matching	13.59	34.39	50.76	11.86	41.55	62.84	35.83
NewNet	19.23	43.27	54.17	17.95	49.75	68.78	41.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Song, R.; Tian, J.; Zhu, P.; Chen, B. Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching. Electronics 2025, 14, 3098. https://doi.org/10.3390/electronics14153098

AMA Style

Song R, Tian J, Zhu P, Chen B. Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching. Electronics. 2025; 14(15):3098. https://doi.org/10.3390/electronics14153098

Chicago/Turabian Style

Song, Rui, Jiwei Tian, Peican Zhu, and Bin Chen. 2025. "Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching" Electronics 14, no. 15: 3098. https://doi.org/10.3390/electronics14153098

APA Style

Song, R., Tian, J., Zhu, P., & Chen, B. (2025). Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching. Electronics, 14(15), 3098. https://doi.org/10.3390/electronics14153098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving News Retrieval with a Learnable Alignment Module for Multimodal Text–Image Matching

Abstract

1. Introduction

2. Related Work

2.1. Advancements in Image Retrieval

2.2. Advancements in Multimodal Retrieval

3. Methodology

3.1. Overall Network

3.2. Learnable Alignment Module

3.3. Transfer Learning Strategy

3.4. Loss Function

4. Experiments and Results

4.1. Datasets

4.2. Experimental Details

4.3. Experimental Results

4.4. Ablation Study

4.5. Limitations and Future Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI