A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval

Yi, Wang; Cai, Xiahuan; Ma, Hongtao; Fu, Zhengjie; Zhan, Yan

doi:10.3390/electronics14153145

Open AccessArticle

A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval

by

Wang Yi

¹,

Xiahuan Cai

^2,3,

Hongtao Ma

²,

Zhengjie Fu

² and

Yan Zhan

^2,4,*

¹

Library, Sichuan Normal University, Chengdu 610066, China

²

National School of Development, Peking University, Beijing 100871, China

³

School of Foreign Languages, Beihang University, Beijing 100191, China

⁴

Artificial Intelligence Research Institute, Tsinghua University, Beijing 100084, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(15), 3145; https://doi.org/10.3390/electronics14153145

Submission received: 8 July 2025 / Revised: 5 August 2025 / Accepted: 6 August 2025 / Published: 7 August 2025

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Under the growing demand for processing multimodal and cross-lingual information, traditional retrieval systems have encountered substantial limitations when handling heterogeneous inputs such as images, textual layouts, and multilingual language expressions. To address these challenges, a unified retrieval framework has been proposed, which integrates visual features from images, layout-aware optical character recognition (OCR) text, and bilingual semantic representations in Chinese and English. This framework aims to construct a shared semantic embedding space that mitigates semantic discrepancies across modalities and resolves inconsistencies in cross-lingual mappings. The architecture incorporates three main components: a visual encoder, a structure-aware OCR module, and a multilingual Transformer. Furthermore, a joint contrastive learning loss has been introduced to enhance alignment across both modalities and languages. The proposed method has been evaluated on three core tasks: a single-modality retrieval task from image → OCR, a cross-lingual retrieval task between Chinese and English, and a joint multimodal retrieval task involving image, OCR, and language inputs. Experimental results demonstrate that, in the joint multimodal setting, the proposed model achieved a Precision@10 of 0.693, Recall@10 of 0.684, nDCG@10 of 0.672, and F1@10 of 0.685, substantially outperforming established baselines such as CLIP, LayoutLMv3, and UNITER. Ablation studies revealed that removing either the structure-aware OCR module or the cross-lingual alignment mechanism resulted in a decrease in mean reciprocal rank (MRR) to 0.561, thereby confirming the critical role of these components in reinforcing semantic consistency across modalities. This study highlights the powerful potential of large language models in multimodal semantic fusion and retrieval tasks, providing robust solutions for large-scale semantic understanding and application scenarios in multilingual and multimodal contexts.

Keywords:

multimodal semantic alignment; cross-lingual retrieval; contrastive learning; multilingual transformer models; multimodal information fusion

1. Introduction

With the proliferation of large-scale digital literature repositories, there has been an increasing demand for information retrieval systems capable of leveraging diverse cues such as cover images, Optical Character Recognition (OCR)-extracted text, and multilingual content. This trend poses dual challenges for retrieval systems in handling both multimodal and cross-lingual information [1,2]. Cross-modal retrieval (e.g., image–text matching) and cross-lingual retrieval (e.g., Chinese–English document search) have emerged as two critical subfields in information retrieval, each achieving significant advances in recent years [3,4]. For cross-modal retrieval, early efforts focused on learning joint embedding spaces between images and textual descriptions. For instance, Contrastive Language–Image Pretraining (CLIP) proposed by OpenAI demonstrated that contrastive learning on 400 million image–text pairs enables the acquisition of general-purpose visual-semantic representations without requiring additional labels [5]. The success of CLIP showcased the potential of large-scale pretraining and proved that visual models can benefit directly from natural language supervision. Similarly, Google’s ALIGN model, trained on billions of web image–alt text pairs, further enhanced the alignment between visual and textual modalities [6]. In the realm of cross-lingual retrieval, multilingual pretrained language models have significantly improved semantic alignment across languages [7], enabling users to issue queries in one language while retrieving relevant documents in another. Such capabilities have been widely adopted in cross-lingual information retrieval and question answering.

However, most existing cross-modal retrieval models are developed within a monolingual (primarily English) context [4,8,9], rendering them unsuitable for multilingual scenarios. Conversely, conventional cross-lingual retrieval approaches typically rely solely on textual alignment, failing to incorporate visual information. Consequently, integrating cross-modal and cross-lingual retrieval into a unified framework has become an emerging research focus [3]. One critical obstacle lies in the highly imbalanced distribution of multilingual image–text datasets, with over 90% of available pairs skewed toward high-resource languages such as English. In contrast, cross-modal data for low-resource languages such as Chinese remain scarce [4]. To mitigate this issue, various strategies involving machine translation and multilingual corpora have been proposed to augment and align datasets. For example, SMALR introduced by Burns et al. incorporates cross-lingual consistency constraints to retrieve similar results for translated queries, supporting ten languages and outperforming previous models with minimal additional parameters [10]. Similarly, the MURAL model proposed by Jain et al. leverages large-scale translation pairs within contrastive learning, significantly improving retrieval performance across over 100 languages, particularly benefiting low-resource languages compared to models trained only on English image–text data [4].

On the architectural side, achieving unified alignment across modalities and languages remains technically challenging. Cross-encoder-based multimodal pretraining models such as M3P [1] and UC² [2] are capable of jointly processing images and multilingual text, thereby capturing deep cross-modal and cross-lingual interactions. Despite their expressive capacity, these models impose substantial computational overhead during inference, which hinders their deployment in large-scale retrieval scenarios. In contrast, dual-encoder architectures enable efficient retrieval by independently encoding images and text, yet they face significant challenges in aligning representation spaces across different modalities and languages [5,6]. To address this, recent research has explored a variety of alignment strategies, including contrastive learning for joint embedding alignment, multitask loss functions combining image–text matching with translation consistency, and reranking mechanisms guided by translated queries [4,11,12]. Nevertheless, issues of performance instability and cross-modal inconsistency persist. For example, Nie et al. observed that cross-lingual pretraining often yields uneven retrieval performance across different languages. Moreover, conventional cross-modal models may suffer from ranking inconsistency when retrieving image–text pairs described in multiple languages, due to optimization bias during training [3]. In addition, most existing studies focus on aligning global image semantics with textual descriptions while overlooking the rich information embedded in scene text within images. Only a few works have attempted to incorporate scene text into retrieval models [13]. For instance, Miyawaki et al. proposed scene-text-enhanced dual encoders to improve English image–text retrieval [14]; however, their approach did not address cross-lingual scenarios.

To address the aforementioned challenges, a cross-modal and cross-lingual retrieval model that integrates visual features, OCR text, and bilingual semantics is proposed. Specifically, a multi-encoder architecture based on a unified embedding space is designed, comprising visual, OCR, and multilingual text encoders. These encoders enable unified mapping of image features, scene text, and Chinese–English semantic representations, facilitating efficient retrieval across modalities and languages. For the first time, PubLayNet and WikiCLIR datasets are jointly utilized for end-to-end training, bridging visual–OCR alignment with Chinese–English cross-lingual retrieval and enhancing the model’s generalization across tasks. The proposed architecture consists of four key modules: visual encoding, OCR encoding, language encoding, and a multimodal fusion layer, supporting flexible retrieval using combinations of image, OCR text, Chinese queries, and English documents. Furthermore, a multi-loss training strategy is introduced, incorporating contrastive losses for visual–text, OCR–text, and cross-lingual pairs. These components collaboratively optimize the model to enhance representational consistency and robustness across modalities and languages, effectively reducing retrieval biases introduced by semantic and modality mismatches. The main contributions of this work can be summarized as follows:

We propose a novel architecture that seamlessly integrates visual, OCR, and bilingual semantic features into a shared embedding space, enabling retrieval across arbitrary modality and language combinations.
We are the first to jointly train on PubLayNet (image–OCR) and WikiCLIR (Chinese–English) datasets, bridging visual–OCR alignment with cross-lingual retrieval in a single unified model.
We design a structure-aware OCR encoder and a cross-lingual alignment module based on deformable attention, enhancing fine-grained semantic consistency across modalities and languages.
We introduce a joint contrastive loss combining visual–text, OCR–text, and cross-lingual objectives, significantly improving robustness and mitigating semantic bias in cross-modal and cross-lingual retrieval.
Extensive experiments on image→OCR, Chinese→English, and joint multimodal retrieval tasks demonstrate superior results compared to strong baselines such as CLIP, LayoutLMv3, and UNITER.

2. Related Work

2.1. Multimodal Retrieval and Vision-Language Alignment

Research on aligning vision and language has become a prominent topic in the field of multimodal learning. Classical vision-language pretraining models can generally be categorized into two types: dual-stream and single-stream architectures [15]. Dual-stream models, such as CLIP by OpenAI [5] and ALIGN by Google [6], employ separate image and text encoders, using contrastive learning to map paired image–text inputs into a shared embedding space, thereby achieving cross-modal alignment. These models offer the advantage of efficient inference, as visual and textual features can be pre-encoded independently for similarity computation. CLIP was trained on 400 million image–text pairs and achieved breakthrough performance in open-domain image recognition and zero-shot retrieval tasks [5]. ALIGN further scaled up to billions of noisy image–text pairs and significantly improved retrieval accuracy, achieving new state-of-the-art results on benchmarks such as Flickr30K and MSCOCO [6]. In contrast, single-stream models such as UNITER [16] and OSCAR [17] process visual region features and textual tokens jointly through a unified Transformer encoder, enabling deep fusion across modalities. These models employ cross-modal attention mechanisms to capture fine-grained semantic associations (e.g., correspondence between image regions and words) and have demonstrated strong performance in tasks requiring complex multimodal interactions such as visual question answering (VQA) [18,19]. However, real-time cross-modal computation significantly increases inference latency, limiting their suitability for high-throughput retrieval scenarios. Recently, enhanced architectures such as the COTS model proposed by Lu et al. [15] have been developed. COTS introduces token-level visual interaction and task-level symmetric distribution alignment into the dual-stream framework, thereby improving retrieval accuracy while maintaining fast inference speed. Beyond general image–text retrieval models, multimodal analysis of document images has also attracted increasing attention. For instance, Microsoft’s LayoutLM series integrates textual content with document layout information for layout understanding and information extraction tasks [20]. The most recent version, LayoutLMv3, employs unified text–image-masked pretraining and surpasses previous models across multiple document understanding benchmarks. Meanwhile, the PubLayNet dataset developed by IBM [21] provides over 360,000 pages of academic paper images with OCR annotations for layout elements such as titles, paragraphs, and figures. This dataset has become a large-scale benchmark for training and evaluating document image analysis algorithms and offers critical insights for designing vision-OCR alignment modules.

2.2. Cross-Lingual Information Retrieval

Cross-lingual information retrieval (CLIR) aims to address scenarios where the query and document are in different languages, representing a longstanding challenge in the information retrieval domain [2]. Early studies primarily relied on machine translation (MT) by translating queries into the document language or vice versa, followed by monolingual retrieval using traditional systems [22]. Although this “translate-then-retrieve” pipeline is simple and effective, it often introduces translation errors. In recent years, the development of large-scale multilingual pretrained models has enabled end-to-end semantic alignment by constructing shared representation spaces. Models such as multilingual BERT [23], XLM-R [24], and InfoXLM [25] generate language-agnostic text embeddings that facilitate direct cross-lingual matching without the need for explicit translation [25]. Among these, XLM-R was trained on massive corpora covering 100 languages in an unsupervised manner and demonstrated strong performance in cross-lingual tasks such as classification and question answering. InfoXLM further incorporated mutual information maximization and contrastive objectives, achieving enhanced alignment of cross-lingual semantics [25,26]. Recent studies on CLIR have also focused on combining multilingual language models with retrieval-oriented objectives. For example, retrieval-oriented fine-tuning has been applied to multilingual BERT, adapting it for better query-document matching in cross-lingual settings [27]. To mitigate semantic bias in CLIR, several methods have been proposed, including bilingual word embeddings (BWEs) and adversarial training techniques. For instance, Wang et al. introduced a multi-adversarial learning method involving multiple mapping functions to align heterogeneous embedding subspaces, thereby improving performance across distant language pairs [28]. Keung et al. applied adversarial objectives during the fine-tuning of multilingual BERT, enhancing zero-resource performance in tasks such as text classification and named entity recognition [29]. Although these approaches have achieved considerable progress, most CLIR studies remain limited to text-to-text retrieval, where both the query and document are assumed to be in textual form (albeit in different languages). The unique challenges associated with cross-lingual retrieval involving different modalities—such as image-to-foreign-language document search—have not yet been adequately addressed.

2.3. Joint Cross-Modal and Cross-Lingual Alignment

Although prior studies, such as M3P proposed by Ni et al. [1] and UC² introduced by Zhou et al. [2], have explored pretraining paradigms that integrate vision-language and multilingual information, several limitations remain in their architectural designs. M3P incorporates multilingual tokens into English image–text datasets via “multimodal code-switching training,” aiming to model relationships between images and non-English descriptions. UC² constructs multilingual image–text pairs through machine translation and enhances alignment via a visual translation task designed to bridge visual and multilingual semantics. These approaches have demonstrated the feasibility of joint pretraining across modalities and languages. However, both methods rely heavily on translated or synthetic datasets, limiting their adaptability and generalization to open-domain scenarios with complex modality-language combinations (e.g., Chinese image queries retrieving English documents). A unified framework capable of supporting arbitrary alignment and retrieval across vision, text, and multiple languages—without requiring task-specific data augmentation—has yet to be established. At present, joint learning of vision-language and multilingual representations remains in its early stages, and a general modeling paradigm for multimodal-multilingual retrieval tasks is still lacking. To address this research gap, a unified alignment framework is introduced in this work, aiming to advance cross-modal and cross-lingual retrieval toward a more generalized and scalable solution.

3. Materials and Method

3.1. Data Collection

Two publicly available datasets, PubLayNet and WikiCLIR, were utilized in this study to support visual-text alignment and cross-lingual semantic modeling tasks, as shown in Table 1. For the visual modality, PubLayNet was selected as the primary source of document images. This dataset, released by IBM, contains approximately 350,000 pages of PDF-format documents collected from open-access journals on PubMed Central. The PDFs were automatically converted into high-resolution PNG images using scripted tools. Subsequently, layout text blocks were extracted from each page using a Tesseract-OCR-based layout recognition algorithm and aligned with the corresponding image regions, resulting in structured OCR-labeled image data. Each image was segmented into functional regions, including titles, paragraphs, figures, and references, with each region associated with its OCR-extracted textual content and spatial coordinates, facilitating the construction of visual-text alignment training samples. Furthermore, all images were uniformly resized to

1024 \times 768

, and color normalization and noise reduction were performed to enhance the quality of encoder inputs.

For the language modality, the WikiCLIR dataset was employed to construct a cross-lingual retrieval training set. This dataset is derived from interlanguage links between Wikipedia entries. Specifically, bilingual entities with explicit Chinese–English page correspondences were first extracted from Chinese Wikipedia pages. The body content of the corresponding English page was then designated as the target document, and the title or first paragraph of the Chinese page was treated as the query, thereby forming paired query-document samples. All textual data underwent language identification, syntactic filtering, and noise removal, resulting in approximately 364,000 high-quality Chinese–English matched pairs. To accommodate the model’s encoding requirements, both Chinese queries and English documents were tokenized and truncated using dedicated tokenizers, with sequence lengths standardized to

L = 256

or fewer. Non-structured pages lacking semantic content were excluded. Additionally, to enhance model robustness and semantic transferability, machine-translated samples and reverse-direction queries (i.e., English queries to Chinese documents) were incorporated during training to improve generalization across language directions. The two datasets served distinct yet complementary purposes in this study, namely, training for image–OCR region alignment and modeling for cross-lingual retrieval, together supporting the construction of a unified multimodal and multilingual alignment framework.

3.2. Data Preprocessing

Prior to model training, a systematic preprocessing and augmentation pipeline was applied to the visual, OCR-based, and bilingual textual inputs to ensure consistent and robust representation within and across modalities. For the PubLayNet dataset, each document image was first processed using the Tesseract engine to perform OCR, extracting all textual blocks and their corresponding bounding box coordinates. To support downstream image–text region alignment tasks, a layout structure analysis algorithm was employed to cluster and label OCR text blocks with functional categories such as title, body text, and figure captions. To address distributional inconsistency within the visual modality, all images were resized uniformly to

1024 \times 768

, and pixel values were normalized using Z-score normalization:

X_{n o r m} = \frac{X - μ}{σ},

(1)

where X denotes the original pixel value, and

μ

and

σ

represent the mean and standard deviation of the image channels, respectively. This operation enhances the robustness of the visual encoder when processing heterogeneous image inputs. For the language modality, Chinese query texts were segmented using the Jieba tokenizer, followed by the removal of stopwords and special characters. The English documents were processed by extracting the first paragraph as the representative abstract, applying sentence segmentation via the NLTK toolkit, and performing subword encoding using Byte Pair Encoding (BPE). All textual inputs were normalized through lowercasing and the removal of punctuation and whitespace to reduce cross-lingual embedding ambiguity. To improve the model’s adaptability to bidirectional language matching, additional augmented samples were introduced into the WikiCLIR dataset. For each Chinese query–English document pair, a reverse-direction English query–Chinese document counterpart was generated. Moreover, semantic variants of queries were created using the Google Translate API to apply synonym substitution and syntactic transformations. To mitigate noise sensitivity in multimodal co-learning, augmentation strategies were applied to both visual and textual modalities. For images, a cutout-based technique was introduced by randomly masking local regions, simulating partial occlusion scenarios. Gaussian noise was also injected to emulate blur and distortion effects common in scanned documents, following a perturbation distribution of

N (0, 0 . 01^{2})

. On the language side, hard negative mining was implemented. Specifically, for each query, mismatched documents with relatively high embedding similarity but no semantic relevance were selected as hard negatives. These samples were incorporated into training to enhance the model’s discriminative capacity at the decision boundary and to improve retrieval ranking quality. Collectively, this comprehensive preprocessing and augmentation workflow significantly strengthened the model’s cross-modal and cross-lingual alignment capability and generalization performance during training.

3.3. Proposed Method

3.3.1. Overall

The overall construction process of the proposed method, as shown in Figure 1, begins with the input of multimodal data, including images, OCR-structured text extracted from the images, and bilingual textual inputs in Chinese and English. These inputs are individually processed by distinct encoding modules to extract semantic representations. Image information is encoded through a deformable-DETR-based visual backbone to obtain visual features V, while the OCR text is processed by a structure-aware module to extract positional and semantic embedding features

T_{o c r}

. Language features are derived using a multilingual Transformer model, yielding the embeddings

T_{l a n g}

for both Chinese and English textual content. These heterogeneous features are subsequently integrated through a unified alignment architecture. Specifically, the visual features V and OCR features

T_{o c r}

are fused via a query-key-value (Q-K-V) cross-attention mechanism to produce a semantically enriched multimodal representation

F_{i m g_{o c r}}

, incorporating positional encoding and image–text alignment. On top of this representation, a cross-lingual alignment module is introduced to jointly attend over the language embeddings

T_{l a n g}

and the multimodal representation

F_{i m g_{o c r}}

, yielding a unified semantic embedding

F_{u n i f i e d}

capable of capturing deep semantic relationships across both modality and language. Furthermore, a view-fusion module is employed to merge features extracted from multiple visual perspectives (e.g., cover page, table of contents), which enhances global coherence while preserving structural diversity. Contextual associations across modalities are modeled using relational embeddings. The final unified embedding

F_{u n i f i e d}

is passed through a multilayer perceptron (MLP) to compute semantic similarity scores, thereby enabling retrieval across any combination of multimodal and multilingual inputs. The entire framework is trained under a joint contrastive loss, ensuring consistent and discriminative alignment of multimodal embeddings within the shared semantic space.

3.3.2. Vision–OCR Alignment Module

In the vision–OCR alignment module, the primary objective is to construct a shared semantic space between the image modality and the OCR text modality, thereby enabling accurate alignment and matching between cover images and their internal textual content. As shown in Figure 2, this module is built upon a visual image encoder and an OCR structure-aware encoder, which extract image modality features and structural-semantic features of OCR text regions, respectively, and align them through a shared latent variable space. The image input x and the OCR input

x^{'}

are independently encoded by two structurally symmetric Transformer encoder networks, producing corresponding latent distributions

p (z^{'} | x)

and

p (z^{″} | x^{'})

, where

z^{'}

and

z^{''}

represent the latent semantic embedding variables. To unify the representation spaces of both modalities, a shared-parameter variational inference framework is adopted, with a Kullback–Leibler (KL) divergence term introduced to regularize both distributions toward a common prior

p (z)

, facilitating cross-modal alignment modeling.

In the implementation, the image encoder is structured as a multi-layer vision Transformer with an output dimensionality of 768. The OCR text encoder incorporates positional information to achieve structure-aware modeling by integrating position embeddings with encoded textual outputs, also yielding a 768-dimensional representation. The generator module receives the latent variables

z^{'}

and

z^{''}

and reconstructs the original image and OCR region features, respectively, which are then compared with the original inputs to compute reconstruction losses. Simultaneously, a contrastive loss function is employed, treating matched image–OCR pairs as positive samples and unmatched pairs as negative samples. This contrastive learning strategy reduces the distance between positive samples and increases the distance between negative samples in the embedding space, thereby enhancing the discriminative capability of the cross-modal representation. From a mathematical perspective, this design can be interpreted as a contrastive learning framework over a shared cross-modal latent space. It not only reinforces semantic consistency across modalities but also prevents embedding space degeneration through KL regularization, maintaining structural alignment and semantic stability between images and OCR texts. This approach is of significant value for end-to-end multimodal information fusion and substantially improves semantic matching and retrieval under complex cover-image and OCR-text structural conditions.

3.3.3. Cross-Lingual Alignment Module

The proposed cross-lingual alignment module is designed to achieve a unified representation of semantic spaces between Chinese and English, thereby addressing the issues of semantic inconsistency and inaccurate mapping commonly encountered in traditional cross-lingual retrieval. As shown in Figure 3, the module is constructed based on a Transformer architecture, comprising six layers of cross-lingual alignment Transformers. Each layer integrates a self-attention mechanism and a feedforward neural network. The inputs include the Chinese query embedding

X_{z h} \in R^{L \times d}

and the English document embedding

X_{e n} \in R^{L^{'} \times d}

, where L and

L^{'}

denote the lengths of the Chinese and English sentences, respectively, and

d = 768

represents the feature dimension of each token. To align semantic representations across languages, a cross-lingual alignment attention mechanism is introduced, employing a deformable attention structure. The core idea is to establish non-rigid mappings between different semantic points via learnable offsets, thereby enhancing the flexibility and accuracy of alignment. Specifically, the query position embedding is modeled using Bezier curve-based offset encoding. For each query

q_{i}

, the reference point is defined as

p_{i}

, with a control point offset

Δ p_{i}

introduced to compute the final sampling position

p_{i}^{'} = p_{i} + Δ p_{i}

. Based on the spatial position embeddings, the cross-lingual attention is formulated as:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d}} + B) V,

(2)

here, Q denotes the query matrix obtained via linear projection of the Chinese embeddings, while K and V represent the key and value matrices derived from the English embeddings. B denotes the learnable bias term associated with position offsets. The non-uniform sampling locations generated by the control points enable the attention mechanism to focus on semantically aligned regions between Chinese and English, beyond simple sentence-level matching. Each Transformer layer incorporates a multi-head attention mechanism with 12 heads and a feedforward hidden dimension of 3072. GELU activation is employed, and residual connections with layer normalization are used to enhance the stability and expressive capacity of the deep architecture. Within the cross-lingual alignment module, the following joint contrastive loss is introduced:

L_{z h - e n} = - \sum_{i = 1}^{N} log \frac{exp (sim (x_{z h}^{i}, x_{e n}^{i}) / τ)}{\sum_{j = 1}^{N} exp (sim (x_{z h}^{i}, x_{e n}^{j}) / τ)}

(3)

In this equation,

x_{z h}^{i}

and

x_{e n}^{i}

denote the i-th pair of Chinese and English embeddings,

sim (\cdot, \cdot)

represents cosine similarity, and

τ

is the temperature coefficient used to control the smoothness of the distribution. This loss encourages semantically similar Chinese–English pairs to be positioned closer in the shared embedding space, while dissimilar pairs are pushed apart, thereby reinforcing semantic consistency. Furthermore, when used in conjunction with the vision–OCR alignment module, the cross-lingual alignment module enables efficient coupling among images, text, and language in a unified embedding space. In particular, the structured textual content obtained via the OCR module, as a critical component of the language modality, serves as a semantic bridge with natural language Chinese queries through this module, thus facilitating end-to-end cross-modal alignment from image to OCR to language. Experimental results demonstrate that this module significantly improves both precision and recall in cross-lingual retrieval, while maintaining high-quality semantic consistency across languages. Its robustness and practical value are especially evident in real-world settings without translation support. In summary, the cross-lingual alignment module provides a robust semantic modeling mechanism for multilingual scenarios and serves as a critical foundation for the end-to-end construction of multimodal fusion systems.

3.3.4. Unified Multimodal Retrieval Space

The construction of a unified multimodal retrieval space aims to eliminate representational barriers between different modalities—images, OCR text, Chinese queries, and English documents—by aligning and matching all modalities within a shared semantic space. Building upon the vision–OCR and cross-lingual alignment modules, a unified semantic projection space is introduced, where all final encoded vectors, including the global image feature

z_{i m g}

, OCR embedding

z_{o c r}

, Chinese query

z_{z h}

, and English document

z_{e n}

, are projected into a common

D^{'} = 512

-dimensional semantic embedding space, denoted as

z_{i m g}^{'}

,

z_{o c r}^{'}

,

z_{z h}^{'}

, and

z_{e n}^{'}

, respectively. This projection is achieved using modality-consistent multi-layer projection heads consisting of two fully connected layers and one normalization layer. Shared parameters are applied to ensure embedding consistency. To maintain structural alignment among different modality vectors, multiple contrastive loss functions are employed. For example, the image–OCR contrastive loss is defined as:

L_{i - o} = - log \frac{exp (sim (z_{i m g}^{'}, z_{o c r}^{'}) / τ)}{\sum_{j = 1}^{N} exp (sim (z_{i m g}^{'}, z_{{o c r}^{(j)}}^{'}) / τ)}

(4)

Additionally, cross-lingual cross-modal loss terms, such as the indirect alignment loss between Chinese queries and images

L_{z h - i m g}

, are incorporated to enhance cross-modality transferability. The final unified joint loss function is defined as:

L_{u n i f i e d} = λ_{1} L_{i - o} + λ_{2} L_{z h - e n} + λ_{3} L_{z h - i m g} + λ_{4} L_{o c r - e n}

(5)

This unified semantic space enables the model to handle arbitrary modality combinations after training, such as Chinese text retrieving English documents, OCR text retrieving cover images, or cover images retrieving English content. The proposed design significantly improves semantic sharing and transfer across modalities, demonstrating strong adaptability in real-world retrieval scenarios involving document images, cross-lingual and multimodal complexity, and limited annotations. It provides a scalable foundational embedding space for unified multilingual and multimodal retrieval.

4. Results and Discussion

4.1. Experimental Settings

4.1.1. Hardware and Software Platform

All experiments were conducted on a high-performance computing workstation equipped with an NVIDIA RTX 3090 GPU (24 GB VRAM), an Intel Core i9-12900K CPU, and 128 GB of memory, operating under Ubuntu 20.04. All models were implemented using the PyTorch 1.13.1 framework, with training accelerated by CUDA 11.6. The multilingual encoders were instantiated from the HuggingFace Transformers library using XLM-R large, while the visual encoders included pretrained ResNet-50 and ViT-B/16 models. The OCR module was implemented using Tesseract and LayoutLMv3. Data preprocessing and augmentation were performed with OpenCV, NLTK, and jieba. The AdamW optimizer was employed with an initial learning rate of

2 \times 10^{- 5}

, a batch size of 32, and 20 total training epochs. All models were evaluated under identical settings to ensure reproducibility and fair comparison.

4.1.2. Baselines

To comprehensively validate the effectiveness of the proposed method in multimodal and multilingual retrieval tasks, several representative models were selected as baselines. These included CLIP [5], LayoutLMv3 [20], UNITER (single-stream architecture) [16], an image-to-OCR baseline [30], and a Chinese-to-English document retrieval baseline (CLIR baseline) [31]. CLIP, as a representative dual-encoder model, demonstrates strong performance in image–text alignment, especially in image-to-OCR tasks. LayoutLMv3 integrates visual, textual, and layout information, making it suitable for OCR understanding in structured document scenarios. UNITER, as a single-stream cross-modal pretraining model, captures fine-grained associations between image regions and language, yielding stable retrieval performance across joint tasks. The CLIR baseline utilizes multilingual encoders such as mBERT or XLM-R to model cross-lingual representation between Chinese queries and English documents, serving as a fundamental benchmark for language alignment. The image-to-OCR baseline defines a visual-only alignment task between document images and text regions, allowing evaluation of visual retrieval performance without linguistic input.

4.1.3. Evaluation Metrics

A variety of standard metrics were adopted to evaluate the model from multiple dimensions, including retrieval precision, ranking relevance, and overall retrieval capability. These included Precision@K, Recall@K, Mean Average Precision (mAP), Mean Reciprocal Rank (MRR), Normalized Discounted Cumulative Gain (nDCG@K), and F1@K. Their mathematical definitions are given as follows:

Precision @ K = \frac{1}{K} \sum_{i = 1}^{K} r e l (i),

(6)

Recall @ K = \frac{\sum_{i = 1}^{K} r e l (i)}{R},

(7)

AP = \frac{1}{R} \sum_{k = 1}^{N} P (k) \cdot r e l (k),

(8)

mAP = \frac{1}{Q} \sum_{q = 1}^{Q} {AP}_{q},

(9)

MRR = \frac{1}{Q} \sum_{i = 1}^{Q} \frac{1}{{rank}_{i}},

(10)

DCG @ K = \sum_{i = 1}^{K} \frac{2^{r e l (i)} - 1}{{log}_{2} (i + 1)},

(11)

nDCG @ K = \frac{DCG @ K}{IDCG @ K},

(12)

F 1 @ K = 2 \cdot \frac{Precision @ K \cdot Recall @ K}{Precision @ K + Recall @ K},

(13)

here,

r e l (i)

denotes the relevance of the i-th ranked document (typically 0 or 1), and R is the total number of relevant documents for a given query.

P (k)

is the precision at the k-th position, Q represents the total number of queries, and

{rank}_{i}

indicates the rank position of the first relevant document for the i-th query.

I D C G @ K

is the ideal DCG when all relevant documents are perfectly ranked. Precision@K measures the proportion of relevant items among the top K results, reflecting retrieval accuracy. Recall@K evaluates the coverage of relevant documents among retrieved results. mAP averages the precision across recall thresholds for all queries. MRR emphasizes the rank of the first relevant document, offering insight into real-world user experience. nDCG@K assesses ranking quality by penalizing lower-ranked relevant items. F1@K harmonizes precision and recall. These complementary metrics enable a comprehensive assessment of performance in complex multimodal and multilingual retrieval scenarios.

4.2. Performance Comparison Across Tasks and Models in Multimodal and Multilingual Retrieval

This experiment was designed to evaluate the overall performance of various model types across three retrieval tasks: image-to-OCR unimodal retrieval, Chinese-to-English cross-lingual retrieval, and joint multimodal retrieval. Representative models including CLIP, LayoutLMv3, UNITER, task-specific baselines, and the proposed method were selected to systematically compare their adaptability and expressive capacity under multilingual and multimodal scenarios. The objective of the experimental design was to investigate the performance boundaries of unimodal structures, single-stream architectures, and dual-stream cross-modal frameworks under different retrieval contexts and to assess the advantages of the proposed unified alignment framework in complex practical tasks.

As shown in Table 2 and Figure 4, the proposed model achieved superior performance across all three tasks. Notably, in the joint multimodal setting, a Precision@10 of 0.693 was attained, surpassing UNITER (0.566) and CLIP (0.528) by a considerable margin. Although CLIP, as a typical dual-encoder model, demonstrates efficient inference, its reliance on contrastive loss for cross-modal constraint results in limited capability to capture local structures and linguistic details, leading to suboptimal performance in the image-to-OCR task (Precision@10 = 0.621). LayoutLMv3, due to its integration of OCR and layout features, adapts well to structured document scenarios, achieving a Precision@10 of 0.778 in the same task, but exhibits limitations in cross-lingual contexts. UNITER, being a single-stream cross-modal transformer, offers stronger semantic modeling, reflected in higher nDCG@10 scores compared to CLIP, but lacks structural awareness. The proposed method unifies image, OCR, and language modalities via a shared semantic embedding space and a joint contrastive learning strategy, forming a compact and aligned representation path in low-dimensional space. This mitigates the projection shift issue caused by modal switching in conventional frameworks and facilitates optimal performance under multi-task conditions.

4.3. Performance on Long-Tail Queries and Noisy Inputs

To better approximate real-world retrieval scenarios, we further evaluated the proposed framework and baseline models under two challenging conditions: long-tail queries and noisy document inputs. Long-tail queries refer to infrequent or domain-specific terms that rarely appear in the training corpus, which are representative of niche user information needs in practical retrieval settings. To simulate this, we extracted a subset of evaluation queries with low occurrence frequency and assessed retrieval performance specifically on this subset. Noisy document inputs were generated to mimic realistic degradations that occur in scanned or historical materials, including Gaussian noise, JPEG compression artifacts, random skew, and synthetic OCR errors such as character substitutions or deletions. These conditions were applied to both the image → OCR and Chinese → English retrieval tasks. The resulting performance comparison between the proposed framework and representative baselines is presented in Table 3, illustrating the robustness of different models when facing rare query distributions and degraded document quality.

4.4. Ablation Study of Submodules (Joint Multimodal Task)

To validate the contribution of key submodules—including the visual encoder, OCR module, and language alignment module—an ablation study was conducted within the joint multimodal retrieval setting. Each submodule was removed individually from the full model, and the impact on Precision@10, nDCG@10, and F1@10 was evaluated. This experiment quantitatively assessed the role of each modality within the fusion framework and provided empirical evidence for the effectiveness of the proposed collaborative mechanism in supporting unified retrieval.

As presented in Table 4 and Figure 5, the full model outperformed all ablated variants, indicating that each submodule provides complementary benefits. Removing the visual encoder reduced Precision@10 to 0.621, highlighting the importance of global structure and visual semantics from the document cover. The removal of the OCR module also caused a performance drop in F1@10, demonstrating the indispensability of localized text region features for fine-grained semantic modeling. Excluding the language alignment module led to a notable decline in nDCG@10, emphasizing its core role in aligning multilingual queries and ensuring accurate semantic ranking. From a mathematical perspective, the visual encoder contributes high-dimensional compressed representations of global image features, the OCR module introduces auxiliary alignment through local structural modeling, and the language alignment module enhances cross-lingual consistency through shared embeddings and contrastive optimization. The joint embedding of these modalities within a unified space enables multi-center, multi-scale feature representation, enhancing convergence stability and query robustness under multimodal conditions. Thus, the synergistic integration of submodules not only improves local task-specific outcomes but also systematically enhances global semantic alignment and retrieval quality.

4.5. Ablation Results of Joint Loss Functions for Unified Embedding Retrieval

This experiment was conducted to investigate the specific contributions of different loss functions in the learning of a unified multimodal embedding space. A series of ablation configurations were designed by selectively removing key alignment terms across modalities and languages, with the aim of evaluating the impact of each loss component on retrieval performance. The experiment focused on metrics such as MRR, Recall@10, and nDCG@10, to assess the role of complete versus partial loss structures in achieving semantic mapping stability and consistent ranking performance in multilingual and multimodal settings. This design provides insight into the objective function’s role in modeling semantic alignment across modalities.

As presented in Table 5 and Figure 6, the complete joint loss

L_{u n i f i e d}

achieved the best performance across all metrics, with an MRR of 0.642, significantly surpassing all partial loss settings. When the cross-modal language loss was removed—thus excluding the direct alignment between language modalities and image/OCR representations—nDCG@10 dropped to 0.632, indicating a weakened semantic boundary across modalities and reduced ranking precision. Further excluding the visual loss and retaining only language-level alignment between Chinese and English OCR resulted in an MRR of 0.589, emphasizing the importance of the visual modality in constructing global semantic representations. The configuration using only

L_{i m g - o c r}

yielded the poorest performance (MRR = 0.561), which can be attributed to the absence of language-guided semantic abstraction. Without linguistic supervision, the embedding space failed to decode query semantics effectively, leading to degenerated semantic projection structures. Mathematically, the joint contrastive loss not only expands intra-modal positive-negative pairs but also forms hierarchical geometric relationships across modality-specific subspaces through multi-objective optimization. This experiment demonstrates that relying solely on either visual or linguistic alignment is insufficient to achieve comprehensive semantic consistency. A complete loss framework is therefore essential to build a robust and generalizable retrieval system across modalities.

4.6. Discussion

4.6.1. Practical Application Analysis

The proposed unified multimodal and multilingual retrieval framework demonstrates significant advantages across diverse real-world scenarios, particularly in library information services, cross-language document retrieval, and multimodal educational resource recommendation. In traditional university library systems, users typically rely on cover images, OCR-extracted content tables, or chapter information for fuzzy search. However, existing systems are predominantly unimodal and incapable of handling mixed queries involving images, text, and language. By incorporating visual encoders and OCR alignment modules, the proposed method enables structured retrieval through simple cover image inputs. The system automatically parses layout and text content, facilitating linkage to relevant documents, papers, or collections. On international platforms such as multilingual digital libraries or Wikimedia Commons, users often encounter retrieval barriers between Chinese queries and English content. Through the proposed cross-lingual alignment mechanism, Chinese queries can be mapped to English documents, images, or OCR content without the need for external translation tools, significantly enhancing retrieval accuracy and efficiency. In educational domains, particularly for digitalized K–12 content management, educators can upload textbook covers or screenshots to automatically retrieve related subjects or multilingual reference materials. Moreover, in domains like news media management and digital copyright monitoring, the joint modeling of cover images, OCR abstracts, and multilingual retrieval supports consistent cross-modal copyright verification.

4.6.2. Computational Efficiency Analysis

Although the proposed framework integrates multiple components—visual encoding, structure-aware OCR processing, and multilingual language modeling—it has been designed with computational efficiency in mind. The architecture adopts a dual-encoder retrieval paradigm, allowing visual and textual inputs to be encoded independently. This enables pre-computation of embeddings and efficient similarity search at inference time, avoiding the need for computationally expensive joint query–document encoding for every retrieval request. In addition, the modular structure of the framework allows the visual encoder, OCR module, and language encoder to operate in parallel. During both training and inference, these modules process their respective modalities concurrently before feature fusion in the unified semantic space. This parallelism reduces end-to-end latency and improves throughput in large-scale retrieval tasks. Furthermore, training efficiency is improved by leveraging pretrained backbone models for the visual and multilingual encoders, which reduces convergence time and minimizes redundant computation in early layers. The OCR component is similarly integrated into the pipeline in a way that avoids repeated text extraction for static document collections, further saving processing time in deployment scenarios. Through these design choices, the framework achieves strong retrieval performance while keeping computational demands manageable for large-scale multimodal and multilingual document retrieval applications.

4.6.3. Generalization and Robustness Analysis

The proposed unified multimodal multilingual retrieval framework is inherently language-agnostic in design. By projecting visual, OCR, and multilingual textual representations into a shared embedding space, the architecture can in principle accommodate languages beyond Chinese and English without requiring structural changes. The use of multilingual pretrained language encoders further facilitates transfer to additional languages, as such encoders already embed semantic knowledge across a wide variety of language families. We expect that, with appropriate parallel corpora or translation-based augmentation, the framework could be adapted to low-resource languages, even in scenarios where annotated multimodal training data is scarce. The design also supports robustness to variations in document quality. Because retrieval relies on complementary signals from both visual layout features and OCR-extracted text, the system is likely to remain effective even when one modality is partially degraded, as might occur with scanned receipts, historical documents, or other noisy inputs. For example, when OCR accuracy drops due to low resolution or complex layouts, visual features can still provide discriminative cues, and conversely, clean textual cues can compensate for less informative visual patterns. This multimodal complementarity positions the framework to better handle challenging real-world document conditions compared to unimodal approaches. In summary, while the current evaluation focuses on Chinese–English retrieval with clean document inputs, the framework’s modularity, multilingual foundation, and cross-modal design suggest promising potential for extension to low-resource languages and for robust performance under noisy document conditions in future applications.

4.7. Limitation and Future Work

Despite the superior performance exhibited by the proposed unified multimodal multilingual retrieval framework across multiple tasks, certain limitations remain. First, model training currently relies on predefined modality pairs and alignment labels (e.g., cover–OCR pairs or Chinese–English document alignments), which may limit scalability in scenarios with insufficient coverage. Second, the current system lacks user interaction feedback mechanisms, hindering dynamic adaptation and online optimization based on real-time user behaviors such as clicks, corrections, or ratings. Future research directions include introducing generative multimodal augmentation and multilingual knowledge graph supervision to improve robustness under low-resource conditions. Additionally, user-driven online learning mechanisms can be developed to enable model updates based on real-time feedback and support personalized retrieval optimization, thereby promoting widespread deployment of intelligent retrieval systems in real-world multimodal and multilingual environments. Additionally, human-in-the-loop approaches and reinforcement learning from human feedback (RLHF) can be explored to incorporate explicit and implicit user feedback into the training process, enabling the system to learn personalized retrieval preferences and adapt to evolving information needs. Such user-driven online learning mechanisms would allow the model to continuously refine its performance, thereby promoting widespread deployment of intelligent retrieval systems in real-world multimodal and multilingual environments.

5. Conclusions

With the rapid growth of library information systems, digital literature platforms, and cross-lingual knowledge retrieval demands, conventional retrieval systems based on single modality or single language have become inadequate for supporting efficient information access in multimodal and multilingual environments. To address this challenge, a unified multimodal and multilingual aligned retrieval framework has been proposed. By constructing a shared semantic space for visual inputs, OCR texts, and Chinese–English language representations, end-to-end cross-modal alignment and high-precision retrieval have been achieved across images, texts, and languages. The proposed model architecture integrates multimodal embedding encoders, a cross-lingual attention mechanism, and a joint contrastive learning loss, enabling support for arbitrary combinations of query modalities—such as cover image queries, OCR content alignment, and cross-language retrieval—across complex real-world scenarios. The effectiveness of the proposed framework has been extensively validated through experimental evaluations. In tasks including image → OCR, Chinese → English document retrieval, and joint multimodal input scenarios, the proposed model consistently outperformed existing baselines across multiple key metrics such as Precision@10, Recall@10, and nDCG@10. Specifically, in the joint multimodal input task, a Precision@10 of 0.693 and an F1@10 of 0.685 were achieved, demonstrating significant improvements over mainstream baseline models. Additional ablation studies confirmed the complementary contributions of the visual, OCR, and language alignment modules, while the loss function ablation results further highlighted the critical role of multimodal contrastive learning strategies in constructing a unified semantic space.

Author Contributions

Conceptualization, W.Y., X.C., and Y.Z.; data curation, H.M. and Z.F.; methodology, W.Y. and X.C.; project administration, Y.Z.; resources, H.M. and Z.F.; software, W.Y., X.C., and H.M.; supervision, Y.Z.; visualization, Z.F.; writing—original draft, W.Y., X.C., H.M., Z.F., and Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Ni, M.; Huang, H.; Su, L.; Cui, E.; Bharti, T.; Wang, L.; Zhang, D.; Duan, N. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3977–3986. [Google Scholar]
Zhou, M.; Zhou, L.; Wang, S.; Cheng, Y.; Li, L.; Yu, Z.; Liu, J. Uc2: Universal cross-lingual cross-modal vision-and-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 4155–4165. [Google Scholar]
Nie, Z.; Zhang, R.; Feng, Z.; Huang, H.; Liu, X. Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; pp. 2272–2283. [Google Scholar]
Jain, A.; Guo, M.; Srinivasan, K.; Chen, T.; Kudugunta, S.; Jia, C.; Yang, Y.; Baldridge, J. Mural: Multimodal, multitask retrieval across languages. arXiv 2021, arXiv:2109.05125. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PmLR: New York, NY, USA, 2021; pp. 8748–8763. [Google Scholar]
Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Virtual, 18–24 July 2021; PMLR: New York, NY, USA, 2021; pp. 4904–4916. [Google Scholar]
Conneau, A.; Lample, G. Cross-lingual language model pretraining. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
Wang, T.; Li, F.; Zhu, L.; Li, J.; Zhang, Z.; Shen, H.T. Cross-modal retrieval: A systematic review of methods and future directions. Proc. IEEE 2025, 112, 1716–1754. [Google Scholar] [CrossRef]
Han, Z.; Azman, A.; Mustaffa, M.R.; Khalid, F.B. Cross-modal retrieval: A review of methodologies, datasets, and future perspectives. IEEE Access 2024, 12, 115716–115741. [Google Scholar] [CrossRef]
Burns, A.; Kim, D.; Wijaya, D.; Saenko, K.; Plummer, B.A. Learning to scale multilingual representations for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer: Cham, Switzerland, 2020; pp. 197–213. [Google Scholar]
Han, J.; Chen, H.; Zhao, Y.; Wang, H.; Zhao, Q.; Yang, Z.; He, H.; Yue, X.; Jiang, L. Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations. arXiv 2025, arXiv:2506.18898. [Google Scholar]
Xin, Y.; Du, J.; Wang, Q.; Yan, K.; Ding, S. Mmap: Multi-modal alignment prompt for cross-domain multi-task learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 16076–16084. [Google Scholar]
Duan, C.; Fu, P.; Guo, S.; Jiang, Q.; Wei, X. Odm: A text-image further alignment pre-training approach for scene text detection and spotting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 15587–15597. [Google Scholar]
Miyawaki, S.; Hasegawa, T.; Nishida, K.; Kato, T.; Suzuki, J. Scene-text aware image and text retrieval with dual-encoder. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Dublin, Ireland, 22–27 May 2022; pp. 422–433. [Google Scholar]
Lu, H.; Fei, N.; Huo, Y.; Gao, Y.; Lu, Z.; Wen, J.R. Cots: Collaborative two-stream vision-language pre-training model for cross-modal retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 15692–15701. [Google Scholar]
Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 104–120. [Google Scholar]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXX 16. Springer: Cham, Switzerland, 2020; pp. 121–137. [Google Scholar]
Xiong, P.; Shen, Y.; Jin, H. MGA-VQA: Multi-Granularity Alignment for Visual Question Answering. arXiv 2022, arXiv:2201.10656. [Google Scholar]
Chen, H.; Liu, R.; Peng, B. Cross-modal relational reasoning network for visual question answering. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 3956–3965. [Google Scholar]
Huang, Y.; Lv, T.; Cui, L.; Lu, Y.; Wei, F. Layoutlmv3: Pre-training for document ai with unified text and image masking. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 4083–4091. [Google Scholar]
Zhong, X.; Tang, J.; Yepes, A.J. Publaynet: Largest dataset ever for document layout analysis. In Proceedings of the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia, 20–25 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1015–1022. [Google Scholar]
Oard, D.W.; Diekema, A.R. Cross-language information retrieval. Annu. Rev. Inf. Sci. Technol. (ARIST) 1998, 33, 223–256. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Nguyen, D.Q.; Nguyen, A.T. PhoBERT: Pre-trained language models for Vietnamese. arXiv 2020, arXiv:2003.00744. [Google Scholar]
Chi, Z.; Dong, L.; Wei, F.; Yang, N.; Singhal, S.; Wang, W.; Song, X.; Mao, X.L.; Huang, H.; Zhou, M. InfoXLM: An information-theoretic framework for cross-lingual language model pre-training. arXiv 2020, arXiv:2007.07834. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised cross-lingual representation learning at scale. arXiv 2019, arXiv:1911.02116. [Google Scholar]
Jiang, Z.; El-Jaroudi, A.; Hartmann, W.; Karakos, D.; Zhao, L. Cross-lingual information retrieval with BERT. arXiv 2020, arXiv:2004.13005. [Google Scholar]
Wang, H.; Henderson, J.; Merlo, P. Multi-adversarial learning for cross-lingual word embeddings. arXiv 2020, arXiv:2010.08432. [Google Scholar]
Keung, P.; Lu, Y.; Bhardwaj, V. Adversarial learning with contextual embeddings for zero-resource cross-lingual classification and NER. arXiv 2019, arXiv:1909.00153. [Google Scholar]
Bartz, C.; Yang, H.; Meinel, C. STN-OCR: A single Neural Network for Text Detection and Text Recognition. arXiv 2017, arXiv:1707.08831. [Google Scholar]
Empirical studies on the impact of lexical resources on CLIR performance. Inf. Process. Manag. 2005, 41, 475–487. [CrossRef]

Figure 1. Overall architecture of the proposed framework.

Figure 2. Vision–OCR alignment module based on contrastive variational representation learning. Two augmented views

x^{'}

and

x^{''}

are generated from the original input x via random transformation

T

. Each view is passed through a shared-parameter encoder to produce posterior distributions

p (z^{'} | x)

and

p (z^{″} | x)

, parameterized by mean and variance

(μ^{'}, σ^{'})

and

(μ^{″}, σ^{″})

.

Figure 2. Vision–OCR alignment module based on contrastive variational representation learning. Two augmented views

x^{'}

and

x^{''}

are generated from the original input x via random transformation

T

. Each view is passed through a shared-parameter encoder to produce posterior distributions

p (z^{'} | x)

and

p (z^{″} | x)

, parameterized by mean and variance

(μ^{'}, σ^{'})

and

(μ^{″}, σ^{″})

.

Figure 3. Curve-guided cross attention mechanism for cross-lingual or cross-modal alignment. The architecture integrates Bézier curve generation and deformable attention to guide dynamic query updates. In the left panel, the cross-attention block operates with query (Q), key (K), and value (V) embeddings, incorporating bias terms and position encoding into the attention matrix. The right panel illustrates the workflow: control points and reference points define a Bézier curve, which directs deformable sampling for improved spatial alignment. The updated lane query is then propagated through the attention module.

Figure 4. Radar chart comparing the performance of four models—CLIP, LayoutLMv3, UNITER, and Ours (Full)—on the Image → OCR retrieval task. The metrics include Precision@10, Recall@10, nDCG@10, and F1@10. The proposed model outperforms all baselines across all evaluation indicators, especially in Recall and F1 score.

Figure 5. Ablation study on submodules for the joint multimodal retrieval task. The impact of removing individual components—Visual Encoder, OCR Module, and Language Alignment—is shown across three metrics: Precision@10, nDCG@10, and F1@10.

Figure 6. Performance heatmap of different loss function combinations on the unified multimodal retrieval task. The full loss

L_{u n i f i e d}

achieves the best performance across all metrics (MRR, Recall@10, and nDCG@10).

Figure 6. Performance heatmap of different loss function combinations on the unified multimodal retrieval task. The full loss

L_{u n i f i e d}

achieves the best performance across all metrics (MRR, Recall@10, and nDCG@10).

Table 1. Statistics of datasets used in the study.

Dataset	Modality	Samples	Description
PubLayNet	Image + OCR	360,000	Each page includes multiple OCR-annotated zones such as titles, paragraphs, and captions.
WikiCLIR/zh	En Query + Zh Doc	951,480	Bilingual article pairs aligned from cross-lingual Wikipedia pages covering diverse topics.

Table 2. Performance of different models on three retrieval tasks: image-to-OCR, Chinese-to-English, and joint multimodal input.

Model	Task	Precision@10	Recall@10	nDCG@10	F1@10	Runtime (ms/Query)	Memory (GB)
CLIP	Image → OCR	0.621	0.593	0.511	0.492	43	5.8
CLIP	Chinese → English	0.503	–	0.482	–	42	5.7
CLIP	Joint Multimodal	0.528	–	0.511	0.492	47	6.0
LayoutLMv3	Image → OCR	0.778	0.732	0.531	0.517	117	9.6
LayoutLMv3	Chinese → English	0.518	–	0.503	–	111	9.5
LayoutLMv3	Joint Multimodal	0.549	–	0.531	0.517	121	9.7
UNITER	Image → OCR	0.684	0.665	0.538	0.542	129	10.2
UNITER	Chinese → English	0.562	–	0.549	–	126	10.1
UNITER	Joint Multimodal	0.566	–	0.538	0.542	132	10.4
CLIR Baseline	Chinese → English	–	0.632	0.614	–	98	6.5
Image → OCR Baseline	Image → OCR	0.712	0.701	–	–	86	5.2
Ours (Full)	Image → OCR	0.805	0.782	0.672	0.685	94	6.1
Ours (Full)	Chinese → English	0.684	–	0.667	–	91	6.0
Ours (Full)	Joint Multimodal	0.693	–	0.672	0.685	95	6.2

Table 3. Performance comparison on long-tail queries and noisy inputs for image → OCR and Chinese → English retrieval tasks.

Model	Task	Precision@10	Recall@10	nDCG@10	F1@10
CLIP	Long-tail queries	0.428	0.421	0.395	0.424
CLIP	Noisy inputs	0.393	0.381	0.364	0.387
LayoutLMv3	Long-tail queries	0.462	0.453	0.439	0.457
LayoutLMv3	Noisy inputs	0.436	0.429	0.408	0.432
UNITER	Long-tail queries	0.474	0.467	0.452	0.470
UNITER	Noisy inputs	0.443	0.437	0.421	0.440
CLIR Baseline	Long-tail queries	0.488	0.482	0.461	0.485
CLIR Baseline	Noisy inputs	0.454	0.449	0.433	0.451
Image → OCR Baseline	Long-tail queries	0.503	0.497	0.484	0.500
Image → OCR Baseline	Noisy inputs	0.472	0.465	0.447	0.468
Ours (Full)	Long-tail queries	0.624	0.616	0.603	0.620
Ours (Full)	Noisy inputs	0.597	0.586	0.574	0.591

Table 4. Ablation results of submodules (joint multimodal task).

Model Setting	Precision@10	nDCG@10	F1@10
Full Model (Ours Full)	0.693	0.672	0.685
w/o Visual Encoder	0.621	0.608	0.614
w/o OCR Module	0.635	0.612	0.621
w/o Language Alignment	0.649	0.624	0.632

Table 5. Ablation results of joint loss functions on unified embedding space retrieval.

Loss Combination	MRR	Recall@10	nDCG@10
Full loss $L_{u n i f i e d}$	0.642	0.684	0.667
$L_{i m g - o c r} + L_{z h - e n}$ (w/o cross-modal language loss)	0.613	0.648	0.632
$L_{z h - e n} + L_{o c r - e n}$ (w/o visual loss)	0.589	0.621	0.607
$L_{i m g - o c r}$ only (w/o language alignment loss)	0.561	0.603	0.596

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, W.; Cai, X.; Ma, H.; Fu, Z.; Zhan, Y. A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval. Electronics 2025, 14, 3145. https://doi.org/10.3390/electronics14153145

AMA Style

Yi W, Cai X, Ma H, Fu Z, Zhan Y. A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval. Electronics. 2025; 14(15):3145. https://doi.org/10.3390/electronics14153145

Chicago/Turabian Style

Yi, Wang, Xiahuan Cai, Hongtao Ma, Zhengjie Fu, and Yan Zhan. 2025. "A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval" Electronics 14, no. 15: 3145. https://doi.org/10.3390/electronics14153145

APA Style

Yi, W., Cai, X., Ma, H., Fu, Z., & Zhan, Y. (2025). A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval. Electronics, 14(15), 3145. https://doi.org/10.3390/electronics14153145

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Library-Oriented Large Language Model Approach to Cross-Lingual and Cross-Modal Document Retrieval

Abstract

1. Introduction

2. Related Work

2.1. Multimodal Retrieval and Vision-Language Alignment

2.2. Cross-Lingual Information Retrieval

2.3. Joint Cross-Modal and Cross-Lingual Alignment

3. Materials and Method

3.1. Data Collection

3.2. Data Preprocessing

3.3. Proposed Method

3.3.1. Overall

3.3.2. Vision–OCR Alignment Module

3.3.3. Cross-Lingual Alignment Module

3.3.4. Unified Multimodal Retrieval Space

4. Results and Discussion

4.1. Experimental Settings

4.1.1. Hardware and Software Platform

4.1.2. Baselines

4.1.3. Evaluation Metrics

4.2. Performance Comparison Across Tasks and Models in Multimodal and Multilingual Retrieval

4.3. Performance on Long-Tail Queries and Noisy Inputs

4.4. Ablation Study of Submodules (Joint Multimodal Task)

4.5. Ablation Results of Joint Loss Functions for Unified Embedding Retrieval

4.6. Discussion

4.6.1. Practical Application Analysis

4.6.2. Computational Efficiency Analysis

4.6.3. Generalization and Robustness Analysis

4.7. Limitation and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI