Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources

Xia, Zishuo; Liang, Shaobo; Wu, Dan; Lv, Siyu

doi:10.3390/app16094158

Open AccessArticle

Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources

¹

School of Information Management, Wuhan University, Wuhan 430072, China

²

Center for Studies of Human-Computer Interaction and User Behavior, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(9), 4158; https://doi.org/10.3390/app16094158

Submission received: 30 March 2026 / Revised: 14 April 2026 / Accepted: 20 April 2026 / Published: 23 April 2026

(This article belongs to the Special Issue New Advances in Information Retrieval)

Download

Browse Figures

Versions Notes

Abstract

With the rapid development of public digital cultural resources, the lack of cross-lingual information retrieval (CLIR) services catering to multilingual users in practical applications has created significant language barriers. This hinders the promotion of public digital culture and results in the underutilization of relevant resources. To address this need, this paper constructs M-APE, a shared semantic model that operates without reliance on parallel corpora. Through a three-step process comprising the generation, fine-tuning, and optimization of a shared semantic space, M-APE establishes a common semantic framework for diverse languages. The model utilizes a Chinese semantic space, transferred and trained on authentic public cultural corpora, as its input. Evaluation based on bilingual dictionary induction quality demonstrates that M-APE significantly enhances semantic sharing performance between Chinese and Indo-European languages, represented here by English and French, achieving an average cross-family transformation accuracy of 56.6%. Furthermore, focusing on the CLIR needs of multilingual users within China’s public cultural engineering projects, this study develops a Chinese-English-French cross-lingual information retrieval framework by integrating M-APE into public cultural domain tasks. Experimental results indicate that the proposed method achieves superior cross-lingual retrieval performance in terms of average metrics.

Keywords:

information retrieval; cross-lingual information retrieval (CLIR); public digital culture; semantic model

1. Introduction

The continuous accumulation of multilingual data and the growing demand for high-quality retrieval results have increasingly highlighted the limitations of monolingual information retrieval, creating a need for search results that encompass relevant information in multiple languages [1]. In recent years, research on Cross-Lingual Information Retrieval (CLIR) has become highly active. CLIR allows users to retrieve relevant content in languages different from their query language, addressing the common desire among users to access multilingual content by querying solely in their native tongue [2]. With the advancement of digital technologies and digital humanities, the transformation of public cultural services into digital, globalized public digital cultural services has become a prevailing trend. The quantity and variety of accessible public digital cultural resources have increased significantly, and information retrieval provides the public with a crucial pathway to access this wealth of valuable cultural information.

Most existing CLIR systems rely on machine translation; however, translation theories that prioritize formal conversion over semantic transformation fail to reflect the essential characteristics of cross-lingual interaction [3]. With the rapid development of Deep Learning (DL) technologies, it has become possible to accurately generate distributed embeddings of natural language vocabulary within deep models, known as Word Embeddings [4]. These embeddings can capture linguistic regularities within the trained language, and such regularities can be transferred across different languages. Recent advances in transformer-based multilingual models, such as multilingual BERT and XLM-R, have significantly improved cross-lingual representation learning through large-scale multilingual pretraining [5]. These models demonstrate strong performance in various CLIR tasks without requiring explicit word-level alignment [6]. However, they typically rely on large-scale multilingual corpora and substantial computational resources, and often lack adaptability to domain-specific scenarios, particularly in specialized fields such as public digital culture [7]. Nevertheless, both traditional embedding-based approaches and transformer-based multilingual models still face key challenges in cross-lingual settings, including limited effectiveness in low-resource scenarios and insufficient modeling of domain-specific semantic structures.

Existing methods often yield suboptimal cross-lingual performance when dealing with cross-family languages characterized by feature space heterogeneity [8], such as Chinese (Sino-Tibetan) and Indo-European languages like English and French. Consequently, there is an urgent need to investigate semantic information sharing among Chinese, English, and French. Furthermore, domain-specific information retrieval faces significant challenges, including the difficulty of extracting domain-specific semantic relations [9] and the scarcity of easily accessible bilingual supervised corpora for specific domains [10]. These issues pose substantial obstacles to domain-specific CLIR. Particularly in the field of public culture, research on cross-lingual information retrieval remains relatively underdeveloped in current literature [11].

Therefore, this study aims to develop a corpus-independent shared semantic framework that enables effective cross-lingual semantic alignment and domain-specific retrieval. To address these gaps, this paper proposes two primary research questions, exploring solutions within the context of the public culture domain:

(1) How can a Chinese-English-French shared semantic model be constructed without relying on parallel corpora to achieve semantic information sharing across different language families?

(2) How can this shared semantic model be applied to realize cross-lingual information retrieval technology specifically for the public culture domain?

To further formalize the research design, this study is guided by the following assumptions:

H1.

A shared semantic space constructed without parallel corpora can achieve effective cross-lingual semantic alignment across language families.

H2.

The proposed semantic model can outperform baseline methods in cross-lingual retrieval performance within domain-specific scenarios.

2. Related Work

2.1. Domain-Specific Information Retrieval

Knowledge-based approaches in domain-specific information retrieval leverage expert domain knowledge to effectively handle nuances within specific fields. A prominent example is the Unified Medical Language System (UMLS) [12] used in the medical domain. Knowledge-based query expansion techniques for specific domains utilize new terms extracted from initially retrieved documents to expand the original query, subsequently extracting relevant terms and semantic relationships between them from knowledge bases [13].

Ontology-based methods have been widely applied in domain-specific information retrieval, with domain-dependent ontologies existing in commerce, agricultural law, medicine, and numerous other fields. Among these, fuzzy ontologies are employed to identify words associated with the most frequently occurring terms in a specific domain [14], thereby supporting the characteristics of domain ontologies [15]. Lee et al. [16] argued that, compared to crisp ontologies, fuzzy ontologies are better suited for describing domain knowledge and developed a fuzzy ontology specifically for the news summarization domain. Similarly, Abulaish et al. [17] utilized fuzzy ontologies in the medical field, employing text mining techniques for their construction. Fuzzy ontologies are also applied to query expansion by integrating domain-specific and global ontologies to identify the most semantically relevant terms surrounding query keywords [18].

Machine Learning (ML), particularly AI-driven solutions like deep learning models, are often regarded as “black-box” models due to their opaque inference processes. This lack of transparency impacts user trust and acceptance of retrieved information [19]. Consequently, the explainability of information retrieval systems has garnered increasing attention in domain-specific contexts. The interpretation of IR algorithms is highly dependent on the application domain, implying that explainability features must be tailored to specific industrial applications [20]. To address the needs for both explainability and domain scalability, Knowledge Graphs (KGs) can be utilized to model domain knowledge [21]. KGs enable the generation of textual and visual explanations for retrieved information, allowing algorithms to adapt to domain requirements while supporting transferability to other specific domain environments, thus facilitating multi-domain applicability [22].

2.2. Research on Semantic Spaces

Semantic spaces in the field of Natural Language Processing (NLP) aim to create representations that capture the essence of natural language. Before raw text is input into neural networks, a technique is employed to map human language into a geometric space, referred to as a Semantic Space.

One of the most fundamental methods for representing words numerically is One-Hot Encoding [23], a count-based vectorization approach. In one-hot encoding, the vector dimension equals the number of unique word types in the corpus, represented as binary vectors. This method is often associated with statistical language modeling via the Bag of Words (BoW) representation. The BoW model maintains a dictionary of all possible words in a language and records their frequency within a specific corpus; however, it fails to capture word similarity or contextual meaning [24]. To address the issue of word similarity, Term Frequency-Inverse Document Frequency (TF-IDF) was introduced [25]. TF-IDF associates each word in a document with a numerical weight to measure word similarity and facilitates the comparison of similarity across multiple documents. Unlike the high-dimensional sparse vectors (often comprising thousands or millions of dimensions) generated by one-hot encoding, word embedding technologies produce semantic spaces that are lower-dimensional and dense, typically consisting of only dozens or hundreds of dimensions.

Cross-lingual word embeddings represent an extension of the monolingual word embedding concept, mapping equivalent words from two or more languages into a shared cross-lingual semantic space [26,27]. The objective of this paper is to learn a shared embedded semantic space for words across all target languages. With such a vector space, models can be trained on data from any language. By projecting available examples from one language into this shared semantic space, the model simultaneously acquires the capability to perform predictions in all other languages.

3. Construction of a Chinese Semantic Space for the Public Culture Domain

3.1. Construction of a Chinese Corpus for the Public Culture Domain

A document stores information related to a specific domain or topic of interest, while a corpus represents the aggregate collection of such documents [28]. In the current era of rapid Natural Language Processing (NLP) development, raw corpora have become increasingly critical. This study begins by constructing a small-scale Chinese corpus specifically tailored to the public culture domain.

3.1.1. Data Sources

To ensure the validity and domain-specific characteristics of the corpus, all collected materials must be strictly oriented towards the public culture sector. Furthermore, to guarantee that the data originates from officially recognized, endorsed, and authorized sources, this study selects data recorded by the National Public Cultural Cloud (https://www.culturedc.cn/) as the primary source. Operated under the leadership of the National Center for Public Cultural Development of the Ministry of Culture and Tourism, this platform features open-source data characteristics. Consequently, we opted to crawl content directly from this website.

The National Public Cultural Cloud is accessible via four channels: Official Accounts, Mini-programs, Mobile Apps, and its Website. The first three channels revolve around six core functions: “Watch Live,” “Enjoy Activities,” “Learn Arts,” “Book Venues,” “Read Good Books,” and “Visit Markets.” Among these, the “Watch Live” and “Learn Arts” modules primarily consist of video resources indexed by time, while the “Visit Markets” module serves as a trading platform. Therefore, this study narrows the scope of the corpus to the textual content within four specific modules: “News & Information,” “Learn Arts,” “Book Venues,” and “Read Good Books.”

3.1.2. Corpus Construction Process

Initially, this study identified the unique identifier (ID) for each document within the website’s JavaScript client scripts. Following the URL composition patterns of the website, we reconstructed the specific web links corresponding to each document.

Since the website stores data within an Application Programming Interface (API) and queries this API upon every user access, we simulated user requests in our code. By analyzing the POST request types, headers, and payloads, we sent simulated requests to the reconstructed URLs to query data directly from the API. The retrieved data, formatted in JSON, was included in the response. In total, we crawled 5459 updated news items from the “News & Information” module. Table 1 presents the classification and count statistics for other crawled resource types, where the “Total” figure represents deduplicated values.

The raw data, originally in JSON format, was further parsed and stored into corresponding fields to create a Corpus Data Frame Object. This process converts unstructured data into a structured format. A Corpus Data Frame Object represents a method of organizing data in computing; specifically, it is a data frame containing columns of text type, as illustrated in Table 1.

Given that different NLP tasks impose varying requirements on raw text processing [29], this study deliberately refrained from performing further text processing during the corpus construction phase. This approach preserves the versatility of the collected corpus for diverse NLP tasks, resulting in the creation of a Raw Corpus.

3.2. Pre-Trained Chinese Semantic Space

3.2.1. Semantic Space Pathway Design

There are two primary approaches to generating word embedding feature representations from a raw corpus to construct a semantic space:

(1): Learning from Scratch: Word embeddings are learned simultaneously with a primary task (e.g., document classification or sentiment prediction). In this scenario, embeddings are initialized with random numbers and updated alongside the neural network weights during the training of the primary task.
(2): Using Pre-trained Models: Word embedding models pre-trained on large-scale corpora are utilized. Existing pre-trained embeddings, developed for machine learning tasks distinct from the target problem, are loaded into the model. These are known as Pre-trained Word Embeddings, and the space they constitute is referred to as a Pre-trained Semantic Space.

Considering the limited scale of corpora available in our target domain (public culture), this study opts to reuse word embedding features pre-trained on large-scale corpora for initialization. Instead of learning embeddings from scratch while solving the target problem, we load embedding vectors from a pre-computed semantic space. We then employ transfer learning by fine-tuning the entire model on a smaller, domain-specific dataset. This strategy accelerates the optimization process and enhances the model’s adaptability to the specific task.

3.2.2. FastText Pre-Trained Model

This study utilizes the FastText method to load embedding vectors from the pre-computed space. FastText trains a vector representation for each character-level n-gram, encompassing whole words, misspelled words, word fragments, and even individual characters. Zhang et al. [30] demonstrated that FastText’s character-level n-gram encoding yields ideal results for both Indo-European languages (e.g., English, French) and logographic languages (e.g., Chinese, Japanese, Korean).

Furthermore, FastText possesses an inherent capability to generate vectors for out-of-vocabulary (OOV) words. This allows for continuous online updates to the corpus based on training outcomes, supporting ongoing model training and dictionary expansion. These features make FastText an ideal candidate for transfer learning.

As illustrated in Figure 1, the FastText pre-trained model operates as follows: The input layer receives embeddings for individual words and n-gram features from the text. These are mapped to word embeddings via a Look-Up Layer. The word embeddings are then averaged (summed and divided) to form a vector representation for sentences or long texts. Finally, a linear classifier categorizes the sentence or text by calculating the probability distribution over predefined classes using methods such as the Softmax function, Hierarchical Softmax, or Negative Sampling.

The pre-trained Chinese word embeddings used in this study were trained on Wikipedia using the Continuous Bag of Words (CBOW) algorithm. The model contains 332,647 tokens, with each token represented by a 300-dimensional vector. The model is available in both binary (.bin) and text (.txt or .vec) formats.

3.3. Transfer Learning of the Chinese Semantic Space for the Public Culture Domain

Transfer learning, also known as domain adaptation, enables the overcoming of discrepancies between domains, allowing a classifier trained on a source domain to generalize effectively to a target domain [31]. In this study, the source domain is the pre-trained general semantic space, while the target domain is the public culture sector.

3.3.1. Transfer Learning Model Based on Public Culture Corpora

The transfer learning model based on public culture corpora is depicted in Figure 2. Prior to transfer training, we must address linguistic challenges inherent in text embeddings: Chinese is written in logograms, and unlike Indo-European languages, it lacks explicit whitespace delimiters to define word boundaries. Additionally, text data is often noisy, containing symbols, emojis, URLs, and invisible characters. Previous research [32] indicates that filtering and cleaning data before developing pre-trained language models is highly beneficial for text classification tasks.

Therefore, before conducting unsupervised FastText language model transfer training, the public culture corpus dataset collected in Section 3.1 undergoes preprocessing. Using regular expressions, we remove numbers, emojis, punctuation, URLs, HTML tags, and label markers from the text. This process not only reduces noise but also decreases the vocabulary size by eliminating sparse terms (such as specific English names or places), thereby reducing memory consumption and improving training speed.

For segmentation, this study employs the open-source jieba tool, which utilizes a Hidden Markov Model (HMM) and the Viterbi algorithm [33] for dynamic programming-based word segmentation. Jieba offers high flexibility, supporting the loading of stop-word lists and user-defined dictionaries. This ensures that terms in the user dictionary are treated as single units rather than being split. After processing, the dataset comprises 283,511 paragraphs, with a minimum length of 3 characters and a maximum of 6099 characters; the distribution of paragraph lengths follows a reasonable normal distribution. The segmented text is converted into a 2D Tensor and fed into the FastText model for transfer training.

Regarding the specific parameter configuration, this study utilized the default hyperparameters of FastText for training, resulting in 300-dimensional domain-oriented Chinese word embeddings through transfer learning. Similarly, the English and French monolingual vectors used in this study are 300-dimensional unsupervised word embeddings trained via FastText. During the corpus preprocessing phase, noise was eliminated using regular expressions, and subword lengths were carefully defined to optimize parsing performance across various processing tasks.

3.3.2. Chinese Semantic Space After Domain Transfer Learning

To evaluate the extent of change in the word embeddings before and after training, we set a threshold of 1 × 10⁻⁴¹ × 10⁻⁴. Changes exceeding this threshold are considered significant, indicating that the Chinese semantic space has successfully undergone learning via transfer training. Taking representative terms from the public culture domain, such as “Library” and “Museum,” we calculated the top 10 most similar words for each. The results and similarity scores are presented in Table 2. The analysis reveals that the word embeddings obtained after transfer learning capture semantics more accurately, demonstrating improved applicability within the public culture domain. Hereinafter, this space is referred to as the Chinese word vectors after transfer learning in public culture, abbreviated as CTL-PC.

The qualitative shift observed in Table 2 provides empirical evidence of the necessity of domain adaptation. In the ‘before’ state, the neighbors for ‘Library’ are primarily related to generic architectural or administrative terms. However, in the CTL-PC space, terms such as ‘Reading Room’ and ‘Library Science’ emerge with higher relevance scores, indicating that the model successfully captured the ‘service-oriented’ and ‘academic’ semantics specific to the public cultural sector. This domain sensitivity is crucial for reducing noise during subsequent cross-lingual mapping.

4. Construction of a Cross-Lingual Shared Semantic Model Based on Chinese Semantic Space

4.1. Framework of the Chinese-English-French Cross-Lingual Shared Semantic Model

This paper proposes a shared semantic model designed to generate a multilingual shared word embedding semantic space. The cross-lingual transfer of word embeddings aims to establish semantic mapping relationships between words in different languages by learning a transformation function for the corresponding word embedding spaces. Successfully addressing this challenge will benefit numerous downstream cross-lingual learning tasks, such as Cross-Lingual Information Retrieval (CLIR).

Facebook proposed MUSE (Multilingual Unsupervised and Supervised Embeddings), which achieves “word translation” by aligning the distributions of source and target word embeddings without relying on any parallel corpora. While this model achieves favorable results within Indo-European languages, it exhibits limitations when performing cross-family embeddings between Chinese and Indo-European languages.

Building upon MUSE, this paper introduces a machine translation method, focusing on combining domain adaptation and deep feature learning within a single training process. We propose the MUSE-Additional Pseudo Embedding (M-APE) model.

The M-APE model can be decomposed into three steps: cross-lingual adversarial learning, Procrustes solution fine-tuning, and pseudo-embedding with vector optimization. These correspond to the generation, adjustment, and optimization of the shared semantic space, respectively. Since the input data for the M-APE model consists of Chinese word embeddings that have undergone transfer learning (as described in Section 3) for the public culture domain, the FastText model serves as a necessary and fundamental component of the Chinese-English-French cross-lingual shared semantic model. This paper refers to the output cross-lingual embedding space as the Cross-Lingual Embedding—Additional Pseudo Embedding (CLE-APE) space, as illustrated in Figure 3.

4.2. Construction Steps of the Chinese-English-French Cross-Lingual Shared Semantic Model

4.2.1. Shared Semantic Space Generation

Generally, learning cross-lingual transfer of word embeddings can be viewed as a domain transfer problem, where a domain is defined as the vocabulary set of a different language, represented by either a source or target language [34]. For domain transfer of word embeddings, Ganin et al. [35] proposed an adaptive representation learning method that uses Domain Adversarial Learning to obtain latent representations invariant to the input domain. Therefore, an initial proxy for W can also be learned using an adversarial criterion. This paper employs a linear projection with a transformation matrix W to convert the vector space of one language into that of another.

During the training of the Generative Adversarial Network (GAN), the transformation matrix W (of size 300 × 300) was initialized as an identity matrix. For the discriminator, we constructed a Multi-Layer Perceptron (MLP) featuring two hidden layers with 2048 units each, employing Leaky-ReLU as the activation function. To enhance model robustness, a dropout noise of 0.1 was applied to the discriminator’s input, and a smoothing coefficient of 0.2 was added to its output.

The adversarial learning process was optimized using Stochastic Gradient Descent (SGD) with the following configurations:

(1) Batch Size: 32;

(2) Learning Rate: Initially set to 0.1, with a decay rate of 0.95;

(3) Learning Rate Shrinkage: When the supervised validation criterion decreased, the learning rate was shrunk by a factor of 0.5 to mitigate oscillations in gradient descent.

The training process covered the entire dataset and was iteratively repeated over different mini-batches until the parameters reached full convergence.

To provide a clear overview and facilitate the reproducibility of the proposed M-APE model, the key architectural configurations and hyperparameters utilized during the training process are summarized in Table 3.

W^{*} = \underset{W \in M_{d} (R)}{{argmin ‖WX - Y‖}_{F}}

(1)

Let

X = {x_{1}, \dots, x_{n}}

and

Y = {y_{1}, \dots, y_{m}}

be two sets of n and m word embeddings from the source and target languages, respectively. Let

W_{x i}

be the mapping of source language word embeddings to share the semantic space with the target language. By randomly sampling elements from the mapping

W X = {W_{x 1}, \dots, W_{x n}}

and Y, a model is trained to discriminate and distinguish elements from different sources; this model is called a Discriminator. Let the parameters of the discriminator be

θ_{D}

. The probability that the discriminator judges vector z as a mapping of a source language word embedding is

P_{θ_{D}} (s o u r c e = 1 | z)

, and the probability that it judges vector z as a target language word embedding is

P_{θ_{D}} (s o u r c e = 0 | z)

. Thus, the cost function of the discriminator can be expressed as the sum of misclassification losses:

L_{D} (θ_{D}| W) = - \frac{1}{n} \sum_{i = 1}^{n} l o g P_{θ_{D}} (s o u r c e = 1 | W x_{i}) - \frac{1}{m} \sum_{i = 1}^{m} l o g P_{θ_{D}} (s o u r c e = 0 | y_{i})

(2)

On the other hand, unsupervised learning trains W to confuse the discriminator, preventing it from making accurate predictions, thereby continuously making WX more similar to Y. The cost function for the mapping is:

L_{W} (W| θ_{D}) = - \frac{1}{n} \sum_{i = 1}^{n} l o g P_{θ_{D}} (s o u r c e = 0 | W x_{i}) - \frac{1}{m} \sum_{i = 1}^{m} l o g P_{θ_{D}} (s o u r c e = 1 | y_{i})

(3)

Consequently, this constitutes a Two-Player Game: the discriminator aims to maximize its ability to identify the source of word embeddings (WX or Y), while W aims to prevent effective identification by the discriminator by making WX and Y as similar as possible.

As shown in Figure 4, there are two word embedding distributions: blue English words denoted by X and red French words denoted by Y. The goal of adversarial learning is to align points with corresponding semantics in the two spaces. Each point represents a word in that space. The size of the point is positively correlated with the word frequency in the training corpus of that language. Through adversarial learning, a rotation matrix W is obtained. During training, word embeddings are randomly selected and fed into the discriminator to determine if the two embeddings originate from the same distribution.

4.2.2. Shared Semantic Space Adjustment

Since adversarial training results in higher alignment accuracy for high-frequency word pairs, these words are termed hubs, and their corresponding vectors are hub vectors. Under the assumption that the mapping is linear, it is optimal to infer the global mapping using only these hubs. Therefore, it is necessary to generate reliable matching pairs of high-frequency words between the two languages to serve as anchor points for Procrustes solution fine-tuning.

MUSE considers a Bi-partite Neighborhood Graph, where each word in a given vocabulary is connected to its K nearest neighbors in the other language. Let

{N N}_{T} (W x_{s})

be the K nearest neighbors of the source language word embedding mapping

W x_{s}

, all from the target language. Then, the average similarity between the source language word embedding

x_{s}

and its nearest neighbors in the target language is:

{s i m}_{T} (W x_{s}) = \frac{1}{K} \sum_{y_{t} \in {N N}_{T} (W x_{s})} \cos (W x_{s} {, y}_{t})

(4)

where

c o s (\cdot, \cdot)

is the cosine similarity. Similarly, let

{N N}_{S} (y_{t})

be the K nearest neighbors of the target language embedding

y_{t}

, mapped to the source language. Then, the average similarity between the target language word embedding

y_{t}

and its nearest neighbors in the target language is:

{s i m}_{S} (y_{t}) = \frac{1}{K} \sum_{W x_{s} \in {N N}_{S} (y_{t})} \cos (y_{t}, W x_{s})

(5)

Cross-Domain Similarity Local Scaling (CSLS) integrates

{s i m}_{T} (W x_{s})

and

{s i m}_{S} (y_{t})

to measure the similarity between the source language word embedding mapping and the target language embedding. This is used to prune the high-frequency word bilingual dictionary, retaining only word pairs that satisfy the bidirectional translation property.

C S L S (W x_{s} {, y}_{t}) = 2 \cos (W x_{s} {, y}_{t}) - {s i m}_{T} (W x_{s}) - {s i m}_{S} (y_{t})

(6)

Procrustes solution fine-tuning changes the metric of the space via CSLS to generate a new bidirectional dictionary. Subsequently, adversarial training is applied to this generated dictionary to update W. The training results improve performance for lower-frequency words and increase similarity associated with isolated word embeddings.

4.2.3. Shared Semantic Space Optimization

In obtaining the design for a trilingual shared space, this paper formulates the following strategy: generate an English-French shared semantic space, and through linear calculation of English and French word embeddings, obtain Chinese word embeddings for cross-lingual retrieval. This ensures the newly generated vector is most similar (i.e., has minimal difference) to the existing two vectors. The Chinese word embeddings generated via this method are termed Pseudo Embeddings, denoted as PE, serving as the strategy proposed by M-APE, as shown in Figure 5. Euclidean distance, Manhattan distance, Chebyshev distance, and cosine similarity are all methods for measuring the closeness between vectors in a vector space.

(1): Pseudo Embedding:

Existing research indicates that for retrieval tasks or determining the most similar text given a document, the cosine similarity metric is more advantageous. This paper connects two existing vectors and selects a point on the line where the angles are identical. This point is the one with equal cosine similarity to the other two vectors within the set of minimum cosine distance points, while also maintaining the minimum sum of Euclidean distances between the three pairs. PE can be expressed as:

{P E}_{k} = {W x}_{k} + λ (y_{k} - {W x}_{k})

(7)

a n d \frac{\sum_{k = 1}^{n} {P E}_{k} {W x}_{k}}{\sqrt{\sum_{k = 1}^{n} ({P E}_{k}^{2})} \sqrt{\sum_{k = 1}^{n} ({W x}_{k}^{2})}} = \frac{\sum_{k = 1}^{n} {P E}_{k} y_{k}}{\sqrt{\sum_{k = 1}^{n} ({P E}_{k}^{2})} \sqrt{\sum_{k = 1}^{n} (y_{k}^{2})}}

(8)

w h e r e λ \in [0, 1], k = 1, 2, \dots, n

(9)

Therefore, for a specific Chinese word t, the steps to obtain the shared semantic space Chinese word embedding

{P E}_{t}

are as follows: first, obtain the corresponding French and English words for t via machine translation, thereby acquiring the French word vector Wx and English word vector y in the shared semantic space. Calculate λ using the formula, perform a linear combination of Wx and y, and thus determine the word embedding vector

{P E}_{t}

for t.

(2): Neighbor Equidistant Generation Method:

In cross-lingual machine translation, a Lexical Gap refers to the phenomenon where a word existing in one language lacks a corresponding equivalent in another due to cultural and linguistic differences [36]. If a Chinese word

l g

cannot be interpreted in the English-French shared space due to a lexical gap, this paper proposes the Neighbor Equidistant Generation Method, as shown in Figure 6.

As shown in Figure 6, let the vertical intersection of the connecting lines between word

l g

and

n_{i}, n_{j}

in the CTL-PC space subgraph (a) be denoted as

s e c

, and the distance from word

l g

to the base be

d i s t a n c e (l g, s e c)

. In the CLE-APE space, let the vertical intersection of the connecting lines between

P E (l g)

and

{P E (n}_{i}), P E (n_{j})

be denoted as

P E (s e c)

, then the distance from

P E (l g)

to the base is

d i s t a n c e (P E (l g), (P E (s e c))

. We can generate pseudo embeddings for words with lexical gaps using the vector triangle rule and similarity scaling.

P E (s e c) = {P E (n}_{i}) + \frac{l_{1}}{l_{2}} \times (P E (n_{j}) - {P E (n}_{i}))

(10)

d i s t a n c e (P E (l g), P E (s e c)) = s c a l e \times d i s t a n c e (l g, s e c) = \frac{d_{i}^{'} + d_{j}^{'}}{d_{i} + d_{j}} \times d i s t a n c e (l g, s e c)

(11)

P E (l g) = P E (s e c) + d i s t a n c e (P E (l g), P E (s e c))

(12)

While Figure 6 provides a simplified two-dimensional visualization for intuitive understanding, Equations (10)–(12) are fundamentally derived from

N

-dimensional vector algebra, which maintains mathematical consistency in high-dimensional embedding spaces (e.g., 300 dimensions). In the M-APE framework, the ‘intersection point’

s e c

is calculated by projecting the source-space vector

l g

onto the hyperplane (or specifically, the vector segment) spanned by its nearest neighbors

n_{i}

and

n_{j}

. By utilizing cosine similarity as the optimization metric—ensuring that the generated vector

P E (l g) - P E (s e c)

is orthogonal to the baseline

P E (n_{j}) - {P E (n}_{i})

—we preserve the local semantic topology during the transfer from CTL-PC to CLE-APE space. This iterative projection approach ensures that even in 300-dimensional space, the relative semantic distances captured in the domain-specific Chinese space are reconstructed within the trilingual shared space, thereby mitigating information loss caused by lexical gaps.

As indicated by the formula, the improved neighbor equidistant generation method normalizes the distance between identical word pairs in the two spaces through scaling. By multiplying

d_{n i - l g}

by

s c a l e R a t e

, which represents the size ratio of the two spaces, it is possible to locate the vertical intersection point in a general direction, minimizing the impact caused by the difference in spaces.

In the implementation of the Pseudo-Embedding method and the neighbor equidistant generation method, this study utilized geometric relationship mapping within the vector space rather than direct numerical solving to ensure stability in engineering implementation. Specifically, when identifying the vertical intersection point

s e c

in the CTL-PC space, we employed an iterative search with small-step movements to locate a point on the line such that the cosine similarity between the line connecting

l g

to this point and the base approached zero (falling below a set threshold). Similarly, during the general pseudo-embedding generation process, the optimal word embedding was determined through a cyclic traversal to ensure identical angles between the vectors.

In summary, the process and operating mechanism of machine translation and vector optimization are illustrated in Figure 7.

4.3. Validation of the Chinese-English-French Cross-Lingual Shared Semantic Model

As mentioned above, constructing a cross-lingual shared semantic embedding space requires inputting word embeddings corresponding to Chinese, English, and French, using adversarial training to learn the mapping WW from source to target space, and continuously iterating for optimization.

The characteristic of cross-lingual word embeddings in a shared semantic space lies in their ability to compare word meanings across languages. One of their primary use cases is Bilingual Dictionary Induction (BDI), which learns cross-lingual translation correspondences. This also facilitates core language technology development for resource-poor languages and domains and can be viewed as an intrinsic evaluation type of cross-lingual information retrieval [37].

From the comparison of MUSE and M-APE experimental results in Table 4, it is evident that for the cross-lingual shared semantic space between English and French, high-quality cross-lingual word embeddings can be obtained regardless of the strategy, with accuracy ×100 values hovering around 82.1.

The comparative data in Table 4 highlights the core contribution of the M-APE strategy. While the vanilla MUSE model (Strategy 1) maintains high accuracy for the EN-FR pair (82.3%)—both belonging to the Indo-European family—it suffers a drastic performance drop when handling Chinese as the source language (31.4% for ZH-EN). In contrast, M-APE (Strategy 2) elevates this accuracy to 84.6%, nearly matching the same-family performance. This visual and numerical contrast confirms that our pseudo-embedding and equidistant generation methods are specifically effective in bridging the lexical and structural gaps between Sino-Tibetan and Indo-European languages.

Strategy 2, the model optimized with machine translation and vector operations, achieves significant performance improvements in BDI tasks between Chinese and English/French. Particularly when Chinese is the source language and English/French are the target languages, it achieves results comparable to those between different languages of the same language family. Regardless of whether Chinese is the source or target language, M-APE’s performance not only surpasses the MUSE model but is also superior to existing supervised models. This enables cross-lingual models to be not limited to same-family languages but to possess greater cross-family generality. For the two tasks, CH-EN and CH-FR, where Chinese is the target language and the performance improvement is not obvious, this paper sets K as the number of allowed returned semantically closest words and tests performance under K = 5 and K = 10 conditions, as shown in Table 5. As the allowance for the number of results increases, accuracy improves by over 10%, indicating that the model has the potential to achieve a high performance ceiling after tuning.

4.4. Experimental Design for Cross-Lingual Information Retrieval in the Public Culture Domain

To verify the effectiveness of Chinese-English-French cross-lingual retrieval in the public culture domain based on the unsupervised cross-lingual shared semantic model, this study establishes two types of baselines:

Weak Baseline [39]: A baseline achievable by a simple model. We conduct monolingual experiments using Chinese, English, and French monolingual queries and document collections, respectively, and average the results to set a reference weak baseline for the cross-lingual information retrieval task.

Strong Baseline: A baseline achievable by an ideal model. We use machine translation-based Chinese-English-French cross-lingual information retrieval as the strong baseline.

This paper selects the following as experimental baselines:

Monolingual Information Retrieval for Chinese, English, and French vocabularies (hereinafter referred to as SLIR).

Monolingual Information Retrieval based on supervised monolingual word embeddings (hereinafter referred to as SESLIR).

Traditional machine translation-based Chinese-English-French cross-lingual information retrieval (hereinafter referred to as MCLIR).

Machine translation combined with supervised monolingual word embeddings for information retrieval (MCLIR + SESLIR).

Cross-lingual Information Retrieval based on supervised cross-lingual word embeddings (hereinafter referred to as SECLIR).

The retrieval performance of Cross-Lingual Information Retrieval based on the shared semantic model (hereinafter referred to as UECLIR) is compared with the above five experimental baselines.

This paper cites the text topic classifications from the Fudan University Corpus [40] and The New York Times News Corpus [41] as query words. Titles reflect topic content, and existing research typically determines article topics from titles and abstracts [42]. The top five categories ranked by the number of relevant documents—Art, History, Film, Health, and Education—are designated as Q1–Q5, respectively. The corresponding 736 public culture digital resource text documents serve as the target document set. Retrieval is performed on the document content excluding the title. The retrieval tasks are divided into 10 groups, executed separately.

The retrieval results are shown in Table 6, where QAVG represents the mean of the evaluation metrics across different queries. Figure 8 visualizes the mean values of different retrieval technologies across various evaluation metrics. The length of the bars represents performance quality, and different retrieval technologies are assigned different colors according to the order in Table 6.

The weak baselines SLIR and SESLIR, although achieving relatively ideal results on metrics requiring a small number of returned results like P@10 and P@20, have a cross-lingual document ratio of nearly 0% in the returned retrieval documents; they can only identify and return documents in the language corresponding to the query words. In cross-lingual retrieval tasks, all documents in languages corresponding to the formulated query are set as relevant documents; thus, the mAP values of SLIR and SESLIR are limited by their monolingual nature. Comparing SLIR and SESLIR, introducing monolingual word embeddings improves performance on almost every metric in monolingual contexts, but cross-lingual retrieval remains difficult to achieve.

MCLIR + SESLIR, on metrics requiring a small number of returned results and high precision like P@10 (0.84 vs. 0.92) and P@20 (0.86 vs. 0.87), does not demonstrate the advantage of word embeddings in terms of precision due to query expansion and generalization. However, it performs better on metrics with higher recall requirements and those measuring the ranking level of all relevant documents, such as R-prec (0.649 vs. 0.628) and mAP (0.651 vs. 0.613).

The performance metrics visualized in Figure 8 and detailed in Table 6 serve as a direct validation of the UECLIR (M-APE) framework’s efficacy in real-world retrieval scenarios. While the strong baseline MCLIR achieves high precision at top ranks (P@10), the UECLIR model consistently demonstrates superior performance in terms of Mean Average Precision (MAP) and the overall AVG metric (0.764). This indicates that by shifting from formal translation to shared semantic matching, the system is not only retrieving exact keyword matches but is also effectively ranking semantically relevant documents across different languages that traditional translation-based methods might overlook due to ‘translation drift’.

The improved cross-lingual information retrieval breaks free from the constraints of specific terms, achieving semantic matching. This improves translation accuracy during query translation and effectively resolves phenomena such as translation drift and vocabulary mismatch. It demonstrates good performance across various retrieval metrics for Chinese-English-French cross-lingual retrieval based on public culture corpora.

5. Discussion

5.1. Enhancing Language Scalability and Domain Flexibility

Although the Chinese-English-French cross-lingual retrieval framework for the public culture domain serves as an application of the shared semantic model in cross-lingual information retrieval (CLIR), the shared semantic model solution we propose is, in essence, a technical foundation. It can be utilized to develop CLIR systems that are both domain-flexible and language-scalable.

We use the term “Domain Flexible” to reflect the solution’s capability to transfer across multiple domains; being targeted at a specific domain does not inherently conflict with the capability for transfer. Specifically, in this study, the public semantic model is tailored to the public culture domain. Through public culture corpora, we performed transfer learning on an existing Chinese semantic space, enabling it to capture semantic relationships between words within the context of public culture. When this is input into the shared semantic model, these semantic relationships are “replicated” or “mapped” into the Chinese-English-French shared semantic space, allowing this relationship to be utilized for relevance calculation during cross-lingual retrieval.

We use the term “Language Scalable” to reflect the solution’s capability to extend to languages beyond Chinese, English, and French. Although the algorithmic calculations of the Pseudo Embedding method and the Neighbor Equidistant Generation method proposed in this paper are currently limited to trilingual scenarios, on one hand, we can randomly replace the composition of the three languages. More importantly, the approach of introducing machine translation to generate pseudo-embeddings and incorporating geometric derivation into cross-lingual embeddings during the generation of cross-family shared semantic spaces can be generalized to many other languages. While this may impose higher requirements on mathematical theory, in the long run, it plays a foundational role in establishing a paradigm for constructing shared models across varying numbers of languages.

5.2. Comparative Interpretation of Semantic Alignment Performance

Experimental results demonstrate a significant performance leap of the M-APE model over baseline unsupervised methods like MUSE, particularly in cross-family language pairs. While previous studies often struggled with the feature space heterogeneity between Sino-Tibetan and Indo-European languages, our findings suggest that the integration of Pseudo-Embeddings (PE) acts as a “semantic anchor” that bridges the structural gap. By leveraging the geometric regularities of the shared space, M-APE achieves a conversion accuracy of 56.6%, which is comparable to some supervised models. This implication suggests that for specialized fields where parallel corpora are non-existent, geometric derivation within a trilingual shared space is a robust alternative to traditional formal translation methods.

The superior performance of UECLIR (M-APE) in precision metrics (P@10 and P@20) compared to the MCLIR + SESLIR baseline is attributed to its focus on holistic semantic clusters rather than literal word-to-word translation. By ensuring that identical concepts across languages are positioned closely in the 300-dimensional shared space, the model effectively eliminates “translation drift,” allowing it to prioritize documents within the core semantic cluster. However, the mixed results in recall-oriented metrics (R-prec and mAP) highlight that semantic-based retrieval is more “selective” than keyword-based methods. While vector similarity ensures high relevance for top-ranked results, it may occasionally overlook documents relying on fringe vocabulary that sits at the periphery of the trained semantic clusters, especially when cultural dimensions are filtered to resolve lexical gaps.

5.3. Improving Cross-Lingual Information Retrieval and Services in the Public Culture Domain

The CLIR framework proposed in this paper can be constructed for specific domains. Taking the public culture domain as an example, the corpora, queries, and document collections used in this study are all derived from the public culture domain, and domain information is embedded into the shared space following transfer learning on China’s National Public Cultural Cloud. Such smaller, localized frameworks oriented towards low-resource domains have been proven necessary for the foreseeable future [43]. For example, a foreign language search for “opera” on the National Public Cultural Cloud typically returns results only if the keyword is in the title, whereas M-APE identifies a much wider range of semantically relevant resources.

Furthermore, the model effectively bridges lexical gaps through multidimensional semantic mapping. For culturally specific terms like “Lion Dance” (舞狮), the system extracts multiple semantic dimensions such as “acrobatics,” “celebration,” and “dance”. Foreign users can successfully retrieve related Chinese documents using these descriptive combinations, with experimental results showing that 50% of the top 10 results are relevant. The system also maintains high semantic consistency; for terms like “Museum,” the top results are nearly identical across languages, while for broader terms like “Concert,” it provides diverse but valid documents tailored to each language.

The application scheme on the National Public Cultural Cloud can be designed to include periodic online updates to the corpus based on latest resources. Considering the effectiveness of shared language models, the system can recommend semantically similar Chinese query strings and definitions to foreign users, facilitating the understanding of both queries and results. This provides robust support for improving CLIR functions and multilingual resource discovery on public service platforms.

5.4. Limitations and Future Directions

Despite the promising performance of UECLIR in our retrieval tasks, we must acknowledge a critical limitation regarding the experimental dataset. Due to the scarcity of authentic multilingual parallel resources in the public cultural domain, the target document set was constructed using machine-translated versions of the National Public Cultural Cloud data. This resulted in a “pseudo-multilingual” dataset that serves as a synthetic testing environment. While such datasets may inherit the systematic biases or stylistic uniformity inherent in machine translation engines, potentially inflating the semantic alignment scores, their use in this study is strategically justified as a controlled feasibility proof.

By utilizing translated documents, we maintained semantic parity across different languages, which allowed us to isolate and rigorously evaluate the M-APE model’s core mechanism—specifically its ability to achieve geometric alignment and bridge lexical gaps between Sino-Tibetan and Indo-European language families without supervised signals. The significant improvements observed in the CTL-PC space (Table 2) and retrieval metrics (Table 6) indicate that the model has successfully captured the underlying semantic structures of the public culture domain, which is a prerequisite for handling real-world, high-variance documents.

5.5. Strengthening the Evaluation of Readability in Cross-Lingual Information Retrieval

For ordinary users, a typical challenge frequently faced in CLIR and domain-specific information retrieval is that search results are often a mixture of resources with varying levels of readability, whereas users prefer to encounter highly readable resources in the top results [44].

Word embeddings can better assist in evaluating readability because they are typically designed to capture the semantics and topics of words. Since topics can serve as good indicators of whether a document is difficult to understand, embedding layers can be used within deep neural network architectures to determine the readability level of a given document. However, these studies primarily focus on Latin languages such as English and are not easily portable to other languages, particularly Asian languages.

The shared semantic space generated by the shared semantic model can ameliorate the issue where current word embedding-based readability research relies heavily on specific languages, shifting the focus from language-specific to multilingual transfer applicability. Since the shared semantic space has learned the domain knowledge captured by monolingual models, it is possible to further explore how domain-specific concepts contained within documents affect their readability.

6. Conclusions

This paper addresses the lack of semantic information in current machine translation-based cross-lingual information retrieval by proposing the M-APE shared semantic model. We utilize adversarial training to learn a linear mapping from source to target space, generating a shared semantic space composed of cross-lingual word embeddings. We then extract a synthetic dictionary from the obtained shared embedding space and use the Procrustes solution to adjust the shared semantic space. Finally, we optimize the shared semantic space using the Pseudo Embedding method and the Neighbor Equidistant Generation method.

Taking the public culture domain as a case study, this paper constructs a Chinese-English-French cross-lingual information retrieval framework and applies the shared semantic model to CLIR tasks within this domain. Through comparative analysis with experimental baselines and case studies, the proposed CLIR method improves retrieval performance and enhances the semantic association between retrieval results and query strings to a certain extent, providing support for CLIR in domains such as public culture.

Due to the scarcity of multilingual parallel resources in the public culture domain, to facilitate simulated experiments for CLIR, this paper utilized machine translation to convert Chinese text crawled from the National Public Cultural Cloud (and the subsequently established Chinese corpus) into English and French text. This generated pseudo-multilingual data, which was populated into the target document set; consequently, the multilingual resources remain limited to the context of Chinese public culture. Therefore, in future research, the updating and refinement of corpora is an endless endeavor. Compared to the construction of monolingual digital resource corpora in the public culture domain, the future demands a greater focus on the ability to generate, link, and acquire genuine multilingual public culture resources.

Author Contributions

Conceptualization, Z.X. and D.W.; methodology, Z.X. and S.L. (Shaobo Liang); software, Z.X. and S.L. (Siyu Lv); validation, Z.X. and S.L. (Siyu Lv); formal analysis, Z.X.; investigation, Z.X.; data curation, Z.X. and S.L. (Siyu Lv); writing—original draft preparation, Z.X.; writing—review and editing, S.L. (Shaobo Liang) and D.W.; visualization, Z.X. and S.L. (Siyu Lv); supervision, S.L. (Shaobo Liang) and D.W.; project administration, D.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 92370112, No. 72574176), Innovative Research Group Project of Hubei Provincial Natural Science Foundation (No. 2023AFA012), and the Natural Science Foundation of Hubei Province (No. 2025AFB770).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Litschko, R.; Vulić, I.; Glavaš, G. On cross-lingual retrieval with multilingual text encoders. Inf. Retr. J. 2022, 25, 149–183. [Google Scholar] [CrossRef]
Cheon, J.; Ko, Y. Parallel sentence extraction to improve cross-language information retrieval from Wikipedia. J. Inf. Sci. 2021, 47, 281–293. [Google Scholar] [CrossRef]
Agarwal, S.; Barry, J.; Boschee, E.; Miller, S. What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models. arXiv 2025, arXiv:2511.19324. [Google Scholar] [CrossRef]
Wang, Y.; Wu, A.; Neubig, G. English contrastive learning can learn universal cross-lingual sentence embeddings. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; pp. 9122–9133. [Google Scholar] [CrossRef]
Goworek, R.; Macmillan-Scott, O.; Özyiğit, E.B. Bridging language gaps: Advances in cross-lingual information retrieval with multilingual llms. arXiv 2025, arXiv:2510.00908. [Google Scholar] [CrossRef]
Lawrie, D.; Mayfield, J.; Yang, E.; Yates, A.; MacAvaney, S.; Pradeep, R.; Miller, S.; McNamee, P.; Soldani, L. NeuCLIRBench: A modern evaluation collection for cross-language and multilingual information retrieval. arXiv 2025, arXiv:2511.14758. [Google Scholar]
Litschko, R.; Kraus, O.; Blaschke, V.; Plank, B. Cross-dialect information retrieval: Information access in low-resource and high-variance languages. In Proceedings of the 31st International Conference on Computational Linguistics, Abu Dhabi, United Arab Emirates, 19–24 January 2025; pp. 10158–10171. [Google Scholar]
Oro, E.; Granata, F.M.; Ruffolo, M. A comprehensive evaluation of embedding models and LLMs for IR and QA across English and Italian. Big Data Cogn. Comput. 2025, 9, 141. [Google Scholar] [CrossRef]
Sun, G.; Deng, Y.; Liang, S.; Wu, D. Chinese-Tibetan Bilingual Knowledge Organization in the Cultural Heritage Domain: A Practice for Traditional Tibetan Festivals. Proc. Assoc. Inf. Sci. Technol. 2023, 60, 1137–1139. [Google Scholar] [CrossRef]
Elayeb, B. Arabic word sense disambiguation: A review. Artif. Intell. Rev. 2019, 52, 2475–2532. [Google Scholar] [CrossRef]
Wu, D.; Fan, S.; Yao, S.; Xu, S. An exploration of ethnic minorities’ needs for multilingual information access of public digital cultural services. J. Doc. 2023, 79, 1–20. [Google Scholar] [CrossRef]
Tognola, G.; Murri, A.; Cuda, D. Cognitive computing for the automated extraction and meaningful use of health data in narrative medical notes: An application to the clinical management of hearing impaired aged patients. In Proceedings of the 2018 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI), Las Vegas, NV, USA, 4–7 March 2018; pp. 299–302. [Google Scholar]
Litschko, R.; Vulić, I.; Glavaš, G. Parameter-efficient neural reranking for cross-lingual and multilingual retrieval. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 1071–1082. [Google Scholar]
Raza, M.A.; Mokhtar, R.; Ahmad, N.; Pasha, M.; Pasha, U. A taxonomy and survey of semantic approaches for query expansion. IEEE Access 2019, 7, 17823–17833. [Google Scholar] [CrossRef]
Calegari, S.; Sanchez, E. A fuzzy ontology-approach to improve semantic information retrieval. URSW 2007, 327, 1–6. [Google Scholar]
Lee, C.S.; Jian, Z.W.; Huang, L.K. A fuzzy ontology and its application to news summarization. IEEE Trans. Syst. Man Cybern. B 2005, 35, 859–880. [Google Scholar] [CrossRef]
Abulaish, M. An ontology enhancement framework to accommodate imprecise concepts and relations. J. Emerg. Technol. Web Intell. 2009, 1, 22–36. [Google Scholar] [CrossRef]
Jain, S.; Seeja, K.R.; Jindal, R. A fuzzy ontology framework in information retrieval using semantic query expansion. Int. J. Inf. Manag. Data Insights 2021, 1, 100009. [Google Scholar] [CrossRef]
Polley, S.; Koparde, R.R.; Gowri, A.B.; Perera, M.; Nuernberger, A. Towards trustworthiness in the context of explainable search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 11–15 July 2021; pp. 2580–2584. [Google Scholar]
Abu-Rasheed, H.; Weber, C.; Zenkert, J.; Krumm, R.; Fathi, M. Explainable graph-based search for lessons-learned documents in the semiconductor industry. In Intelligent Computing; Springer: Cham, Switzerland, 2022; pp. 1097–1106. [Google Scholar]
Yang, Z. Biomedical information retrieval incorporating knowledge graph for explainable precision medicine. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual, 25–30 July 2020; p. 2486. [Google Scholar]
Abu-Rasheed, H.; Weber, C.; Zenkert, J.; Dornhöfer, M.; Fathi, M. Transferrable framework based on knowledge graphs for generating explainable results in domain-specific intelligent information retrieval. Informatics 2022, 9, 6. [Google Scholar] [CrossRef]
Majeed, A.; Mujtaba, H.; Beg, M.O. Emotion detection in roman urdu text using machine learning. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, Virtual, 21–25 September 2020; pp. 125–130. [Google Scholar]
Chen, J.; Xiao, S.; Zhang, P.; Luo, K.; Lian, D.; Liu, Z. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv 2024, arXiv:2402.03216. [Google Scholar]
Khan, S.U.R.; Farooq, M.U.; Beg, M.O. Big data analysis of stack overflow for energy consumption of android framework. In Proceedings of the 2019 International Conference on Innovative Computing (ICIC), Lahore, Pakistan, 1–2 November 2019; pp. 1–9. [Google Scholar]
Guo, P.; Wei, X.; Hu, Y.; Yang, B.; Liu, D.; Huang, F.; Xie, J. EMMA-X: An EM-like multilingual pre-training algorithm for cross-lingual representation learning. Adv. Neural Inf. Process. Syst. NeurIPS 2023, 36, 1–15. [Google Scholar]
Hämmerl, K.; Libovický, J.; Fraser, A. Understanding cross-lingual alignment—A survey. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2024 Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; pp. 10922–10943. [Google Scholar] [CrossRef]
Gudivada, A.; Tabrizi, N. A literature review on machine learning based medical information retrieval systems. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI), Bangalore, India, 18–21 November 2018; pp. 250–257. [Google Scholar]
Kotei, E.; Thirunavukarasu, R. A systematic review of transformer-based pre-trained language models through self-supervised learning. Information 2023, 14, 187. [Google Scholar] [CrossRef]
Zhang, X.; Le Cun, Y. Which encoding is best for text classification in Chinese, English, Japanese and Korean? arXiv 2017, arXiv:1708.02657. [Google Scholar] [CrossRef]
Mozaffari, M.H.; Lee, W.S. Domain adaptation for ultrasound tongue contour extraction using transfer learning: A deep learning approach. J. Acoust. Soc. Am. 2019, 146, 431–437. [Google Scholar] [CrossRef]
Kowsher, M.; Sobuj, M.S.; Shahriar, M.F.; Prottasha, N.J.; Arefin, M.S.; Dhar, P.K.; Koshiba, T. An enhanced neural word embedding model for transfer learning. Appl. Sci. 2022, 12, 2848. [Google Scholar] [CrossRef]
Forney, G.D. The Viterbi algorithm. Proc. IEEE 1973, 61, 268–278. [Google Scholar] [CrossRef]
Huang, Z.; Yu, P.; Allan, J. Cross-lingual knowledge transfer via distillation for multilingual information retrieval. arXiv 2023, arXiv:2302.13400. [Google Scholar] [CrossRef]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Ebrahimi, A.; von der Wense, K. Zero-shot vs. translation-based cross-lingual transfer: The case of lexical gaps. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers); Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 443–458. [Google Scholar]
Karan, M.; Vulić, I.; Korhonen, A.; Glavaš, G. Classification-based self-learning for weakly supervised bilingual lexicon induction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; pp. 6915–6922. [Google Scholar]
Xu, R.; Yang, Y.; Otani, N.; Wu, Y. Unsupervised cross-lingual transfer of word embedding spaces. arXiv 2018, arXiv:1809.03633. [Google Scholar] [CrossRef]
Lin, J. The neural hype and comparisons against weak baselines. ACM SIGIR Forum 2019, 52, 40–51. [Google Scholar] [CrossRef]
Qiu, X.; Zhang, Q.; Huang, X.J. Fudannlp: A toolkit for chinese natural language processing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Sofia, Bulgaria, 4–9 August 2013; pp. 49–54. [Google Scholar]
TextData.cn. New York Times News Dataset from 2000 to 2025. 2025. Available online: https://textdata.cn/blog/2025-03-05-nytimes-news-dataset-from-2000-to-2025/ (accessed on 2 June 2025).
Abdelrazek, A.; Eid, Y.; Gawish, E.; Medhat, W.; Hassan, A. Topic modeling algorithms and applications: A survey. Inf. Syst. 2023, 112, 102131. [Google Scholar] [CrossRef]
Pohorec, S.; Verlič, M.; Zorman, M. Domain specific information retrieval system. In Proceedings of the 8th WSEAS International Conference on Computer, Hangzhou, China, 20–22 May 2009; pp. 465–469. [Google Scholar]
Yan, X.; Song, D.; Li, X. Concept-based document readability in domain specific information retrieval. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Arlington, VA, USA, 6–11 November 2006; pp. 540–549. [Google Scholar]

Figure 1. FastText Pre-trained Model Architecture.

Figure 2. Transfer training model built on public cultural corpus. The model consists of three main stages: (1) Public Cultural Corpus Preprocessing, where raw text is cleaned and segmented; (2) the Transfer Training process, which involves word embedding and aggregation via a stacking average method; and (3) the Generation of Chinese Word Embeddings, which includes exporting the vectors to a .vec file and updating the dictionary to create traditional and simplified Chinese versions.

Figure 3. Overall framework of the M-APE model.

Figure 4. Schematic diagram of cross-lingual adversarial learning. The distributions of word embeddings are shown, where blue denotes English words and red denotes French words. The objective is to align semantically equivalent points across the two spaces. Each point represents a word in that space.

Figure 5. Schematic diagram of pseudo-embedding and vector operation optimization. The blue and red points represent English and French word embeddings in the shared semantic space, respectively. The green points represent the generated Chinese Pseudo Embeddings. The diagram illustrates how vector operations align these embeddings to optimize cross-lingual retrieval.

Figure 6. Improved version of the neighbor equidistant generation method. (a) The geometric relationship in the original CTL-PC space; (b) The generation of Pseudo Embeddings in the CLE-APE space.

Figure 7. Flowchart of pseudo-embedding and equidistant generation operation.

Figure 8. QAVG of different retrieval techniques on different evaluation metrics.

Table 1. Statistics of Resource Types on the Public Cultural Cloud Platform.

Reading Section Book Type	Count	Venue Booking Section Venue Category	Count	Activity Section Venue Category	Count
Chinese and Foreign Classics	67	Cultural Center	3160	Public Culture and Arts
Children’s Literature	62	Library	1348	COVID-19 Prevention Works	20
Literature	54	Tourist Attraction	391	Works on Building a Moderately Prosperous Society in All Respects	15
Classic Chinese Studies	46	Museum	290	Works on Poverty Alleviation	8
Poetry and Prose	42	Art Museum	70	Folk Culture and Arts
Fiction and Stories	40	City Book Bar	27	Works Exhibition	76
Humanities and History	36	Visitor Center	12	Folk Custom Showcase	55
Biography	23			Art Heritage	10
Art	17			Unclassified	17,926
Local Chronicles	16
Others	14
Chinese History	13
Essays and Prose	10
History of Literature	4
Literary Criticism and Appreciation	1
Total	430	Total	5298	Total	18,100

Table 2. Similarity Ranking Results Before and After Transfer Training.

Similarity Rank	Library (Before Transfer)	Similarity Score	Library (After Transfer)	Similarity Score	Museum (Before Transfer)	Similarity Score	Museum (After Transfer)	Similarity Score
1	Branch Library	0.936	Reading Room	0.812	Art Museum	0.923	Art Museum	0.804
2	Collection	0.926	Library Science	0.804	Collection	0.922	Natural History Museum	0.806
3	Library (archaic)	0.925	Branch Library	0.800	Inside the Museum	0.916	Museum (abbr.)	0.778
4	Reading Room	0.923	Library Catalog	0.788	Museum (abbr.)	0.916	Exhibition Hall	0.764
5	Book	0.921	Library (archaic)	0.772	Open to the Public	0.915	Artwork	0.764
6	Book Collection	0.918	Reading Room	0.771	Branch Museum	0.914	Branch Museum	0.747
7	New Library	0.913	Collection Size	0.765	Exhibition	0.913	Collection	0.746
8	Library Building	0.911	Ancient Library	0.756	Gallery	0.913	Botanical Garden	0.745
9	Various Libraries	0.910	Collection	0.753	Museum Director	0.911	Memorial Hall	0.743
10	Borrowing	0.909	Book	0.752	History Museum	0.910	Artwork	0.742

Table 3. Summary of Hyperparameters and Architectural Configurations for the M-APE Model.

Category	Parameter	Value/Description
Embeddings	Vector Dimension	300
Discriminator	Number of Hidden Layers	2
	Hidden Layer Size	2048
	Activation Function	Leaky-ReLU
	Input Dropout	0.1
	Output Smoothing Coefficient	0.2
Training	Optimizer	SGD
	Batch Size	32
	Initial Learning Rate	0.1
	Learning Rate Decay Rate	0.95
	Learning Rate Shrinkage Factor	0.5
	Mapping Matrix $W$ Initialization	Identity Matrix

Table 4. Bilingual Dictionary Induction Experiments for Cross-Lingual Embeddings between Chinese and Indo-European Languages under Different Strategies.

Model	EN-FR	FR-EN	ZH-EN	EN-ZH	ZH-FR	FR-ZH
Best Supervised Mode1 [38]	81.1	82.4	49.9	45.4	-	-
MUSE (Strategy 1)	82.3	82.1	31.4	32.5	-	-
M-APE (Strategy 2)	82.3	82.1	84.6	61.8	84.5	55.4

Note: ZH represents Chinese, EN represents English, and FR represents French.

Table 5. Bilingual Dictionary Induction Experiments with Chinese as the Target Language.

	EN-ZH	FR-ZH	AVG
K5	74.7	73.7	74.2
K10	87.9	85.9	86.9
K15	95.3	91.9	93.6

Table 6. Information Retrieval Experimental Results.

Category	Retrieval Technique	Query List	P@10	P@20	R-prec	AP/mAP	AVG
Monolingual Information Retrieval	SLIR	Q1	1.000	0.983	0.525	0.689	0.799
		Q2	0.833	0.783	0.264	0.221	0.525
		Q3	0.733	0.783	0.461	0.386	0.591
		Q4	0.567	0.567	0.189	0.172	0.374
		Q5	0.733	0.533	0.233	0.196	0.424
		QAVG	0.773	0.730	0.334	0.333	0.543
	SESLIR	Q1	1.000	0.983	0.535	0.665	0.796
		Q2	0.733	0.750	0.297	0.241	0.505
		Q3	1.000	0.950	0.523	0.491	0.741
		Q4	0.533	0.533	0.182	0.160	0.352
		Q5	0.733	0.533	0.233	0.196	0.424
		QAVG	0.800	0.750	0.354	0.351	0.564
Cross-Lingual Information Retrieval	MCLIR	Q1	1.000	0.950	0.902	0.916	0.942
		Q2	0.900	0.900	0.490	0.495	0.696
		Q3	0.700	0.750	0.686	0.645	0.695
		Q4	1.000	0.950	0.545	0.488	0.746
		Q5	1.000	0.800	0.517	0.521	0.709
		QAVG	0.920	0.870	0.628	0.613	0.758
	SESLIR +MCLIR	Q1	1.000	1.000	0.865	0.883	0.937
		Q2	0.600	0.700	0.552	0.542	0.598
		Q3	1.000	1.000	0.784	0.871	0.914
		Q4	0.600	0.800	0.525	0.437	0.591
		Q5	1.000	0.800	0.517	0.521	0.709
		QAVG	0.840	0.860	0.649	0.651	0.750
	SECLIR	Q1	1.000	0.950	0.902	0.916	0.942
		Q2	0.700	0.700	0.490	0.495	0.596
		Q3	1.000	1.000	0.686	0.645	0.833
		Q4	0.800	0.900	0.545	0.488	0.683
		Q5	0.900	0.650	0.517	0.521	0.647
		QAVG	0.880	0.840	0.628	0.613	0.740
	UECLIR (M-APE)	Q1	1.000	1.000	0.857	0.883	0.935
		Q2	0.800	0.700	0.563	0.550	0.653
		Q3	1.000	0.900	0.725	0.801	0.857
		Q4	0.800	0.900	0.535	0.435	0.668
		Q5	1.000	0.800	0.517	0.521	0.709
		QAVG	0.920	0.860	0.640	0.638	0.764

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xia, Z.; Liang, S.; Wu, D.; Lv, S. Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Appl. Sci. 2026, 16, 4158. https://doi.org/10.3390/app16094158

AMA Style

Xia Z, Liang S, Wu D, Lv S. Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Applied Sciences. 2026; 16(9):4158. https://doi.org/10.3390/app16094158

Chicago/Turabian Style

Xia, Zishuo, Shaobo Liang, Dan Wu, and Siyu Lv. 2026. "Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources" Applied Sciences 16, no. 9: 4158. https://doi.org/10.3390/app16094158

APA Style

Xia, Z., Liang, S., Wu, D., & Lv, S. (2026). Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources. Applied Sciences, 16(9), 4158. https://doi.org/10.3390/app16094158

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advancing Cross-Language Information Retrieval Through Shared Semantic Models: Applications in Public Cultural Resources

Abstract

1. Introduction

2. Related Work

2.1. Domain-Specific Information Retrieval

2.2. Research on Semantic Spaces

3. Construction of a Chinese Semantic Space for the Public Culture Domain

3.1. Construction of a Chinese Corpus for the Public Culture Domain

3.1.1. Data Sources

3.1.2. Corpus Construction Process

3.2. Pre-Trained Chinese Semantic Space

3.2.1. Semantic Space Pathway Design

3.2.2. FastText Pre-Trained Model

3.3. Transfer Learning of the Chinese Semantic Space for the Public Culture Domain

3.3.1. Transfer Learning Model Based on Public Culture Corpora

3.3.2. Chinese Semantic Space After Domain Transfer Learning

4. Construction of a Cross-Lingual Shared Semantic Model Based on Chinese Semantic Space

4.1. Framework of the Chinese-English-French Cross-Lingual Shared Semantic Model

4.2. Construction Steps of the Chinese-English-French Cross-Lingual Shared Semantic Model

4.2.1. Shared Semantic Space Generation

4.2.2. Shared Semantic Space Adjustment

4.2.3. Shared Semantic Space Optimization

4.3. Validation of the Chinese-English-French Cross-Lingual Shared Semantic Model

4.4. Experimental Design for Cross-Lingual Information Retrieval in the Public Culture Domain

5. Discussion

5.1. Enhancing Language Scalability and Domain Flexibility

5.2. Comparative Interpretation of Semantic Alignment Performance

5.3. Improving Cross-Lingual Information Retrieval and Services in the Public Culture Domain

5.4. Limitations and Future Directions

5.5. Strengthening the Evaluation of Readability in Cross-Lingual Information Retrieval

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI