Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces

Meng, Lei; Yang, Xiaona; Chen, Shangfeng; Zhao, Xiaojun

doi:10.3390/sym17071005

Open AccessArticle

Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces

by

Lei Meng

¹,

Xiaona Yang

²,

Shangfeng Chen

² and

Xiaojun Zhao

^2,*

¹

College of Information Engineering, Xuchang University, Xuchang 461000, China

²

College of Software, Zhengzhou University of Light Industry, Zhengzhou 450000, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1005; https://doi.org/10.3390/sym17071005

Submission received: 24 April 2025 / Revised: 12 June 2025 / Accepted: 17 June 2025 / Published: 26 June 2025

(This article belongs to the Special Issue Symmetry/Asymmetry Studies in Data Mining & Machine Learning of Large Language Models)

Download

Browse Figures

Versions Notes

Abstract

Bilingual Word Embeddings (BWEs) are crucial for multilingual NLP tasks, enabling cross-lingual transfer. While traditional joint training methods require bilingual corpora, their applicability is limited for many language pairs, especially low-resource ones. Unsupervised methods, relying on the isomorphism assumption, suffer from performance degradation when dealing with non-isomorphic embedding spaces, which are common in distant language pairs. This structural asymmetry challenges conventional approaches. To address these limitations, we propose a novel unsupervised joint training method for BWEs. We leverage monolingual corpora and introduce a dynamic programming algorithm to extract bilingual text segments, facilitating concurrent BWE training without relying on explicit bilingual supervision. Our approach effectively mitigates the challenge posed by asymmetric, non-isomorphic spaces by jointly learning BWEs in a shared space. Extensive experiments demonstrate the superiority of our method compared to existing approaches, particularly for distant language pairs exhibiting significant structural asymmetry

Keywords:

structural asymmetry; Bilingual Word Embeddings; cross-lingual applications; non-isomorphic spaces

1. Introduction

The advent of large language models (LLMs) has profoundly reshaped the landscape of natural language processing (NLP), demonstrating remarkable capabilities in multilingual understanding and generation [1,2]. A cornerstone of their multilingual prowess lies in the ability to represent words and concepts from different languages within a shared or alignable semantic space, facilitating cross-lingual transfer for various downstream tasks [3,4,5]. Despite the remarkable achievements of LLMs, the quality of their underlying cross-lingual representations, particularly for low-resource or typologically distant languages, remains critically dependent on effective methods for aligning semantic spaces [6]. A fundamental challenge obstructing robust cross-lingual transfer, especially in unsupervised or low-resource scenarios, is the pervasive issue of structural asymmetry (non-isomorphism) between the embedding spaces of different languages [7]. This asymmetry arises not only from inherent linguistic differences but also from disparities in training data quality and domain, which are amplified for underrepresented languages. However, as we probe the geometric properties of these high-dimensional representations, the interplay between symmetry and asymmetry emerges as a critical factor influencing model performance, particularly in unsupervised settings [8,9].

Historically, cross-lingual representation methods like Bilingual Word Embeddings (BWEs) rely on the isomorphism assumption [10]. This posits structural symmetry: Word geometries (e.g., semantic distances and angles) are preserved across languages, typically via an orthogonal transformation. This presumed symmetry simplifies the learning problem, allowing for effective mapping strategies, particularly using offline methods that align pre-trained monolingual embeddings [11,12]. While this assumption holds reasonably well for closely related languages with similar linguistic structures and substantial data overlap, its validity diminishes significantly when dealing with distant language pairs [13,14]. Consequently, applications relying on cross-lingual alignment, such as unsupervised machine translation for low-resource pairs, bilingual lexicon induction for specialized domains, or zero-shot cross-lingual transfer in critical tasks (e.g., public health information access), face significant performance degradation or even failure when the underlying isomorphism assumption is violated [15]. This performance gap highlights the urgent need for unsupervised methods specifically designed to handle structural asymmetry.

Increasing evidence points toward inherent structural asymmetry, or non-isomorphism, between embedding spaces, especially for languages that are typologically distant or culturally divergent [16]. This asymmetry arises from fundamental differences in linguistic structures (syntax, morphology), variations in cultural conceptualizations reflected in language use, and disparities in the size and domain distribution of the monolingual corpora used for training [17,18]. Such asymmetry manifests as inconsistencies in the geometric neighborhoods and relational structures within the embedding spaces, rendering the simple symmetric mapping assumption inadequate. Consequently, unsupervised methods predicated on isomorphism often experience a sharp decline in performance when confronted with these asymmetric realities [19]. Addressing this structural asymmetry is crucial for building truly robust multilingual systems, especially in low-resource scenarios where parallel data, the traditional remedy via supervised joint training [20,21], is scarce.

Existing attempts to mitigate this asymmetry challenge in unsupervised settings often involve iterative refinement [22] or leveraging synthetic parallel data generated by unsupervised machine translation systems for joint training [23]. While the latter approach moves toward joint learning, potentially better accommodating asymmetry, it introduces a dependency on the quality of the synthetic data, which can be noisy and error-prone, particularly for distant language pairs where translation itself is challenging [24]. The approaches, whether reliant on the fragile isomorphism assumption or the noisy output of unsupervised MT systems, severely constrain the practical deployment of robust multilingual NLP solutions, particularly for the vast majority of language pairs lacking abundant parallel resources. This motivates the need for alternative unsupervised joint training paradigms that can directly handle structural asymmetry without relying on potentially flawed synthetic supervision or the overly strong symmetry assumption.

In this paper, we propose a novel unsupervised joint training methodology specifically designed to navigate the complexities of asymmetric, non-isomorphic embedding spaces. Our core idea is to bypass the need for explicit parallel corpora or synthetic data by mining weak bilingual signals directly from large monolingual corpora. We introduce a dynamic programming algorithm, incorporating nearest-neighbor search within the embedding space, to identify and extract parallel phrase segments across the two monolingual datasets. These mined phrases, while not perfectly parallel sentences, provide valuable anchor points for concurrently training BWEs in a shared space. We jointly optimize embeddings using monolingual context (e.g., skip-gram) and cross-lingual phrase alignments. This fosters shared representations respecting cross-lingual correspondences, even under structural asymmetry. This method directly confronts the asymmetry challenge by learning the alignment implicitly during training, rather than assuming symmetry beforehand or relying on noisy external components.

Our contributions are as follows: (1) We introduce an unsupervised joint training framework for BWEs that operates solely on monolingual corpora, directly addressing the low-resource challenge. (2) We propose a dynamic programming-based phrase-mining technique to extract bilingual signals, offering a novel alternative to synthetic data generation for unsupervised joint training. (3) Through extensive experiments on both closely related and distant language pairs (representing varying degrees of structural asymmetry), we demonstrate empirically that our method significantly outperforms existing unsupervised approaches, particularly in challenging non-isomorphic scenarios, thereby effectively mitigating the negative impact of structural asymmetry in cross-lingual representation learning. We believe this approach offers valuable insights into handling inherent data and structural asymmetries, a pertinent challenge highlighted by this special issue.

2. Related Work

The pursuit of effective cross-lingual representations, particularly BWEs, has been a vibrant area of NLP research. Methodologies have evolved significantly, grappling implicitly and explicitly with the inherent structural symmetries and asymmetries between languages.

2.1. Supervised and Semi-Supervised Approaches

Early successful approaches often relied on explicit bilingual supervision to guide the alignment process. Joint training methods learn embeddings for two languages concurrently by optimizing objectives that leverage parallel corpora [20]. For instance, models might extend monolingual objectives (like skip-gram) to predict context words across languages within aligned sentences. While effective, these methods require substantial parallel data, a resource unavailable for many language pairs, especially low-resource ones [25,26]. By directly using translation pairs, these methods effectively force an alignment, implicitly handling potential structural asymmetries through strong supervision, but their applicability is limited. Other approaches use smaller seed dictionaries [27] to learn linear mapping, reducing the supervision requirement but still needing some parallel signal.

2.2. Unsupervised Mapping Approaches and the Isomorphism Assumption

To overcome the reliance on parallel data, unsupervised methods emerged, predominantly focusing on offline mapping. These methods first train high-quality monolingual embeddings independently (e.g., using FastText [28]) and then learn a transformation, typically assumed to be orthogonal, to map one embedding space onto the other [29]. This orthogonality constraint is key, as it preserves the internal geometry (word distances and angles) of the monolingual spaces, embodying the isomorphism assumption—a belief in fundamental structural symmetry between the source and target embedding spaces. Prominent unsupervised mapping techniques include adversarial training [22], where a discriminator tries to distinguish between mapped source embeddings and target embeddings, while the mapping function tries to fool it, encouraging distributional alignment. Another successful line employs iterative self-learning [30], starting with a weak initial dictionary (e.g., based on identical strings or numerals) and refining the mapping and dictionary iteratively using techniques like Procrustes alignment. Robustness has been improved by incorporating techniques like frequency-based vocabulary selection, CSLS (cross-domain similarity local scaling) retrieval [22], and leveraging subword information [27].

However, the foundational isomorphism (symmetry) assumption has been increasingly challenged [16,31]. Studies suggest that embedding spaces, especially for distant language pairs, often exhibit significant structural asymmetry (non-isomorphism) [21,32,33,34]. This asymmetry can stem from linguistic typology, cultural differences reflected in language use, or artifacts of the training data (size, domain). When this symmetry breaks down, unsupervised mapping methods relying on it suffer performance degradation [18,35,36]. The challenge then becomes how to learn meaningful cross-lingual mappings in the face of this inherent asymmetry without supervision.

2.3. Addressing Asymmetry in Unsupervised Settings

Recognizing the limitations of the strict isomorphism assumption, researchers have explored ways to create more robust unsupervised cross-lingual links, implicitly or explicitly trying to accommodate asymmetry. Some methods attempt to improve the initial mapping by incorporating richer linguistic information beyond distributional similarity, such as morphological similarities or cognates [37], or by using more robust alignment techniques [38]. A significant direction involves moving toward unsupervised joint training. The goal is to leverage monolingual data more deeply while inferring cross-lingual links without parallel text. One prominent approach utilizes unsupervised machine translation (UNMT) [39,40]. First, a UNMT system is trained using only monolingual corpora. Then, this system generates synthetic parallel sentences via back-translation, which are subsequently used as supervision for jointly training BWEs [8,39]. This approach inherently allows for more complex, non-linear alignments than simple orthogonal mapping and can potentially adapt better to structural asymmetries. However, performance depends critically on UNMT quality. For distant languages, UNMT generates noisy synthetic data, harming BWE training. Ref. [8] proposed a fusion approach that combines offline mapping with joint training on synthetic data, showing improved robustness.

Another related line of work focuses on mining parallel or comparable data directly from large web corpora or sources like Wikipedia [41,42]. Recent work [9] specifically mined parallel sentences from the internet using multi-view knowledge distillation for low-resource pairs. While powerful, these often aim to build large parallel corpora for MT, rather than extracting the finer-grained signals needed specifically for BWE joint training from purely monolingual, potentially non-comparable sources.

3. Methodology

Our objective is to learn high-quality BWEs in an unsupervised manner, effectively navigating the challenges posed by structural asymmetry (non-isomorphism) between language embedding spaces, particularly in low-resource settings. Our approach operates solely on monolingual corpora from a source language

L_{s}

and a target language

L_{t}

, denoted as

C_{s}

and

C_{t}

respectively. The overall process, depicted in Figure 1, involves three key stages: unsupervised initialization, parallel phrase mining, and unsupervised joint training.

3.1. Unsupervised Initialization of BWEs

Before joint training can commence, we need an initial estimate of the cross-lingual alignment. This initial step provides a starting point for the nearest-neighbor search crucial for our phrase-mining algorithm. We employ a state-of-the-art unsupervised offline mapping technique to establish this initial foundation. Let

X \in R^{| V_{s} | \times d}

and

Y \in R^{| V_{t} | \times d}

be the pre-trained monolingual word embedding matrices for the source and target languages, respectively, where

| V_{s} |

and

| V_{t} |

are the vocabulary sizes and d is the embedding dimension. These are typically trained using algorithms FastText [28] on

C_{s}

and

C_{t}

.

We adopt the robust self-learning approach, which aims to find transformation matrices

W_{s} \in R^{d \times d}

and

W_{t} \in R^{d \times d}

that align the spaces. The core idea often involves optimizing a criterion:

W^{*} = \underset{W_{x}, W_{y}}{argmax} \sum_{i = 1}^{| V_{s} |} \sum_{j = 1}^{| V_{t} |} D_{i j} sim ({(X W_{x})}_{i}, {(Y W_{y})}_{j})

(1)

where

{(X W_{x})}_{i}

is the i-th row of the transformed source matrix,

{(Y W_{y})}_{j}

is the j-th row of the transformed target matrix,

D_{i j}

indicates likely translation pairs, and

sim (\cdot, \cdot)

is a similarity function measuring the similarity between two vectors

{(X W_{x})}_{i}

and

{(Y W_{y})}_{j}

, commonly, cosine similarity

cos ({(X W_{x})}_{i}, {(Y W_{y})}_{j}) = \frac{{(X W_{x})}_{i} \cdot {(Y W_{y})}_{j}}{∥ {(X W_{x})}_{i} ∥ ∥ {(Y W_{y})}_{j} ∥}

is used. The summation

\sum_{i = 1}^{| V_{s} |} \sum_{j = 1}^{| V_{t} |}

iterates over all possible pairs of words in the source and target vocabularies. A common solution involves the SVD of

X^{T} D Y = U Σ V^{T}

, yielding orthogonal transformations

W_{x} = U

and

W_{y} = V

. Let

E_{0}^{s} = X W_{x}

and

E_{0}^{t} = Y W_{y}

represent the initial d-dimensional embeddings aligned in a common space.

3.2. Parallel Phrase Mining via Dynamic Programming

This stage extracts weak bilingual signals by identifying continuous text segments in

C_{s}

and

C_{t}

that are likely translations, based on word similarity in the “current” embedding space. Consider segments

s = (w_{1}^{s}, \dots, w_{N}^{s})

from

C_{s}

and

t = (w_{1}^{t}, \dots, w_{M}^{t})

from

C_{t}

. Let

E^{s}

and

E^{t}

be the current embeddings (initially

E_{0}^{s}, E_{0}^{t}

).

Similarity Criterion: We define a binary similarity function

similar (w_{i}^{s}, w_{j}^{t})

using nearest neighbors:

similar (w_{i}^{s}, w_{j}^{t}) = \{\begin{matrix} 1 & if e_{j}^{t} \in {NN}_{k} (e_{i}^{s}, E^{t}) or e_{i}^{s} \in {NN}_{k} (e_{j}^{t}, E^{s}) \\ 0 & otherwise \end{matrix}

(2)

where

w_{i}^{s}

is the i-th word in a source language text segment,

w_{j}^{t}

is the j-th word in a target language text segment,

e_{i}^{s}

is the embedding vector of the source word

w_{i}^{s}

from the current source embedding matrix

E^{s}

,

e_{j}^{t}

is the embedding vector of the target word

w_{j}^{t}

from the current target embedding matrix

E^{t}

,

{NN}_{k} (e, E^{'})

is the set of the k nearest neighbors of e within

E^{'}

based on cosine similarity

\cos (u, v) = \frac{u \cdot v}{∥ u ∥ ∥ v ∥}

, and k is a parameter. The

similar (w_{i}^{s}, w_{j}^{t}) = 1

indicates that the words

w_{i}^{s}

and

w_{j}^{t}

are considered potential mutual translation candidates based on their current embedding proximity.

Dynamic Programming: We compute a matrix

m [i, j]

that stores the length of the longest contiguous similar sequence ending at

(w_{i}^{s}, w_{j}^{t})

. The transition is as follows:

m [i, j] = \{\begin{matrix} m [i - 1, j - 1] + 1 & if similar (w_{i}^{s}, w_{j}^{t}) = 1 \\ 0 & if similar (w_{i}^{s}, w_{j}^{t}) = 0 \end{matrix}

(3)

with base cases

m [i, 0] = 0

and

m [0, j] = 0

. An illustration is in Figure 2.

Phrase Extraction: We find entries

m [i, j] = L \geq L_{\min}

(minimum length, e.g., 3) and extract the corresponding source phrase

p^{s} = (w_{i - L + 1}^{s}, \dots, w_{i}^{s})

and target phrase

p^{t} = (w_{j - L + 1}^{t}, \dots, w_{j}^{t})

. These pairs form the set

P = {(p_{k}^{s}, p_{k}^{t})}

.

Discussion on Validity of Mined Phrases: A critical challenge in mining bilingual phrase segments from monolingual corpora lies in distinguishing valid cross-lingual alignments from coincidental similarities. Due to the high dimensionality and potential structural asymmetry of embedding spaces, especially between typologically distant languages, isolated word-level nearest neighbor matches can be unreliable. To address this, our method enforces two criteria, as follows: (1) a minimum length threshold

L_{\min}

ensures that only sufficiently long continuous segments are extracted, filtering out spurious matches; and (2) a mutual nearest neighbor check guarantees that word pairs exhibit consistent similarity in both directions. These constraints significantly reduce noise in the mined bilingual signals, ensuring that the resulting phrases provide meaningful supervision for unsupervised joint training.

3.3. Unsupervised Joint Training Framework

We jointly optimize

E^{s}

and

E^{t}

using monolingual corpora (

C_{s}, C_{t}

) and mined phrases, P. The combined objective is as follows:

L_{total} = L_{mono} (E^{s}; C_{s}) + L_{mono} (E^{t}; C_{t}) + λ L_{cross} (E^{s}, E^{t}; P)

(4)

where

λ

balances the cross-lingual signal,

L_{total}

is the combined loss function minimized during joint training,

L_{mono} (E^{s}; C_{s})

is the monolingual loss for the source language,

L_{mono} (E^{t}; C_{t})

is the monolingual loss for the target language,

L_{cross} (E^{s}, E^{t}; P)

is the cross-lingual loss function (defined below) evaluated using source embeddings

E^{s}

, target embeddings

E^{t}

, and the mined phrase pairs P.

Monolingual Objective ( $L_{mono}$ ): We use skip-gram with negative sampling (SGNS):

L_{mono} (E; C) = - \sum_{w_{i} \in C} \sum_{w_{c} \in Context (w_{i})} [log σ (e_{i} \cdot e_{c}) + \sum_{n = 1}^{N_{neg}} E_{w_{n} \sim P_{n} (w)} [log σ (- e_{i} \cdot e_{n})]]

(5)

where

e_{i}, e_{c}, e_{n}

are embeddings for the center, context, and negative sample words,

σ (x) = 1 / (1 + e^{- x})

is the sigmoid function,

N_{neg}

is the number of negative samples, and

P_{n} (w)

is the noise distribution.

Cross-lingual Objective ( $L_{cross}$ ): For each mined phrase pair

k \in P

, let

A_{k} = {(w^{s}, w^{t})}

be the set of implicitly aligned word pairs identified during DP. The loss maximizes their similarity:

L_{cross} (E^{s}, E^{t}; P) = - \sum_{k = 1}^{| P |} \sum_{(w^{s}, w^{t}) \in A_{k}} log σ (\cos (e^{s}, e^{t}))

(6)

This pulls aligned word embeddings closer.

Optimization and Normalization: Minimize

L_{total}

using SGD (or Adam). Apply embedding normalization and mean centering periodically during training [39] for stability.

Handling Structural Asymmetry: Unlike traditional unsupervised methods that assume isomorphic embedding spaces, our approach explicitly addresses non-isomorphism through two design choices, as follows: (1) The dynamic programming mining step identifies bilingual segments without relying on symmetric neighborhood structures, avoiding biases toward language-pair-specific isomorphism. (2) The joint training framework iteratively refines embeddings in a shared space, allowing asymmetric adjustments to accommodate linguistic divergences. This eliminates the need for restrictive orthogonality constraints common in mapping-based methods.

This joint framework learns alignment dynamically from mined signals combined with monolingual context, making it adaptable to structural asymmetries without assuming strict isomorphism. The complete unsupervised training procedure integrates the initialization, mining, and joint learning steps.

4. Experimental Settings

This section details the datasets, implementation parameters, evaluation protocols, and baseline methods used to assess the performance of our proposed unsupervised joint training approach for BWEs.

4.1. Datasets and Preprocessing

Monolingual Corpora: We utilized large monolingual corpora extracted from Wikipedia for our primary experiments, ensuring comparability with many previous studies. Additionally, to test robustness across different data sources and potential domain mismatch (reflecting real-world asymmetry), we also conducted experiments using corpora derived from Common Crawl. Following common practice [22,30], we used the “WikiExtractor” tool to obtain plain text from Wikipedia dumps. For each language, we used approximately 100 million words (or a comparable amount based on typical dataset sizes in the literature) for training the embeddings.

Language Pairs: We evaluated our method on six language pairs, categorized by linguistic distance, to investigate performance under varying degrees of expected structural asymmetry:

Closely related: English–German (En-De), English–Italian (En-It), German-Italian (De-It). These pairs are expected to exhibit relatively higher structural symmetry.
Distant: English–Russian (En-Ru), English–Turkish (En-Tr), English–Chinese (En-Zh). These pairs are typologically more distant and are expected to exhibit greater structural asymmetry (non-isomorphism).

All experiments involving English use it as the source language unless otherwise specified (e.g., De-It).

Preprocessing: Standard text preprocessing steps were applied, including tokenization (using the Moses tokenizer scripts), lowercasing, and the removal of rare words (e.g., keeping words occurring at least five times). We constructed vocabulary sizes appropriate for each language pair, ensuring coverage of evaluation dictionaries.

4.2. Implementation Details

Monolingual Embeddings: We used the FastText [28] implementation to train the initial 300-dimensional monolingual embeddings on the preprocessed corpora. We used a window size of 5, as well as 10 negative samples, training for 5 epochs.

Initialization for Our Method: The initial alignment for our unsupervised joint training framework (Stage 1 in Section 3.1) was obtained using the robust self-learning unsupervised mapping method [30], specifically using their publicly available implementation with default settings. We also experimented with other initializations as reported in the ablation studies.

Phrase-Mining Parameters: For the dynamic programming phrase mining (Section 3.2), we set the number of nearest neighbors

k = 10

for the similarity check (Equation in Section 3.2) and the minimum phrase length threshold

L_{\min} = 3

.

Joint Training Parameters: Our joint training framework (Section 3.3) was implemented by building upon standard embedding training procedures. We used the Adam optimizer with a learning rate tuned on a held-out validation set (or a standard default like 0.001). The skip-gram objective within the joint training used a window size of 5 and 5 negative samples. The weight for the cross-lingual loss term

λ

was set based on preliminary experiments (e.g.,

λ = 1.0

). The joint training ran for a fixed number of iterations or epochs (e.g., 5 epochs over the effective combined data). Normalization and mean centering were applied periodically as suggested in [39]. The primary results reported use a single pass of phrase mining followed by joint training, unless specified as iterative.

4.3. Evaluation Tasks and Metrics

We evaluated the learned BWEs on intrinsic and extrinsic tasks:

1.

Bilingual Lexicon Induction (BLI): This is the primary intrinsic evaluation.

Dataset: We used the standard test dictionaries provided by MUSE [22], which contain 1500 source words and their translations for training and 5000 for testing (we used the test split).
Protocol: For a source word embedding, we retrieved its nearest neighbor in the target embedding space using cosine similarity. We evaluated using Accuracy@1 (P@1), i.e., the percentage of source words whose retrieved nearest neighbor is the correct translation according to the dictionary. We report average accuracies across all test words.

2.

Downstream Task Evaluation: We assessed the utility of the BWEs in three cross-lingual transfer scenarios, following protocols similar to [17]:

Cross-lingual Natural Language Inference (XNLI): We trained a classifier (e.g., a BiLSTM with attention) on the English MultiNLI training data and evaluated its zero-shot performance on the translated XNLI test sets for the target languages, using the learned BWEs as input features.
Cross-lingual Document Classification (CLDC): We trained a document classifier (e.g., averaging BWEs + linear SVM) on the English portion and tested its zero-shot performance on target language documents.
Cross-lingual Information Retrieval (CLIR): We represented queries (English) and documents (target languages) by averaging the embeddings of their constituent words (after stopword removal). We ranked documents based on cosine similarity to the query.

3.

Unsupervised Machine Translation (UMT): We initialized the embedding layer of a standard Transformer model with the learned BWEs and trained a UNMT system following [39].

4.4. Baseline Methods

We compared our proposed method against representative state-of-the-art approaches:

Supervised Mapping:

Mapper [43]: Learns a linear projection using a seed dictionary (we used 1k or 5k pairs for supervised runs).
VecMap (Supervised) [27]: The supervised variant of VecMap, using a seed dictionary.

Unsupervised Offline Mapping: These methods rely on the isomorphism assumption.

VecMap (Unsupervised) [30]: Robust self-learning approach. (Used for our initialization).
MUSE (Adversarial + Refinement) [22]: Adversarial training followed by Procrustes refinement.

Unsupervised Joint Training:

Fusion [44]: Combines offline mapping with joint training on synthetic data from UNMT.

For all baselines, we utilized publicly available code and default parameters where possible. Crucially, all methods (including ours and baselines) were trained using the “same” preprocessed monolingual corpora and vocabularies for each language pair to ensure fair comparison. For supervised baselines requiring seeds, we used standard bilingual dictionary pairs.

5. Experiments and Results

5.1. Main Results

We evaluate the quality of the learned BWEs intrinsically using the BLI task, measuring the Accuracy@1 (P@1, %) on standard test dictionaries. The results using Wikipedia corpora are presented in Table 1, and results using Common Crawl corpora are shown in Table 2. We also include measures of embedding space asymmetry, that is, the Gromov–Hausdorff (GH) distance and the deviation from orthogonality

∥ I - W^{T} {W ∥}_{F}

for the mapping W obtained from unsupervised mapping. Higher values indicate greater asymmetry/non-isomorphism.

Performance Analysis (Wikipedia):

As shown in Table 1, our proposed unsupervised joint training method consistently outperforms all unsupervised baseline methods across all six language pairs.

Versus Offline Mapping: Our method significantly surpasses the strong offline mapping baselines MUSE [22] and VecMap [30]. While these baselines perform well on closely related pairs (En-De, En-It), their performance drops notably for distant pairs (En-Ru, En-Tr, En-Zh). This aligns with the higher asymmetry metrics observed for these distant pairs. Our method shows much greater resilience, achieving substantial gains, e.g., +4.4 P@1 over VecMap for En-Ru, +5.2 P@1 for En-Tr, and +10.1 P@1 for En-Zh. This confirms our phrase-guided joint training better handles structural asymmetry in distant pairs. Here, the isomorphism assumption fails.
Versus Unsupervised Joint Baseline: Our method also outperforms the unsupervised joint training baseline across all pairs. While their method already improves over offline mapping by incorporating joint training on synthetic data, our approach, using directly mined phrases, yields further improvements. This suggests that the weak bilingual signal extracted by our DP mining process might be more effective or less noisy than synthetic parallel sentences generated by UNMT, especially for challenging pairs. Gains are noticeable across both close (+1.04 P@1 for En-De) and distant pairs (+1.36 P@1 for En-Ru, +3.66 P@1 for En-Tr, +3.57 P@1 for En-Zh).
Comparison to Supervised Methods: While unsupervised methods are not expected to match fully supervised ones (which use direct translation pairs), our method closes the gap considerably, especially on distant pairs, compared to other unsupervised approaches. For instance, on En-Zh, our method (50.7) significantly improves over the best unsupervised mapping baseline (40.6) and the joint baseline (47.13), approaching the performance levels seen in some supervised settings for this challenging pair.

Performance Analysis (Common Crawl):

Table 2 presents results on the Common Crawl dataset, which is generally considered more diverse and noisy than Wikipedia, potentially exhibiting different distributional properties and, thus, greater asymmetry or domain mismatch.

Increased Challenge: We observe that all methods generally achieve lower absolute scores on Common Crawl compared to Wikipedia, highlighting the increased difficulty. The measured asymmetry metrics (GH, $∥ I - W^{T} W ∥$ ) are also generally higher on Common Crawl for the corresponding language pairs, supporting the notion of increased structural divergence.
Robustness of Joint Training: Despite the overall drop, the relative performance trends persist. Our proposed method employing joint training demonstrates greater robustness compared to the offline mapping methods (MUSE, VecMap (Unsup)). The performance gap between joint training methods and offline mapping methods widens on Common Crawl, particularly for distant pairs. For example, on En-Zh, our method achieves 47.91 P@1, while the best offline mapping baseline (VecMap Unsup.) reaches only 32.87 P@1.

Across both datasets, we observe a general correlation, that is, the performance of offline unsupervised methods tends to degrade more significantly as the measured asymmetry (GH distance,

∥ I - W^{T} W ∥

) increases (i.e., for more distant language pairs). Our joint training method demonstrates significantly less degradation, indicating its ability to better mitigate the negative impact of non-isomorphism/structural asymmetry. While asymmetry makes the task harder for all unsupervised methods, our approach appears more robust to this challenge.

5.1.1. Sensitivity to Data Divergence

Real-world scenarios often involve training on monolingual corpora from different domains or sources, introducing another layer of asymmetry beyond purely linguistic distance. To simulate this, we conducted experiments training the source language (English) embeddings on Wikipedia and the target language embeddings on Common Crawl. The results are presented in Table 3.

The results in Table 3 clearly show that our proposed unsupervised joint training method exhibits significantly higher robustness to this data source asymmetry compared to offline mapping baselines. The performance gap between our method and offline methods like VecMap widens considerably under data divergence. For instance, on En-De, the gap increases, and on distant pairs like En-Tr, our method achieves 50.82 P@1 while VecMap drops to 43.11 P@1. Compared to Fusion, our method also maintains a consistent advantage, suggesting that relying on directly mined phrases from the potentially divergent corpora is more effective than relying on synthetic data potentially generated from models trained on mismatched sources. This resilience highlights the adaptability of our joint training framework, which learns the alignment based on signals present “in the actual target corpora”, rather than assuming distributional similarity or relying on intermediary translation models.

While our experiments primarily utilize Wikipedia and Common Crawl corpora to simulate domain generality and noise, the robustness observed in mixed-source settings (Table 3) suggests applicability to noisy social data. For instance, Common Crawl inherently contains informal text fragments (e.g., social media snippets), and our method’s resilience to its heterogeneity—evidenced by consistent gains over baselines under data divergence—aligns with challenges in social data processing. Although explicit evaluation on domain-specific social datasets (e.g., Twitter) is beyond our scope, the proposed phrase-mining mechanism inherently filters noise by requiring contiguous semantic matches (

L_{\min} = 3

), which discards spurious alignments common in noisy contexts. Future work could explicitly validate this on curated social datasets.

The proposed method’s ability to mitigate structural asymmetry is further validated by its performance in high-asymmetry scenarios. For example, in Table 3, where data source divergence introduces additional non-isomorphism, our method retains 69.13% of its baseline accuracy (En-De), whereas VecMap is 58.62%. This demonstrates that joint training with mined signals adapts more effectively to asymmetric spaces than offline mapping techniques reliant on isomorphism assumptions.

5.1.2. Sensitivity to Initial BWEs

Our method requires an initial BWE alignment (Stage 1) to bootstrap the phrase-mining process. We investigated how the choice of the unsupervised mapping method used for this initialization affects the final performance. We experimented with initializing our joint training framework using different offline unsupervised methods as the starting point (

E_{0}^{s}, E_{0}^{t}

). The results on the Common Crawl dataset are presented in Table 4. We also include a random initialization baseline (where

W_{x}, W_{y}

are random orthogonal matrices), which is expected to fail as no meaningful initial signal exists for phrase mining.

Table 4 demonstrates that while the choice of initialization does have a minor impact, our method achieves strong results regardless of the specific unsupervised mapping technique used initially, as long as it provides a reasonable starting point. The performance variation across different initializations (excluding random) is relatively small compared to the overall gains achieved by our joint training. For example, the standard deviation across different valid initializations for a given language pair is typically low. The random initialization fails completely, confirming the need for some initial signal to start the phrase mining. VecMap provided slightly better initialization on average for this dataset, leading to the best final results reported as our main scores, but other methods also led to competitive outcomes after our joint training phase. This relative stability suggests that our joint training procedure, driven by mined phrases and monolingual signals, can effectively refine and improve upon various initial alignments, making it less sensitive to the specific choice of the initializer compared to methods that solely rely on the initial mapping.

5.2. Ablation Studies

To understand the contribution of the core components of our method, particularly the mined parallel phrases, we performed ablation studies.

5.2.1. Impact of Bilingual Signal Type

We compared our full method (using mined parallel phrases) against two variants on the Common Crawl dataset:

1.: +Non: Joint training using only the monolingual objective ( $λ = 0$ in Equation in Section 3.3). This isolates the effect of joint optimization without any cross-lingual signal.
2.: +Word Pairs: Joint training using only aligned word pairs based on the initial nearest neighbors, without enforcing phrase continuity ( $L_{\min} = 1$ and extracting individual pairs). This tests the value of the phrase continuity constraint.

The results are shown in Table 5 (corresponding to Table 5 in the original draft).

The results are stark. Removing the cross-lingual signal entirely (+Non) leads to complete failure (0.0 accuracy), confirming that unsupervised joint training requires some form of bilingual signal to bridge the languages. Using only individual nearest-neighbor word pairs (+Word Pairs) provides some signal but yields very poor results, significantly worse than even the offline mapping baselines (compare with Table 2). This highlights the critical importance of our dynamic programming approach to identify “continuous parallel phrases”. Enforcing phrase continuity acts as a strong filter against noisy or spurious nearest-neighbor matches, ensuring that the cross-lingual signal used for joint training (

L_{cross}

) is of much higher quality. The substantial performance leap from “+Word Pairs” to “+Ours (Phrases)” underscores the effectiveness of the phrase-mining component.

5.2.2. Impact of Minimum Phrase Length

We investigated the effect of the minimum phrase length threshold

L_{\min}

used during the mining stage. Figure 3 (corresponding to Figure 3 in the original draft) shows the BLI accuracy on the Common Crawl En-De pair as

L_{\min}

varies.

Figure 3 indicates that performance improves significantly when moving from

L_{\min} = 1

(equivalent to “+Word Pairs” in the ablation) to

L_{\min} = 2

or

L_{\min} = 3

. Setting

L_{\min} = 3

(our default) provides the best results. Increasing

L_{\min}

further (e.g., to 4 or 5) results in a slight decrease in performance. This suggests a trade-off: longer phrases provide stronger, less noisy alignment signals, but requiring very long contiguous matches might filter out too many valid, shorter parallel segments, reducing the overall quantity of the bilingual signal. A minimum length of 3 appears to strike a good balance between signal quality and quantity for this task.

These analyses demonstrate the robustness of our method to data variations and initialization choices, and critically highlight the importance of the continuous parallel phrase-mining component for extracting high-quality bilingual signals essential for effective unsupervised joint training.

5.3. Downstream Task Performance

While intrinsic evaluation, such as BLI, assesses the direct word-level alignment quality, evaluation on downstream tasks assesses the practical utility of the learned BWEs for cross-lingual transfer. We evaluated our method and baselines on cross-lingual natural language inference (XNLI), cross-lingual document classification (CLDC), and cross-lingual information retrieval (CLIR). The results, using embeddings trained on the Common Crawl corpora (as they represent a more challenging scenario), are presented in Table 6, Table 7 and Table 8.

1. Cross-lingual Natural Language Inference (XNLI): Table 6 shows the zero-shot classification accuracy on the XNLI test sets. The XNLI results clearly demonstrate the advantage of joint training methods for tasks requiring deeper semantic understanding. Both our proposed method and Fusion significantly outperform all offline mapping methods (MUSE, VecMap Unsup.) and even surpass the supervised mapping baselines on several language pairs. This suggests that the richer representations learned through joint optimization, which incorporate both monolingual context and cross-lingual signals, are more beneficial for complex transfer tasks than simply aligning static, pre-trained embeddings. Our method consistently achieves the best performance among the unsupervised approaches across all target languages, indicating that the quality of the cross-lingual alignment learned through mined phrases translates into better downstream performance for semantic inference.

2. Cross-lingual Document Classification (CLDC): Table 7 shows the micro-averaged F1 scores for zero-shot document classification. Similar trends are observed for CLDC. Joint training methods outperform offline mapping and supervised mapping baselines. CLDC relies on aggregating word embeddings to represent documents, potentially making it slightly less sensitive to fine-grained alignment issues than XNLI or BLI. Nevertheless, the improved quality of BWEs from joint training still yields better classification performance. Our method again edges out the other baselines, achieving the best unsupervised results across all language pairs.

3. Cross-lingual Information Retrieval (CLIR): Table 8 presents the mean average precision (MAP) scores for CLIR. The CLIR results, despite potential inconsistencies in the draft’s table values, generally follow the pattern observed in XNLI and CLDC. Joint training methods provide substantial improvements over offline mapping. CLIR performance is sensitive to the quality of semantic representation at both the word and document levels. The superior BWEs learned by our method enable more accurate cross-lingual query-document matching. Again, our proposed method achieves the best unsupervised performance.

6. Conclusions

In this paper, we systematically addressed the critical challenge of learning high-quality Bilingual Word Embeddings (BWEs) in unsupervised settings, with a focus on mitigating the pervasive issue of structural asymmetry between embedding spaces of typologically diverse languages. Structural asymmetry, particularly pronounced in distant language pairs and exacerbated by domain mismatches in monolingual corpora, has been shown to degrade the performance of traditional unsupervised methods reliant on the isomorphism assumption in bilingual lexicon induction (BLI) tasks. To overcome these limitations, we proposed a novel unsupervised joint training framework that circumvents the need for parallel corpora or synthetic data, operating solely on monolingual texts.

Our approach leverages a dynamic programming algorithm to mine parallel phrase segments directly from monolingual corpora. By exploiting nearest-neighbor relationships in the evolving embedding space, the algorithm identifies contiguous blocks of semantically related words, generating robust, data-driven bilingual signals. These mined phrases guide the concurrent optimization of source and target embeddings within a shared space, enabling the model to learn cross-lingual alignments dynamically while preserving monolingual structure. Extensive experiments across six language pairs (en-de, en-it, de-it, en-ru, en-tr, en-zh) demonstrated significant improvements over state-of-the-art methods. On BLI tasks, our framework achieved 74.15% accuracy@1 for English–German (en-de) and 74.33% accuracy@1 for English–Italian (en-it) on Wikipedia Corpora. For cross-lingual natural language inference (XNLI), our method attained 58.6% test accuracy on average, surpassing unsupervised offline mapping baselines by 11.82%.

The contributions of this work are threefold. First, we introduced an unsupervised joint training framework that explicitly addresses structural asymmetry without relying on synthetic data or external resources, achieving competitive performance even for distant language pairs like English–Turkish. Second, our dynamic programming-based phrase-mining technique provides a scalable alternative to conventional synthetic data generation, reducing computational overhead compared to back-translation-based methods. Third, we validated the effectiveness of our approach across diverse corpora, including clean (Wikipedia) and noisy (Common Crawl) domains, demonstrating robustness to data quality variations.

Despite these advancements, the proposed framework has limitations. The dynamic programming algorithm’s time complexity scales quadratically with sentence length, limiting its applicability to longer texts. Additionally, while our method excels in aligning lexical semantics, it does not explicitly model syntactic structures, potentially affecting performance in tasks requiring fine-grained grammatical transfer. Future work will focus on optimizing the algorithm’s efficiency via approximate nearest-neighbor search and integrating syntactic constraints into the alignment process. Furthermore, extending this framework to multilingual settings and exploring semi-supervised variants with minimal seed lexicons could further bridge the gap between unsupervised and fully supervised performance.

Meanwhile, our method’s design is extensible to multilingual and low-resource scenarios. Similar to multilingual joint training paradigms [45,46], our phrase-mining algorithm could be adapted to mine multi-view alignments across multiple monolingual corpora, leveraging shared embedding spaces for distant languages. Further, the minimal reliance on external resources (only monolingual corpora) aligns with low-resource settings [47]. For languages with extreme data scarcity, subword-enhanced initialization (e.g., FastText) combined with our phrase mining could mitigate lexical sparsity—an avenue for future exploration.

By directly addressing the challenges of structural asymmetry in cross-lingual representation learning, this work advances the feasibility of unsupervised methods for low-resource language pairs, offering a practical pathway toward equitable multilingual NLP systems.

Author Contributions

Conceptualization, X.Z.; methodology, L.M., X.Y., S.C. and X.Z.; software, X.Y.; investigation, X.Z.; data curation, S.C.; writing—original draft, L.M.; writing—review and editing, L.M., X.Y. and S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partly supported by the Foundation and Cutting-Edge Technologies Research Program of Henan Province (No. 252102211067, No. 252102210064, No. 252102210124).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author. Restrictions apply to the availability of these data due to institutional policies and the protection of proprietary information.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BWE	Bilingual Word Embedding
NLP	Natural Language Processing
LLM	Large Language Model
DP	Dynamic Programming
BLI	Bilingual Lexicon Induction
XNLI	Cross-lingual Natural Language Inference
CLDC	Cross-lingual Document Classification
CLIR	Cross-lingual Information Retrieval
UMT	Unsupervised Machine Translation
UNMT	Unsupervised Machine Translation
SVD	Singular Value Decomposition
SGD	Stochastic Gradient Descent
CSLS	Cross-domain Similarity Local Scaling
GH	Gromov–Hausdorff (Distance)
MAP	Mean Average Precision

References

Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A massively multilingual pre-trained text-to-text transformer. arXiv 2020, arXiv:2010.11934. [Google Scholar]
Wu, S.; Dredze, M. Beto, Bentz, Becas: The surprising cross-lingual effectiveness of Bert. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019. Association for Computational Linguistics, Hong Kong, China, 3–7 November 2019; pp. 833–844. [Google Scholar]
Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar]
Zhu, S.; Supryadi; Xu, S.; Sun, H.; Pan, L.; Cui, M.; Du, J.; Jin, R.; Branco, A.; Xiong, D. Multilingual Large Language Models: A Systematic Survey. arXiv 2024, arXiv:2411.11072. [Google Scholar]
Hu, P.; Liu, S.; Gao, C.; Huang, X.; Han, X.; Feng, J.; Deng, C.; Huang, S. Large Language Models Are Cross-Lingual Knowledge-Free Reasoners. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Albuquerque, NM, USA, 29 April–4 May 2025; Chiruzzo, L., Ritter, A., Wang, L., Eds.; pp. 1525–1542. [Google Scholar] [CrossRef]
Xu, Y.; Hu, L.; Zhao, J.; Qiu, Z.; Xu, K.; Ye, Y.; Gu, H. A survey on multilingual large language models: Corpora, alignment, and bias. Front. Comput. Sci. 2025, 19, 1911362. [Google Scholar] [CrossRef]
Cao, H.; Zhao, T.; Wang, W.; Peng, W. Bilingual word embedding fusion for robust unsupervised bilingual lexicon induction. Inf. Fusion 2023, 97, 101818. [Google Scholar] [CrossRef]
Zhu, S.; Gu, S.; Li, S.; Xu, L.; Xiong, D. Mining parallel sentences from internet with multi-view knowledge distillation for low-resource language pairs. Knowl. Inf. Syst. 2024, 66, 187–209. [Google Scholar] [CrossRef]
Zhu, S.; Mi, C.; Li, T.; Yang, Y.; Xu, C. Unsupervised parallel sentences of machine translation for Asian language pairs. ACM Trans. Asian-Low-Resour. Lang. Inf. Process. 2023, 22, 1–14. [Google Scholar] [CrossRef]
Zhu, S.; Mi, C.; Li, T.; Zhang, F.; Zhang, Z.; Sun, Y. Improving bilingual word embeddings mapping with monolingual context information. Mach. Transl. 2021, 35, 503–518. [Google Scholar] [CrossRef]
Wang, Y.; Wang, F.; Dong, J.; Luo, H. Cl2cm: Improving cross-lingual cross-modal retrieval via cross-lingual knowledge transfer. In Proceedings of the AAAI Conference on Artificial Intelligence 2024, Vancouver, BC, USA, 20–27 February 2024; Volume 38, pp. 5651–5659. [Google Scholar]
Sun, Y.; Zhu, S.; Yifan, F.; Mi, C. Parallel sentences mining with transfer learning in an unsupervised setting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, Online, 6–11 June 2021; pp. 136–142. [Google Scholar]
Zhu, S.; Mi, C.; Shi, X. An explainable evaluation of unsupervised transfer learning for parallel sentences mining. In Proceedings of the Web and Big Data: 5th International Joint Conference, APWeb-WAIM 2021, Guangzhou, China, 23–25 August 2021; Proceedings, Part I 5. Springer: Berlin/Heidelberg, Germany, 2021; pp. 273–281. [Google Scholar]
Shen, Y.; Bao, W.; Gao, G.; Zhou, M.; Zhao, X. Unsupervised multilingual machine translation with pretrained cross-lingual encoders. Knowl.-Based Syst. 2024, 284, 111304. [Google Scholar] [CrossRef]
Søgaard, A.; Ruder, S.; Vulić, I. On the Limitations of Unsupervised Bilingual Dictionary Induction. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, VIC, Australia, 15–20 July 2018; pp. 778–788. [Google Scholar]
Glavas, G.; Litschko, R.; Ruder, S.; Vulic, I. How to (properly) evaluate cross-lingual word embeddings: On strong baselines, comparative analyses, and some misconceptions. arXiv 2019, arXiv:1902.00508. [Google Scholar]
Ding, Q.; Cao, H.; Zhu, C.; Zhao, T. Reshaping Word Embedding Space with Monolingual Synonyms for Bilingual Lexicon Induction. IEEE Trans. Audio, Speech Lang. Process. 2025, 33, 785–796. [Google Scholar] [CrossRef]
Sannigrahi, S.; Read, J. Isomorphic Cross-lingual Embeddings for Low-Resource Languages. In Proceedings of the 7th Workshop on Representation Learning for NLP, RepL4NLP@ACL 2022, Dublin, Ireland, 26 May 2022; pp. 133–142. [Google Scholar]
Gouws, S.; Bengio, Y.; Corrado, G. Bilbowa: Fast bilingual distributed representations without word alignments. In Proceedings of the International Conference on Machine Learning. PMLR, 2015, Lille, France, 6–11 July 2015; pp. 748–756. [Google Scholar]
Ahmat, A.; Yang, Y.; Ma, B.; Dong, R.; Lu, K.; Wang, L. Wad-x: Improving zero-shot cross-lingual transfer via adapter-based word alignment. ACM Trans. Asian -Low-Resour. Lang. Inf. Process. 2023, 22, 1–23. [Google Scholar] [CrossRef]
Lample, G.; Conneau, A.; Ranzato, M.; Denoyer, L.; Jégou, H. Word translation without parallel data. In Proceedings of the International Conference on Learning Representations 2018, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Cao, H.; Li, L.; Zhu, C.; Yang, M.; Zhao, T. Dual Word Embedding for Robust Unsupervised Bilingual Lexicon Induction. IEEE/ACM Trans. Audio Speech Lang. Process. 2023, 31, 2606–2615. [Google Scholar] [CrossRef]
Feng, Z.; Cao, H.; Zhao, T.; Wang, W.; Peng, W. Cross-lingual feature extraction from monolingual corpora for low-resource unsupervised bilingual lexicon induction. In Proceedings of the 29th International Conference on Computational lLinguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 5278–5287. [Google Scholar]
Garnier, P.; Guinet, G. Semi-Supervised Learning for Bilingual Lexicon Induction. arXiv 2024, arXiv:2402.07028. [Google Scholar]
Bear, D. Leveraging Bilingual Dictionaries to Learn Word Embeddings for Low-Resource Languages; University of New Brunswick: Fredericton, NB, Canada, 2025. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. Learning bilingual word embeddings with (almost) no bilingual data. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 451–462. [Google Scholar]
Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef]
Xing, C.; Wang, D.; Liu, C.; Lin, Y. Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation In NAACL HLT 2015, The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May 31–5 June 2015; The Association for Computational Linguistics: Pittsburgh, PA, USA, 2015; pp. 1006–1011. [Google Scholar]
Artetxe, M.; Labaka, G.; Agirre, E. A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv 2018, arXiv:1805.06297. [Google Scholar]
Miaschi, A.; Dell’Orletta, F. Contextual and non-contextual word embeddings: An in-depth linguistic investigation. In Proceedings of the 5th Workshop on Representation Learning for NLP, Online, 9 July 2020; pp. 110–119. [Google Scholar]
Camacho-Collados, J.; Pilehvar, M.T. Embeddings in natural language processing. In Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, Online, 8–13 December 2020; pp. 10–15. [Google Scholar]
Asudani, D.S.; Nagwani, N.K.; Singh, P. Impact of word embedding models on text analytics in deep learning environment: A review. Artif. Intell. Rev. 2023, 56, 10345–10425. [Google Scholar] [CrossRef]
Incitti, F.; Urli, F.; Snidaro, L. Beyond word embeddings: A survey. Inf. Fusion 2023, 89, 418–436. [Google Scholar] [CrossRef]
Vulić, I.; Glavaš, G.; Reichart, R.; Korhonen, A. Do We Really Need Fully Unsupervised Cross-Lingual Embeddings? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4407–4418. [Google Scholar]
Ding, Q.; Cao, H.; Zhao, T. Enhancing bilingual lexicon induction via bi-directional translation pair retrieving. In Proceedings of the AAAI Conference on Artificial Intelligence 2024, Vancouver, BC, Canada, 20–27 February 20224; Volume 38, pp. 17898–17906. [Google Scholar]
Nakashole, N.; Flauger, R. Characterizing Departures from Linearity in Word Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, VIC, Australia, 15–20 July 2018; pp. 221–227. [Google Scholar]
Thrampoulidis, C. Implicit optimization bias of next-token prediction in linear models. arXiv 2024, arXiv:2402.18551. [Google Scholar]
Lample, G.; Ott, M.; Conneau, A.; Denoyer, L.; Ranzato, M. Phrase-Based & Neural Unsupervised Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 31 October–4 November 2018; p. 5039. [Google Scholar]
Artetxe, M.; Schwenk, H. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. 2019, 7, 597–610. [Google Scholar] [CrossRef]
Schwenk, H.; Wenzek, G.; Edunov, S.; Grave, E.; Joulin, A.; Fan, A. CCMatrix: Mining Billions of High-Quality Parallel Sentences on the Web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 1–6 August 2021; pp. 6490–6500. [Google Scholar]
El-Kishky, A.; Chaudhary, V.; Guzmán, F.; Koehn, P. CCAligned: A Massive Collection of Cross-Lingual Web-Document Pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 5960–5969. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 26, 3111–3119. [Google Scholar]
Hangya, V.; Fraser, A. Unsupervised parallel sentence extraction with parallel segment detection helps machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 1224–1234. [Google Scholar]
Chen, X.; Cardie, C. Unsupervised Multilingual Word Embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Riloff, E., Chiang, D., Hockenmaier, J., Tsujii, J., Eds.; pp. 261–270. [Google Scholar] [CrossRef]
Ammar, W.; Mulcaire, G.; Tsvetkov, Y.; Lample, G.; Dyer, C.; Smith, N.A. Massively Multilingual Word Embeddings. arXiv 2016, arXiv:1602.01925. [Google Scholar]
Haddow, B.; Bawden, R.; Miceli Barone, A.V.; Helcl, J.; Birch, A. Survey of Low-Resource Machine Translation. Comput. Linguist. 2022, 48, 673–732. [Google Scholar] [CrossRef]

Figure 1. An overview of our proposed unsupervised joint training methodology. The Chinese example demonstrates how to handle non parallel monolingual sentences. We use pinyin to further correspond to English.

Figure 2. Illustration of the dynamic programming (DP) matrix for parallel phrase detection. The matrix stores the length

m [i, j]

of the longest contiguous similar sequence ending at source word

w_{i}^{s}

(rows) and target word

w_{j}^{t}

(columns). Shaded diagonal paths indicate aligned phrase pairs. Example: The highlighted path corresponds to a phrase pair of length

L = 3

extracted when

m [i, j] \geq L_{\min}

.

Figure 2. Illustration of the dynamic programming (DP) matrix for parallel phrase detection. The matrix stores the length

m [i, j]

of the longest contiguous similar sequence ending at source word

w_{i}^{s}

(rows) and target word

w_{j}^{t}

(columns). Shaded diagonal paths indicate aligned phrase pairs. Example: The highlighted path corresponds to a phrase pair of length

L = 3

extracted when

m [i, j] \geq L_{\min}

.

Figure 3. Impact of minimum phrase length (

L_{\min}

) on BLI Accuracy@1 (%) for En-De on Common Crawl, the Y-axis label is “BLI Accuracy@1 (%)” and the X-axis label is “Minimum Phrase Length (

L_{\min}

)”.

Figure 3. Impact of minimum phrase length (

L_{\min}

) on BLI Accuracy@1 (%) for En-De on Common Crawl, the Y-axis label is “BLI Accuracy@1 (%)” and the X-axis label is “Minimum Phrase Length (

L_{\min}

)”.

Table 1. BLI Accuracy@1 (%) on Wikipedia Corpora. Higher asymmetry metrics (GH,

∥ I - W^{T} W ∥

) indicate greater deviation from isomorphism. The best unsupervised results are in bold. “Sup.” represents supervised and “Unsup.” represents unsupervised.

Table 1. BLI Accuracy@1 (%) on Wikipedia Corpora. Higher asymmetry metrics (GH,

∥ I - W^{T} W ∥

) indicate greater deviation from isomorphism. The best unsupervised results are in bold. “Sup.” represents supervised and “Unsup.” represents unsupervised.

Methods	En-De	En-It	De-It	En-Ru	En-Tr	En-Zh
Asymmetry Metrics
$∥ I - W^{T} {W ∥}_{F}$	0.13	4.81	5.13	5.32	7.63	13.86
GH Distance	0.18	0.41	0.43	0.46	0.80	0.92
Methods with Bilingual Supervision (using 1k seeds)
Mapper [43]	60.7	58.32	53.69	32.15	33.2	32.63
VecMap (Sup.) [27]	72.8	70.57	64.72	41.6	40.65	38.3
Methods without Bilingual Supervision (Offline Unsupervised Mapping)
MUSE [22]	68.49	70.86	65.32	29.13	41.66	37.5
VecMap (Unsup.) [30]	73.02	73.42	64.52	45.66	47.6	40.6
Unsupervised Joint Training
Fusion [44]	73.11	74.13	66.07	48.7	49.15	47.13
Proposed Method	74.15	74.33	67.07	50.06	52.81	50.7

Table 2. BLI Accuracy@1 (%) on Common Crawl Corpora. Higher asymmetry metrics are generally observed compared to Wikipedia. The best unsupervised results are in bold. “Sup.” represents supervised and “Unsup.” represents unsupervised.

Methods	En-De	En-It	De-It	En-Ru	En-Tr	En-Zh
Asymmetry Metrics
$∥ I - W^{T} {W ∥}_{F}$	0.15	6.31	7.03	7.16	8.83	15.62
GH Distance	0.23	0.61	0.62	0.68	0.88	0.98
Methods with Bilingual Supervision (using 1k seeds)
Mapper [43]	43.16	39.33	38.42	33.8	39.65	30.1
VecMap (Sup.) [27]	48.37	42.56	44.13	36.18	38.12	33.4
Methods without Bilingual Supervision (Offline Unsupervised Mapping)
MUSE [22]	53.6	53.1	48.1	29.8	33.16	28.7
VecMap (Unsup.) [30]	56.02	56.04	51.53	35.59	40.37	32.87
Unsupervised Joint Training
Fusion [44]	67.2	66.31	61.8	45.65	45.32	42.17
Proposed Method	68.1	67.23	64.83	48.18	47.66	47.91

Table 3. BLI Accuracy@1 (%) on mixed corpora (Source: Wikipedia, Target: Common Crawl). The best unsupervised results are in bold. “Unsup.” represents unsupervised.

Methods	En-De	En-It	De-It	En-Ru	En-Tr	En-Zh
Methods without Bilingual Supervision (Offline Unsupervised Mapping)
MUSE [22]	51.67	50.71	45.62	30.29	33.16	29.12
VecMap (Unsup.) [30]	58.62	57.71	54.36	38.71	43.11	33.36
Unsupervised Joint Training
Fusion [44]	65.72	54.87	63.25	47.12	48.27	43.16
Proposed Method	69.13	67.82	66.25	50.16	50.82	45.12

Table 4. BLI Accuracy@1 (%) on Common Crawl using the proposed method with different initializations. The best unsupervised results are in bold. “Sup.” represents supervised and “Unsup.” represents unsupervised.

Initialization Method	En-De	En-It	De-It	En-Ru	En-Tr	En-Zh
Random	0.0	0.0	0.0	0.0	0.0	0.0
Mapper [43]	67.15	67.31	63.03	47.08	47.63	47.62
VecMap (Sup.) [27]	67.23	67.51	63.32	47.68	47.72	47.28
MUSE [22]	67.18	66.85	63.72	48.16	47.75	47.69
VecMap (Unsup.) [30]	68.1	67.23	64.83	48.18	47.66	47.91

Table 5. Ablation Study: Effect of bilingual signal types (+Non:No Cross-lingual Signal, +Word Pairs:Word Pairs Only, +Ours:Mined Phrases) on BLI Accuracy@1 (%) on Common Crawl Corpora. The best results are in bold.

Language Pair	+Word Pairs	+Ours (Phrases)
En-De	37.18	68.1
En-It	36.85	67.23
De-It	23.72	64.83
En-Ru	18.16	48.18
En-Tr	17.35	47.66
En-Zh	12.19	47.91

Table 6. Cross-lingual natural language inference (XNLI) test set accuracy (%) by using embeddings trained on Common Crawl corpora. “Sup.” represents supervised and “Unsup.” represents unsupervised. The best results are in bold.

Model	De	It	Ru	Tr	Zh
Supervised Baselines (using 1k seeds)
Mapper [43]	33.1	36.3	33.1	33.1	22.5
VecMap (Sup.) [27]	48.2	51.8	46.2	43.4	41.7
Unsupervised Offline Mapping
MUSE [22]	49.4	52.1	46.4	43.8	42.9
VecMap (Unsup.) [30]	48.7	51.8	46.1	43.3	42.3
Unsupervised Joint Training
Fusion [44]	60.2	62.3	57.2	54.6	53.8
Proposed Method	60.8	63.2	58.1	55.0	54.2

Table 7. Cross-lingual document classification (CLDC) micro-averaged F1 score (%) by using embeddings trained on Common Crawl corpora. “Sup.” represents supervised and “Unsup.” represents unsupervised. The best results are in bold.

Model	En-De	En-It	En-Ru	En-Tr	En-Zh
Supervised Baselines (using 1k seeds)
Mapper [43]	23.3	26.2	17.3	15.4	13.6
VecMap (Sup.) [27]	35.8	36.5	34.3	33.1	31.8
Unsupervised Offline Mapping
MUSE [22]	36.1	38.2	35.1	34.8	32.2
VecMap (Unsup.) [30]	37.2	38.6	36.3	35.3	33.0
Unsupervised Joint Training
Fusion	40.0	40.6	37.3	36.6	34.8
Proposed Method	40.8	41.2	38.1	37.5	35.3

Table 8. Cross-lingual information retrieval (CLIR) mean average precision (MAP) score by using embeddings trained on Common Crawl corpora. “Sup.” represents supervised and “Unsup.” represents unsupervised. The best results are in bold.

Model	De	It	Ru	Tr	Zh
Supervised Baselines (using 1k seeds)
Mapper [43]	0.331	0.363	0.331	0.331	0.225
VecMap (Sup.) [27]	0.186	0.201	0.176	0.165	0.151
Unsupervised Offline Mapping
MUSE [22]	0.188	0.213	0.181	0.173	0.158
VecMap (Unsup.) [30]	0.203	0.235	0.197	0.186	0.175
Unsupervised Joint Training
Fusion [44]	0.261	0.286	0.247	0.239	0.217
Proposed Method	0.276	0.291	0.257	0.246	0.228

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, L.; Yang, X.; Chen, S.; Zhao, X. Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces. Symmetry 2025, 17, 1005. https://doi.org/10.3390/sym17071005

AMA Style

Meng L, Yang X, Chen S, Zhao X. Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces. Symmetry. 2025; 17(7):1005. https://doi.org/10.3390/sym17071005

Chicago/Turabian Style

Meng, Lei, Xiaona Yang, Shangfeng Chen, and Xiaojun Zhao. 2025. "Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces" Symmetry 17, no. 7: 1005. https://doi.org/10.3390/sym17071005

APA Style

Meng, L., Yang, X., Chen, S., & Zhao, X. (2025). Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces. Symmetry, 17(7), 1005. https://doi.org/10.3390/sym17071005

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Addressing Structural Asymmetry: Unsupervised Joint Training of Bilingual Embeddings for Non-Isomorphic Spaces

Abstract

1. Introduction

2. Related Work

2.1. Supervised and Semi-Supervised Approaches

2.2. Unsupervised Mapping Approaches and the Isomorphism Assumption

2.3. Addressing Asymmetry in Unsupervised Settings

3. Methodology

3.1. Unsupervised Initialization of BWEs

3.2. Parallel Phrase Mining via Dynamic Programming

3.3. Unsupervised Joint Training Framework

4. Experimental Settings

4.1. Datasets and Preprocessing

4.2. Implementation Details

4.3. Evaluation Tasks and Metrics

4.4. Baseline Methods

5. Experiments and Results

5.1. Main Results

5.1.1. Sensitivity to Data Divergence

5.1.2. Sensitivity to Initial BWEs

5.2. Ablation Studies

5.2.1. Impact of Bilingual Signal Type

5.2.2. Impact of Minimum Phrase Length

5.3. Downstream Task Performance

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI