Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT

Liao, Junwei; Shi, Yu

doi:10.3390/math10224178

Open AccessFeature PaperArticle

Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT

by

Junwei Liao

^1,*

and

Yu Shi

²

¹

School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China

²

Microsoft Cognitive Services Research Group, Redmond, WA 98052, USA

^*

Author to whom correspondence should be addressed.

Mathematics 2022, 10(22), 4178; https://doi.org/10.3390/math10224178

Submission received: 13 September 2022 / Revised: 18 October 2022 / Accepted: 31 October 2022 / Published: 9 November 2022

(This article belongs to the Special Issue Artificial Neural Networks: Design and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The multilingual neural machine translation (NMT) model can handle translation between more than one language pair. From the perspective of industrial applications, the modularized multilingual NMT model (M2 model) that only shares modules between the same languages is a practical alternative to the model that shares one encoder and one decoder (1-1 model). Previous works have proven that the M2 model can benefit from multiway training without suffering from capacity bottlenecks and exhibits better performance than the 1-1 model. However, the M2 model trained on English-centric data is incapable of zero-shot translation due to the ill-formed interlingual space. In this study, we propose a framework to help the M2 model form an interlingual space for zero-shot translation. Using this framework, we devise an approach that combines multiway training with a denoising autoencoder task and incorporates a Transformer attention bridge module based on the attention mechanism. We experimentally show that the proposed method can form an improved interlingual space in two zero-shot experiments. Our findings further extend the use of the M2 model for multilingual translation in industrial applications.

Keywords:

multilingual neural machine translation; interlingual space; zero-shot translation; denoising autoencoder; neural interlingual module

MSC:

68T07; 68T50

1. Introduction

Multilingual neural machine translation (NMT) trains a single model to translate multiple source languages into multiple target languages. Multilingual NMT models have demonstrated significant improvements in translation quality of low-resource languages [1,2] and the capability for zero-shot translation [3,4,5], where the model can translate the unseen language pairs in training. Due to its compactness, the most popular approach for multilingual NMT is the sequence-to-sequence model [6], which is composed of one encoder and one decoder (hereafter referred to as the 1-1 model, Figure 1a), that can translate in all directions [3,4].

However, it is well recognized that adding a large number of languages to a 1-1 model results in capacity bottlenecks [2,5,7], which is particularly undesirable for translation systems in the industry. There are two approaches to expanding the model scales to provide the capacity for more languages. One approach is to directly scale the 1-1 model size, such as increasing the depth and width of standard Transformer architectures [8,9]. However, the scaled 1-1 model still has issues with inference cost and latency, both of which are important in reality. Furthermore, a 1-1 model must also be trained from scratch as a single module, which requires considerable time and effort. This makes modifying a 1-1 model, such as adding a language, time-consuming.

Another approach for expanding the model scales is the use of language-specific modules [1]. Figure 1b illustrates the architecture overview of this model. Because this design only shares language-independent modules (encoders or decoders), rather than the full model, we refer to it as the modularized multilingual NMT model (henceforth the M2 model). Although the M2 model has received little attention due to the linear increase in its parameters as the number of languages grows, its modularized architecture allows it to maintain a reasonable inference cost and low maintainability, such as adding a new language, which makes it appealing in the industrial applications.

Some works have revisited the M2 model as an alternative to the 1-1 model in the industrial setting [10,11]. The M2 model trained in a many-to-many (M2M) environment, i.e., training data include parallel pairs of all languages (Figure 2a) was explored. However, it is difficult to obtain this type of data in practice, especially for some low-resource languages, but parallel data to or from English are common. As a continuation of work using the M2 model for industrial settings, we investigate the M2 model trained in a joint one-to-many and many-to-one environment (JM2M), with English-centric training data (Figure 2b). Specifically, we focus on zero-shot translation in the JM2M environment, which is an important measure of model generalization.

We first explore the interlingual space formed by the M2 model trained in the JM2M environment and analyze the reason for the model’s inability to perform zero-shot translation. Based on this observation, we propose a framework including two steps to rectify the ill-formed interlingual space. First, we add mono-directional translation to exploit the effect of multilingual translation training (hereafter called multiway training). This essential step provides the M2 model with the ability to complete zero-shot translation. Then, based on the first step, we incorporate a neural interlingual module to help the M2 model form language-independent representation, which further improves its transfer learning ability, leading to an improvement in zero-shot translation performance.

Using this framework, we propose an approach that combines injecting noise to mono-direction translation and a novel neural interlingual module based on a Transformer attention mechanism. By visualizing the multi-parallel sentence embedding, we show that the proposed approach rectifies the ill-formed interlingual space. Furthermore, the two zero-shot translation experiments prove the effectiveness of our proposed method in the formation of a language-independent interlingual space.

Our significant contributions are outlined here:

We explore the interlingual space of the M2 model and identify the ill-formed interlingual space formed by this model when trained on English-centric data.
We propose a framework to help the M2 model form a good interlingua for zero-shot translation. Under this framework, we devise an approach that combines multiway training with the denoising task and incorporates a novel neural interlingual module.
We verify the performance of our method using two zero-shot experiments. The experimental results show that our method can outperform the 1-1 model for zero-shot translation by a large margin and surpasses the pivot-based method when combined with back-translation.

2. Related Works

2.1. Multilingual Neural Machine Translation

The architecture of the multilingual NMT model was classified by Dabre et al. [12] based on the degree of parameter sharing. By sharing a language-specific encoder [13,14], decoder [15], or both encoder and decoder [1], early multilingual NMT models minimally shared parameters.Because of its compactness, the fully shared 1-1 model, which employs only a single encoder and decoder to translate all language pairs, became the mainstream model in multilingual NMT research [2,3,4,5,16,17]. Partially shared models have been widely investigated to balance the capacity bottleneck and model size in order to ease the capacity constraint of a fully shared model [18,19,20,21,22].

The modularized multilingual NMT (M2) model defined in our paper belongs to the category of models that minimally share parameters by sharing a language-specific encoder and decoder. Despite the fact that the M2 model was proposed prior to the 1-1 model, the M2 model has received little attention due to the linear growth in its parameters as the number of languages grows. Recently, several works [10,11] have revisited the application of the M2 model in industrial settings because of its bottleneck-free and low maintenance properties.

2.2. Zero-Shot Neural Machine Translation

Zero-shot neural machine translation can translate unseen language pairs and has received increasing interest in recent years. Gu et al. [23] employed back-translation and decoder pretraining to disregard spurious correlations in zero-shot translation. Ji et al. [24] proposed cross-lingual pretraining on the encoder before training the whole model with parallel data. Al-Shedivat and Parikh [25] developed a training strategy based on consistent agreement that promotes the model to create similar translations of parallel phrases in auxiliary languages. Arivazhagan et al. [26] used an explicit alignment loss to align sentence representations from multiple languages with the same meaning in the latent space. Sestorain et al. [27] proposed a fine-tuning technique that uses dual learning [28], in which a language model is employed to generate a signal for fine-tuning zero-shot translation directions. Chen et al. [29] initialized the encoder with a multilingual pretrained encoder and then trained the model with a two-stage training strategy, yielding in a many-to-English model that supports 100 source languages but is trained with a parallel dataset in just six source languages. Liu et al. [30] dropped residual connections in one layer of the Transformer to disentangle positional information that hinders the encoder to output a language-agnostic representation. To increase translation accuracy on zero-shot directions, Wang et al. [31] incorporated a denoising autoencoder objective based on pivot language into the usual training objective. Raganato et al. [32] proposed an auxiliary supervised objective that explicitly aligns the words in language pairs to alleviate the off-target problem of zero-shot translation. Wu et al. [33] empirically compared several language tag strategies and recommended adding the target language tag to the encoder for zero-shot translation. Gonzales et al. [34] improved the zero-shot translation by performing language-specific subword segmentation and including the non-English pairs in English-centric training data.

All of the above studies of zero-shot translation use the 1-1 model for its inherent cross-lingual transferability due to shared parameters. Only a few works have studied the zero-shot translation approach using the M2 model. Lu et al. [35] and Vázquez et al. [36] both incorporated an explicit neural interlingual module into the M2 model and verified its effect on multilingual translation. Nevertheless, the proposed neural interlingual modules were the focus of these papers. The usage of reconstruction tasks in their works were understated, as they were only mentioned within the experimental settings. In contrast, we reveal the relationship between adding mono-direction translation and incorporating a neural interlingual module. We identify that the former is an essential component to enable zero-shot translation for the M2 model. A neural interlingual module can only be effective when adding mono-direction translation is applied.

2.3. Leveraging Denoising Task in Multilingual NMT

In our proposed framework, we adopt two steps to help the M2 model form an interlingual space. The first step, namely, adding mono-direction translation, leverages monolingual data. To efficiently exploit monolingual data, we propose a denoising autoencoder task and combine it with multiway training for the M2 model. Some recent works have also used similar denoising tasks in multilingual NMT. Liu et al. [16] employed a denoising objective and self-supervision to pretrain a 1-1 model, which was followed by fine-tuning on small amounts of supervised data. These authors concentrated on multilingualism during the pretraining step but fine-tuned the learned model in a typical bilingual environment. Siddhant et al. [37] trained a 1-1 model on supervised parallel data with the translation objective and on monolingual data with the masked denoising objective at the same time. To improve performance on zero-shot directions, Wang et al. [31] added a basic denoising autoencoder objective based on the English language into the usual training objective of multilingual NMT and analyzed the proposed method from a perspective of latent variables. All these works used denoising tasks for the 1-1 model. In their work, the denoising task acted as an auxiliary task in multitask learning, improving the zero-shot performance of the 1-1 model. However, for the M2 model trained in the JM2M environment, the denoising task plays a more critical role in “rectifying” the ill-formed interlingual space, enabling zero-shot translation for the M2 model. Without translation directions added by introducing the denoising task, this model is incapable of zero-shot translation.

2.4. Incorporating a Neural Interlingual Module into the M2 Model

Due to the minimally shared architecture, it is difficult for the M2 model to form a shared semantic interlingual space when trained in a JM2M environment. However, interlingual space is required for zero-shot translation [4,25,26] and incremental training [10,38]. To solve this issue, some works incorporated a neural interlingual module. Firat et al. [1] used a single attention mechanism shared across all language pairs. Lu et al. [35] incorporated an attentional LSTM encoder as an explicit neural interlingua that converts language-specific embeddings to language-independent ones. Vázquez et al. [36] hypothesized a neural interlingua consisting of a self-attention layer shared by all language pairings. These works were all based on multiple language-dependent LSTM encoder–decoders. Different from their methods, we adopt Transformer as the elemental encoder–decoder module and integrate an independent Transformer-like network that acts as a bridge to connect the encoders and decoders via a cross-attention mechanism.

3. Exploring the Interlingual Space of the M2 Model

3.1. Background: Modularized Multilingual NMT

The sequence-to-sequence model underpins the modularized multilingual NMT. However, as seen in Figure 1b, each language module has its own encoder and decoder [1]. Both the encoder and decoder are regarded as separate modules that may be freely swapped to work in all translation directions. The encoder and decoder for the

i_{t h}

language in the system are denoted as

e n c_{i}

and

d e c_{i}

, respectively. We employ

(x_{i}, y_{j})

, where

i, j \in {1, . . ., K}

indicates a pair of sentences translating from a source language i to a target language j, and K languages are taken into account. The M2 model is trained by maximizing the probability across all available language pairings

S

in training sets

D_{i, j}

. Formally, our goal is to maximize

L_{m t}

, where mt denotes the machine translation task:

L_{m t} (θ) = \sum_{\begin{matrix} (x_{i}, y_{j}) \in D_{i, j}, \\ (i, j) \in S \end{matrix}} log p (y_{j} | x_{i}; θ),

(1)

where the probability

p (y_{j} | x_{i})

is modeled as

p (y_{j} | x_{i}) = d e c_{j} (e n c_{i} (x_{i})) .

(2)

Johnson et al. [4] demonstrated that if both the source and target languages are included in the training, a trained multilingual NMT model can automatically translate between unseen pairings without any direct supervision. In other words, a model trained on Spanish → English and English → French can translate directly from Spanish to French. In a multilingual system, this sort of emergent characteristic is known as zero-shot translation. It is hypothesized that a zero-shot NMT is conceivable because the optimization pushes multiple languages to be encoded into a shared language-independent space, allowing the decoder to be separated from the source languages. Because all language pairings share the same encoder and decoder, the 1-1 model automatically possesses this attribute. However, in the M2 model, because there is no shared module among languages, and a shared semantic space is barely formed when trained in a JM2M environment.

3.2. Comparison of Interlingual Space in M2M and JM2M

Interlingual space is the ground for transfer learning, which is critical to zero-shot translation and incremental learning. Because the M2 input contains no information about the target language, encoders must encode it into a language-agnostic representation such that any decoder can translate from. Simultaneously, M2 decoders should be able to operate from the output of any M2 encoder. As a result, Lyu et al. [10] assumed that M2 encoders’ output space is interlingual and that the new module is compatible with existing modules if the interlingual space is well formed.

Figure 3a illustrates the interlingual space of the M2 model trained in the M2M environment with three languages (En, De, and Fr) that is shared by six modules. This space is retained because the M2’s weights are frozen. During the incremental training step, a new module (Es) is adapted to the interlingual space by one of the frozen modules (En) using a single parallel corpus (En-Es). Because of the well-formed interlingual space, the zero-shot learning of the incremented language module (Es) is even comparable to a supervised model [10,38]. However, when the interlingual space is formed by the M2 model trained in the JM2M environment (En-De, Fr), the situation is quite different. Figure 3b illustrates this setting. Because the JM2M environment only has language pairs to or from English, intuitively, we speculate that the interlingual space formed in the JM2M environment is weaker than that formed by the M2M environment. For example, the De encoder only needs to encode sentences so that the En decoder can decode it, and other decoders have never used its output. The same applies to the Fr encoder. The output of these two encoders can only form a common space that is compatible with the En decoder. However, this space may not be the common space needed by all decoders to generate properly.

3.3. Interlingua Visualization

We can verify the above analysis by visualizing the interlingual space formed by the M2 model trained in the M2M and JM2M environments. We plot 100 multi-parallel sentence embeddings of three languages (En, De, and Fr) from the M2 model trained in the M2M and JM2M environments in Figure 4 (Section 6.2). Mean pooling each language encoder output embedding to a 256-dimensional vector and projecting it to

R^{2}

using t-SNE [39] yields these embeddings. From this figure, we can see that the points of sentence embedding from the M2M environment cluster together, while the points from the JM2M environment are divided into two groups. There is a clear separation between the En embeddings and the other two languages in the JM2M environment. These misaligned embeddings show that the interlingual space of the M2 model trained in JM2M environment is not formed well.

To view the sentence embedding alignment in more detail, we visualize the embeddings for four groups of parallel sentences in the M2M and JM2M environments separately (Figure 5). Sentences from the same group are colored the same. Each group has a parallel translation of one English, one German, and one French sentence. Table 1 contains the text of the embedded sentences. We see a strong boundary between various groups of sentences in the M2M environment (Figure 5a), whereas sentences within the same group remain near to each other in space. This means that the M2 model trained in the M2M environment can capture language-independent semantic information in its sentence representation. In the JM2M environment (Figure 5b), in each group of sentences, the De embedding and Fr embedding are close to each other in space, but the En embedding is far away from the other two languages. This ill-formed interlingua makes the M2 model trained in the JM2M environment fail to perform zero-shot translation, as we show in the experiment in Section 6.2.

4. A Framework for Rectifying Ill-Formed Interlingua of the M2 Model Trained in the JM2M Environment

4.1. Adding a Mono-Direction Translation

The encoders’ interlingual output representation should be able to be translated by any decoder, including the source language’s decoder. Lyu et al. [10] empirically proved that the translation score of mono-direction increases when the language invariance of interlingual space improves with more languages. This inspires us to use mono-direction translation (where the source and target languages are the same) to help form an interlingual space. Additionally, learning the same language in several orientations to encode or decode it may lead to better representation learning and less overfitting in one direction. This is the regularization effect that has previously been noted in works [1,2]. In the JM2M environment (Figure 6a), when adding mono-direction translation, the encoder and decoder (excluding En) can learn to encode or decode the same language in two directions (English and itself), which results in less overfitting of English and forms a better interlingual space. This step is fundamental to enable the M2 model to have cross-lingual transfer for zero-shot translation. We prove this hypothesis experimentally in Section 6.2.

4.2. Incorporating a Neural Interlingual Module

The M2 model trained in the M2M environment can form a good interlingual space, which is attributed to the effect of data diversification and regularization [10]. This space should be language-independent because any decoders can use the representation in the space to generate a meaningful sentence. However, the effect of data diversification and regularization of the M2 model trained in the JM2M environment is too weak to form a good interlingua due to the shrunken parallel pairs. For example, for an M2 model supporting four languages, the M2M environment has 12 directions, while the JM2M environment has only 6 directions. Even after adding three mono-directions, the JM2M environment still has three directions less than the M2M environment. To compensate for this difference, we consider incorporating a neural interlingual module (NIM) to explicitly model the shared semantic space for all languages and act as a bridge between the encoder and decoder networks (Figure 6b) [35,36,40]. In addition, an explicit neural interlingual module can facilitate the incremental addition of languages. We can train the NIM with the newly added language module, which helps the old interlingual space adapt to a new space that is compatible with new languages (Section 6.3).

5. Approaches

Utilizing the proposed framework, we explore some specific methods to rectify the ill-formed interlingual space trained in the JM2M environment. In each step of the framework, we first present a simple and intuitive method. Then, based on an analysis of the shortcomings of this method, we propose a more sophisticated method that can overcome these drawbacks to form a better interlingual space.

5.1. Adding a Mono-Direction Translation

5.1.1. Reconstruction Task (REC)

The most straightforward method to add a mono-direction translation to the model is the reconstruction task [35,36,40]. The language-specific encoder encodes the monolingual sentence to its embedding representation, and the language-specific decoder generates the same sentence from the source embedding.

5.1.2. Denoising Autoencoder Task (DAE)

The reconstruction task uses the same sentence as the source and target. As a result, the language-specific encoder and decoder tend to learn to simply copy the input, which may hinder translation training. Corrupting the source sentence can prevent the module from learning a simple copy. Furthermore, if we randomly inject noise into corrupt source sentences in each batch, the encoder will learn more diverse sentences in the whole training and benefit from the data-diversification effect. Considering the above two points and inspired by the denoising pretraining task for the sequence-to-sequence model [16,41], we propose using the denoising autoencoder task (DAE) for mono-direction translation. In more technical terms, our training dataset includes

K

languages (excluding English), and each

D_{k}

is a collection of monolingual sentences in language k. We suppose that we have access to a noise function g that corrupts text and train the model to predict the original text

x_{k}

given

g (x_{k})

. We intend to maximize

L_{d a e}

as

L_{d a e} (θ) = \sum_{\begin{matrix} x_{k} \in D_{k}, \\ k \in K \end{matrix}} log p (x_{k} | g (x_{k}); θ),

(3)

where

x_{k}

is an sample of language k and the probability

p (x_{k} | g (x_{k}))

is defined as

p (x_{k} | g (x_{k})) = d e c_{k} (e n c_{k} (g (x_{k}))) .

(4)

To generate randomly altered text, the noise function g injects three forms of noise: first, tokens of the sentence are randomly discarded with a certain probability; second, tokens are substituted by a special masking token with another probability; and third, the token order is locally shuffled. Candidate tokens might be subwords or complete words that span n-grams.

After including the DAE task, our learning algorithm’s objective function becomes

L = L_{m t} + L_{d a e} .

(5)

We train the multiple encoders and decoders together by picking two tasks at random with equal probability, namely the multilingual translation and DAE tasks.

5.2. Incorporating a Neural Interlingual Module

5.2.1. Sharing Encoder Layers (SEL)

The output of language-specific encoders is mapped into a shared space by a neural interlingual module (NIM), which is a shared component between encoders and decoders. This produces an intermediate universal representation that is used as the input to decoders. Previous works [42,43,44] showed that the top layers of the multilingual Transformer [45] encoder tend to learn language-independent knowledge. Based on their findings, a natural and simple solution to this problem is sharing the top layers of language-specific encoders as the NIM. Through these means, we hope to encode sentences with different languages into a shared semantic space and take the output embedding at the top layer of the NIM as the language-independent semantic representation.

5.2.2. Transformer Attention Bridge (TAB)

Sharing top encoder layers is a simple yet effective implementation for an NIM (Section 6.2). However, there is a problem with this method. A case can be considered where sentences with the same meaning in different languages are not the same token length, which is a common occurrence. For example, the English sentence, “Mary had a little lamb. Its fleece was white as snow.” has 13 tokens (we use each word as a token for simplicity), while its Chinese translation, “玛丽有只小羊羔。它的羊毛像雪一样白。” has 18 tokens. Because the sequence length of the output embedding at the shared top layers is the same as the input sentence length, these two sentences have output embeddings with different lengths. Intuitively, an output embedding with different lengths is unlikely to be a language-independent representation coming from a common semantic space.

Based on the above observation and inspired by Zhu et al. [40], who incorporated an independent language-aware interlingua module into the 1-1 model to enhance the language-specific signal, we propose an independent neural network module as an NIM for the M2 model. This neural network module is made up of a stack of feed-forward and multi-attention sub-layers, which are similar to the Transformer architecture and act as a bridge connecting the encoders and decoders via an attention mechanism. Therefore, we call this module the Transformer attention bridge (TAB) (Figure 7).

In the original M2 model, every position of the language-specific decoder attends over all positions in the input sequence of the corresponding encoder when translating from one language to another. In the M2 model incorporated with the TAB module (Figure 7), the encoders and decoders have no direct connections; instead, they both perform cross-attention between themselves and the TAB module. For encoders, the TAB module acts as a decoder, and every position of the TAB module attend over all position of encoder output for input sequence. For decoders, the TAB module acts as an encoder, and every position of decoders attends over all position of the TAB module’s output. Formally, we define the TAB module as:

\begin{matrix} H_{t a b}^{l} = FFN (ATT (Q, K, V)), l \in [1, L] \\ Q = H_{t a b}^{l - 1} \in R^{d \times r} \\ K, V = H_{e n c_{i}} \in R^{d \times n} \end{matrix}

(6)

where

H_{t a b}^{l}

denotes the hidden states of the

l_{t h}

layer of TAB layers,

H_{t a b}^{L}

is the output hidden states that are used in cross-attention between the TAB module and decoders, and

H_{t a b}^{0}

is the input embedding for the TAB module, which is the positional encoding in our proposed module. There are different choices of positional encodings, learned or fixed [45]. Additionally,

FFN (\cdot)

is a position-wise feed-forward network utilized in the Transformer model, while

ATT (\cdot)

is the multi-head attention mechanism. The queries Q originate from the preceding TAB layer, and the memory keys K and values V are obtained from the output of the

i_{t h}

language encoder

H_{e n c_{i}}

; d is the hidden size, and n is the input sequence length of the encoder. Similar to the definition of n, r denotes the sequence length of the TAB module. However, different from n, r is a fixed value and does not change with the input embedding.

Through the cross-attention between the encoders and the TAB module, different length input sequences from the language-specific encoders are mapped into a fixed-size representation of r hidden states at the output of the module. Based on the analysis at the beginning of this section, the fixed-length output embedding represents a better language-independent semantic space than the variable-length output embedding that is obtained from the top layers of shared encoders. The experimental results also prove our speculation (see the comparison between M2 (JM2M + DAE + SEL) and M2 (JM2M + DAE + TAB) in Section 6.2.2).

Each position of the TAB module uses position embedding as the query to attend all the positions of

H_{e n c_{i}}

and forms a semantic subspace, which represents the same semantic components of different sentences from both the same and different languages. The sequence length of the TAB module r represents the number of subspaces and is worth further discussion. If r is small, we can expect that the information in each subspace is “dense”. This helps to form a common semantic space, but it can also cause information bottlenecks, leading to information loss. If r is large, the information in each subspace is “sparse”. This type of space can retain more information, but a large r will also increase the computational complexity because, for the multi-attention mechanism, the complexity is

O (r^{2})

. Therefore, we have to choose the appropriate r carefully. In the experimental settings of the zero-shot translation experiment (Section 6.2.1), we perform a hyperparameter search and set r to 50 during all experiments.

The M2 model incorporated with the proposed TAB module only needs a trivial modification for decoders to adapt to this module. We only need to change the encoder–decoder attention to TAB-decoder attention, which means that the memory keys and values originate from the output of the TAB module instead of the encoder in the original M2 model. This process allows every position in the decoder to attend over all positions in the hidden states of the TAB module.

After incorporating TAB as a neural interlingual module, Equation (2) is updated to

p (y_{j} | x_{i}) = d e c_{j} (TAB (e n c_{i} (x_{i}))),

(7)

and Equation (4) is updated to

p (x_{k} | g (x_{k})) = d e c_{k} (TAB (e n c_{k} (g (x_{k})))),

(8)

where TAB denotes the proposed Transformer attention bridge module.

6. Experiments

To show that our proposed method rectifies the ill-formed interlingual space, we projected the language-independent sentence embeddings into two dimensions to visualize them. In this low-dimensional space, parallel translations of German, English, and French sentences remain near to each other.

Interlingual space is the basis for zero-shot translation and incremental training. Thus, we conducted experiments with both settings. The results further prove the effectiveness of our method in forming a language-independent interlingual space for the M2 model trained in the JM2M environment.

6.1. Interlingua Visualization

We followed the steps in Section 3.3 to project multi-parallel sentences of German, English, and French into two dimensions. As Figure 8 shows, in addition to plotting the sentence embeddings from the M2 model trained in the M2M and JM2M environments, we also visualize the sentence embedding from our proposed model by adding a denoising autoencoder task and incorporating a Transformer attention bridge to the M2 model trained in the JM2M environment (JM2M + DAE + TAB). We can see that the embeddings of JM2M + DAE + TAB are clustered together, similar to the M2M environment, which implies that the interlingua of JM2M + DAE + TAB is formed better than in the JM2M environment.

Let us take a closer look at embeddings of the JM2M + DAE + TAB method. As in the preliminary experiment in Section 3.3, we selected four groups of parallel sentences from the Europarl dataset [46]. Each group is represented by one color and includes German, English, and French sentences that can be translated into each other (see Table 1). The embeddings of these sentences were projected to

R^{2}

using t-SNE. As shown in Figure 9, different groups separate from each other, while the three sentences embedded in one group are close to each other. This shows that our proposed model learns language-independent semantic information and rectifies the ill-formed interlingual space formed by the M2 model trained in the JM2M environment (compared with Figure 5b).

To provide a much larger evaluation of latent space alignment, we measured the representation similarity of parallel sentences from a parallel corpus. Specifically, we selected English, German and French from the test2006 of the Europarl corpus, which includes 2000 multi-parallel sentences for each language. We averaged the cosine similarity of 2000 pairs of sentence embeddings. As shown in Table 2, the results are consistent with our above observations in the interlingua visualization. For example, the M2M environment trained on all language pairs can align the three languages well. The similarity scores of the JM2M environment trained on English-centric pairs are almost zero between English and German and English and French. Our proposed model, JM2M + DAE + TAB, obtains similar scores to the M2M environment for three language pairs and even outperforms this environment in average score.

6.2. Zero-Shot Translation

6.2.1. Experimental Settings

Dataset

We used the Europarl corpus [46] to collect multi-parallel data in four languages: German, English, Finnish, and French. From 1.56M multi-parallel data, we generated 500 K, 10 K, and 10 K (training, validation, and testing, respectively) non-sharing pairs for each of the twelve potential translation orientations. For the M2M environment, we used all language pairs. For the JM2M environment, we only used English-centered language pairs, namely, En-De, En-Fi, and En-Fr. All the remaining directions were used to evaluate zero-shot performance. We exclusively utilized monolingual data taken from parallel training data for the mono-direction translation task. There was no use of additional monolingual data.

Model

We utilized the Aharoni et al. [2] model for the 1-1 model, which is based on Johnson et al. [4]’s Transformer implementation. We adjusted Firat et al. [1]’s M2 model to avoid sharing the attention module. The encoder and decoder share language-specific embeddings. Transformer [45] is used to implement all models. For our model, we utilize Transformer with a hidden dimension of 256 and a feed-forward size of 1024. For NIM-TAB, we set three layers for the TAB module and r = 50 (search in 20, 50, 100, 1024). For positional encoding of the TAB input, we experimented with learned positional encoding and sinusoidal positional encoding and found that the learned version performs slightly better than the fixed version. Therefore, we utilized the learned positional encoding in all our experiments. For NIM-SEL, we set the encoders of the M2 model to nine layers and shared the top three layers as an interlingua, which makes the NIM comparable to the TAB module in terms of model size. Except for the attention and activation dropouts of 0.1, the remainder of the setup is the same as the base model used by Vaswani et al. [45]. The 1-1 model uses a shared vocabulary of 32K tokens, whilst the M2 model employs a language-specific vocabulary of 16K tokens each, both of which are processed using the BPE [47] tokenizer of the sentencepiece package (https://github.com/google/sentencepiece. accessed on 22 June 2022) [48].

Training

All models were trained and tested using the fairseq framework (https://github.com/pytorch/fairseq. accessed on 8 June 2022) [49]. We selected the batch size such that each encoder/decoder module may learn up to 6144 tokens per GPU. Four NVIDIA Tesla V100 GPUs were used to train all models. We used the default settings of the Adam optimizer [50]. We employed 4 K warm-up steps for the learning rate schedule until 0.002, then we used the inverse square root learning rate schedule [45]. The model was optimized using cross-entropy loss with label smoothing value

ϵ_{l s} = 0.1

[51]. Within the same maximum number of epochs, the best model was chosen using the best validation loss. A sacreBLEU [52] metric was calculated using a beam size of four and a length penalty of 0.6. To train the M2 model, we used round-robin scheduling of all directions excluding the English mono-direction. The noise function employed a token deletion rate of 0.2, a token masking rate of 0.1, a token unit of WordPiece, and a span width of unigram.

Baselines

We compare our methods with three baselines. (1) The 1-1 model trained in the JM2M environment with the ability to perform zero-shot translation [4]. We empirically find that appending the language token in source and target sentences simultaneously can result in the best performance, so we use this setting to train this model. In multilingual translation, for directions without parallel training data, pivot-based translation is a commonly used method. Pivot-based translation translates source language to a pivot language and then to the target language in two steps, which causes an inefficient translation process. However, pivot-based translation usually acts as a gold standard for evaluating the zero-shot performance of multilingual NMT. Therefore, we use two pivot-based translations in our experiments: (2) PIV-S uses two single-pair NMT models trained on each pair to conduct pivot-based translation; and (3) PIV-M uses the M2 model trained in the JM2M environment to perform a two-step pivot-based translation. Both pivot methods use English as the pivot language.

6.2.2. Results and Analysis

Table 3 displays the results. Because of the ill-formed interlingual space, the vanilla M2 model trained in the JM2M environment is incapable of zero-shot translation. The BLEU scores of zero-shot translation are all less than one. Nevertheless, the 1-1 model trained in the JM2M environment has the ability for zero-shot translation to some extent, which was first observed in [4] and researched in many works [4,25,26].

In the third group of Table 3, as we expected, adding mono-direction translation provides the M2 model with the ability to perform zero-shot translation. Adding only a reconstruction task can obtain a significant average score of 11.34 with zero-shot translation. Adding our proposed DAE task further improves the performance (15.08). Note that by adding only the DAE task, the M2 model can outperform the 1-1 model (15.08 vs. 13.76) on average zero-shot performance. We can also observe that adding a reconstruction task has a disadvantage for translation direction to and from English (henceforth parallel translation). The performance drops on the average score (1.69). As a result, modules learn to merely replicate input, which impedes translation training. However, the DAE task can avoid this problem because it uses corrupted sentences on its source side, and the language-specific module cannot simply copy the source to the target. Due to the superiority of the DAE task over the reconstruction task, we use the DAE task in the subsequent experiments.

In the fourth group of Table 3, based on expanding the JM2M environment with three DAE tasks, we integrate the two kinds of NIM into the M2 model and compare their performances on zero-shot translation. We can see that both NIM methods further improve the zero-shot performance compared to using only the DAE tasks (approximately 5∼6 BLEU points on average). Although SEL behaves slightly better than TAB on parallel translation (an average score of 0.51 higher), TAB achieves much better performance on every direction of zero-shot translation (higher 1∼2 BLEU point, average 1.45). This shows that our proposed Transformer attention bridge can form a better interlingual space than only sharing the top encoder layers, which benefits the transfer learning ability for zero-shot translation. We obtain better performance when combining the DAE task and TAB, which outperforms the 1-1 baseline by a large margin (8.22) and is similar to pivot-based methods. Furthermore, the M2 (JM2M + DAE + TAB) method surpasses pivot-based methods in all zero-shot directions by adding back-translation [53] to construct pseudo parallel sentence pairs for all zero-shot translation directions using itself and further training with the mixture of these pseudo data and English-centric data. Gu et al. [23] introduced back-translation to ignore spurious correlations for zero-shot translation of the 1-1 model, while we adopted it for the M2 model.

To reveal the relationship between two steps within the proposed framework, i.e., adding mono-direction translation and incorporating an NIM, we experimented on integrating an NIM without using the DAE task. In the second group from the bottom of Table 3, we see that solely incorporating an NIM in the M2 model has no noticeable improvement on zero-shot performance. The results show that the two steps of the proposed framework are not independent. Mono-direction translation is essential in the framework that rectifies ill-formed interlingual space for the M2 model trained in the JM2M environment. Based on multiway training in the JM2M environment and mono-direction translations, the NIM further helps the M2 model form the shared semantic interlingual space. In fact, an NIM is particularly suitable for use with the DAE task, which injects noise that breaks the sentence’s syntax but keeps the semantics. An NIM can learn to keep only language-independent semantic information and discard language-specific syntax information. This makes the NIM exploit semantic information from the DAE task and ignore the perturbed syntax.

6.3. Zero-Shot Learning of Adding a New Language Incrementally

6.3.1. Experimental Settings

Dataset

We changed the multi-parallel dataset of the Europarl corpus in a new way to enhance the number of languages. We partitioned a 1.25 M multi-parallel dataset into 250 K for each direction without sharing, using five different languages: German, English, Spanish, French, and Dutch. Most other aspects were carried over from previous experiments.

Model

In this experiment, we used the best model found in zero-shot translation experiments (Section 6.2), namely, M2 (JM2M + DAE + TAB). The other details were generally the same as in former experiments.

Training

In the initial training stage, we trained an M2 model with available translation directions from four different languages (German, English, Spanish, and Dutch). In the incremental training stage, we added French to the trained model. We experimented with two settings for training the added French module: (1) we trained the French encoder and decoder using English–French pairs and the French DAE task, while the parameters of the English and TAB modules remained frozen, i.e., only the parameters of the French module were trained; and (2) we trained the French encoder and decoder on some data, but only the English module remained frozen, i.e., the parameters of the French module and TAB were trained. For comparison, we also trained an M2 (JM2M + DAE + TAB) model using five languages from scratch, which functions as an upper bound for the incremental training. The other details were generally the same as in former experiments.

6.3.2. Result and Analysis

The results are shown in Table 4. From an overall view, we can see that both settings of incremental training can reach comparable performance with joint training from scratch with much less effort. Especially for setting (1), by only training the Fr module for approximately 3.9 h, the average performance of incremental zero-shot translation is marginally lower than joint training from scratch, which takes approximately three times longer than incremental training (12.9 h). This result shows that our approach can form a good interlingual space that is independent of language. The interlingual space trained in the initial training stage is preserved as the weights of language modules and NIM are frozen. The new module (Fr) is adjusted to the initially formed interlingual space by training it with a parallel corpus (En-Fr) and DAE task (Fr-Fr) using one of the frozen modules (En). The new module (Fr) can be compatible with the existing modules (De, Es, and Nl) if this interlingual space is formed well.

Comparing the two settings of incremental training, some findings are of interest. Considering the impact of incremental training on translation directions in the initially trained stage, it is no surprise that in the Fr setting, the performance of these directions is not changed because all modules except Fr are frozen. However, their performance in the Fr + TAB setting decreased slightly (1.09 for P-AVG and 1.57 for Z-AVG), which implies that adapting the TAB to the Fr module hurts the performance of initially trained modules due to alteration of the interlingual space formed in the initial training stage. To consider translation directions in the incremental training stage, the situation is more complicated. For parallel translation (En-Fr, Fr-En), Fr + TAB has more parameters than Fr trained on these two parallel pairs, which benefits the performance of parallel translations. However, for zero-shot translation, the results are mixed. Fr manifests a better performance in directions from Fr (Fr→De, Fr→Es, Fr→Nl), while Fr + TAB performs better in directions to Fr (De→Fr, Es→Fr, Nl→Fr) than JOINT, which is the upper bound for incremental training. The results indicate that when TAB is frozen, the Fr encoder has to adapt to the interlingual space formed in the initial training stage, which is more compatible with decoders except for Fr. If TAB is trained together, the initially learned interlingual space transforms to adapt to the Fr decoder, which makes it less compatible with other decoders. For this reason, we observe that the performance in the XX→Fr directions increases while the performance of those in the Fr→XX directions decreases. This result inspires us to conclude that neither Fr nor Fr + TAB is the best strategy for incremental training. We should consider a better method instead. We leave this to future study.

7. Discussion

In this section, we first provide some takeaways about model choice for readers. Then, some comparisons and discussions with the existing similar conclusions are made to show the innovation and contribution of the proposed method. Finally, we discuss some potential directions for future research.

7.1. Takeaways for Model Choice

The 1-1 model is the mainstream of multilingual NMT, which processes merits such as compact model architecture and the ability of cross-lingual transfer. The latter is essential for the improvement of translation performance for low-resource languages and the emerging capability of zero-shot translation. However, we can use the M2 model as an alternative for the 1-1 model in two circumstances. The first one is that models need to accommodate large amounts of languages and data. An increased number of languages and data tend to cause the capacity bottleneck for the 1-1 model, which leads to the degeneration of translation quality for high-resource languages. Moreover, for languages that come from different language families and do not share scripts (e.g., Chinese, Arabic, Russian, and Greek), the shared vocabulary becomes larger due to less vocabulary overlap, which increases the inference cost and reduces the speed of translating sentences. M2 models are free from these problems because the parameters and vocabulary of each language are independent based on its modular architecture. Another scenario suitable for the M2 model is the situation where we need to quickly extend existing machine translation systems to support new languages. In industrial settings, a rapid extension of an existing multilingual NMT is critical in some situations, such as assisting in a catastrophe situation where international assistance is necessary in a limited territory, or building a translation system for a new customer whose mother language is not in the existing multilingual NMT. To support new languages, the 1-1 model needs to expand the vocabulary and train from scratch using all training data. This process consumes a lot of time and cost, which makes it difficult to meet the requirements of rapid deployment. However, the M2 model naturally satisfies this need properly through incremental learning, which only needs to train the newly added language modules on the new language pairs.

7.2. Comparison with Similar Conclusions

7.2.1. Language Adapter

The modularity of the M2 model endows it with the ability to perform incremental learning, where the model only needs to train the newly added language module on new language data to support the translation between the new language and original language as well as remaining the performance of original translation directions. Another solution similar to this idea is the usage of adapters [22]. The adapter is a sub-network with fewer parameters than the adapted model. When adding new languages, the model with adapters can adapt to new languages via training adapters with the new language data and keeping the model parameters invariant. Due to the relatively small size of adapters, the new data can be fit with less effort than training the whole model. Furthermore, because the original network remains unchanged, the translation quality of the original directions is not affected. Moreover, Philip et al. [54] add language-specific adapters in the encoder and decoder and trained the adapters with English-centric data to perform the zero-shot translation between language pairs excluding English. One limitation of the adapter technique is that it can only expand the input–output distribution of the original pretrained model to include all language pairs that we are interested in supporting. In other words, we cannot add any new language pairs to the model during adaptation; rather, we utilize the adaptation stage to enhance performance on languages that were pretrained. In contrast, the M2 model, which can be seen as taking the language module (including encoder and decoder) as the adapter, provides additional versatility because the first phase of training does not alter depending on whatever new language is subsequently fine-tuned. This could be appropriate for a real-world scenario where data for the new language are subsequently acquired.

7.2.2. Positional Disentangled Encoder

In Section 5.2, we mention that the motivation for introducing the TAB to replace the SEL is that the hidden state sequence of the Transformer, i.e., interlingua which should be language-agnostic representation, is prone to language specificity due to the fact that a sentence in different languages is likely of varying lengths. Therefore, we use the cross-attention mechanism to map the variable-length hidden state sequences of encoders to the fixed-length hidden state sequences of TAB, which is more likely to be a language-agnostic representation thanks to the removal of language-specific information such as lengths of sentences. Based on a similar idea, Liu et al. [30] revealed that the encoder output has strong positional correspondence to input tokens in a typical Transformer encoder. The same semantic meaning will be encoded into various hidden state sequences because a sentence in a different language is probably going to be different lengths and have a different word order, which hinders the creation of language-independent representation. The authors hypothesize that the residual connection present throughout all layers is the main factor causing the outputs to be positionally matching to the input tokens, and they loosen this constraint by dropping residual connections once in one encoder layer. Because their approach is orthogonal to ours, we can use both methods to help the encoder to output a better language-agnostic representation. We leave that for future study.

7.2.3. Shared or Language-Specific Parameters

The M2 model is a practical alternative to the 1-1 model. In this work, we introduce the DAE and NIM to solve the problem that the M2 model cannot do zero-shot translation due to its lack of cross-lingual transfer capability. The NIM is shared among the language-specific modules, which makes the M2 equipped with NIM become a partially shared model between a completely language-specific and fully shared model. This partially shared model not only enjoys the benefit of positive transfer across languages enabled by shared parameters (NIM) but also avoids task interference between dissimilar languages and a capacity bottleneck caused by the increased number of languages and data via language-specific parameters (language modules). This naturally leads to a question: Is there a better strategy to share partial parameters of the M2 model so as to achieve a better performance? Zhang et al. [55] used a conditional computation method [56] to enable the 1-1 model to automatically learn which sub-layers of Transformer parameters are language-specific to maximize translation quality under a given budget constraint. Inspired by their work, we can also use conditional computation to investigate which parts of the M2 model parameters can be shared to achieve the best translation quality and reduce the model parameters without compromising performance. We leave this for future research.

7.3. Future Research

Based on the above discussion and the current work, we have the following directions worthy of further research:

We can use methods that are orthogonal to the proposed methods to further improve the zero-shot translation quality for the M2 model, such as the positional disentangled encoder.
We can study the strategy of shared parameters for the M2 model to reduce the number of model parameters without losing the translation quality and model flexibility. We can use conditional computation to allow the model to automatically learn the best strategy of shared parameters from the data.
We need to explore the potential capacity bottleneck created by the introduction of NIM and find a solution.
The use of the NIM as shared parameters undermines the modular structure of the M2 model, which prevents the NIM from incremental learning when adding new languages. One possible research direction is to modularize NIM while maintaining its cross-lingual transfer capability in order to support incremental updating of partial model parameters.
Thanks to the modularity of the M2 model, we can explore the possibility of adding multiple modalities such as image or speech with modality-specific modules.

8. Conclusions

The modularized multilingual NMT (M2) model is a practical alternative for the model that shares one encoder and decoder (1-1) in industrial settings. However, the M2 model trained in the JM2M environment cannot achieve zero-shot translation, limiting its application in industry. In this paper, we studied the zero-shot translation of the M2 model trained in the JM2M environment.

Because interlingual space is the ground for cross-lingual transfer and zero-shot translation, we first compared the interlingual space formed by the M2 model trained in the M2M and JM2M environments and identified the ill-formed interlingual space formed by the M2 model trained in the JM2M environment. To rectify this ill-formed interlingual space, we propose a framework including two steps: adding mono-direction translation and incorporating a neural interlingual module. Using this framework, we propose an approach that includes expanding the multiway training of the M2 model with a denoising autoencoder task and incorporating a Transformer attention bridge as a neural interlingual module. By visualizing the interlingua of our proposed model, we see that the ill-formed interlingual space is improved. Furthermore, we show that the improved interlingual space benefits zero-shot translation through two experiments: in the zero-shot translation experiment, our model outperforms the 1-1 model on zero-shot translation by a large margin and surpasses the pivot-based translation when combined with back-translation. In the incremental training experiment, our model achieved comparable performance with the model trained from scratch at only one-third of the time cost.

Author Contributions

Conceptualization, Y.S.; methodology, J.L.; software, J.L.; validation, J.L.; formal analysis, J.L.; data curation, J.L.; writing—original draft preparation, J.L.; writing—review and editing, Y.S.; visualization, J.L.; supervision, Y.S.; project administration, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw corpora used in this article are from publicly available datasets, which can be obtained from the links given in the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Firat, O.; Cho, K.; Bengio, Y. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 866–875. [Google Scholar] [CrossRef] [Green Version]
Aharoni, R.; Johnson, M.; Firat, O. Massively Multilingual Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 3874–3884. [Google Scholar] [CrossRef] [Green Version]
Ha, T.; Niehues, J.; Waibel, A.H. Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. arXiv 2016, arXiv:1611.04798. [Google Scholar]
Johnson, M.; Schuster, M.; Le, Q.V.; Krikun, M.; Wu, Y.; Chen, Z.; Thorat, N.; Viégas, F.B.; Wattenberg, M.; Corrado, G.; et al. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. Trans. Assoc. Comput. Linguist. 2017, 5, 339–351. [Google Scholar] [CrossRef] [Green Version]
Arivazhagan, N.; Bapna, A.; Firat, O.; Lepikhin, D.; Johnson, M.; Krikun, M.; Chen, M.X.; Cao, Y.; Foster, G.F.; Cherry, C.; et al. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. arXiv 2019, arXiv:1907.05019. [Google Scholar]
Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, QC, Canada, 8–13 December 2014; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q., Eds.; Curran Associates: Red Hook, NY, USA, 2014; pp. 3104–3112. [Google Scholar]
Zhang, B.; Williams, P.; Titov, I.; Sennrich, R. Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1628–1639. [Google Scholar] [CrossRef]
Huang, Y.; Cheng, Y.; Bapna, A.; Firat, O.; Chen, D.; Chen, M.X.; Lee, H.; Ngiam, J.; Le, Q.V.; Wu, Y.; et al. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H.M., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E.B., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2019; pp. 103–112. [Google Scholar]
Lepikhin, D.; Lee, H.; Xu, Y.; Chen, D.; Firat, O.; Huang, Y.; Krikun, M.; Shazeer, N.; Chen, Z. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In Proceedings of the 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
Lyu, S.; Son, B.; Yang, K.; Bae, J. Revisiting Modularized Multilingual NMT to Meet Industrial Demands. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5905–5918. [Google Scholar] [CrossRef]
Escolano, C.; Costa-jussà, M.R.; Fonollosa, J.A.R.; Artetxe, M. Multilingual Machine Translation: Closing the Gap between Shared and Language-specific Encoder-Decoders. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, EACL 2021, Online, 19–23 April 2021; Merlo, P., Tiedemann, J., Tsarfaty, R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 944–948. [Google Scholar]
Dabre, R.; Chu, C.; Kunchukuttan, A. A Survey of Multilingual Neural Machine Translation. ACM Comput. Surv. 2020, 53, 99. [Google Scholar] [CrossRef]
Dong, D.; Wu, H.; He, W.; Yu, D.; Wang, H. Multi-Task Learning for Multiple Language Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China, 26–31 July 2015; Association for Computational Linguistics: Stroudsburg, PA, USA, 2015; Volume 1, pp. 1723–1732. [Google Scholar] [CrossRef] [Green Version]
Lee, J.; Cho, K.; Hofmann, T. Fully Character-Level Neural Machine Translation without Explicit Segmentation. Trans. Assoc. Comput. Linguist. 2017, 5, 365–378. [Google Scholar] [CrossRef] [Green Version]
Zoph, B.; Knight, K. Multi-Source Neural Translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; pp. 30–34. [Google Scholar] [CrossRef]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual Denoising Pre-training for Neural Machine Translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Wang, X.; Pham, H.; Arthur, P.; Neubig, G. Multilingual Neural Machine Translation with Soft Decoupled Encoding. In Proceedings of the 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Blackwood, G.; Ballesteros, M.; Ward, T. Multilingual Neural Machine Translation with Task-Specific Attention. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 3112–3122. [Google Scholar]
Sachan, D.; Neubig, G. Parameter Sharing Methods for Multilingual Self-Attentional Translation Models. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 261–271. [Google Scholar] [CrossRef] [Green Version]
Platanios, E.A.; Sachan, M.; Neubig, G.; Mitchell, T. Contextual Parameter Generation for Universal Neural Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 425–435. [Google Scholar] [CrossRef] [Green Version]
Zaremoodi, P.; Buntine, W.; Haffari, G. Adaptive Knowledge Sharing in Multi-Task Learning: Improving Low-Resource Neural Machine Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 2, pp. 656–661. [Google Scholar] [CrossRef] [Green Version]
Bapna, A.; Firat, O. Simple, Scalable Adaptation for Neural Machine Translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 1538–1548. [Google Scholar]
Gu, J.; Wang, Y.; Cho, K.; Li, V.O. Improved Zero-shot Neural Machine Translation via Ignoring Spurious Correlations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1258–1268. [Google Scholar] [CrossRef] [Green Version]
Ji, B.; Zhang, Z.; Duan, X.; Zhang, M.; Chen, B.; Luo, W. Cross-Lingual Pre-Training Based Transfer for Zero-Shot Neural Machine Translation. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, the Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020; AAAI Press: Palo Alto, CA, USA, 2020; pp. 115–122. [Google Scholar]
Al-Shedivat, M.; Parikh, A. Consistency by Agreement in Zero-Shot Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1, pp. 1184–1197. [Google Scholar] [CrossRef]
Arivazhagan, N.; Bapna, A.; Firat, O.; Aharoni, R.; Johnson, M.; Macherey, W. The Missing Ingredient in Zero-Shot Neural Machine Translation. arXiv 2019, arXiv:1903.07091. [Google Scholar]
Sestorain, L.; Ciaramita, M.; Buck, C.; Hofmann, T. Zero-Shot Dual Machine Translation. arXiv 2018, arXiv:1805.10338. [Google Scholar]
He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; Ma, W. Dual Learning for Machine Translation. In Proceedings of the Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., von Luxburg, U., Guyon, I., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2016; pp. 820–828. [Google Scholar]
Chen, G.; Ma, S.; Chen, Y.; Zhang, D.; Pan, J.; Wang, W.; Wei, F. Towards Making the Most of Cross-Lingual Transfer for Zero-Shot Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 142–157. [Google Scholar]
Liu, D.; Niehues, J.; Cross, J.; Guzmán, F.; Li, X. Improving Zero-Shot Translation by Disentangling Positional Information. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Virtual Event, 1–6 August 2021; Volume 1, pp. 1259–1273. [Google Scholar]
Wang, W.; Zhang, Z.; Du, Y.; Chen, B.; Xie, J.; Luo, W. Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 4321–4327. [Google Scholar]
Raganato, A.; Vázquez, R.; Creutz, M.; Tiedemann, J. An Empirical Investigation of Word Alignment Supervision for Zero-Shot Multilingual Neural Machine Translation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 8449–8456. [Google Scholar]
Wu, L.; Cheng, S.; Wang, M.; Li, L. Language Tags Matter for Zero-Shot Neural Machine Translation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Online Event, 1–6 August 2021; pp. 3001–3007. [Google Scholar]
Gonzales, A.R.; Müller, M.; Sennrich, R. Subword Segmentation and a Single Bridge Language Affect Zero-Shot Neural Machine Translation. In Proceedings of the Fifth Conference on Machine Translation, Online Event, 19–20 November 2020; pp. 528–537. [Google Scholar]
Lu, Y.; Keung, P.; Ladhak, F.; Bhardwaj, V.; Zhang, S.; Sun, J. A neural interlingua for multilingual machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, 31 October–1 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 84–92. [Google Scholar] [CrossRef] [Green Version]
Vázquez, R.; Raganato, A.; Tiedemann, J.; Creutz, M. Multilingual NMT with a Language-Independent Attention Bridge. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), Florence, Italy, 2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 33–39. [Google Scholar] [CrossRef]
Siddhant, A.; Bapna, A.; Cao, Y.; Firat, O.; Chen, M.X.; Kudugunta, S.R.; Arivazhagan, N.; Wu, Y. Leveraging Monolingual Data with Self-Supervision for Multilingual Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 2827–2835. [Google Scholar] [CrossRef]
Escolano, C.; Costa-jussà, M.R.; Fonollosa, J.A.R. From Bilingual to Multilingual Neural Machine Translation by Incremental Training. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy, 28 July–2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 236–242. [Google Scholar] [CrossRef]
Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Zhu, C.; Yu, H.; Cheng, S.; Luo, W. Language-aware Interlingua for Multilingual Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1650–1655. [Google Scholar] [CrossRef]
Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J.R., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
Conneau, A.; Wu, S.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging Cross-lingual Structure in Pretrained Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 6022–6034. [Google Scholar] [CrossRef]
Artetxe, M.; Ruder, S.; Yogatama, D. On the Cross-lingual Transferability of Monolingual Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4623–4637. [Google Scholar] [CrossRef]
Kudugunta, S.R.; Bapna, A.; Caswell, I.; Firat, O. Investigating Multilingual NMT Representations at Scale. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3–7 November 2019; Inui, K., Jiang, J., Ng, V., Wan, X., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 1565–1575. [Google Scholar] [CrossRef] [Green Version]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates: Red Hook, NY, USA, 2017; pp. 5998–6008. [Google Scholar]
Koehn, P. Europarl: A parallel corpus for statistical machine translation. In Proceedings of the MT Summit, Phuket, Thailand, 13–15 September 2005; Citeseer: Princeton, NJ, USA, 2005; Volume 5, pp. 79–86. [Google Scholar]
Kudo, T. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; Volume 1, pp. 66–75. [Google Scholar] [CrossRef] [Green Version]
Kudo, T.; Richardson, J. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 66–71. [Google Scholar] [CrossRef] [Green Version]
Ott, M.; Edunov, S.; Baevski, A.; Fan, A.; Gross, S.; Ng, N.; Grangier, D.; Auli, M. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations): Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Ammar, W., Louis, A., Mostafazadeh, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 48–53. [Google Scholar] [CrossRef]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar] [CrossRef] [Green Version]
Post, M. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Brussels, Belgium, 31 October–1 November 2018; Association for Computational Linguistics: Stroudsburg, PA, USA, 2018; pp. 186–191. [Google Scholar] [CrossRef] [Green Version]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Stroudsburg, PA, USA, 2016; Volume 1, pp. 86–96. [Google Scholar] [CrossRef]
Philip, J.; Berard, A.; Gallé, M.; Besacier, L. Monolingual adapters for zero-shot neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Honolulu, HI, USA, 16–18 November 2020; pp. 4465–4470. [Google Scholar]
Zhang, B.; Bapna, A.; Sennrich, R.; Firat, O. Share or Not? Learning to Schedule Language-Specific Capacity for Multilingual Translation. In Proceedings of the Ninth International Conference on Learning Representations 2021, Virtual Event, 3–7 May 2021. [Google Scholar]
Bengio, Y.; Léonard, N.; Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv 2013, arXiv:1308.3432. [Google Scholar]

Figure 1. Overview of two common forms of multilingual NMT models for three languages: English (En), German (De), and French (Fr). (a) The 1-1 model, which shares all model parameters for all six directions. (b) The M2 model, in which only language-specific modules are shared.

Figure 2. Source–target translation graphs in multilingual NMT. Lines indicate that direct parallel data exist. When there is no line connecting any two languages, the zero-shot translation approach can be applied. (a) M2M (b) JM2M.

Figure 3. Interlingual space formed by training in the M2M and JM2M environments. (a) The well-formed interlingual space in the M2M environment enables zero-shot translation in the incremental learning stage. (b) The ill-formed interlingual space in the JM2M environment disables zero-shot translation.

Figure 4. An interlingual representation of multi-parallel sentences from the M2 model trained in the M2M and JM2M environments.

Figure 5. Interlingual embeddings for four groups of parallel En, De, and Fr sentences from the M2 model trained in the M2M and JM2M environments, respectively. (a) M2M (b) JM2M. t-SNE is used to project the 256-dimensional mean-pooled interlingual sentence embeddings down to

R^{2}

. The colors and content of the sentences are shown in Table 1.

Figure 5. Interlingual embeddings for four groups of parallel En, De, and Fr sentences from the M2 model trained in the M2M and JM2M environments, respectively. (a) M2M (b) JM2M. t-SNE is used to project the 256-dimensional mean-pooled interlingual sentence embeddings down to

R^{2}

. The colors and content of the sentences are shown in Table 1.

Figure 6. The two steps of the framework for rectifying ill-formed interlingua of the M2 model trained in the JM2M environment. (a) Adding mono-direction translation (b) Incorporating a neural interlingual module.

Figure 7. The architecture of the M2 model incorporated with the proposed Transformer attention bridge (TAB) module.

Figure 8. Interlingual representation of multi-parallel sentences from the M2 model trained in the M2M and JM2M environments and the proposed JM2M + DAE + TAB model.

Figure 9. Interlingual embeddings from the proposed method (JM2M + DAE + TAB) for four sets of parallel German, English, and French sentences. t-SNE was used to project the 256-dimensional mean-pooled interlingual sentence embeddings to

R^{2}

. The representative sentence colors and contents are shown in Table 1.

Figure 9. Interlingual embeddings from the proposed method (JM2M + DAE + TAB) for four sets of parallel German, English, and French sentences. t-SNE was used to project the 256-dimensional mean-pooled interlingual sentence embeddings to

R^{2}

. The representative sentence colors and contents are shown in Table 1.

Table 1. Multi-parallel sentences used for interlingua visualization.

Color	Lang.	Text
Red	En	I hope with all my heart, and I must say this quite emphatically, that an opportunity will arise when this document can be incorporated into the Treaties at some point in the future.
	De	Ich hoffe unbedingt—und das sage ich mit allem Nachdruck-, dass es sich durchaus als möglich erweisen wird, diese Charta einmal in die Verträge aufzunehmen.
	Fr	J’espère vraiment, et j’insiste très fort, que l’on verra se présenter une occasion réelle d’incorporer un jour ce document dans les Traités.
Green	En	Should this fail to materialise, we should not be surprised if public opinion proves sceptical about Europe, or even rejects it.
	De	Anderenfalls darf man sich über den Skeptizismus gegenüber Europa oder gar seine Ablehnung durch die Öffentlichkeit nicht wundern.
	Fr	Faute de quoi comment s’étonner du scepticisme, voire du rejet de l’Europe dans l’opinion publique.
Blue	En	The Intergovernmental Conference—to address a third subject—on the reform of the European institutions is also of decisive significance for us in Parliament.
	De	Die Regierungskonferenz—um ein drittes Thema anzusprechen—zur Reform der europäischen Institutionen ist für uns als Parlament ebenfalls von entscheidender Bedeutung.
	Fr	Pour nous, en tant que Parlement—et j’aborde là un troisième thème-, la Conférence intergouvernementale sur la réforme des institutions européennes est aussi éminemment importante.
Orange	En	At present I feel there is a danger that if the proposal by the Belgian Government on these sanction mechanisms were to be implemented, we would be hitting first and examining only afterwards.
	De	Derzeit halte ich es für bedenklich, dass zuerst besiegelt und dann erst geprüft wird, wenn sich der Vorschlag der belgischen Regierung in Bezug auf die Sanktionsmechanismen durchsetzt.
	Fr	En ce moment, si la proposition du gouvernement belge devait être adoptée pour ces mécanismes de sanction, on courrait selon moi le risque de sévir avant d’enquêter.

Table 2. Cosine similarity scores of sentence embeddings for three parallel language pairs. The highest score in the average column is in bold.

Model	En-De	En-Fr	De-Fr	Avg
M2M	0.8441	0.8526	0.8421	0.8463
JM2M	0.0161	0.0034	0.8062	0.2752
JM2M + DAE + TAB	0.8424	0.8492	0.8526	0.8481

Table 3. Zero-shot translation results. For all M2 models except M2 (M2M), a fully supervised “upper bound”, the top two scores in each column are highlighted in bold. REC represents reconstruction task, SEL represents sharing top encoder layers as an NIM, TAB represents the Transformer attention bridge as an NIM. P-AVG and Z-AVG represent average scores of parallel translation and zero-shot translation, respectively.

Model	De-Fi	Fi-De	De-Fr	Fr-De	Fi-Fr	Fr-Fi	Z_AVG	P_AVG
PIV-S	18.66	20.75	30.04	24.07	26.96	18.84	23.22	–
PIV-M	18.66	20.90	30.40	24.35	27.53	19.31	23.53	–
1-1 (JM2M)	10.51	11.75	18.48	15.08	15.75	11.00	13.76	33.01
M2 (JM2M)	0.27	0.13	0.11	0.11	0.12	0.28	0.17	33.71
M2 (JM2M + REC)	8.99	8.56	15.11	13.44	12.32	9.64	11.34	32.02
M2 (JM2M + DAE)	10.77	12.83	21.49	15.39	18.57	11.44	15.08	33.88
M2 (JM2M + DAE + SEL)	16.37	17.96	27.58	20.70	24.35	16.21	20.53	33.93
M2 (JM2M + DAE + TAB)	17.54	19.46	28.76	22.77	25.30	18.05	21.98	33.42
+Back-translation	19.03	21.64	30.76	24.62	27.90	19.55	23.92	34.10
M2 (JM2M+SEL)	0.48	0.35	1.34	0.31	1.20	0.46	0.69	33.91
M2 (JM2M+TAB)	1.03	0.62	0.57	0.64	0.54	0.96	0.73	33.48
M2 (M2M)	20.99	23.18	32.93	26.61	29.70	21.19	25.77	34.57

Table 4. Zero-shot translation results of incremental addition of Fr module. For two methods of incremental training, the highest scores are marked in bold. INIT represents initial training, INCR represents incremental training, and JOINT represents joint training all language pairs from scratch. P-AVG and Z-AVG represent the average scores of parallel translation and zero-shot translation, respectively.

Training	Pairs	INIT	INCR		JOINT
Stage	Pairs	INIT	Fr	Fr + TAB	JOINT
INIT	P-AVG	34.33	34.33	33.24	34.77
INIT	Z-AVG	24.28	24.28	22.71	24.44
INCR	En-Fr	–	36.12	37.59	38.29
	Fr-En	–	37.74	38.04	38.15
	P-AVG	–	36.93	37.82	38.22
	De-Fr	–	27.31	27.76	27.53
	Fr-De	–	20.92	19.61	21.24
	Es-Fr	–	33.97	34.99	33.99
	Fr-Es	–	34.84	33.52	34.93
	Nl-Fr	–	25.46	25.90	25.82
	Fr-Nl	–	22.61	21.70	23.14
	Z-AVG	–	27.52	27.25	27.78
train #para (M)		79.4	19.3	21.6	98.6
train time (h)		9.6	3.9	5.0	12.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, J.; Shi, Y. Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT. Mathematics 2022, 10, 4178. https://doi.org/10.3390/math10224178

AMA Style

Liao J, Shi Y. Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT. Mathematics. 2022; 10(22):4178. https://doi.org/10.3390/math10224178

Chicago/Turabian Style

Liao, Junwei, and Yu Shi. 2022. "Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT" Mathematics 10, no. 22: 4178. https://doi.org/10.3390/math10224178

APA Style

Liao, J., & Shi, Y. (2022). Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT. Mathematics, 10(22), 4178. https://doi.org/10.3390/math10224178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Rectifying Ill-Formed Interlingual Space: A Framework for Zero-Shot Translation on Modularized Multilingual NMT

Abstract

1. Introduction

2. Related Works

2.1. Multilingual Neural Machine Translation

2.2. Zero-Shot Neural Machine Translation

2.3. Leveraging Denoising Task in Multilingual NMT

2.4. Incorporating a Neural Interlingual Module into the M2 Model

3. Exploring the Interlingual Space of the M2 Model

3.1. Background: Modularized Multilingual NMT

3.2. Comparison of Interlingual Space in M2M and JM2M

3.3. Interlingua Visualization

4. A Framework for Rectifying Ill-Formed Interlingua of the M2 Model Trained in the JM2M Environment

4.1. Adding a Mono-Direction Translation

4.2. Incorporating a Neural Interlingual Module

5. Approaches

5.1. Adding a Mono-Direction Translation

5.1.1. Reconstruction Task (REC)

5.1.2. Denoising Autoencoder Task (DAE)

5.2. Incorporating a Neural Interlingual Module

5.2.1. Sharing Encoder Layers (SEL)

5.2.2. Transformer Attention Bridge (TAB)

6. Experiments

6.1. Interlingua Visualization

6.2. Zero-Shot Translation

6.2.1. Experimental Settings

Dataset

Model

Training

Baselines

6.2.2. Results and Analysis

6.3. Zero-Shot Learning of Adding a New Language Incrementally

6.3.1. Experimental Settings

Dataset

Model

Training

6.3.2. Result and Analysis

7. Discussion

7.1. Takeaways for Model Choice

7.2. Comparison with Similar Conclusions

7.2.1. Language Adapter

7.2.2. Positional Disentangled Encoder

7.2.3. Shared or Language-Specific Parameters

7.3. Future Research

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI