Exploring the Data Efﬁciency of Cross-Lingual Post-Training in Pretrained Language Models

: Language model pretraining is an effective method for improving the performance of downstream natural language processing tasks. Even though language modeling is unsupervised and thus collecting data for it is relatively less expensive, it is still a challenging process for languages with limited resources. This results in great technological disparity between high-and low-resource languages for numerous downstream natural language processing tasks. In this paper, we aim to make this technology more accessible by enabling data efﬁcient training of pretrained language models. It is achieved by formulating language modeling of low-resource languages as a domain adaptation task using transformer-based language models pretrained on corpora of high-resource languages. Our novel cross-lingual post-training approach selectively reuses parameters of the language model trained on a high-resource language and post-trains them while learning language-speciﬁc parameters in the low-resource language. We also propose implicit translation layers that can learn linguistic differences between languages at a sequence level. To evaluate our method, we post-train a RoBERTa model pretrained in English and conduct a case study for the Korean language. Quantitative results from intrinsic and extrinsic evaluations show that our method outperforms several massively multilingual and monolingual pretrained language models in most settings and improves the data efﬁciency by a factor of up to 32 compared to monolingual training.


Introduction
Bidirectional Encoder Representations from Transformers (BERT) [1] is a Transformer network [2] pretrained with a language modeling objective and a vast amount of raw text. BERT was able to obtain state-of-the-art performance in many challenging natural language understanding tasks by a sizable margin. Thus, BERTology has become one of the most influential and active research areas in Natural Language Processing (NLP). This led to the development of many improved architectures and training methodologies for Pretrained Language Models (PLMs), such as RoBERTa [3], ALBERT [4], BART [5], ELEC-TRA [6], and GPT [7,8], improving various NLP systems and even achieving superhuman performance [1,9,10].
The language modeling objective can be optimized via unsupervised training, requiring only a raw corpus without costly annotation. However, even among over 7000 languages spoken worldwide [11] , only a handful provide such raw corpora large enough for training. As Wikipedia is actively updated and uses mostly formal language, it serves as a reasonable resource for obtaining a raw corpus and is thus useful for measuring language resource availability. Figure 1 shows the number of documents per language in Wikipedia as of September 2020. These data were collected from 303 languages with at least 100 documents, which is only a fraction of all existing languages. Still, as can be seen in Figure 1a, the number of documents generally follows the power law distribution where the majority of documents are from a handful of languages. Additionally, while there are more than 6 million documents in the most used language, 154 of the 303 languages have less than 10,000 documents. In Figure 1b, the percentage of the top 10 languages based on the number of documents is visualized. Here, more than the half of all documents are from these 10 languages, which also indicates that language resource availability remains highly imbalanced. To tackle this problem, transfer learning via multilingual PLMs has been proposed [1,12,13]. In the multilingual PLM literature, transfer learning mostly focuses on zero-shot transfer, which assumes that there is no labeled data available in the target language. This is, however, unrealistic, as in real-world scenarios there are in fact labeled data available in many cases [14,15]. Furthermore, zero-shot scenarios force the model to maintain its performance in the source language, which might prevent the model from fully adapting to the target language. Multilingual PLMs are also much more costly to train due to increased data and model size [13]. The most practical solution would be to use a PLM trained in the source language and fully adapt it to the target language to perform supervised fine-tuning in that language. However, this approach is largely unexplored.
To overcome such limitations we propose cross-lingual post-training (XPT), which formulates language modeling as a pretraining and post-training problem. Starting from a monolingual PLM trained in a high-resource language, we fully adapt it to a low-resource language via unsupervised post-training, which is then fine-tuned in the target language. To aid in adaptation, we introduce Implicit Translation Layers (ITLs) which aim to learn linguistic differences between the two languages. To evaluate our proposed method, we conduct a case study for Korean, using English as the source language. We limit the target language to Korean for two reasons. First, Korean is a language isolate [16][17][18] that was shown to be challenging to transfer from English [19]. Second, by evaluating on a single language we utilize linguistic characteristics as a control variable. This let us focus on data efficiency, the primary metric that we aim to measure.
Evaluating our method on four challenging Korean natural language understanding tasks, we find that cross-lingual post-training is extremely effective at increasing data efficiency. Compared to training a language model from scratch, data efficiency improvement of up to 32 times was observed. Further, XPT outperformed or matched the performance of publicly available massively multilingual and monolingual PLMs while utilizing only a fraction of the data used to train those models.

Related Work
Pretraining a neural network with some variant of the language modeling objective has become a popular trend in NLP due to its effectiveness. While not being the first, BERT [1] has arguably become the most successful in generating contextualized representations, leading to a new research field termed BERTology with hundreds of publications [20]. However, the success is largely centered around English and few other high-resource languages [21], limiting the use of this technology in most of the world's languages [14].
To overcome this limit, there has been focus on bringing these advancements to more languages by learning multilingual representations. In the case of token-level representations such as word2vec [22,23], this was achieved by aligning monolingual representations [24][25][26] or jointly training on multilingual data [27,28]. Aligning monolingual embeddings was also attempted in contextualized representations [29,30], but the most successful results were obtained from joint training. Initially, these joint models were trained using explicit supervision from sentence aligned data [31], but later it was discovered that merely training with a language modeling objective on a concatenation of raw corpora from multiple languages can yield multilingual representations [12,32]. This approach was later extended by incorporating more pretraining tasks [33,34] and even learning a hundred languages using a single model [13]. While these massively multilingual language models are effective at increasing the sample efficiency in low-resource languages, they are prohibitively expensive to train since the training cost increases linearly with the size of the data in use. Further, learning from many languages requires the model to have higher capacity [13]. This leads to difficulties when trying to adapt this method to more efficient and capable architectures or deploy to devices with limited computing resources.
The fact that mBERT [35] and XLM-R [13] learn multilingual representations without any explicit supervision has led to more research investigating their zero-and few-shot performance on various tasks. In [36], the authors concluded that the overlap in subword vocabulary between different languages plays an important role in acquiring multilinguality. On the other hand, it has also been reported that they even generalize to languages written in different scripts, thus having no such overlap [37], and when the overlap is intentionally removed [38]. Lauscher et al. [19] demonstrated that the performance of these models can be improved in a few-shot scenario with as low as 10 annotated sentences. UDapter [39] and MAD-X [40] improve data efficiency even further by limiting the parameter update to a small set of Adapter modules [41][42][43]. However, despite their strong zeroand few-shot performance, all approaches in this category inherit the same limitations of massively multilingual language models.
Training a bilingual language model by transferring from a monolingual one is a much cheaper alternative to multilingual pretraining, as most publications regarding pretrained Transformers publish a trained model checkpoint as well. Despite this advantage, it is a less explored approach. In [29], monolingual ELMo embeddings are aligned to a common space to perform zero-shot dependency parsing. MonoTrans [44] transfers English BERT to other languages by learning a new token embedding from scratch for each target language. Transformer encoder layers are frozen while learning the embeddings to prevent catastrophic forgetting. RAMEN [45] takes a similar approach, but initializes each target language embedding as a linear combination of English embeddings. These approaches are the ones most close to ours, but limited in a sense that the model is forced to maintain its ability in the source language, restricting its adaptation to the target language.

Proposed Method
In this section, we describe our proposed XPT in detail. The overall process is illustrated in Figure 2a, alongside with the illustration of the multilingual pre-training and monolingual transfer learning approaches.

Transfer Learning as Post-Training
Our proposed method aims at transferring the knowledge from a high-resource language L S to a low-resource language L T . Transfer learning in the context of multilingual PLM has mostly revolved around zero-shot learning, and the small number of existing few-shot and supervised learning approaches limit the performance in L T by forcing the model to maintain its ability in L S . This is based on the assumption that there are none or few labeled examples in L T while unlabeled data is abundant. However, it has been suggested that this assumption is neither realistic nor practical [14,15].
Instead of this limited transfer learning approach, we assume that both unlabeled and labeled data are available in L T and formulate this as a pretraining and post-training problem. Post-training refers to the process of performing additional unsupervised training to a PLM such as BERT using unlabeled domain-specific data, prior to fine-tuning. It has been shown that this leads to improved performance by helping the PLM to adapt to the target domain [46][47][48][49]. We start with a monolingual PLM in L S and completely adapt it to L T .
Another key advantage of this approach is that this makes it possible to completely skip the training in L S . This is because most recent publications in PLM literature make the trained model checkpoint publicly available, and the model architecture and training objectives in L S are inherited to L T when post-training. Beside the cost saving, this also enables the use of this method when the pretraining data in L S is not available in certain scenarios (e.g., privacy concerns, licensing, etc.).

Selecting Parameters to Transfer
A language model consumes a word sequence and emits a contextualized vector representation of it. Then these vectors can be used to assign some probability to a word or to perform some tasks such as sequence classification.
More formally, assume we have an input as a sequence of tokens T = [t 1 , t 2 , · · · , t n ], where V is the vocabulary size. Then, the output of a language model LM is given by where E is the embedding matrix, Embedding(·) is a lookup function, θ l is the parameters of the lth encoder layer, Θ = [E L S , θ 1 , θ 2 , · · · , θ C ], and C is the number of encoder layers in LM. Then, the probability of a token is computed as for the masked language modeling task, and for the next word prediction task. Among the parameters in Θ, some could be helpful in modeling L T while some could be harmful. The most important part of the modeling process is the contextualization of embedding vectors, performed by the encoder layers. We reuse them in post-training as these layers are known to acquire mostly language-independent knowledge [34,[36][37][38]. On the other hand, embedding vectors project the tokens in a language into the semantic space and thus cannot be directly transferred to another language. It is possible to indirectly use them using bilingual word embedding techniques [26,45], but we randomly initialize the word vectors of L T for simplicity.
Modern transformer-based architectures have some additional parameters such as positional embeddings and language modeling head [1,3]. We also reuse them in L T as they are not language-dependent and have shown to improve performance in the preliminary experiments.

Implicit Translation Layer
Reusing the encoder layer trained in the source language and only learning the word embeddings in the target language can be seen as finding a token-level mapping between the two languages. However, this is suboptimal for two reasons. First, such mapping is most likely to be impossible due to ambiguity such as homographs. Second, linguistic differences beyond token-level, such as word order, cannot be learned with this method.
To overcome these shortcomings, we propose the Implicit Translation Layer (ITL) to find this mapping at a sequence level. The ITL takes a sequence of vectors as input and outputs contextualized vectors of equal length. To maximize compatibility, we utilize the same architecture used as the encoder layer. Two ITLs are added to the language model, one before the first encoder layer (input-to-encoder) and another one after the last encoder layer (encoder-to-output). These ITLs can be seen as implicitly translating from L T to L S and L S to L T , respectively. With the addition of ITLs, the computation by a LM in L T becomes where Θ = [E L T , θ ITL in , θ ITL out , θ 1 , θ 2 , · · · , θ C ]. H C is computed using Equation (3), and the token vectors in Equations (5) and (6) are also replaced with respective vectors from E L T . This configuration allows a more flexible mapping compared to modules that operate on a token level such as Adapters. It is a great advantage as multilingual contextualized representations are known to be highly sensitive to word order [37].

Two-Phase Post-Training
The parameters E L T , θ ITL in , and θ ITL out are randomly initialized and learned during the post-training phase. The noise introduced by this randomness can negatively impact the tuned parameters from L S . To prevent this, we split the post-training into two phases, similar to gradual unfreezing [50,51]. In the first phase, the parameters copied from the L S model are frozen, and only the L T embeddings and ITLs are learned using the training examples in L T . This is analogous of the method used for zero-shot transfer learning in the multilingual PLM literature, where only the parameters responsible in learning L T are updated.
Phase two of our proposed method proceeds further and completely adapts the language model to L T . This is achieved by unfreezing the parameters from L S and finetuning the entire model using data in L T . Here, it is assumed that the randomness and the resulting noise is minimized in the first phase. Each phase inherits the training objective from L S training, and the model is optimized until convergence using the unlabeled data in L T .

Overview
We conduct a case study for Korean, transferring from English RoBERTa [3] as the PLM in L S . This model has almost the same structure as BERT, but incorporates improved training techniques such as dynamic masking, longer training, and larger batch size. The BASE configuration is used, which has 768-dimensional word embedding and 12 encoder layers. Applying our proposed XPT results in 14 encoder layers in total. To understand the effect of two-phase training, we also train a variation of XPT, termed XPT-SP, where the entire model is post-trained in a single phase without freezing any parameters. We skip the pretraining in L S and use the model checkpoint released by the authors instead [52].

Baselines
For intrinsic evaluations, we compare our proposed method to the following two baseline models.
Scratch-This model does not utilize the knowledge from L 1 and is trained from scratch using data from L 2 . To match the number of parameters with our proposed method, two additional encoder layers are added to this model. Adapters-Instead of using ITL, Adapter modules are added as in [43] in the encoder layers. This setting is similar to MonoTrans [44], except that the entire model is post-trained without freezing to maximize the performance in L 2 .
In addition to the aforementioned baselines, we consider the following models as baselines for the extrinsic evaluations.
mBERT-The massively multilingual version of BERT, trained on the Wikipedia dump for the top 100 languages with the largest number of documents.
XLM-R-Another massively multilingual language model, which is based on RoBERTa and trained on the cleaned Common Crawl dataset [53] with 295 billion tokens covering 100 languages.
KoBERT-Publicly available monolingual BERT trained from scratch using Korean corpora [54]. The training corpora consists of Korean Wikipedia dump with 54 million tokens and Korean news dataset with 270 million tokens.

Dataset
We use Korean Wikipedia (Wiki-ko) for post-training in L 2 . The Wikipedia dump from September 2020 was downloaded and extracted using the WikiExtractor [55] tool. This raw text is split into sentences and tokenized using SentencePiece [56] with a vocabulary size of 50 K tokens. This resulted in 4.19 M sentences with 61 M words before tokenization. The dataset is split into 4 M/100 K/88 K sentences to be used as train/valid/test splits, respectively.
For extrinsic evaluations, we use the following four tasks to quantitatively compare different models. Detailed statistics of each dataset is summarized in Table 1. PAWS-X-We use the Korean portion of the PAWS-X [57] dataset. The goal is to identify whether the given sentence pairs are a paraphrasing of each other or not. We report the classification accuracy (%) for this dataset.
KorSTS-The Semantic Textual Similarity (STS) [58] dataset aims at assessing the semantic similarity between a pair of sentences. Each pair is assigned with a score ranging from 0 to 5, and the model's performance is measured using Spearman's rank correlation coefficient against the gold labels. The KorSTS [59]  KQP-The Korean Question Pairs (KQP) [60] is a dataset analogous to the Quora Question Pairs (QQP) [61] dataset, in which the model needs to identify if the given two questions convey the same meaning. KQP consists of 7.6 K human-annotated question pairs. Accuracy (%) is used as the evaluation metric.
KHS-The Korean Hate Speech (KHS) [62] dataset is a collection of 8367 news comments, labeled as one of "hate", "offensive", or "none" by human annotators. As gold labels for the test split are unavailable, we report the best f1-score on the validation split for this task.

Implementation Details
We implement our proposed method and the baselines using PyTorch [63], Fairseq [64], and Hugging Face Transformers [65]. All models are trained until convergence with early stopping. We mostly use the suggested hyperparameters from [3] with a few exceptions, which are summarized in Tables A1 and A2.

Results and Discussion
In this section, we summarize the key findings from quantitative evaluations, both intrinsic and extrinsic. For the intrinsic evaluations and the experiments testing the data efficiency (i.e., Figures 3 and 4, and Table 2), the results for mBERT, XLM-R, and KoBERT are not available as these are not trained from scratch and instead the publicly available model checkpoints are used.

Intrinsic Evaluation
To quantitatively measure the learning process of each model, we first perform intrinsic evaluations. While performing better on intrinsic tests does not guarantee the model to be better at downstream tasks as well, they are good estimates and relatively cheap to evaluate. In this study, perplexity, hits@k, mean rank, and mean reciprocal rank are calculated on the masked tokens and used as intrinsic metrics.
The intrinsic performance of each model after using all available 4M training examples is summarized in Table 2. It can be seen that XPT performs the best, followed by Adapters and Scratch. Further, the improvement from Scratch to Adapters is as large as the improvement from Adapters to XPT, indicating that our proposed XPT performs much favorably to Adapters. To understand the data efficiency during the post-training process, we exponentially vary the number of training examples from 100 K to 3200 K. The result for each metric is plotted in Figure 3. As can be seen, the performance increases linearly as the dataset size increases exponentially. This shows how data-hungry these language models are, demonstrating the importance of increasing the data efficiency. Adapters and XPT perform comparably, with XPT performing slightly better when there are more than 400 K training examples and slightly worse with less than 400 K examples.
On the other hand, we find that transferring from English (i.e., Adapters and XPT) is roughly as effective as doubling the number of training examples in the Scratch setting. Further, the difference between transferred and non-transferred settings are most pronounced when the amount of data is minimal, suggesting that low-resource languages are likely to benefit the most from XPT. The common underlying hypothesis in the multilingual language modeling literature is that simultaneously learning from multiple languages is the key to improving performance in low-resource languages by learning polyglot representation [12,13,[36][37][38]. However, our results suggests that some part of the knowledge acquired by monolingual models is language-agnostic and thus can be effectively transferred to other languages. Based on this, we argue that more emphasis should be put on transferring monolingual representation [44,45] as these are more sustainable than multilingual training.

Extrinsic Evaluation
The results after using all 4 M post-training examples are summarized in Table 3, alongside the results from fine-tuning publicly available models. All models are trained 10 times with different random seeds, and we report the mean and standard deviation. It can be seen that the proposed XPT outperforms Adapters and Scratch by a large margin across all tasks, demonstrating its effectiveness given the same amount of training data. Further, it also outperforms mBERT in all tasks and XLM-R in three out of four tasks. The fact that XPT is post-trained with only a fraction of the data used to train these massively multilingual language models makes this result more encouraging. When compared to KoBERT, a monolingual model trained from scratch with approximately 5.3 times more training examples, XPT still performs better in all tasks except KQP, with a 22.30% relative error reduction in KorSTS. Interestingly, the Scratch model outperforms all models in the KHS dataset. We believe that this is caused by a domain mismatch between the pre/posttraining data and the fine-tuning data, as the KHS dataset is collected from social media. Similar to the intrinsic evaluations, we also experiment with varying the number of post-training sentences in the downstream evaluations and plot the results in Figure 4. From Figure 4a,b, it can be seen that the Scratch setting shows minimal gains by increasing the data size from 100 K sentences to 800 K sentences. On the other hand, the models transferred from English show consistent improvements with more data. This shows that without any prior knowledge, these tasks cannot benefit from pretraining until a certain amount of data is available. We find a similar yet less pronounced trend in the KQP task as well. However, the results from the KHS task is different from the other three tasks, with all models performing comparably across all dataset sizes. Again, this is likely caused by the domain shift. Investigating the change in data efficiency, it can be seen that to reach the performance of transferred models post-trained with 100 K examples, the Scratch model requires approximately 32, 16, and 16 times more pre-training data in the PAWS-X, KorSTS, and KQP tasks, respectively. Between the transferred models, XPT consistently outperforms Adapters in PAWS-X, KorSTS, and KHS, while performing on par in KQP. This indicates that regardless of the amount of available post-training data, XPT can be expected to perform better than Adapters.
Existing cross-lingual language model pretraining approaches force the model to maintain its multi-linguality [13,40,44,45]. However, in a realistic and practical scenario, the goal is often maximizing the performance on a single language at interest. Our results demonstrates that under this scenario, XPT and completely adapting the model to a single language in general are superior to polyglot models.

Effect of Two-Phase Training
To understand the effect of two-phase training, we trained XPT-SP, where the entire model is post-trained without freezing any parameters. The intrinsic results and downstream results are shown in Tables 2 and 3, respectively. Overall, XPT-SP outperforms Adapters in all cases, but performs suboptimally compared to two-phase training. This demonstrates that ITL is better than Adapter modules at learning linguistic differences. Incorporating two-phase training to Adapters could bring some improvements. However, based on the fact that XPT-SP performs better than Adapters, we expect this variant to be less effective than XPT.

Conclusions and Future Directions
While being highly effective across a wide range of NLP tasks, pretrained Transformers are extremely data-hungry. For the majority of the over 7000 languages spoken worldwide, it is difficult to secure sufficient data for training such models. In this paper, we tackled this problem by proposing an approach for data-efficient training of pretrained language models in a low-resource language. Our approach, termed XPT, achieves this goal by post-training a PLM from another high-resource language. Language-agnostic parameters of a model trained in the high-resource language are selectively reused and tuned while learning the language-specific parameters in the target language. We also proposed ITL, which is designed to learn linguistic differences between the two languages at a sequence level instead of a token level.
To evaluate our method in a challenging and controlled scenario, we conducted a case study for Korean by post-training English RoBERTa with a varying amount of posttraining examples. Intrinsic results have shown that post-training an English model in L T is roughly as effective as using twice as much data in L T and training from scratch. Further, downstream evaluations on four natural language understanding tasks demonstrated that our approach can improve the data efficiency by a factor of up to 32. When compared to monolingual and massively multilingual PLMs trained with several orders of magnitude more data, XPT still outperformed or matched the performance. This suggests that completely adapting a model to a single language of interest is more effective and efficient, and more focus should be put on this direction of research.
As for future directions, experimenting with other target languages with different resource availability and linguistic characteristics is an important step. Building a systematic approach for selecting the source language depending on the target language is also a promising research direction.

Conflicts of Interest:
The authors declare no conflicts of interest.