You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Editor’s Choice
  • Article
  • Open Access

18 February 2023

Distilling Monolingual Models from Large Multilingual Transformers

,
and
LT3, Language and Translation Technology Team, Department of Translation, Interpreting and Communication, Ghent University Groot-Brittanniëlaan 45, 9000 Ghent, Belgium
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue AI for Text Understanding

Abstract

Although language modeling has been trending upwards steadily, models available for low-resourced languages are limited to large multilingual models such as mBERT and XLM-RoBERTa, which come with significant overheads for deployment vis-à-vis their model size, inference speeds, etc. We attempt to tackle this problem by proposing a novel methodology to apply knowledge distillation techniques to filter language-specific information from a large multilingual model into a small, fast monolingual model that can often outperform the teacher model. We demonstrate the viability of this methodology on two downstream tasks each for six languages. We further dive into the possible modifications to the basic setup for low-resourced languages by exploring ideas to tune the final vocabulary of the distilled models. Lastly, we perform a detailed ablation study to understand the different components of the setup better and find out what works best for the two under-resourced languages, Swahili and Slovene.

1. Introduction

The advent of extremely large language models (LLMs) in the past decade has pushed Natural Language Processing (NLP) for under-resourced languages beyond all foreseen expectations, while the building and training of these LLMs has been an impetus for low-resource NLP, the deployability and sustainability of these technologies for real-world use cases is an often ignored secondary aspect. Even though multilingual models such as mBERT [1] and XLM-R [2] excel at low-resource and multilingual NLP, they often fail when it comes to this second aspect because they are extremely large language models with vocabularies of hundreds of languages, which may not be necessary for the deployment of a model for a single low-resourced language. Unlike for high-resourced languages, under-resourced languages often lack the availability of a single monolingual language model, such as CamemBERT [3] for French or RobBERT [4] for Dutch, thus making large jointly trained multilingual models a necessary evil; while one can argue that mBERT and XLM are still deployment-friendly in some ways, the trends toward an exponential rise in parameters will soon make it impossible to deploy research-grade released models. For example, this occurs in the mT5-XXL (13 billion parameters) [5] and the Turing ULR (4.6 billion parameters) [6] series of models, which are currently state of the art on the XTREME [7] data set—a comprehensive benchmark for cross-lingual transfer learning for a large variety of NLP tasks and languages.
While there have been significant strides forward in reducing model footprints, inference, and training times with methodologies such as Distillation, Quantization, and Pruning, these methodologies are often tested in a general direction, i.e., reducing a multilingual model as a whole, or in a task-specific setting, i.e., creating a smaller model specialized for a particular task. In this work, we attempt to explore the consequences of using the ideas behind knowledge distillation and applying these to large pre-trained multilingual models, to filter knowledge specific to a target language into a new, smaller, and faster student language model which performs identically to or even outperforms the teacher in some cases. The main contribution of this paper is to dive deep into standard knowledge distillation practices and explore optimal strategies to distill individual target languages from a large multilingual model.
The first objective of the proposed research is to explore the standard knowledge distillation setup designed for generic full-model distillation for two widely used multilingual models, i.e., multilingual-BERT (mBERT) and XLM-RoBERTa (XLM-R). Important to note is that we attempt to only keep information for a single target language for the student. We build upon the pilot experiments for Eliquare, first proposed in Singh and Lefever [8], and perform all experiments on a set of six carefully selected languages accounting for as much variation as possible with regard to their typologies, language families, and available resources. We consider Dutch and French to be representative of high-resourced languages, Hindi and Hebrew are considered moderately resourced languages, and Swahili and Slovene are representatives of low-resourced languages. For each language, we evaluate the obtained distilled students on a set of two downstream tasks: one being a syntactic word-level task such as Part-of-Speech Tagging and the other a semantic sentence-level task such as Sentiment Analysis.
A second, and perhaps more vital objective of this research is to propose ideas that specifically benefit the construction of students for low-resourced languages, i.e., Swahili and Slovene in our case. We attempt to do this in two stages. Firstly, we explore the principles behind altering the vocabularies of the final student to better suit the low-resource setting. While joint models have large combined vocabularies which assist in multilingual aspects, for a distilled student model only the vocabulary of a single target language is required. While the high-resourced languages used in our work (Dutch and French) have enough sub-words in the multilingual vocabulary to adequately represent the language space, the middle- and under-resourced languages have an extremely poor representation. In mBERT, for example, a medium-resourced language such as Hebrew has around 2483 sub-words in the vocabulary accounting for approximately 2% of the whole vocabulary, while Thai only has 370 sub-words, amounting to around 0.3% of the vocabulary. We, therefore, explore techniques to reduce the vocabulary sizes both pre- and post-distillation while keeping the performance consistent across all benchmarks. Secondly, we perform a detailed ablation study to explore what components and hyper-parameters specifically impact the performance of the distilled student in the low-resource setting. We specifically dive deeper into the two most vital components of the distillation framework: the loss and softmax temperature.
The remainder of this paper is structured as follows. In Section 2, we first describe relevant related research on knowledge distillation and shortcomings of large multilingual language models for low-resourced settings, and Section 3 discusses the fundamental principles of classic knowledge distillation and builds from the DistilBERT [9] setup towards a language-specific distillation setting. Section 4 discusses the experimental setup and results of the basic setups and demonstrates the viability of the proposed Eliquare methodology. Section 5 and Section 6 further venture into advanced modifications possible to the distillation setup, to suit a low-resource language setting, while Section 5 discusses the concept of altering the vocabularies of the multilingual models, to only accommodate a low-resourced language, while also speeding up the distillation process further. Section 6 discusses the impact of some of the key hyper-parameters and their impact on the student models for Slovene and Swahili. Section 7 concludes this paper by summarizing our findings and suggesting ideas for future research.

3. Language Distillation Setup

We begin the system description by explaining the fundamental principles behind a distillation setup in more detail. While there has been work that is an exception that forgoes the standard logit setup and uses ideas such as mutual information [28] and graph-based methods [29], most distillation methodologies work with a few common principles at the core. Distillation, as previously described, can be simply thought of as the task of finding the approximation
f s ( x ) f t ( x )
where f s ( x ) is the student model’s final output for the training data x, and f t ( x ) is a teacher model’s final output on the same data. There can be three broad variables in a distillation setup. Firstly, the data used for distillation, which determines the type of knowledge being distilled. For instance, to distill a specialized model for Natural Language Inference (NLI) (NLI is a sentence-pair task that given a premise, evaluates if a hypothesis is an entailment, a contradiction, or unrelated to the premise), only information vital for NLI needs to be filtered from the teacher, and this can be conducted by imitating the teacher’s knowledge for an NLI dataset, therefore implying that any stored information not relevant to NLI can be forgotten. A few approaches experiment with augmenting data to boost the learning towards a target task, but this is usually useful in task-specific settings where labeled data is a requirement for distillation. The second variable can be the loss function. The loss function essentially determines how we choose to compare the student to the teacher during the learning stage. Given a loss metric L(x,y) and a teacher and student prediction on a sample i represented by f t ( i ) and f s ( i ) , a minimization objective over a dataset of size N can be defined as
m i n . i = 0 N L ( f s ( i ) , f t ( i ) )
The third and final variable can be how the student model is set up, primarily, the architecture and the initialization. While most approaches work with an architecture identical to the teacher but with a smaller number of layers, there has been work that adopts simpler architectures for the student than for the teacher. A number of initialization strategies have also been explored since a better initialization can heavily impact the distillation outcome, as shown by Turc et al. [30].
Regarding the first variable, i.e., the distillation data, our goal is to distill knowledge relevant to a single target language which is why we use the entire latest Wikipedia dump for the target language. The minimization objective for that given target language t can then be simply modified as
m i n . i : i N t N L ( f s ( i ) , f t ( i ) )
For the second variable, i.e., the distillation objective, Hinton et al. [11] introduced two vital contributions, which have become fundamental building blocks of most distillation setups since. Firstly, the error function L(x,y) is defined as the cross-entropy between the student and teacher logits:
L C E ( f s ( i ) , f t ( i ) ) = f t ( i ) × l o g ( f s ( i ) )
Secondly, Hinton et al. also introduced the concept of softmax temperature. Instead of using logits from the teacher directly in the error function L C E , they propose using soft targets instead, determined by a preset temperature value. Given a temperature value τ and f ( x ) k representing the k t h output logit given K classes, the soft targets can be generated with
g ( x ) = k = 0 K e x p ( f ( x ) k ) / τ j e x p ( f ( x ) k ) / τ
Thus, softening the probability distribution of the logits if τ > 1 or hardening the distribution if τ < 1 . Softening the targets can produce stable training that reduces the impact of noisy labels from the teacher model, while hardening can be more useful for faster convergence when distillation data is hard to come by. Sanh et al’s [9] setup inherits from the temperature-based soft targets and uses cross-entropy between the soft targets as the error function L(x,y). Additional loss functions L c o s i n e and a standard L m l m for Masked Language Modeling defined below, are used in addition to L C E
L c o s i n e ( f s ( i ) , f t ( i ) ) = i ϵ L 1 c o s ( f t ( i ) , f s ( i ) )
L m l m ( f s ( i ) , y ( i ) ) = y ( i ) l o g ( f s ( i ) )
While L c o s i n e is expected to minimize the cosine distance between the soft targets and the student logits, L m l m adds an additional component that learns directly from the data y ( i ) instead of the teacher outputs. This can serve as a self-correction for those examples where the teacher is not always reliable, while also speeding up training by adding an additional learning signal directly from the ground truth. The three losses are combined with a preset weighted sum,
L = α C E L C E + α c o s i n e L c o s i n e + α m l m L m l m
While for the initial setups, we inherit the preset weights ( α C E = 5 ,   α c o s i n e = 1 ,   α m l m = 2 ) and softmax temperature ( τ = 2.0 ) from Sanh et al. [9], we discuss the impact of these components further in Section 6.
For the third and final variable, i.e., the student model’s setup, we use an architecture identical to the teacher, but with 6 encoder layers, in contrast to the 12 teacher layers. We attempt two alternate setups by changing the vocabulary of the teacher pre-distillation or of the student directly through post-distillation. This is further covered in Section 5, as the initial experiments did not involve any changes to the vocabulary. Another important part of this variable is the initialization of the student. We follow the general approach [9] where the student is initialized from the teacher’s layers. The authors explore the initialization of the student with the first 6 layers, or the final 6 layers of the teacher model, but concluded that using alternating layers of the teacher offers the best initialization, i.e., layer n of the student is initialized from layer 2 n 1 of the teacher and we, therefore, adopt an identical initialization.

4. Experiments

For the experiments we build upon the pilot experiments discussed in Singh and Lefever [8] using mBERT [1], as well as experiment with another state-of-the-art multilingual teacher, i.e., XLM-RoBERTa [31]. We name our approach Eliquare which is the Latin word for ‘distillation, filtering or refining’. For both setups, the student (Eliquare) is initialized from the given teacher (mBERT or XML-RoBERTa) using the 2 n 1 approach described in Section 3.
We experiment with six target languages for distillation: French, Dutch, Hindi, Hebrew, Slovene, and Swahili. As can be derived from Table 2, these languages have been selected because they are varied in terms of typology (they all belong to different language groups), script, and resources available (expressed in the number of available Wikipedia pages; for reference, English has 57.29 million Wikipedia pages). Based on this latter column, we consider Dutch and French as representative of high-resourced languages, Hindi and Hebrew as moderately-resourced languages and Swahili and Slovene as low-resourced in our experiments and analyses.
Table 2. An overview of the target languages used for distillation, their genus, scripts and available Wikipedia pages (in millions).
The same Wikipedia dumps of these target languages are used as distillation data in order to construct the Eliquare student models with the basic distillation setup. For each language, we obtain the latest Wikipedia XML dumps and pre-process them for MLM, with a masking probability of 0.15 and word masking, word replacement, and unchanged word proportions of 0.8, 0.1, and 0.1, respectively. We also employ the MLM smoothing parameter (set to 0.7) to emphasize masking of less frequent words. Next, the pre-processed data is split into two parts for training and validation with a 90:10 split. All students are trained for 10 iterations over this processed data, using a starting learning rate of 5 × 10 4 . As learning from larger batches works better for distillation, we opted for a batch size of 32 (8 per device) and performed gradient accumulation for 50 steps (effective batch size of 32 50 = 1600 ). We use the Adam optimizer with an ϵ of 1 × 10 6 . The position embeddings in XLM-RoBERTa are frozen to save some computing time. We store the student model after every epoch and use the version with the best distillation loss on the held-out validation set for the evaluation step.
For the evaluation step, a logical choice could be to look at perplexity and validation loss. However, these are not the best metrics to assess the overall language understanding of an LLM, since they focus on evaluating the Masked Language Modeling objective, rather than general language understanding. Instead, we decided to assess the six monolingual students by fine-tuning them for different language-specific downstream tasks. For each target language, two downstream tasks have been selected, as summarized in Table 3. One task each time requires higher-level (semantic) sentence understanding (such as Sentiment Analysis or News Classification) while the other is highly syntax-dependent (such as Part-of-Speech Tagging). Please note that for the two under-resourced target languages Slovene and Swahili, it was not always possible to find available datasets for these tasks. In those cases, we fell back to the task of named entity recognition or NER, which can be perceived as a task requiring both semantic (which entities do these refer to in the real world) and syntactic (often named entities consists of more than one token) understanding.
Table 3. An overview of the two downstream tasks that have been used to evaluate the language understanding of the Eliquare student models for each target language.
For Task 1 we employed Sentiment Analysis data from various sources for three languages: Le et al. [32] for French, Van der Burgh and Verberne [33] for Dutch, and Amram et al. [34] for Hebrew. For Hindi and Swahili we relied on News Genre Classification data from Hindi2Vec (https://github.com/NirantK/hindi2vec, accessed on 1 January 2023), comprising 14 news classes, and from SNCD (https://huggingface.co/datasets/swahili_news, accessed on 1 January 2023) comprising 6 news classes in Swahili. Due to the unavailability of a suitable semantic sentence-level task for Slovene, we used NER data from Rahimi et al. [35] as an alternative. For Task 2 we relied on the Universal Dependencies (https://universaldependencies.org, accessed on 1 January 2023) (UD) project, which comprises treebanks with unified POS-tagged data for French (GSD), Dutch (Lassy-small), Hebrew (HTB), Hindi (HDTB), and Slovene (SSJ). Since there is no UD (or other) treebank publicly available for Swahili, we fell back to NER and used NER data from the Masakhane initiative [36].
We train the student of the respective language individually for each downstream task for 10 epochs with a starting learning rate of 5 × 10 5 with a decay of 0.01 after 500 warmup steps. We select the best validation model (train-validation-test splits are used as provided by the datasets; however, when this is not provided, an 80-10-10 split is used). All tasks are evaluated using F1-score, except task 1 for Dutch (DBRD) which is evaluated with accuracy to allow comparisons with the upper bound.
The results of these experiments are presented in Table 4. Each time we compare the performance of our student models (Eliquare-mBERT and Eliquare-XLM) to a similarly sized reference, namely distilmBERT which serves as our baseline. Moreover, a comparison is made with the two teacher models and we also represent the upper bound (row in gray) which is each time based on monolingual transformers of the same size as the standard, BERT-base-uncased for English. These upper bounds, therefore, are of much larger sizes and trained with multitudes more monolingual data for the target language, while also having a significantly larger and specialized vocabulary for the script in question. The best results per transformer algorithm (BERT/RoBERTa) for each language and task are indicated in bold. From the table, we can observe that the Eliquare models often perform similarly or in some cases even better than the respective teachers, i.e., mBERT and XLM, which are much larger in size. The statistical significance of Eliquare-mBERT’s improvement over the teacher mBERT was validated using the Wilcoxon Signed-Rank (Left-Tailed) Test ( p = 0.017 ) (Statistically significant if p < 0.05 ). Moreover, in a number of low-resourced settings, specifically, for Hebrew (task 2), Slovene (task 1), and Hindi (tasks 1 and 2) the students sometimes even outperform the upper bound. The monolingual performance of the Eliquare student models further emphasizes the added value of language-specific distillation since in low-resource settings (Slovene and Swahili) the much more efficient and sustainable student models are able to compete in performance with their larger upper-bounds trained on extensive amounts of monolingual data, making them a better choice for deployment in practical scenarios. It is important to stress the advantages of Eliquare students for sustainability and efficiency. The base Eliquare student after vocabulary reduction (Section 5) has 66 million parameters, which is 2.5 times less than mBERT, and 2 times less than distilmBERT, while also having a significantly faster inference speed of 0.066 s, compared to mBERT’s 0.384 s (single V100 GPU with a batch size of 32).
Table 4. A summary of the results for the basic distillation setups for all six languages with mBERT and XLM-RoBERTa as the teachers, respectively, for Eliquare-mBERT and Eliquare-XLM.
These results clearly demonstrate, that even with a vanilla distillation setup, it is possible to obtain better monolingual models for low-resourced languages from a multilingual teacher. In the next sections, we further explore the changes that could be made to the vanilla setup to make the language-specific application of distillation even more viable.

5. Vocabulary Manipulation for mBERT

While the Eliquare distilled student models achieve results on par with their respective multilingual teacher models (see Table 4), there are still issues that need to be addressed when using them in a monolingual setting. The most vital of these issues pertains to the multilingual vocabulary of these huge multilingual LLMs.
As visualized in Figure 1, the vocabulary of multilingual models, in this case mBERT, heavily favors Latin-based languages, while having only a meager few thousand sub-words for large language groups such as Indic (6545 to be exact, which can be derived from the circa 12 included languages in mBERT from the Indian sub-continent) and Cyrillic (10 languages and 13,782 sub-words in mBERT). Having a smaller vocabulary in these languages thus means less diverse sub-words which inevitably results in some semantically meaningless alphabet-based tokens in the vocabulary, such as # # a , etc.
Figure 1. A summary of the distribution of vocabulary for 6 different scripts in mBERT (where CJK stands for Chinese–Japanese–Korean).
As an example, Figure 2 represents the tokenization of long words in English, a similar high-resourced language (Dutch), a medium-resourced language (Hindi), and a low-resourced language (Farsi). We compare the tokenization by mBERT’s WordPiece tokenizer to that of a monolingual model in the respective language. As illustrated in the figure the tokenization is consistent between mBERT and the monolingual model for English, on average having a size of around three characters per sub-word. However, this changes as we go down the resource ladder. For Dutch, some sub-words are only two characters long, especially sub-words that do not have much semantic meaning attached to them. However, the final two sub-words for Dutch still have 4–5 characters and allow them to have some abstract sense associated with them. Finally, for the final two examples in Hindi and Farsi, mBERT ends up breaking down the word into each individual character, whereas the monolingual model considers the example as an independent whole sub-word.
Figure 2. Examples of tokenization for long words in English (EN), Dutch (NL), Hindi (HI), and Urdu (UR), to show the contrast between the obtained sub-words from mBERT and a state-of-the-art monolingual transformer for the respective language.
These tokenization issues, combined with the poor overall representation of low-resourced languages in the vocabulary space are a motivation to investigate strategies to alter multilingual vocabulary for use in a monolingual target language setting, while XLM-RoBERTa suffers from many of the same issues, the Word-Piece Tokenizer of mBERT allows some flexibility to alter the vocabulary even after pre-training, while the Byte-Pair Encoding (BPE) Tokenizer of XLM-RoBERTa is more rigid and does not allow vocabulary deletions/additions as easily. This is why for this and the next section we only experiment with mBERT to alleviate this vocabulary issue. However, we do hope to transfer the methodologies to XLM-RoBERTa in future work.
Hypothetically, two stages can be discerned when building a monolingual student with the ideal low-resourced vocabulary. Firstly, mBERT can be purged of any additional sub-words that may not be needed for a particular target language. We will call this the VocabReduce step. Two alternate methodologies can be used for this step. On the one hand, the distillation can work identically to the basic setup, and the vocabulary can be reduced post-distillation directly from the student by removing unnecessary tokens (as proposed by Abdaoui et al. [37]. On the other hand, vocabulary can be reduced pre-distillation, i.e., directly from the teacher. By purging additional sub-tokens from the teacher, we ensure that the student does not initialize the vocabulary for the additional sub-words. In this step pre-distillation reduction has a significant advantage over post-distillation as the distillation can go significantly faster. This is because the sizes of both the student and the teacher are reduced significantly beforehand, thus reducing the number of parameters and by extension the computing time for each iteration.
For the second stage additional richer sub-words could be input to the target language to force the tokenization to not result in meaningless character-based sub-words VocabAmp. The vocabulary setups for all the discussed methodologies are summarized in Figure 3. It should be noted that the VocabAmp step is more complex as in order to learn additional representations for non-existent sub-words one needs to rely exclusively on external data since the teacher does not possess representations for these missing sub-words. Moreover, given a mismatch between the logits of the teacher and the student, the standard distillation loss cannot be computed since it relies on the divergence between the teacher and student logits. Due to these additional challenges, we consider VocabAmp beyond the scope of this work and focus on VocabReduce.
Figure 3. A visual representation of the different vocabulary setups for basic distillation, post-distillation VocabReduce, pre-distillation VocabReduce, and VocabAmp.
We perform experiments with pre- and post-distillation VocabReduce for all six languages with mBERT as the teacher. We initialize a list of sub-words for the target language that we would like to retain, by tokenizing the respective Wikipedia dump and selecting sub-words that exist in at least 0.05% of the sentences. We then proceed to reinitialize the transformer’s embedding layer and tokenizer so only the selected sub-words are retained. For pre-distillation we apply this technique directly to the teacher, while in post-distillation we apply it to the student after the distillation process. Table 5 shows the result of the experiments for the two tasks for each of the six target languages, while post-distillation results in near-identical performance to the basic distillation setup due to the reduction only taking place afterwards. Pre-distillation comes with minor variance, sometimes better and sometimes worse compared to post-distillation; however, it is consistently faster to train due to the significant reduction in the model’s embedding layer sizes. Since the performance difference is barely noticeable, pre-distill VocabReduce should be the go-to methodology due to the additional advantages it comes with.
Table 5. An overview of results for the more advanced post- and pre-distill VocabReduce techniques for all six languages.

6. Analysis for Low-Resourced Settings

While the basic distillation setups seem to be quite robust in obtaining comparable performance to the multilingual counterparts and in some cases even comparable to the respective upper-bounds, we look into further adaptations that can be made to make distillation setups more suitable for the low-resourced setting. To this end, we perform an ablation study with two vital parts of the distillation pipeline:
  • Loss Components: we attempt to find the most and least impactful components of the three-fold loss function to better tune loss weights for low-resourced settings.
  • Softmax Temperature: while softening the distribution with a temperature of 2.0 is standard practice in most distillation settings, we dig deeper and see if hardening or further softening can have an impact in the low-resourced setting.
To study the impact of these two variables, we perform additional experiments for the two low-resourced languages: Slovene and Swahili. For the baseline setup, we use the distilled student from the previous section with pre-distillation vocabulary reduction using mBERT as the teacher.
For the first ablation study we thus experiment with the three-fold loss function. The results are presented in Table 6 where the baseline scores (row 1) represent the setup from Section 4 with losses weighted with alpha values of 5.0, 1.0, and 2.0, respectively. The second row gives a general indication of the performance when all losses are weighted equally, while the next three rows show the impact of the individual loss components by removing them from the setup. We notice a drop in performance. Figure 4 also provides a visual intuition of the trends by visualizing this drop in performance in the even weights setting (row 2). The figure demonstrates that each loss component is vital to the setup, which is in line with the consistent drop in performance (rows 3–5) when removing any of the losses. It is also possible to infer from the figure that L m l m is the most pivotal component of the loss function. This is quite an intuitive finding since the student models often perform better than mBERT, their teacher. For these students to learn information missing from the teacher, they would have to rely on knowledge that is not present in the teacher but comes from external sources. In that respect L m l m is the only component able to provide such an external signal. This especially holds in a low-resourced setting, where mBERT’s signals may not always be reliable.
Table 6. Results for Slovene and Swahili for both tasks for the first ablation study. The first row refers to the results of the baseline from Section 4. The second row represents losses with equal weights to set up a comparison for each of the next three experiments where one of the losses is removed from the setup one by one.
Figure 4. A representation of the results from Table 6 to visualize the drop in performance (in F1-score) from the equal weights setup of L C E + L c o s i n e + L m l m compared to the baseline.
For our second ablation study, we experiment with the softmax temperature ( τ ). The results are presented in Table 7, while a τ of 2.0 was used in the baseline experiments in Section 4, four additional experiments have been performed. For two the distribution was further softened with a τ of 3.0 and 4.0, one uses the unchanged logits from the teacher ( τ = 1.0 ) and for another the distribution was hardened ( τ = 0.5 ), while at first sight, the other setups seem to be only marginally deficient, the baseline setup with a τ of 2.0 is consistently better. This indicates that further softening or hardening the logits does not benefit the student specifically in a low-resourced setting.
Table 7. Results for Slovene and Swahili for both tasks illustrating the impact of either softening or hardening the softmax temperature ( τ ). τ = 2.0 refers to the baseline setup from Section 4.
Figure 5 elaborates on this finding, as it shows the drop in performance from the peak F1 score at 2.0. While there are some anomalies, it seems to be the case that the further we move from the optimal τ of 2.0, the worse the performance becomes. It is also important to note that tasks such as POS-tagging for Slovene, seem to be quite robust to drops in performance with changes in τ . However, this might be only because the dataset is comparatively easier and performance might already be quite saturated with extremely high scores of the order of 0.984.
Figure 5. A visualization of the results in Table 7 expressing the relative drops in F1-score from the best value of softmax temperature, τ = 2.0 .

7. Conclusions

In this work, we have further explored and improved upon the novel language distillation methodology first introduced in Singh and Lefever [8], where it was tested for mBERT [1]. In this research, we have extended the approach to the more robust and state-of-the-art XLM-RoBERTa [31] and demonstrated its efficacy. Similarly to the language-distillation systems developed from mBERT, the Eliquare students of XLM-RoBERTa are able to produce consistent student models for six languages. These languages were carefully selected to account for as much variation as possible with regard to their typologies, language families and available resources. We considered Dutch and French as representative of high-resourced languages, Hindi and Hebrew as moderately resourced languages, and Swahili and Slovene as low-resourced languages. The experimental results confirmed that language-distillation is viable, especially in low-resourced settings, and the resulting students were often able to outperform the teacher multilingual models while being up to four times smaller and six times faster for inference than their respective teachers.
The objective of this research was to further progress research in low-resourced languages, in particular by creating systems for these languages building on existing large multilingual models. This area of research was explored further by looking into the manipulation of the vocabulary of the resulting student models. Two different strategies were proposed to reduce a multilingual vocabulary into a monolingual one as part of the distillation process. We showed that pre-distillation VocabReduce is a consistently better strategy since it performs just as well and saves computing time over the alternative, post-distillation VocabReduce.
In addition, we also explored the impact of the different loss components on the student of the two low-resourced languages. We discovered that L m l m is the most impactful component of the triplet loss. However, all losses contribute to the performance and the ablation of any component results in a drop in performance.
Finally, we also investigated optimal softmax temperatures in the low-resourced setting and concluded that the default values of τ = 2.0 are optimal, further softening or hardening of the logits results in a drop in performance.
In future work, we would like to venture into more advanced distillation setups described in Section 2, such as TinyBERT [21] and MobileBERT [38], with additional loss components such as Feature Map Transfer and Attention Transfer. We also aim to explore alternate teacher–student setups with multiple teachers, and the construction of bilingual students for two typologically related languages. A logical next step then will be to research strategies for VocabAmp, while also modifying the VocabReduce technique for application to XLM-RoBERTa.

Author Contributions

Conceptualization, P.S., O.D.C. and E.L.; methodology, P.S., O.D.C. and E.L.; software, P.S.; validation, P.S., O.D.C. and E.L.; formal analysis, P.S.; investigation, P.S.; resources, P.S.; data curation, P.S.; writing—original draft preparation, P.S.; writing—review and editing, P.S., O.D.C. and E.L.; visualization, P.S.; supervision, O.D.C. and E.L.; project administration, E.L.; funding acquisition, E.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing not applicable. No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Association for Computational Linguistics: Minneapolis, MN, USA, 2019; Volume 1, pp. 4171–4186. [Google Scholar]
  2. Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
  3. Martin, L.; Muller, B.; Ortiz Suárez, P.J.; Dupont, Y.; Romary, L.; de la Clergerie, É.; Seddah, D.; Sagot, B. CamemBERT: A Tasty French Language Model. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7203–7219. [Google Scholar]
  4. Delobelle, P.; Winters, T.; Berendt, B. RobBERT: A Dutch RoBERTa-based Language Model. arXiv 2020, arXiv:cs.CL/2001.06286. [Google Scholar]
  5. Xue, L.; Constant, N.; Roberts, A.; Kale, M.; Al-Rfou, R.; Siddhant, A.; Barua, A.; Raffel, C. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Online, 6–11 July 2021; pp. 483–498. [Google Scholar] [CrossRef]
  6. Chi, Z.; Huang, S.; Dong, L.; Ma, S.; Zheng, B.; Singhal, S.; Bajaj, P.; Song, X.; Mao, X.L.; Huang, H.; et al. XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, Dublin, Ireland, 22–27 May 2022; Volume 1, pp. 6170–6182. [Google Scholar] [CrossRef]
  7. Hu, J.; Ruder, S.; Siddhant, A.; Neubig, G.; Firat, O.; Johnson, M. XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation. In Proceedings of the 37th International Conference on Machine Learning, Virtual Event, 13–18 July 2020; Volume 119, pp. 4411–4421. [Google Scholar]
  8. Singh, P.; Lefever, E. When the Student Becomes the Master: Learning Better and Smaller Monolingual Models from mBERT. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 29–31 July 2022; pp. 4434–4441. [Google Scholar]
  9. Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2019, arXiv:1910.01108. [Google Scholar]
  10. Sachdeva, N.; McAuley, J. Data Distillation: A Survey. arXiv 2023, arXiv:2301.04272. [Google Scholar] [CrossRef]
  11. Hinton, G.E.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  12. Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar] [CrossRef]
  13. Tang, Z.; Wang, D.; Zhang, Z. Recurrent Neural Network Training with Dark Knowledge Transfer. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5900–5904. [Google Scholar] [CrossRef]
  14. Xu, J.; Wang, P.; Yang, H.; L’opez, A.M. Training a Binary Weight Object Detector by Knowledge Transfer for Autonomous Driving. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 2379–2384. [Google Scholar]
  15. Mun, J.; Lee, K.; Shin, J.; Han, B. Learning to Specialize with Knowledge Distillation for Visual Question Answering. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Curran Associates Inc.: Red Hook, NY, USA, 2018; pp. 8092–8102. [Google Scholar]
  16. Zhou, C.; Neubig, G.; Gu, J. Understanding Knowledge Distillation in Non-autoregressive Machine Translation. arXiv 2020, arXiv:1911.02727. [Google Scholar]
  17. Fetahu, B.; Veeragouni, A.; Rokhlenko, O.; Malmasi, S. Distilling multilingual transformers into CNNs for scalable intent classification. In Proceedings of the EMNLP 2022, Abu Dhabi, United Arab Emirates, 7–11 December 2022. [Google Scholar]
  18. Chen, Y.C.; Gan, Z.; Cheng, Y.; Liu, J.; Liu, J. Distilling Knowledge Learned in BERT for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 7893–7905. [Google Scholar] [CrossRef]
  19. Tang, R.; Lu, Y.; Liu, L.; Mou, L.; Vechtomova, O.; Lin, J. Distilling Task-Specific Knowledge from BERT into Simple Neural Networks. arXiv 2019, arXiv:1903.12136. [Google Scholar]
  20. Sun, S.; Cheng, Y.; Gan, Z.; Liu, J. Patient Knowledge Distillation for BERT Model Compression. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 3–7 November 2019; pp. 4323–4332. [Google Scholar] [CrossRef]
  21. Jiao, X.; Yin, Y.; Shang, L.; Jiang, X.; Chen, X.; Li, L.; Wang, F.; Liu, Q. TinyBERT: Distilling BERT for Natural Language Understanding. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 4163–4174. [Google Scholar] [CrossRef]
  22. Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Brussels, Belgium, 1 November 2018; pp. 353–355. [Google Scholar] [CrossRef]
  23. Üstün, A.; Berard, A.; Besacier, L.; Gallé, M. Multilingual Unsupervised Neural Machine Translation with Denoising Adapters. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 6650–6662. [Google Scholar] [CrossRef]
  24. Libovický, J.; Rosa, R.; Fraser, A. On the Language Neutrality of Pre-trained Multilingual Representations. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1663–1674. [Google Scholar] [CrossRef]
  25. Conneau, A.; Wu, S.; Li, H.; Zettlemoyer, L.; Stoyanov, V. Emerging Cross-lingual Structure in Pretrained Language Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 6022–6034. [Google Scholar] [CrossRef]
  26. Pires, T.; Schlinger, E.; Garrette, D. How Multilingual is Multilingual BERT? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 4996–5001. [Google Scholar] [CrossRef]
  27. Wu, S.; Dredze, M. Are All Languages Created Equal in Multilingual BERT? arXiv 2020, arXiv:2005.09093. [Google Scholar]
  28. Peng, B.; Jin, X.; Liu, J.; Zhou, S.; Wu, Y.; Liu, Y.; Li, D.; Zhang, Z. Correlation Congruence for Knowledge Distillation. arXiv 2019, arXiv:1904.01802. [Google Scholar]
  29. Lee, S.; Song, B.C. Graph-based Knowledge Distillation by Multi-head Attention Network. arXiv 2019, arXiv:1907.02226. [Google Scholar]
  30. Turc, I.; Chang, M.; Lee, K.; Toutanova, K. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. arXiv 2019, arXiv:1908.08962. [Google Scholar]
  31. Lample, G.; Conneau, A. Cross-lingual Language Model Pretraining. arXiv 2019, arXiv:1901.07291. [Google Scholar]
  32. Le, H.; Vial, L.; Frej, J.; Segonne, V.; Coavoux, M.; Lecouteux, B.; Allauzen, A.; Crabbé, B.; Besacier, L.; Schwab, D. FlauBERT: Unsupervised Language Model Pre-training for French. arXiv 2019, arXiv:1912.05372. [Google Scholar] [CrossRef]
  33. Van der Burgh, B.; Verberne, S. The merits of Universal Language Model Fine-tuning for Small Datasets—A case with Dutch book reviews. arXiv 2019, arXiv:1910.00896. [Google Scholar]
  34. Amram, A.; Ben David, A.; Tsarfaty, R. Representations and Architectures in Neural Sentiment Analysis for Morphologically Rich Languages: A Case Study from Modern Hebrew. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 21–25 August 2018; pp. 2242–2252. [Google Scholar]
  35. Rahimi, A.; Li, Y.; Cohn, T. Massively Multilingual Transfer for NER. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 151–164. [Google Scholar]
  36. Adelani, D.I.; Abbott, J.; Neubig, G.; D’souza, D.; Kreutzer, J.; Lignos, C.; Palen-Michel, C.; Buzaaba, H.; Rijhwani, S.; Ruder, S.; et al. MasakhaNER: Named Entity Recognition for African Languages. Trans. Assoc. Comput. Linguist. 2021, 9, 1116–1131. [Google Scholar] [CrossRef]
  37. Abdaoui, A.; Pradel, C.; Sigel, G. Load What You Need: Smaller Versions of Multilingual BERT. In Proceedings of the SustaiNLP/EMNLP, Online, 20 November 2020. [Google Scholar]
  38. Sun, Z.; Yu, H.; Song, X.; Liu, R.; Yang, Y.; Zhou, D. MobileBERT: A Compact Task-Agnostic BERT for Resource-Limited Devices. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, WA, USA, 5–10 July 2020; pp. 2158–2170. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.