Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation

Zafar, Maria; Wall, Patrick J.; Bakkali, Souhail; Haque, Rejwanul

doi:10.3390/app15148091

Open AccessArticle

Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation

¹

Department of Computing, South East Technological University, R93 V960 Carlow, Ireland

²

Faculty of Business, Technological University Dublin, D02HW71 Dublin, Ireland

³

L3i, University of La Rochelle, 17000 La Rochelle, France

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(14), 8091; https://doi.org/10.3390/app15148091

Submission received: 13 May 2025 / Revised: 7 July 2025 / Accepted: 15 July 2025 / Published: 21 July 2025

(This article belongs to the Special Issue Deep Learning and Its Applications in Natural Language Processing)

Download

Browse Figures

Versions Notes

Abstract

The transformer-based deep learning approach represents the current state-of-the-art in machine translation (MT) research. Large-scale pretrained transformer models produce state-of-the-art performance across a wide range of MT tasks for many languages. However, such deep neural network (NN) models are often data-, compute-, space-, power-, and energy-hungry, typically requiring powerful GPUs or large-scale clusters to train and deploy. As a result, they are often regarded as “non-green” and “unsustainable” technologies. Distilling knowledge from large deep NN models (teachers) to smaller NN models (students) is a widely adopted sustainable development approach in MT as well as in broader areas of natural language processing (NLP), including speech, and image processing. However, distilling large pretrained models presents several challenges. First, increased training time and cost that scales with the volume of data used for training a student model. This could pose a challenge for translation service providers (TSPs), as they may have limited budgets for training. Moreover, CO₂ emissions generated during model training are typically proportional to the amount of data used, contributing to environmental harm. Second, when querying teacher models, including encoder–decoder models such as NLLB, the translations they produce for low-resource languages may be noisy or of low quality. This can undermine sequence-level knowledge distillation (SKD), as student models may inherit and reinforce errors from inaccurate labels. In this study, the teacher model’s confidence estimation is employed to filter those instances from the distilled training data for which the teacher exhibits low confidence. We tested our methods on a low-resource Urdu-to-English translation task operating within a constrained training budget in an industrial translation setting. Our findings show that confidence estimation-based filtering can significantly reduce the cost and CO₂ emissions associated with training a student model without drop in translation quality, making it a practical and environmentally sustainable solution for the TSPs.

Keywords:

knowledge distillation; NMT; LLMs

1. Introduction

Knowledge distillation (KD) is a popular technique for transferring knowledge from a large NN (teacher) to a typically smaller NN (student) [1,2]. In this approach, the student model is trained on unlabeled data to mimic the teacher model’s output, typically containing soft labels generated by the teacher on the same data. In case of an MT task, translations generated by the teacher model are used as soft labels for training the student model.

Although KD has proven effective in many MT tasks [3,4], training time and cost tend to increase with the volume of data used to train the student model. This presents a significant challenge for TSPs operating under constrained training budgets. Moreover, the CO₂ emissions produced during model training are typically proportional to the data size. While large state-of-the-art pretrained models such as NLLB [5] can be used to generate soft target labels to create the distilled training data, these labels are often not of good quality when low-resource languages are involved. The poor quality and noisy translations produced by the large pretrained models (e.g., encoder–decoder models such as NLLB) can become a critical issue for SKD, since the noisy labels can propagate errors and the student models can learn incorrect patterns from such distilled data.

Because training cost usually scales proportionally with the number of instances used for learning, keeping noisy instances in the training data may unnecessarily increase training cost without any performance gain [6]. Moreover, removing instances with low-quality and erroneous labels from the distilled data can prevent the student model from learning incorrect patterns. In this regard, obtaining powerful hardware and relevant data to train neural models and deploying them in production within tight timelines and limited training budgets remains a critical challenge for TSPs seeking to quickly build efficient and high-quality MT systems for their clients.

In our study, we aim to provide TSPs with an effective low-cost KD training solution for MT, balancing training data size and label quality with the goal of achieving little to no loss in translation quality. We exploit model confidence estimation measures for selecting instances with good-quality soft labels (translations) generated by the teacher model. Translations for which the teacher model exhibits strong confidence should ideally contribute more to the distillation process, while those with lower confidence should have a reduced impact. This research focuses on recommending suitable confidence thresholds that TSPs should use for KD in specific use cases under a limited training budget. In addition, we assess whether this strategy can substantially reduce training costs, particularly in terms of CO₂ emissions. For our investigation, we carried out experiments on the Urdu-to-English language pair, focusing on the following two specific use cases in an industrial translation project operating under constrained training budget conditions:

A small amount of bilingual data similar to the project data to be translated are available for training.
Only limited source-language monolingual data similar to the project data to be translated are available for training.

We found that the student models trained on the distilled data prepared following our strategy performed comparably or even outperformed the baseline in certain cases, while also significantly reducing carbon emission during the training process. The baseline models were trained following the standard KD training setup.

The rest of this paper is organized as follows: in Section 2, we describe the related work; in Section 3, we describe the background; our datasets are explained in Section 4; Section 5 details the experimental setups; in Section 6, we discuss our results; finally, Section 7 concludes our work and discusses potential directions for future research.

2. Related Work

Hinton et al. [1] successfully employed KD for the image classification task to compress large networks or ensembles of networks into smaller models that achieved similar performance to the large networks. Kim and Rush [2] were the first to apply KD techniques for neural MT (NMT). Subsequently, many researchers have explored various applications of sequence-level KD for NMT [7,8,9,10,11,12]. More specifically, sequence-level KD [2] is an effective technique which involves training small student models based on pseudo-target sequences generated by large teacher models. This technique significantly reduces both model size and inference time, typically with minimum loss in performance [9,10,11,13,14,15]. In other words, this approach is straightforward, as it only requires generating forward-translated synthetic data (i.e., soft target labels) using the teacher model, and has proven to be effective in training smaller models with minimum loss in performance [16,17,18,19]. In sum, there are two stages for the KD pipeline: the first stage involves generating distilled data using large teacher models, and the second involves training and fine-tuning of the student models using the distilled data created in the first stage.

The literature on KD for NMT is rich and extensive. For example, researchers have used student models to fine-tune teacher-generated chains of thought (CoT) (CoT is a series of intermediate reasoning steps which is incorporated in LLMs to improve their ability to perform complex reasoning) [20,21,22,23,24,25] and other prompting trajectories (rationales) [20,21]. Researchers have also explored alternative prompting methods [20,21] to determine how sequence-level distillation generated by the teacher can be more effective in incorporating reasoning or decision-making capabilities in students. In fact, most sequence-level KD uses known correct or incorrect trajectories to fine-tune student models [26,27,28]. Multi-domain KD, proposed in [29], distills multiple expert models into a single student. The authors kept the model architecture and model capacity fixed, showing improvement in BLEU [30] scores across all domains over multi-domain models without any increase in translation time or memory usage. They used a fixed depth and architecture of the teachers, making their approach architecture-independent and easy to combine with other multi-domain NMT models. Yu et al. [31] enriched fine-tuning data, with noisy rationales being rephrased; however, they did not select annotated samples in order to save on teacher budget. Jooste et al. [32] investigated the impact of KD training on CO₂ emissions and conducted empirical evaluations of various KD strategies. Their study included key parameters (i.e., CO₂ emissions, translation time, translation quality (BLEU)) with explicit measurement of GPU power consumption during translation. By experimenting with different teacher model variants and continuously monitoring energy usage, they estimated the computational cost of translation and assessed student model performance. Their findings suggest that the proposed methods offer a computationally efficient alternative to standard KD approaches, achieving similar accuracy levels while lowering energy demands. Rather than just aiming to reduce model size or improve inference speed, their primary focus was on improving the overall efficiency of the KD process.

More recently, [6] tested several pseudo-label filtering methods in a speech translation task, including proxy models, uncertainty quantification (entropy and geometric mean of confidence scores), negative log-likelihood, multimodal embeddings, and perceptual evaluation of speech quality. Their experiments demonstrate that these unsupervised techniques enable distilled models to outperform or match the performance of supervised distillation setups while also being more computationally and memory efficient than the larger teacher models, particularly on dialectal speech. In their paper “Learning With Less: Computational Resources and Less Data for Knowledge Distillation from LLMs”, the authors of [33] introduced (LLKD), a novel method designed to improve the training of smaller language models by leveraging the extensive knowledge of larger and more powerful LLMs. The core challenge they addressed is the scarcity of labeled data for training smaller models, which are preferred for practical applications due to their lower computational demands. More specifically, LLKD enables LLMs to generate pseudolabels for readily available unlabeled data, effectively acting as a teacher to the smaller student models. Their proposed adaptive sample selection method prioritizes data where the teacher LLM demonstrates high confidence in its labeling and where the student model indicates a high need for information, leading to more efficient and effective knowledge transfer and superior performance even with less training data. The work of Koneru et al. [34] is closely related to ours; there, the authors proposed a training procedure that leverages readily available monolingual data along with small and inexpensive dictionaries to pretrain NMT models. They introduced dictionary-preserving byte–pair encoding to better integrate rare dictionary words as well as a new “cross-entropy difference” active learning (AL) strategy to intelligently select the most informative sentences for human annotation. Their research demonstrates that this combined approach can significantly improve translation quality even with very small annotation budgets, outperforming conventional AL methods.

In this context, application of deep learning algorithms can also be seen in different but related research domains [35] such as optimizing residential electricity costs and reducing CO₂ emissions via renewable energy integration. For instance, [36] developed a deep learning-based system that monitors and manages household electrical appliance usage with the aim of reducing electricity consumption. Their proposed approach demonstrated significant performance improvements, successfully lowering both energy costs and carbon emissions. Similarly, although our research focuses on minimizing computational costs in MT model training through techniques such as KD, our underlying goal aligns with that of [36], namely, reducing energy consumption, albeit in the context of smart grids rather than computational systems.

Tiny NMT models usually have benefits [16] such as faster training times, reduced computational requirements, lower energy consumption, and the ability to be deployed on resource-constrained devices without significant performance degradation, leading to significantly lower CO₂ emissions. However, tiny models can see significantly compromised quality compared to larger models. These issues become worse when dealing with low-resource languages, which face problem in building NMT systems due to limited availability of high-quality data. As pointed out above, while large pretrained models such as NLLB can be used to generate soft target labels in order to create distilled training data, these labels are often not of good quality when low-resource languages are involved. The poor quality and noisy translations of large pretrained models can become a critical issue for sequence-level KD, since noisy or inaccurate labels can lead to the propagation of errors and the student model learning incorrect patterns from distilled data.

The studies discussed above primarily focus on scenarios involving large fine-tuning datasets, and do not address situations where the teacher has a fixed budget and provides flawed annotations for the unlabeled data (e.g., in case of low-resource languages). In contrast, our work focuses on helping TSPs to operate under limited budgets by leveraging confidence-based filtering to mitigate flawed teacher annotations and reduce training costs without sacrificing translation quality.

3. Background

3.1. Transformer

Transformer [37] currently represents the state-of-the-art in MT research. It is an encoder–decoder architecture based on stacked self-attention and position-wise fully connected layers in both its encoder and decoder components. Figure 1 shows the transformer architecture. The left side of Figure 1 depicts the Nth layer of the encoder, while the right side shows the Nth layer of the decoder.

Each encoder layer contains two sub-layers: a multi-head self-attention mechanism layer, and a position-wise fully connected feedforward network layer. Residual connections [38] are employed around each sub-layer, followed by layer normalization [39]. This configuration results in the output of each sub-layer being expressed as

L a y e r N o r m (x + S u b l a y e r (x))

, where

S u b l a y e r (x)

denotes the function implemented by the specific sub-layer. Both the sub-layers and the embedding layers are designed to produce outputs of dimension

d m o d e l

to accommodate the residual connections.

The decoder structure is similar to the encoder but includes an additional third sub-layer. This sub-layer applies multihead attention over the output of the encoder stack, known as encoder–decoder attention. Multihead attention enables the model to process information from different representation subspaces across different positions. The multihead attention sub-layer masks the received output from the encoder and offsets the embeddings by one position, ensuring that predictions for position i can only depend on the known outputs at positions smaller than i. As in the encoder, a residual connection which carries gradients from the input layers to output layers is added around each sub-layer within the decoder layer to prevent vanishing gradients, followed by layer normalisation.

Furthermore, ref. [37] proposed scaled dot-product attention, which is significantly faster and more memory-efficient than additive attention when implemented via optimized matrix multiplication. By scaling the computed dot products, dot-product attention maintains its advantage over additive attention even as the dot product values become larger. The attention function is simultaneously computed on a set of queries, keys, and values packed into matrices Q, K, and V, respectively. The matrix of outputs that is fed into the decoder is computed as shown below in Equation (1), where

d_{k}

denotes the dimension of the query and key vectors. Scaling is performed by dividing the dot product by

\sqrt{d_{k}}

. This process is illustrated in Figure 2.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(1)

Multi-head attention is a mechanism enabling neural network models to simultaneously process information from distinct representation subspaces across various positions within the input. This process begins by linearly projecting the query, key, and value (Q, K, and V) h times using unique learned linear transformations to specific dimensions:

d_{k}

for Q and K, and

d_{v}

for V. The core attention function is then applied in parallel to each of these projected Q, K, V sets, yielding

d_{v}

-dimensional output values for each parallel head. These outputs are subsequently concatenated and subjected to a final linear projection to produce the ultimate values. The attention mechanism is applied independently for each of the h attention heads, as shown in Equation (2). The resulting outputs from each head are then concatenated and passed through a final linear transformation, as shown in Equation (3).

MultiHead (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W^{O}

(2)

{head}_{i} = Attention (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V})

(3)

W_{i}^{Q} \in R^{d_{model} \times d_{k}} W_{i}^{K} \in R^{d_{model} \times d_{k}} W_{i}^{V} \in R^{d_{model} \times d_{v}} W^{O} \in R^{h d_{v} \times d_{model}}

The diagram in Figure 3 shows how multi-head attention is implemented.

Because transformer models do not use recurrence or convolution, positional encodings are added to the input embeddings before they are passed to the first layers of the encoder and decoder stacks. These positional encodings allow the model to incorporate information about the relative or absolute positions of tokens in the input sequence.

3.2. Sequence-Level Knowledge Distillation

The methods described by [1] can be used for word-level KD, since NMT models make use of multi-class prediction at the word level. However, these models also need to predict complete sequences that are dependent on previous predictions. Kim and Rush [2] introduced sequence-level KD, in which a new training set is created by translating a dataset with the teacher model using beam search. The new training set is then used to train a smaller student model. They demonstrated how the standard training objective for multi-class classifiers can be adapted to develop a function for KD, which can be extended further to support word-level and ultimately sequence-level KD.

Assume that given a dataset

(x, y)

, we want to assign x into one of the classes V. The aim is to minimize the cross-entropy between the data distribution and the model distribution p parameterized by

θ

. This can be accomplished by minimizing the negative log-likelihood (NLL) for each training example:

L_{N L L} (θ) = - \sum_{k = 1}^{| V |} 1 {y = k} log p (y = k | x; θ)

(4)

where

1 {\cdot}

is the indicator function. In the case of KD, we have a model distribution

q (y = k | x, θ_{T})

learned by the teacher; therefore, Equation (4) can equivalently be written as Equation (5):

L_{K D} (θ, θ_{T}) = - \sum_{k = 1}^{| V |} q (y = k | x, θ_{T}) log p (y = k | x; θ) .

(5)

We can now use

L_{K D}

to define functions for KD for NMT. Standard KD can be applied to NMT models because word-level NLL is minimized during training. The standard loss function is then defined as in Equation (6):

L_{W O R D - K D} (θ, θ_{T}) = - \sum_{j = 1}^{J} \sum_{k = 1}^{| V |} q (t_{j} = k | s, t_{< j}) log p (y = k | s, t_{< j})

(6)

where V is the target vocabulary and where t and s are the target and source sentences, respectively. Finally, a loss function for sequence-level KD is formulated, as word-level KD can often lead to the propagation of incorrect predictions during forward passes. Again, the teacher model’s probability distribution can be used to define this loss function. Sequence distributions from the teacher model are used instead of word distributions, allowing Equation (5) to be rewritten as

L_{S E Q - K D} (θ, θ_{T}) = - \sum_{t \in T} q (t | s) log p (t | s),

(7)

where

q (t | s)

represents the sequence distribution over all possible sequences. This loss function is complex to handle, since it sums over an exponential number of terms. In [2], the authors suggested using beam search to approximate Equation (7), which reduces the complexity of

L_{S E Q - K D}

.

4. Datasets

FLORES-200 (https://github.com/facebookresearch/flores/blob/main/flores200/README.md; accessed on 1 June 2025) Urdu–English test data were chosen as our translation project data. These are general domain data covering a wide range of topics, including news, culture, and daily life. Additionally, we considered IN22-Gen, (https://huggingface.co/datasets/ai4bharat/IN22-Gen; accessed on 1 June 2025), consisting of general-purpose multi-domain test data, as alternative translation project test data. The statistics of FLORES-200 and IN22-Gen are shown in the last rows of Table 1.

For training, we used Urdu–English (Ur–En) parallel data from OPUS (https://opus.nlpl.eu/NLLB/ur&en/v1/NLLB; accessed on 1 June 2025). We sampled 200,000 bilingual sentence pairs from NLLB (https://opus.nlpl.eu/NLLB/en&ur/v1/NLLB; accessed on 1 June 2025) to simulate a low-resource scenario. We removed duplicates from the training data shown in first row of Table 1. We sampled a set of 1000 source–target sentence pairs from the same data source for our validation data.

As mentioned above, we wanted to test two translation industry-specific use cases (see Section 2). The bilingual data in Table 1 serve as limited bilingual data similar to the project data to be translated. The monolingual data (Urdu) serve only as the source monolingual data similar to the project data to be translated. FLORES-200 and IN22-Gen both consist of general domain data, as was the case with NLLB.

5. Experimental Setup

This section details our experimental setup, including the configurations of the teacher and student MT models used in our experiments.

5.1. NLLB

NLLB is a cutting-edge multilingual translation model developed to support many languages, mainly low-resource languages. The model was trained on diverse multilingual data that included various underrepresented languages. Due to this comprehensive pretraining, NLLB can effectively handle translation tasks across many languages that typically lack sufficient data. For our experiments, we used the facebook/nllb-200-3.3B checkpoint as our teacher model and used the facebook/nllb-200-distilled-600M checkpoint to build our student models. The training configuration was as follows: batch size = 8; max sequence length = 512 tokens; epochs = 2; learning rate = $1 e^{- 5}$ ; weight decay = applied; and the model was saved at every epoch [40]. We limited the number of training epochs to two for all student models. This decision reflected our aim of providing practical and sustainable solutions for TSPs with limited training budgets.

5.2. Soft Labels for Distillation

As mentioned above, NLLB-200-3.3B was considered as the teacher in our experiments. The Urdu sentences of the training data (see Section 4) were translated to English by NLLB-200-3.3B. We sampled a few hundred English translations, manually examined them, and found that the some English translations were low in quality, often featuring repetitive words (over-generation) and generation of contextually incorrect words. We captured the confidence scores for each translation in order to understand how uncertain the models were about generating outputs. For this, we used the negative log-likelihood (NLL) [2] loss of the predicted sequence. By applying the exponential function to the negative of this loss, we approximated the probability of the sequence, as shown in Equation (8):

Confidence = exp (- \sum_{j = 1}^{J} log p (t ∣ s))

(8)

where j represents a position in the target sequence, s is the source sentence, and

p (t | s)

represents the probability of the target token being at position j given s. The sequence-level negative log-likelihood computes the joint probability of the entire sequence [2].

We observed that the teacher model usually produced low confidence scores during inference when dealing with low-quality translations. The opposite case was observed as well, i.e., high confidence scores were associated with the good-quality translations. In Figure 4, we show English translations of four Urdu sentences produced by the teacher with both low and high confidence scores. As can be been from Figure 4, translations with high confidence scores are generally of good quality, whereas those with low confidence scores tend to be of lower quality. This encouraged us to leverage the confidence scores estimated by the teacher for KD. We discarded those instances for which the teacher was quite uncertain in generating translations. For our experiments, we used a grid of threshold values (the first quartile, mean, and third quartile) in the filtering process in order to examine the tradeoffs between model performance and training cost.

We followed the standard student–teacher setup in order to build our baseline student models; in other words, we treated the English translations of Urdu training sentences as pseudo-target labels for model training, appended them to the original training data, and built the baseline student model. The process of training the student model is shown in Figure 5.

6. Results and Discussion

We evaluated our MT systems on the test sets described in Section 4 and measured their performance using SacreBLEU [41]. To report the results, we use three evaluation metrics: BLEU, chrF [42], and COMET [43]. To measure the training costs of the student models, we measured the energy consumption and carbon emissions [44] during training, for which we used Machine Learning Emissions Calculator (https://mlco2.github.io/impact/; accessed on 1 June 2025). The CO₂ emissions generated during training were computed using Equation (9):

{CO}_{2} Emissions = Power Usage \times T T \times C

(9)

where “Power Usage” is a constant related to the hardware type (we used an A100 GPU for model training, for which the power usage is 0.4 kWh),

T T

represents the training time, and C represents the carbon intensity of the power grid (C = 0.43

k g

CO₂/kWh). Below, we present and discuss the obtained results.

The performance of the teacher (NLLB-200-3.3B) and baseline (NLLB-600M) models can be seen in Table 2. In Table 2, the second row shows results for the vanilla baseline model, which used the NLLB-600M model without any modifications. The final row of Table 2 refers to the NLLB-600M baseline model trained on the original parallel data only.

It can be seen from Table 2 that the performance of the vanilla baseline model is quite good and close to the performance of the teacher model. As pointed out above, our experimental setups included two specific use cases involving industrial translation setups with access to only (i) a bilingual corpus of limited size and (ii) a monolingual corpus of the source language. The following sections discusses the findings on each of these scenarios.

6.1. Use Case 1: Only Monolingual (Urdu) Data Available

Table 3 shows the performance of our students models. Note that the same hyperparameter configuration was chosen for all our student models and the baseline (see the last row of Table 2), ensuring that their performance results are directly comparable. As mentioned above, all student models, including the baseline (shown in the last row of Table 2), were trained for a maximum of two epochs. The CO₂ emissions generated during training of the student models were recorded. The last column of Table 3 shows the CO₂ emissions generated during the training of our models.

The student models in Table 3 were built using distilled data without any bilingual parallel sentences. As mentioned in Section 5.2, we considered different threshold values for the estimated confidence scores produced by the teacher model in order to filter out noisy instances from the distilled data. Specifically, we used a grid of threshold values corresponding to the first quartile (0.1295), second quartile (0.2175), and third quartile (0.3802). Our aim was to examine the relationship between the teacher’s confidence and the quality of the resulting student models. This also helped us to identify suitable threshold values for confidence-based filtering in low-resource KD scenarios. From now on, we refer to the student models built using the standard and confidence estimation-based KD setups as SKD and CEKD, respectively. The SKD model serves as the KD baseline model in our experimental setup.

As can be seen from Table 3, the student models outperform the vanilla baseline across all metrics on IN22-Gen (the baseline model (Baseline; cf. Table 3) trained on the original parallel corpus is not directly comparable to the student models reported in this section, as this use case does not include access to the parallel corpus used for training in that case). The students models outperform the vanilla baseline across chrF and COMET on the FLORES-22 test data. SKD achieved a 0.22 BLEU point (corresponding to 0.59% relative) improvement on IN22-Gen compared to the vanilla baseline. The best-performing CEKD model was the one trained with the threshold set to the third quartile, which achieved a 0.69 BLEU point improvement (corresponding to a 1.84% relative gain) on IN22-Gen compared to the vanilla baseline. This CEKD model outperformed SKD with reasonable margins in BLEU, chrF, and COMET scores. Similar improvements were generally observed on the FLORES-200 test set as well. Figure 6 presents six plots showing how the student models performed at different confidence thresholds on both test sets in terms of BLEU, chrF, and COMET. The student model corresponding to a confidence threshold of 0.0 represents SKD, while the student models corresponding to the confidence thresholds of the first (0.1295), second (0.2175), and third (0.3802) quartiles represent CEKD. As can be seen from Figure 6, the performance of the student models generally tends to improve as the threshold values increase.

We evaluated the statistical significance of the improvements using bootstrap resampling [45]. For this, we used the standard MT evaluation tool compare-mt [46] (https://github.com/neulab/compare-mt/tree/master; accessed on 1 July 2025). We found that the difference in BLEU scores between SKD and CEKD were not statistically significant.

6.2. Use Case 2: Limited Bilingual (Urdu–English) Data Available

Table 4 shows the performance of the student models in the experiments corresponding to the second use case, in which only limited bilingual Urdu—English data were available. Notably, the student models referred to in this table were built using distilled data that contain sentence pairs from the original parallel data. As expected, the performance of these student models is significantly better than the students models reported in Section 6.1 (see Table 3).

As can be seen from Table 4, SKD (SDD+OPD) outperformed the baseline model across all three metrics on both test sets. Interestingly, unlike the findings in Section 6.1, there are significant performance differences between the SKD model and the baseline across all evaluation metrics on both the IN22-Gen and FLORES-200 test sets. Similar trends can be seen in the performance differences between the three CEKD (FDD+OPD) models and baseline.

We again refer to Figure 6 to discuss the impact of confidence thresholds on the performance of the student models. As above, the performance of the student models generally tends to improve as the threshold values increase, with only a few exceptions. In terms of the BLEU scores, the best-performing CEKD (FDD+OPD) model is the one trained with the threshold set to the second quartile. More specifically, compared to SKD, CEKD achieves an improvement of 0.61 BLEU points (corresponding to a 1.57% relative improvement) on IN22-Gen and 0.33 BLEU points (corresponding to a 0.85% relative improvement) on FLORES-22. We see very similar trends in performance with the chrF and COMET metrics. Note that no statistically significant differences in BLEU evaluation scores were found between SKD and CEKD.

6.3. Training Costs of the Student Models

The SKD and CEKD models differ significantly in terms of their training costs when considering the carbon emissions generated during the training process. We refer to the second-last columns of Table 3 and Table 4 for the differences in carbon emissions between SKD and CEKD. Figure 7 further illustrates the CO₂ emissions plotted against the different confidence thresholds, providing a clear visualization of this trend. We found that the confidence-estimation based filtering method in a low-resource KD training task can reduce CO₂ emissions by up to 79.7% without any performance loss. Additionally, the last columns of Table 3 and Table 4 show the time required to train the student models. It can be seen from the tables that the training time is significantly less when using the filtered data. We note that the same architecture was used for all of our student models; as a result, their inference speeds are identical.

These findings are especially encouraging for TSPs, which often operate under resource constraints and are increasingly mindful of sustainability goals. By adopting confidence-guided knowledge distillation, TSPs can achieve high-quality translation models while substantially reducing the environmental impact of training.

7. Conclusions

In this paper, we have employed model-based confidence estimation for choosing pseudo-target labels generated by a teacher model for training student models in an NMT knowledge distillation task. We investigated two translation industry use cases operating under a constrained training budget: (i) scenarios with limited bilingual data, and (ii) scenarios where only source-language monolingual data were available for student training. We conducted experiments for a low-resource translation task (Urdu-to-English translation) both with and without taking the teacher model’s confidence measure into account. We considered NLLB-3.3B as our teacher model.

Experimental results showed (i) that student models trained on the distilled data prepared using the confidence estimation-based filtering strategy performed on par or in some instances better than the student models trained on the standard distilled data, and (ii) that the filtering strategy can significantly reduce the cost of training the student model in terms of CO₂ emissions. The proposed methods can help SMEs or MT users to build compact MT systems and deploy them on resource-limited devices either via the cloud or offline to provide greener translation service that is both low-cost and high-quality. We note that the primary goal of this research was to improve the efficiency of student training rather than to improve the quality or inference speeds of the student models themselves. Our strategy can be employed in any standard KD setup; for instance, if a high-quality compact NMT model is available as the student, then our methods can be employed to reduce training costs and CO₂ emissions.

We experimented with different threshold values in order to determine the impact of confidence-based data filtering in KD. We found that low threshold values (e.g., first quartile) can allow lower-quality examples to pass through the filter, which can reduce the student model’s performance. The key findings from our threshold-based experiments are as follows: (i) when only monolingual data similar to the project data were available, a high confidence threshold (e.g., third quartile) for filtering led to the best KD-based MT systems; (ii) when limited bilingual data were available, an average confidence threshold (e.g., mean) was found to be most effective for KD in MT. In future research, we aim to further fine-tune the threshold parameter. This might involve randomly sampling a set of sentences translated by the teacher model and manually annotating them for quality, either using binary labels (e.g., good vs. bad) or continuous scores via direct assessment. This will make it possible to analyse the relationship between confidence scores and human quality labels, then choose the threshold that best separates high-quality examples from low-quality ones.

In addition, we plan to focus on exploring the use of multiple teachers. We also intend to further investigate the extent to which multiple teachers can be beneficial in scenarios involving limited bilingual data and training budgets. Importantly, one of the limitations of this work is that our research focused on a single low-resource translation task. In future work, we plan to explore a larger number of languages, domains, and datasets. This research aimed at developing sustainable and budget-efficient neural networks for low-resource MT in order to provide TSPs with practical solutions for deploying compact MT models that maintain high translation quality while reducing CO₂ emissions; however, this study did not address other critical challenges in MT, such as mitigating model bias and reducing hallucinations in translations. In future work, we plan to further investigate these aspects using student–teacher learning frameworks.

Author Contributions

Investigation, M.Z.; Writing—original draft, M.Z.; Writing—review & editing, S.B. and R.H.; Supervision, P.J.W., S.B. and R.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is partially funded by South East Technological University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available in OPUS [OPUS NLLB] [https://opus.nlpl.eu/NLLB/ur&en/v1/NLLB] [accessed on 1 July 2025].

Conflicts of Interest

The authors declare no conflict of interest.

References

Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Kim, Y.; Rush, A.M. Sequence-Level Knowledge Distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 1317–1327. [Google Scholar] [CrossRef]
Wei, J.; Sun, L.; Leng, Y.; Tan, X.; Yu, B.; Guo, R. Sentence-Level or Token-Level? A Comprehensive Study on Knowledge Distillation. arXiv 2024, arXiv:2404.14827. [Google Scholar]
Moslem, Y. Efficient Speech Translation through Model Compression and Knowledge Distillation. arXiv 2025, arXiv:2505.20237. [Google Scholar]
Team, N.; Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; et al. No Language Left Behind: Scaling Human-Centered Machine Translation. arXiv 2022, arXiv:2207.04672. [Google Scholar]
Waheed, A.; Kadaoui, K.; Raj, B.; Abdul-Mageed, M. uDistil-Whisper: Label-Free Data Filtering for Knowledge Distillation in Low-Data Regimes. arXiv 2025, arXiv:2407.01257. [Google Scholar]
Britz, D.; Le, Q.; Pryzant, R. Effective Domain Mixing for Neural Machine Translation. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark, 7–8 September 2017; pp. 118–126. [Google Scholar] [CrossRef]
Gu, J.; Bradbury, J.; Xiong, C.; Li, V.O.K.; Socher, R. Non-Autoregressive Neural Machine Translation. arXiv 2018, arXiv:1711.02281. [Google Scholar] [PubMed]
Kasai, J.; Pappas, N.; Peng, H.; Cross, J.; Smith, N.A. Deep encoder, shallow decoder: Reevaluating non-autoregressive machine translation. arXiv 2020, arXiv:2006.10369. [Google Scholar]
Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
Wang, F.; Yan, J.; Meng, F.; Zhou, J. Selective Knowledge Distillation for Neural Machine Translation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 1–6 August 2021; pp. 6456–6466. [Google Scholar] [CrossRef]
Zhou, C.; Neubig, G.; Gu, J. Understanding Knowledge Distillation in Non-autoregressive Machine Translation. arXiv 2021, arXiv:1911.02727. [Google Scholar]
Bucila, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006. [Google Scholar]
Mirzadeh, S.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; Ghasemzadeh, H. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI 2020—34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 5191–5198. [Google Scholar]
Sanh, V.; Debut, L.; Chaumond, J.; Wolf, T. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv 2020, arXiv:1910.01108. [Google Scholar]
Kim, Y.J.; Junczys-Dowmunt, M.; Hassan, H.; Fikri Aji, A.; Heafield, K.; Grundkiewicz, R.; Bogoychev, N. From Research to Production and Back: Ludicrously Fast Neural Machine Translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, China, 4 November 2019; pp. 280–288. [Google Scholar] [CrossRef]
Yoo, K.M.; Park, D.; Kang, J.; Lee, S.W.; Park, W. GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 16–20 November 2021; pp. 2225–2239. [Google Scholar] [CrossRef]
Schick, T.; Schütze, H. Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference. arXiv 2021, arXiv:2001.07676. [Google Scholar]
Zhou, Y.; Maharjan, S.; Liu, B. Scalable Prompt Generation for Semi-supervised Learning with Language Models. arXiv 2023, arXiv:2302.09236. [Google Scholar]
Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv 2023, arXiv:2210.03629. [Google Scholar]
Ho, N.; Schmid, L.; Yun, S.Y. Large Language Models Are Reasoning Teachers. arXiv 2023, arXiv:2212.10071. [Google Scholar]
He, N.; Lai, H.; Zhao, C.; Cheng, Z.; Pan, J.; Qin, R.; Lu, R.; Lu, R.; Zhang, Y.; Zhao, G.; et al. TeacherLM: Teaching to Fish Rather Than Giving the Fish, Language Modeling Likewise. arXiv 2024, arXiv:2310.19019. [Google Scholar]
Shridhar, K.; Stolfo, A.; Sachan, M. Distilling Reasoning Capabilities into Smaller Language Models. In Proceedings of the Findings of the Association for Computational Linguistics: ACL, Toronto, ON, Canada, 9–14 July 2023; pp. 7059–7073. [Google Scholar] [CrossRef]
Hsieh, C.Y.; Li, C.L.; Yeh, C.k.; Nakhost, H.; Fujii, Y.; Ratner, A.; Krishna, R.; Lee, C.Y.; Pfister, T. Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes. In Proceedings of the Findings of the Association for Computational Linguistics: ACL, Toronto, ON, Canada, 9–14 July 2023; pp. 8003–8017. [Google Scholar] [CrossRef]
Wang, P.; Li, L.; Chen, L.; Song, F.; Lin, B.; Cao, Y.; Liu, T.; Sui, Z. Making Large Language Models Better Reasoners with Alignment. arXiv 2023, arXiv:2309.02144. [Google Scholar]
Li, Y.; Yuan, P.; Feng, S.; Pan, B.; Sun, B.; Wang, X.; Wang, H.; Li, K. Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data. arXiv 2023, arXiv:2312.12832. [Google Scholar] [CrossRef]
Chen, H.; Wu, S.; Quan, X.; Wang, R.; Yan, M.; Zhang, J. MCC-KD: Multi-CoT Consistent Knowledge Distillation. arXiv 2023, arXiv:2310.14747. [Google Scholar]
Liu, W.; Li, G.; Zhang, K.; Du, B.; Chen, Q.; Hu, X.; Xu, H.; Chen, J.; Wu, J. Mind’s Mirror: Distilling Self-Evaluation Capability and Comprehensive Thinking from Large Language Models. arXiv 2024, arXiv:2311.09214. [Google Scholar]
Currey, A.; Mathur, P.; Dinu, G. Distilling Multiple Domains for Neural Machine Translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; pp. 4500–4511. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar] [CrossRef]
Yu, D.; Backurs, A.; Gopi, S.; Inan, H.; Kulkarni, J.; Lin, Z.; Xie, C.; Zhang, H.; Zhang, W. Training Private and Efficient Language Models with Synthetic Data from LLMs. In Proceedings of the Socially Responsible Language Modelling Research, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Jooste, W.; Haque, R.; Way, A. Knowledge Distillation: A Method for Making Neural Machine Translation More Efficient. Information 2022, 13, 88. [Google Scholar] [CrossRef]
Li, J.; Nag, S.; Liu, H.; Tang, X.; Sarwar, S.; Cui, L.; Gu, H.; Wang, S.; He, Q.; Tang, J. Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data. arXiv 2025, arXiv:2411.08028. [Google Scholar]
Koneru, S.; Liu, D.; Niehues, J. Cost-Effective Training in Low-Resource Neural Machine Translation. arXiv 2022, arXiv:2201.05700. [Google Scholar]
Dong, S.; Wang, P.; Abbas, K. A survey on deep learning and its applications. Neural Comput. Appl. 2021, 33, 6273–6293. [Google Scholar] [CrossRef]
Balakrishnan, R.; Geetha, V.; Kumar, M.; Leung, M.F. Reduction in Residential Electricity Bill and Carbon Dioxide Emission through Renewable Energy Integration Using an Adaptive Feed-Forward Neural Network System and MPPT Technique. Sustainability 2023, 15, 14088. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar]
Zafar, M.; Castaldo, A.; Nayak, P.; Haque, R.; Way, A. The SETU-ADAPT Submissions to WMT 2024 Chat Translation Tasks. In Proceedings of the Ninth Conference on Machine Translation, Miami, FL, USA, 15–16 November 2024; pp. 1023–1030. [Google Scholar] [CrossRef]
Post, M. A call for clarity in reporting BLEU scores. arXiv 2018, arXiv:1804.08771. [Google Scholar]
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 17–18 September 2015; pp. 392–395. [Google Scholar] [CrossRef]
Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A Neural Framework for MT Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–18 November 2020; pp. 2685–2702. [Google Scholar] [CrossRef]
Lacoste, A.; Luccioni, A.; Schmidt, V.; Dandres, T. Quantifying the carbon emissions of machine learning. arXiv 2019, arXiv:1910.09700. [Google Scholar]
Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2004; pp. 388–395. [Google Scholar]
Neubig, G.; Dou, Z.; Hu, J.; Michel, P.; Pruthi, D.; Wang, X.; Wieting, J. Compare-mt: A Tool for Holistic Comparison of Language Generation Systems. arXiv 2019, arXiv:1903.07926. [Google Scholar]

Figure 1. Transformer model architecture as in [37].

Figure 2. Scaled dot-product attention, as in [37].

Figure 3. Multi-head attention, as in [37].

Figure 4. Examples of English translations of Urdu sentences produced by the teacher model with high and low confidence scores. The Urdu source sentences are accompanied by their English transliterations.

Figure 5. Confidence filtering in KD. SDD: Standard Distilled Data, FDD: Filtered Distilled Data, OPD: Original Parallel Data.

Figure 6. Student model performance vs. confidence threshold. The models corresponding to a confidence threshold of 0.0 represent SKD, while the models corresponding to the confidence thresholds of the first (0.1295), second (0.2175), and third (0.3802) quartiles represent CEKD.

Figure 7. CO₂ emissions vs. confidence threshold. The models corresponding to a confidence threshold of 0.0 represent SKD, while the models corresponding to the confidence thresholds of the first (0.1295), second (0.2175), and third (0.3802) quartiles represent CEKD.

Table 1. Data statistics.

	Sentences	Vocabulary
		Urdu	Englis
Train	188,130	152,943	166,848
Valid	1000	5488	4996
IN22-Gen	1023	6708	7387
FLORES-200	994	5775	5336

Table 2. Performance of the teacher and vanilla baseline models. Teacher models: NLLB-200-3.3B and NLLB-200-600M.

		SacreBLEU	chrF	COMET
NLLB	IN22-Gen	40.57	66.62	85.55
NLLB	FLORES-200	39.02	60.99	87.35
Vanilla Baseline (NLLB-600M)	IN22-Gen	37.48	63.31	83.73
Vanilla Baseline (NLLB-600M)	FLORES-200	31.87	58.80	82.63
Baseline (NLLB-600M)	IN22-Gen	37.38	65.17	84.56
Baseline (NLLB-600M)	FLORES-200	31.05	60.62	83.72

Table 3. Performance of the student models. SDD: Standard Distilled Data, FDD: Filtered Distilled Data. CO₂ is calculated in KGs. TT is the training time calculated in hours. All student models used the same architecture and had identical size and memory usage.

		SacreBLEU	chrF	COMET	Train	CO₂	TT
Vanilla Baseline	IN22-Gen	37.48	63.31	83.73	-	-	-
	FLORES-200	31.87	58.80	82.63
SDD	IN22-Gen	37.70	65.46	84.65	188,130	0.69	4.0
SDD	FLORES-200	31.19	60.87	83.72
FDD	IN22-Gen	37.85	65.42	84.62	153,754	0.47	2.7
(1st Quartile)	FLORES-200	30.89	60.65	83.66
FDD	IN22-Gen	37.74	65.51	84.74	85,298	0.25	1.4
(2nd Quartile)	FLORES-200	31.02	60.81	83.68
FDD	IN22-Gen	38.17	65.94	85.10	49,517	0.14	0.8
(3rd Quartile)	FLORES-200	31.38	61.15	84.01

Table 4. Performance of the fine-tuned NLLB-200-600M student model. SDD: Standard Distilled Data, FDD: Filtered Distilled Data, OPD: Original Parallel Data. CO₂ is calculated in KGs. TT is the training time calculated in hours. All student models used the same architecture and had identical size and memory usage.

		SacreBLEU	chrF	COMET	Train	CO₂	TT
Baseline	IN22-Gen	37.38	65.17	84.56	-	-	-
	FLORES-200	31.05	60.62	83.72
SDD+OPD	IN22-Gen	38.70	66.10	84.90	376,260	1.08	4.5
	FLORES-200	31.86	61.06	83.90
FDD+OPD	IN22-Gen	39.12	66.34	84.81	341,884	0.67	3.86
(1st Quartile)	FLORES-200	31.99	61.11	83.91
FDD+OPD	IN22-Gen	39.31	66.43	85.05	273,428	0.66	3.83
(2nd Quartile)	FLORES-200	32.19	61.31	84.01
FDD+OPD	IN22-Gen	39.19	66.40	85.09	237,647	0.61	3.5
(3rd Quartile)	FLORES-200	32.16	61.29	84.01

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zafar, M.; Wall, P.J.; Bakkali, S.; Haque, R. Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation. Appl. Sci. 2025, 15, 8091. https://doi.org/10.3390/app15148091

AMA Style

Zafar M, Wall PJ, Bakkali S, Haque R. Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation. Applied Sciences. 2025; 15(14):8091. https://doi.org/10.3390/app15148091

Chicago/Turabian Style

Zafar, Maria, Patrick J. Wall, Souhail Bakkali, and Rejwanul Haque. 2025. "Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation" Applied Sciences 15, no. 14: 8091. https://doi.org/10.3390/app15148091

APA Style

Zafar, M., Wall, P. J., Bakkali, S., & Haque, R. (2025). Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation. Applied Sciences, 15(14), 8091. https://doi.org/10.3390/app15148091

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Confidence-Based Knowledge Distillation to Reduce Training Costs and Carbon Footprint for Low-Resource Neural Machine Translation

Abstract

1. Introduction

2. Related Work

3. Background

3.1. Transformer

3.2. Sequence-Level Knowledge Distillation

4. Datasets

5. Experimental Setup

5.1. NLLB

5.2. Soft Labels for Distillation

6. Results and Discussion

6.1. Use Case 1: Only Monolingual (Urdu) Data Available

6.2. Use Case 2: Limited Bilingual (Urdu–English) Data Available

6.3. Training Costs of the Student Models

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI