Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios

Sindhujan, Archchana; Kanojia, Diptesh; Orăsan, Constantin

doi:10.3390/info16100916

Open AccessArticle

Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios

by

Archchana Sindhujan

^1,2

,

Diptesh Kanojia

^1,2

and

Constantin Orăsan

^3,*

¹

Surrey Institute for People-Centred AI (PAI), Guildford GU2 7XH, UK

²

Department of Computer Science and Electronic Engineering, University of Surrey, Guildford GU2 7JN, UK

³

Centre for Translation Studies, University of Surrey, Guildford GU2 7XH, UK

^*

Author to whom correspondence should be addressed.

Information 2025, 16(10), 916; https://doi.org/10.3390/info16100916

Submission received: 29 August 2025 / Revised: 3 October 2025 / Accepted: 5 October 2025 / Published: 18 October 2025

(This article belongs to the Special Issue Machine Translation Quality Estimation: Advances and Emerging Challenges)

Download

Browse Figures

Versions Notes

Abstract

Reference-less evaluation of machine translation, or Quality Estimation (QE), is vital for low-resource language pairs where high-quality references are often unavailable. In this study, we investigate segment-level QE methods comparing encoder-based models such as MonoTransQuest, CometKiwi, and xCOMET with various decoder-based methods (Tower+, ALOPE, and other instruction-fine-tuned language models). Our work primarily focused on utilizing eight low-resource language pairs, involving both English on the source side and the target side of the translation. Results indicate that while fine-tuned encoder-based models remain strong performers across most low-resource language pairs, decoder-based Large Language Models (LLMs) show clear improvements when adapted through instruction tuning. Importantly, the ALOPE framework further enhances LLM performance beyond standard fine-tuning, demonstrating its effectiveness in narrowing the gap with encoder-based approaches and highlighting its potential as a viable strategy for low-resource QE. In addition, our experiments demonstrates that with adaptation techniques such as LoRA (Low Rank Adapters) and quantization, decoder-based QE models can be trained with competitive GPU memory efficiency, though they generally require substantially more disk space than encoder-based models. Our findings highlight the effectiveness of encoder-based models for low-resource QE and suggest that advances in cross-lingual modeling will be key to improving LLM-based QE in the future.

Keywords:

machine translation; quality estimation; evaluation; low-resource languages

1. Introduction

Machine translation (MT) has become increasingly essential due to significant data digitization in the last decade. Further, MT is more accurate due to advancements in Natural Language Processing (NLP) and the availability of large-scale parallel corpora [1,2,3]. The availability of hardware which is capable of processing data streams in parallel is yet another significant contributing factor to the rise of large-scale MT systems. The growth of MT usage can be observed in the success of Google Translate, which surpassed a billion installations in March 2021 [4]. Further, it translates more than 100 billion words a day, which highlights the demand and the use of publicly available MT systems [5].

Although MT has come a long way in recent years, it still has limitations, and the translated output is not perfect. However, most commercial MT systems do not explicitly state this to the user, leaving the user to assume that the output is accurate. Most large-scale MT systems perform better for high-resource languages such as English, whereas languages with limited available resources—even those that are widely spoken often exhibit lower translation quality. Limited resources pose a challenge for achieving high-quality translations [6].

Most automatic evaluation metrics for machine translation output are reference-based metrics, which assess the quality of a machine-generated translation by comparing it against one or more gold standard human translations. While well-established metrics like BLEU [7] were instrumental in the early progress of statistical and neural MT, the field has increasingly recognised the limitations inherent in these reference-based metrics. The main challenge is the prohibitive cost and effort associated with creating high-quality reference translations. This process is labour-intensive and time-consuming, requiring skilled professional translators, which is not feasible for many real-world applications.

Given the limitations of reference-dependent methods, the field has refocused towards quality estimation (QE) as the primary paradigm for reference-less evaluation. QE is formally defined as the task of predicting the quality of a machine translation system’s output, using only the source text and the generated translation as input, without access to any human reference. This reframes the problem from a direct comparison against a “correct” answer to a prediction task, where a model learns to mimic the quality judgments of human evaluators.

Primary focus of recent QE research has been at the segment level (i.e., a sentence) [8], which is modeled as a regression problem. The objective is to train a model that can predict a continuous quality score for a given source–target segment pair. To train and evaluate these QE models, the field relies on human-annotated data that serves as the ground truth. This ground truth is usually annotated either as Direct Assessment (DA) [8] scores or Multidimensional Quality Metrics (MQM) scores [9], which have become standard for segment-level QE. While MQM [9] enables detailed identification of translation errors, it is cognitively demanding for annotators and requires substantial effort to implement. DA scores, on the other hand, provide a simpler and less time-consuming approach that can effectively complement these frameworks in assessing segment-level MT quality. In this study, we focus exclusively on DA scores as the reference standard for segment-level QE.

The primary objective of this paper is to provide a comprehensive empirical comparison of current state-of-the-art approaches for segment-level quality estimation of low-resource languages. We begin by outlining the necessary background in Section 2 to contextualize the different QE methodologies later. Section 3 describes the eight low-resource language pair datasets employed in our experiments. Next, in Section 4, we detail both encoder-based approaches (such as CometKiwi, TransQuest, and xCOMET) and decoder-based approaches (large language models) for QE. Section 5 presents our experimental setup and the evaluation metrics adopted for model assessment. We then report and analyze the performance of these methods and models across all language pairs in Section 6, offering insights into their strengths and limitations. Finally, Section 7 summarises our key findings and discusses their implications for future research in reference-less QE for low-resource languages.

2. Background

2.1. Quality Estimation

The first methods proposed in quality estimation for machine translation involved feature-based engineering along with incorporating standard machine learning approaches like Support Vector Machines and simple neural network architectures [10,11]. Later, deep learning-based approaches were utilized for QE, introducing Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRUs) [12,13,14] for the task. In recent years, Transformer-based architectures have gained prominence in the field of machine translation and quality estimation [15,16,17,18,19]. XLM-R [20], mBART [21], BERT (Bidirectional Encoder Representations from Transformers) [22] are some of the encoder-based transformer approaches used for MT quality estimation. Researchers have also proposed decoder-based QE approaches, which solely focus on generative large language models (LLMs) for estimating the quality of translation [23,24,25]. We examine the transformer-based QE approaches in detail in Section 4.

In addition to single-model approaches, various ensemble methods have been proposed, where the outputs of multiple models are combined to yield more robust and reliable estimates of machine translation quality [12,13,15,16,19,26,27,28,29]. Furthermore, several studies have introduced multi-task learning strategies that jointly optimize related objectives, such as sentence-level and word-level quality estimation to enhance overall performance [15,26,28,30].

2.2. Encoder- and Decoder-Based Language Models

Language models derived from the Transformer architecture can be broadly classified into two principal paradigms based on their underlying structure: encoder-based and decoder-based models. Each paradigm possesses a distinct architectural design that influences its operational mechanism and functional specialization.

Encoder-based architectures, exemplified by models like BERT, leverage the encoder stack of the Transformer to process input sequences in their entirety [22]. This architecture facilitates the learning of deep bidirectional contexts, where the representation of each token is informed by both preceding and subsequent tokens. A prevalent pre-training objective for these models is Masked Language Modeling (MLM), which involves training the model to predict masked tokens from their surrounding unmasked context.

In contrast, decoder-based architectures are constructed exclusively from the decoder stack and are implemented for autoregressive generation [31]. These models operate sequentially, generating each token conditioned on the previously generated sequence. This unidirectional, forward-looking process makes them highly adept at natural language generation tasks as well, and they are optimized for a broader range of other tasks too.

Therefore, the fundamental architectural difference between these models, bidirectional context processing in encoders versus unidirectional, causal modeling in decoders, results in a clear functional specialization. Encoder models are optimized for tasks requiring input comprehension, whereas decoder models are specialized for sequential content generation.

2.3. Direct Assessment Scores (DA Scores)

DA [8] is a widely adopted evaluation protocol in which human annotators provide a quality judgment on a continuous scale ranging from 0 to 100 for each translation. This single score is intended to capture an overall assessment of the translation based on fluency, adequacy, and general comprehensibility. Given the inherently subjective nature of human evaluation, DA mitigates individual bias by aggregating scores from multiple annotators (usually three or more) through mean averaging. This result is a more reliable and stable ground truth signal for training and evaluating QE models.

The FLORES annotation method [32] is commonly used as the guideline for the DA score-based quality estimation. According to the provided guidelines for annotators, the scoring range from 0 to 10 indicates an incorrect translation, 11 to 29 represents a translation with few correct keywords but a different overall meaning from the source, 30 to 50 indicates a translation with major mistakes, 51 to 70 suggests a translation that is understandable but contains typos or grammatical errors, 71 to 90 signifies a translation that closely preserves the semantics of the source sentence, and 91 to 100 indicates a perfect translation. Figure 1 shows an example of the whole DA score annotation process. The DA framework has become a cornerstone in recent QE shared tasks, offering a standardised and scalable approach for benchmarking model performance across diverse language pairs and evaluation settings.

3. Data

Our work investigates low-resource language pairs by employing subsets of the datasets from the WMT quality estimation shared tasks, which include human-annotated DA scores. The data encompasses several language pairs from the WMT23 QE shared task [33]: English to Hindi (En-Hi), Gujarati (En-Gu), Marathi (En-Mr), Telugu (En-Te), and Tamil (En-Ta). Additionally, the study incorporates language pairs from the WMT22 task [34], including Estonian to English (Et-En), Nepali to English (Ne-En), and Sinhala to English (Si-En). Although Hindi and Estonian are classified as fairly-resourced for machine translation, they are considered low-resource in the context of translation evaluation and QE due to a lack of sufficient data. In the experimental design, the training splits were utilised for fine-tuning models, while the test splits were allocated for zero-shot and inference experiments (data split details can be found in Table 1).

4. Methods

This section outlines the approaches employed for segment-level QE in our study. We consider both encoder-based approaches and decoder-based approaches, reflecting the two dominant methods in current QE research. We start with the state-of-the-art encoder-based methods for QE in Section 4.1, where we examine the MonoTransQuest, CometKiwi, and xCOMET, each incorporating distinct architectural designs and pre-training strategies. The decoder-based approaches, on the other hand, build upon large generative language models that were originally designed for text generation but have recently shown promise in reference-less evaluation. These methods, discussed in Section 4.2, include instruction-fine-tuned LLMs, the translation-optimized Tower+ LLM, and the ALOPE framework. By systematically studying these methods, we aim to provide a comprehensive comparison of how different architectures, ranging from lightweight encoder models to large-scale decoder LLMs address the challenges of QE, particularly in low-resource settings.

4.1. Encoder-Based Methods

4.1.1. TransQuest

The TransQuest framework provides two neural architectures for segment-level quality estimation: MonoTransQuest and SiameseTransQuest [16]. We adopt MonoTransQuest, which encodes the source and translation jointly with a single transformer, as it consistently outperformed the Siamese variant [16]. The source and the machine-translated sentences are concatenated with a [SEP] token, and the [CLS] embedding is fine-tuned to regress a quality score. MonoTransQuest avoids multi-stage complexity, requires fewer resources, and has shown strong performance in WMT shared tasks [33,35,36], making it widely adopted for multilingual QE. For our experiments, we adapt MonoTransQuest with three pre-trained multilingual encoders:

XLM-Roberta (XLM-R) is a multilingual transformer trained on CommonCrawl data [37] across 100 languages using a masked language modeling objective [20]. Its broad coverage of languages and shared understanding of linguistic structures make it a strong baseline for QE.

XLM-V improves upon XLM-R by constructing a one-million-token vocabulary through clustering, leading to more precise and diverse tokenization across languages [38].

InfoXLM-Large extends XLM-R with a contrastive learning objective that maximizes semantic alignment between translation pairs, enhancing cross-lingual understanding [39].

These specific models were selected for our encoder-based QE experiments because their unique pre-training strategies offer diverse approaches to estimate the translation quality. All of these models have approximately 550M parameters, and we fine-tuned them with a learning rate of 2 ×

10^{- 5}

, batch size of 8, for three epochs, using the AdamW optimizer. Our TransQuest based models are made publicly available in HuggingFace (TransQuest-Models, https://huggingface.co/collections/surrey-nlp/monotransquest-68daba485984dc51f5887764, accessed on 4 October 2025). By examining these varied models, the research aims to identify which pre-training methods of encoder-based transformers are most effective for QE and how they leverage linguistic similarities in the training data to improve performance and efficiency. Our adapted models with the TransQuest architecture are trained utilizing the data described in Section 3.

4.1.2. CometKiwi

CometKiwi [15,40] architecture employs a multi-stage process involving large-scale pre-trained encoders, task-specific fine-tuning, and an optimized ensembling technique. The system processes a machine-translated sentence and its original source sentence by concatenating them into a single input sequence formatted as [cls] target [sep] source [eos]. This combined sequence is then fed into a pre-trained multilingual encoder (i.e., InfoXLM, XLM-R-XL or XLM-R-XXL). Instead of utilizing the final layer’s output, the model calculates a learned, weighted sum of the hidden states from all encoder layers using a “scalar mix” module. This employs sparsemax [41] to identify and prioritize the most relevant layers for the task, effectively ignoring unhelpful ones. For the sentence-level quality score prediction, the resulting aggregated hidden state for the special [cls] token is passed through a two-layer feed-forward network to produce a final score.

The latest CometKiwi models [40] are trained on a massive corpus of over 940K samples covering 38 different language pairs. This data includes Direct Assessment annotations from past WMT tasks (2017–2020), the MLQE-PE dataset, and the data from the 2023 QE shared task. After the generic training phase, the models are further fine-tuned for the sentence-level task [40]. For our study, we have utilised the publicly available fine-tuned models of CometKiwi-22 (Unbabel/wmt22-CometKiwi-da, https://huggingface.co/Unbabel/wmt22-CometKiwi-da, accessed on 4 October 2025) (w/InfoXLM) and CometKiwi-23 (Unbabel/wmt23-CometKiwi-da-xl, https://huggingface.co/Unbabel/wmt23-CometKiwi-da-xl, accessed on 4 October 2025) (w/XLM-R-XL) from the Huggingface, where CometKiwi-22 contains approximately 550M parameters, and CometKiwi-23 contains 3.5B parameters.

4.1.3. xCOMET

xCOMET [42] is an open-source model that extends beyond sentence-level scoring by also detecting and categorizing error spans with severity labels. Its unified architecture supports multiple modes of evaluation, including reference-free quality estimation, which is central to our study. Built on pre-trained multilingual encoders, xCOMET is trained starting with Direct Assessment scores from WMT17–20 [43] for regression, and then incorporating datasets such as IndicMT [44], DEMETR [45], and MLQE-PE [43], along with synthetic data for error span detection. Through this staged training, it achieves robust performance in reference-less QE, benefiting from direct supervision on human-provided DA scores. We have utilized this pre-trained model (xCOMET-XL, https://huggingface.co/Unbabel/XCOMET-XL, accessed on 4 October 2025) with 3.5B parameters for our experiments.

4.2. Decoder-Based Methods

To experiment with decoder-based methods of QE, we selected different types of LLM-based methods: instruction-fine-tuned LLMs [25], Tower+ [46], and ALOPE [47].

4.2.1. Instruction-Fine-Tuned LLMs for QE

In this work, we choose decoder-based LLMs to evaluate their effectiveness in segment-level reference-less QE, particularly in low-resource settings. Our goal is to assess how well these models, despite being originally trained for text generation, can be adapted to predict scalar quality scores for translations. We specifically selected the following models based on their promising performance in multilingual contexts and open-source availability: Llama-2-7B (meta-llama/Llama-2-7b-chat-hf, https://huggingface.co/meta-llama/LLaMA-2-7b-chat-hf, accessed on 4 October 2025), OpenChat-7B (openchat/openchat-3.5, https://huggingface.co/OpenChat/OpenChat-3.5-1210, accessed on 4 October 2025), Gemma-2-9B (google/gemma-2-9b, https://huggingface.co/google/gemma-2-9b, accessed on 4 October 2025), Llama-3.1-8B (meta-llama/Llama-3.1-8B-Instruct, https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, accessed on 4 October 2025), Aya-Expanse-8B (CohereLabs/aya-expanse-8b, https://huggingface.co/CohereForAI/aya-expanse-8b, accessed on 4 October 2025) and Llama-3.2-3B (meta-llama/Llama-3.2-3B-Instruct, https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct, accessed on 4 October 2025).

The selection of our models covers a range of sizes with limited parameter size (fewer than 9B parameters) and architectures within the decoder-only paradigm allow us to systematically explore the extent to which generative LLMs can be repurposed for QE tasks while being efficient and deployed under realistic compute budgets. For fine-tuning, we configured a learning rate of 5 ×

10^{- 5}

, a batch size of 2, and two training epochs, and employed LoRA adapters with a rank of 64 to optimize the models efficiently. All our instruction-tuned models are made publicly available in HuggingFace (IFT-Models, https://huggingface.co/collections/surrey-nlp/instruction-fine-tuned-models-gemba-prompt-68dd1ba561450d3d08a4bce2, accessed on 4 October 2025).

4.2.2. TOWER+

Tower+ [46] is a family of multilingual LLMs built with a four-stage post-training pipeline to combine state-of-the-art translation ability with general-purpose instruction-following and reasoning. The pipeline begins with Continued Pretraining (CPT) on monolingual (FineWeb-Edu [48]) and parallel (OPUS [49]) data, followed by Supervised Fine-Tuning (SFT) on curated instruction datasets such as OpenHermes-2.5 Aya [50] and Tülu [51]. Model alignment is then refined with Weighted Preference Optimization (WPO) [52] using prompts, human feedback, and post-edits, and finally with Reinforcement Learning with Verifiable Rewards (RLVR) on the Tülu 3 dataset and programmatically generated tasks.

We employ Tower+ model (Tower-Plus-72B, https://huggingface.co/Unbabel/Tower-Plus-72B, accessed on 4 October 2025) in our study because its training explicitly included tasks for evaluating translation quality, with the final “translation preference evaluation” stage teaching the model to compare translations and reason about their quality, making it particularly well-suited for QE tasks.

4.2.3. ALOPE

The Adaptive Layer Optimization for Translation Quality Estimation (ALOPE) framework is a novel approach designed to adapt LLMs for segment-level quality estimation by restructuring their internal representations for regression-specific tasks [47]. Standard instruction-tuned LLMs are optimized for next-token prediction and often struggle with fine-grained regression outputs. ALOPE addresses this gap by integrating regression heads with low-rank adapters (LoRAs) [53] across selected Transformer layers, enabling efficient and targeted fine-tuning for QE. Unlike general instruction-tuned LLMs, which rely solely on the final Transformer layer embeddings to produce outputs, ALOPE allows access to embeddings from any Transformer layer. This flexibility enables the identification of the layer that yields the most effective cross-lingual representations for low-resource languages.

Within the ALOPE framework, regression heads which are attached to specific Transformer layers are denoted using negative indices to indicate their position relative to the final Transformer layer. For instance, a regression head at TL-1 extracts representations from the last Transformer layer, whereas TL-5 refers to the fifth layer from the end. This negative indexing provides a uniform and model-agnostic way to reference layers across different LLMs with varying depths. To determine the most effective layer for quality estimation, we experimented with several transformer layers, as detailed in Section 6.2. Rather than relying solely on the final Transformer layer, ALOPE systematically explores intermediate and lower layers too.

In addition to the layer-specific approach, ALOPE introduces two strategies: (i) dynamic weighting, where trainable scalar weights determine the relative contribution of multiple layers, and (ii) multi-head regression, where regression heads are attached to several layers simultaneously and their outputs aggregated during training. Both strategies enable the model to leverage multi-layer contextual information. All the ALOPE-based models were trained with a learning rate of 5 ×

10^{- 5}

, batch size of 2, and a single training epoch, using LoRA adapters with rank 32 and alpha value of 16, and with all weights quantized to 4-bit precision. ALOPE models are made publicly available in HuggingFace (ALOPE-Models, https://huggingface.co/collections/surrey-nlp/alope-models-68dd1d660e3249cb6b3f4977, accessed on 4 October 2025).

Overall, ALOPE provides a flexible, scalable, and model-agnostic framework that allows existing generative LLMs to be effectively repurposed for reference-less QE. By exploiting intermediate representations to obtain the best cross-lingual representations and incorporating regression-specific objectives, it offers an efficient and practical solution for deploying LLMs for low-resource translation QE.

5. Experiment Setup

All our encoder-based evaluations are conducted using the pre-fine-tuned models. The methods we evaluate (TransQuest, CometKiwi, and xCOMET) are built upon distinct underlying architectures which are pre-trained and fine-tuned with the labeled DA score data, as detailed in Section 4.1. However, for comparability, all models are evaluated on the same test datasets, as shown in Table 1.

In all our decoder LLM-based experiments, we employed the GEMBA Prompt (see Box 1), a simple and effective template introduced by Kocmi and Federmann [54], to guide the model in generating a quality score for each machine-translated segment. These experiments were conducted under two main configurations: zero-shot inference and instruction fine-tuning. The instruction fine-tuning setting included two variants: instruction fine-tuning with LLMs and ALOPE-based instruction-fine-tuning. These are fine-tuned, integrating the training data from all eight language pairs, and inference was performed on language-specific test sets as shown in Table 1. All selected LLMs in our study have a parameter count of 9B or fewer, with the exception of Tower+ 72B. Despite its larger size, Tower+ was included for zero-shot due to its recent release and specialized pre-training on translation data, which we hypothesize may offer advantages for the QE task.

Box 1. GEMBA Prompt.

Score the following translation from {Source Language} to {Target Language} on a continuous scale from 0 to 100, where a score of zero means “no meaning preserved” and a score of one hundred means “perfect meaning and grammar”.

{Source Language} source: {Source Sentence}

{Target Language} translation: {Translated Sentence}

Score:

Evaluation metrics—Our primary method for evaluating the model performance is Spearman’s correlation (

ρ

) [55], calculated between the model’s predicted scores and the ground truth of the quality score. This metric measures the monotonic relationship between these two variables. Instead of looking at the raw score values, it assesses how well the QE model ranks a set of translations from best to worst, compared to the human ranking. This approach aligns with a primary use case of QE: to compare different MT systems or to rank multiple translation hypotheses. For a comprehensive analysis, we also compute Pearson’s (r) [56] and Kendall’s Tau (

τ

) [57] correlations. These correlation-based metrics are widely adopted standards in MT evaluation, ensuring that our results remain consistent and directly comparable with prior work in the field.

Pearson’s correlation (r) [56] measures the strength and direction of the linear relationship between the model’s predicted scores and the ground truth quality scores. Unlike rank-based metrics, Pearson’s directly compares the numerical values produced by the model and the human annotations, providing insight into how closely the predicted scores align with the true scores on an absolute scale. This metric is particularly informative when it is important to assess not just the ranking, but the actual magnitude of the predictions, such as when models are used for downstream tasks that rely on accurate, real-valued quality estimates.

Kendall’s Tau (

τ

) [58] measures the ordinal association between two ranked variables by quantifying the number of concordant and discordant pairs within the rankings. Rather than relying on the absolute values of the predicted and reference scores, this metric focuses on the consistency of pairwise ranking order between the model’s outputs and human annotations. As a result, Kendall’s Tau is particularly well-suited for assessing quality estimation models where the goal is to preserve the correct ordering of translations according to their quality, making it valuable for tasks such as comparing multiple systems [59,60].

A pre-processing step was required before the correlation score calculation, as model outputs included extraneous text alongside the numerical score; we used regular expressions to isolate the predictions.

6. Results and Discussion

6.1. Encoder-Based

Table 2 presents the Spearman’s correlation scores obtained from encoder-based QE experiments (see Appendix B for Pearson and Kendall’s Tau scores). Among the TransQuest-based methods, the majority of language pairs demonstrated superior performance when the MonoTransQuest framework was paired with the InfoXLM-Large encoder. InfoXLM’s contrastive learning strategy, which maximises mutual information, enhances cross-lingual representation by aligning semantically equivalent translated pairs. This ability to effectively model linguistic nuances across languages makes InfoXLM-Large particularly well-suited for quality estimation tasks.

CometKiwi-23 outperforms its predecessor CometKiwi-22 across most language pairs, despite the latter also utilizing InfoXLM. Improved performance of CometKiwi-23 can be attributed to use of a larger pre-trained model (XLM-R-XL with 3.5B parameters) and additional supervised data from the WMT23 QE shared task. While xCOMET demonstrates competitive results, CometKiwi-23 consistently outperforms it across all evaluated language pairs. Furthermore, CometKiwi surpasses the performance of all TransQuest variants in every language pair except for En-Ta.

It is important to note that while our comparison with CometKiwi highlights relative performance against the state-of-the-art, this system was trained on substantially larger datasets compared to our TransQuest-based models (See Section 4.1.2 for the details of the dataset). As such, performance differences cannot be attributed solely to architectural or methodological factors, we considered it essential to include CometKiwi in our analysis, as it provides the strongest available benchmark for QE.

6.2. Decoder-Based

Table 3 presents the Spearman correlation scores obtained from decoder-based methods for QE, with corresponding graphical visualizations provided in Appendix D. See Appendix C for Pearson and Kendall’s Tau scores. In the zero-shot setting, the Gemma model achieved the highest Spearman correlation scores for five language pairs, while the Tower+ model performed best for the remaining three. Notably, the Gemma-2 and OpenChat models demonstrated consistently strong performance across all language pairs, outperforming the Tower+ 72B model in several cases. This is particularly significant given that Tower+ is substantially larger in size and specifically pre-trained on a large amount of translation-related data [46].

Instruction fine-tuning experiments show overall better performance compared to zero-shot experiments. In this setting, again Gemma-2 achieves the highest correlation score for four language pairs, and Llama-3.1 achieves the best performance for Et-En and Si-En, and OpenChat achieves the best for En-Mr and En-Te. All of the models experimented in instruction fine-tuning settings have shown improvement in the correlation score compared to zero-shot. We did not perform instruction fine-tuning with Tower+ because it has already been pre-trained with the WMT shared tasks dataset, ranging from 2017 to 2023.

ALOPE-based experiments results presented in Table 3 only show the best results obtained from each large language model for each language pair. We empirically investigate with Transformer layers TL(-1), TL(-7), TL(-11), TL(-16), TL(-20), and TL(-24), and discuss results for top three performant layers [TL(-1), TL(-7), and TL(-16)] in Table 4. Appendix A presents the results for the remaining three layers on the initial subset of LLMs used in ALOPE. For most language pairs (six out of eight), the best performing layer was (TL-7). Our finding suggests that the final layer TL(-1), commonly employed in instruction-fine-tuned LLMs, may not always offer the most informative contextual representation for cross-lingual regression tasks. Instead, intermediate layers may capture more relevant semantic features for low-resourced language pairs.

In addition to the layer-specific approach with ALOPE, we also conducted experiments with the multi-layer approaches of ALOPE (dynamic weighting and multi-head regression) to investigate the advantages of utilizing information from multiple Transformer layers. We only utlized the Llama-3.2 model with multi-layer experiments, which in earlier layer-specific ALOPE experiments (See Table 4) showed the most stable performance across language pairs. The evaluation considered combining three distinct ranges of Transformer layers: TL-(1 to 7), TL-(8 to 11), and TL-(12 to 16). As shown in Table 5, both multi-layer strategies outperformed the instruction-fine-tuned models in the majority of the instances. However, the layer-specific ALOPE configuration with selected Transformer layer representations achieved stronger results than the multi-layer variants (dynamic weighting and multi-head regression).

As shown in Table 3, overall performance in the zero-shot setting is substantially lower across all low-resource language pairs, except for Et–En when compared to both instruction-fine-tuned models and those fine-tuned using ALOPE. For Et–En, Tower+ achieves the highest correlation, likely because this model was pre-trained with translation data from the WMT shared task, which may not have been available to other models. Furthermore, when comparing ALOPE to instruction-fine-tuned LLMs, ALOPE consistently delivers higher Spearman correlation scores for nearly all language pairs, with the exception of En–Ta.

6.3. Comparative Analysis: Encoder vs. Decoder

We compared the best Spearman correlation scores between encoder-based methods (TransQuest, CometKiwi, and xCOMET) and decoder-based approaches (LLMs under both zero-shot and instruction-tuned settings), as illustrated in Figure 2. The results indicate that encoder-based models achieved the highest scores in five out of the eight language pairs, while decoder-based models outperformed encoders for only three language pairs (En-Mr, En-Ta, En-Te). We also performed the Williams significance test [61,62] to determine whether the best-performing model for each language pair produced Spearman correlations that were statistically higher than those of the other systems, as shown in Figure 2. For the three language pairs where decoder models showed the best performance, encoder-based models exhibited a statistically insignificant difference, but not vice versa—when encoder models performed best, the difference from decoder-based models was statistically significant. These findings suggest that encoder-based approaches remain more effective and reliable for quality estimation in low-resource language settings compared to decoder-based methods.

Tokenization Analysis: To investigate the reason why the encoder-based models (CometKiwi, TransQuest) generally outperform decoder LLMs in reference-less QE, we carried out a tokenization study comparing token counts across the two model families. For this analysis, 100 sentences were sampled per language pair from the test sets. We implemented a tokenization pipeline utilising the selected language models for our analysis. Both source and translated segments from the input instances are processed with this tokenization pipeline. Figure 3 presents results for four language pairs, with detailed outcomes for all pairs provided in Appendix F. Our findings show that decoder-based LLMs produce token counts that diverge substantially from the original word counts for low-resource, non-English languages. By contrast, encoder models such as InfoXLM, XLMR-XL, and XLMV demonstrate much smaller deviations. The issue is especially pronounced in morphologically rich languages such as Marathi, Tamil, and Telugu, where agglutination is common, and in Hindi, which frequently involves compounding. These discrepancies in tokenization distort semantic alignment between source and translation (Appendix F). For English, however, token counts generated by both encoders and decoders closely approximate word counts, indicating more reliable tokenization. This suggests that improving tokenization strategies for low-resource languages is critical to strengthening cross-lingual semantic matching in QE tasks, especially for decoders. Another noteworthy observation in our results is that Et–En delivers the best Spearman scores under the majority of the experimental settings. As illustrated in Figure 3, the gap between language-model token counts and the original word counts (for both source and translation) is notably smaller for Et–En than for the other pairs, for both encoder- and decoder-based models. This trend is likely attributed to the fact that both Estonian and English use the Latin script, which may facilitate more effective cross-lingual alignment between the source and translated texts. The shared script likely contributes to improved tokenization and semantic matching [25], thereby resulting in higher correlation scores for the Et–En pair across the majority of the experimental settings.

English as the target language: A consistent trend across all the experiments is that language pairs with English as the target language of the translation (e.g., Et–En, Ne–En, Si–En) achieve higher Spearman correlations than those where the target language is non-English. Importantly, this effect holds across both encoder- and decoder-based systems, suggesting that models handle English targets more effectively regardless of architecture. This finding aligns with the observations of Nguyen et al. [63], suggesting that language models demonstrate better performance when English serves as the target language in machine translation. The primary reason for this behaviour is due to English-heavy pre-training, which makes the models have strong generative ability in English. As a result, both model families tend to achieve higher performance in such settings, raising important considerations about the extent of multilingual capabilities claimed in recent model releases.

Computational efficiency—To compare the computational efficiency of different QE approaches, we measured GPU memory utilization for each model. As summarized in Figure 4b, encoder-based models such as TransQuest and CometKiwi exhibit relatively modest memory requirements, with most configurations using between 3 GB and 18 GB. Decoder-based LLMs, in contrast, are generally larger in parameter count and, by default, would require substantially more GPU memory. However, by leveraging parameter-efficient adaptation techniques such as LoRA [53] and 4-bit quantization, we were able to reduce the memory footprint of these LLMs to levels comparable to, or in some cases even lower than, certain encoder-based models. For instance, the instruction-fine-tuned Llama-2-7B and OpenChat models required only 9.2 GB and 8.7 GB, respectively, while some encoder-based variants such as TransQuest-XLM-V, CometKiwi-23 (w/XLM-R-XL) and xCOMET-XL (w/XLM-R-XL) exceeded 15 GB. These results demonstrate that, despite their larger architectural size, LLM-based QE models can be made computationally efficient and practical for low-resource settings by employing appropriate adaptation and quantization strategies.

We also assessed the disk space requirements of each model, as summarized in Figure 4a. The results show a marked difference between encoder-based and decoder-based approaches. Encoder-based models such as TransQuest and CometKiwi-22, which leverage InfoXLM or similar architectures, require as little as 2.4 to 3.8 GB of disk space. In contrast, larger variants like CometKiwi-23 and xCOMET-XL demand up to 17 GB, approaching the storage footprint of some decoder-based models. Decoder-based LLMs exhibit substantially larger disk space utilization, ranging from 8.1 GB (Llama-3.2-3B) to as high as 23 GB (Gemma-2-9B). In general, decoder-based models require more disk space than their encoder-based models, posing additional challenges for on-device deployment. But in contrast, Llama-3.2-3B, despite being the smallest decoder-based model, which has fewer parameters and requires limited computational resources, has obtained the highest correlation among decoder-based approaches for the majority of the language pairs. These results underscore the importance of model selection for QE.

In summary, our experiments provide a comprehensive analysis of both translation quality and computational efficiency across encoder-based and decoder-based QE models in low-resource settings. While encoder-based models consistently deliver strong quality estimation performance with moderate computational demands, recent advances in parameter-efficient adaptation have enabled large language models to achieve comparable efficiency and, in some cases, competitive quality scores. Nonetheless, our results highlight that the choice of architecture entails important trade-offs between predictive accuracy, memory requirements, and disk-storage costs—factors that are especially critical when scaling QE solutions to practical, resource-constrained environments. Ultimately, these findings underscore the continued value of well-optimized encoder-based models, while also illustrating the potential for advanced LLMs to become viable in low-resource QE tasks through careful adaptation and efficient deployment strategies like ALOPE.

7. Conclusions

This study systematically evaluated segment-level reference-less quality estimation (QE) methods for low-resource machine translation, providing a comprehensive comparison between the state-of-the-art encoder-based and decoder-based approaches. Fine-tuned encoder-based models consistently achieved the highest correlation scores across most language pairs and demonstrated strong robustness. Our findings reaffirm the enduring strengths of well-optimized encoder-based models, while also indicating that advanced LLMs, when carefully adapted with framework such as ALOPE, offer significant potential for advancing low-resource QE tasks. In addition to translation quality, our analysis highlighted the importance of computational efficiency. Despite their larger size, decoder-based LLMs, when adapted with LoRA and 4-bit quantization, were able to achieve GPU memory footprints comparable to, or sometimes even lower than, those of larger encoder-based models. As future work, we are planning to explore even more efficient adaptation strategies for LLMs, such as reinforcement learning, as well as broaden evaluation to a wider range of language pairs, to ensure equitable progress in machine translation quality estimation for all languages.

Author Contributions

Conceptualization, A.S. and D.K.; methodology, A.S., D.K. and C.O.; validation, A.S.; formal analysis, A.S., D.K. and C.O.; investigation, A.S.; writing—original draft preparation, A.S.; writing—review and editing, A.S., D.K. and C.O.; visualization, A.S. and D.K.; supervision, D.K. and C.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset is available in https://huggingface.co/datasets/surrey-nlp/Low-resource-QE-DA-dataset (accessed on 4 October 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

MT	Machine Translation
DA	Direct Assessment
QE	Quality Estimation
ALOPE	The Adaptive Layer Optimization for Translation Quality Estimation
LLMs	Large Language Models
MQM	Multidimensional Quality Metrics

Appendix A. Experiment Results of ALOPE with Additional Transformer Layers

Table A1. Spearman correlation scores obtained for the experiments conducted with the additional Transformer layers (TL) using ALOPE fine-tuning: TL (-11), TL (-20), and TL (-24).

TL	Model	En-Gu	En-Hi	En-Mr	En-Ta	En-Te	Et-En	Ne-En	Si-En
TL (-11)	Llama-2-7B	0.360	0.301	0.361	0.254	0.293	0.405	0.164	0.049
	Llama-3.1-8B	0.514	0.412	0.609	0.438	0.304	0.148	0.554	0.493
	Llama-3.2-3B	0.594	0.476	0.605	0.610	0.373	0.748	0.678	0.560
	Aya-Expanse-8B	0.490	0.411	0.572	0.445	0.336	0.569	0.453	0.439
TL (-20)	Llama-2-7B	0.470	0.405	0.544	0.460	0.338	0.684	0.508	0.534
	Llama-3.1-8B	0.484	0.394	0.553	0.321	0.172	0.649	0.524	0.494
	Llama-3.2-3B	0.430	0.408	0.579	0.303	0.286	0.601	0.488	0.464
	Aya-Expanse-8B	0.437	0.300	0.488	0.263	0.287	0.483	0.438	0.395
TL (-24)	Llama-2-7B	0.500	0.421	0.538	0.379	0.239	0.630	0.507	0.472
	Llama-3.1-8B	0.421	0.378	0.552	0.330	0.290	0.515	0.530	0.464
	Llama-3.2-3B	0.443	0.376	0.507	0.367	0.299	0.559	0.528	0.487
	Aya-Expanse-8B	0.375	0.319	0.440	0.337	0.220	0.393	0.407	0.345

Appendix B. Pearson’s and Kendall’s Tau Correlation Scores of Encoder-Based QE Approaches

Table A2. Pearson’s (r) and Kendall’s (

τ

) correlation scores of encoder-based QE models across 8 low-resource language pairs.

Table A2. Pearson’s (r) and Kendall’s (

τ

) correlation scores of encoder-based QE models across 8 low-resource language pairs.

Model	En-Gu		En-Hi		En-Mr		En-Ta		En-Te		Et-En		Ne-En		Si-En
	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$
MonoTQ-InfoXLM	0.680	0.460	0.610	0.336	0.658	0.434	0.650	0.435	0.330	0.247	0.755	0.560	0.767	0.530	0.627	0.413
MonoTQ-XLM-V	0.586	0.396	0.518	0.312	0.608	0.389	0.654	0.435	0.312	0.246	0.764	0.565	0.743	0.517	0.643	0.425
MonoTQ-XLM-R-XL	0.652	0.432	0.617	0.346	0.489	0.317	0.690	0.454	0.292	0.230	0.729	0.536	0.727	0.514	0.607	0.409
CometKiwi-22	0.615	0.416	0.511	0.283	0.717	0.449	0.601	0.404	0.233	0.162	0.807	0.624	0.750	0.566	0.678	0.469
CometKiwi-23	0.678	0.467	0.618	0.390	0.711	0.455	0.648	0.446	0.310	0.235	0.852	0.661	0.783	0.599	0.730	0.515
xCOMET-XL	0.517	0.351	0.305	0.235	0.439	0.304	0.443	0.357	0.213	0.174	0.771	0.609	0.705	0.467	0.597	0.409

Appendix C. Pearson’s and Kendall’s Tau Correlation Scores of Decoder-Based QE Approaches

Table A3. Pearson’s (r) and Kendall’s (

τ

) correlation scores of various decoder-based LLMs across 8 low-resource language pairs.

Table A3. Pearson’s (r) and Kendall’s (

τ

) correlation scores of various decoder-based LLMs across 8 low-resource language pairs.

Model	En-Gu		En-Hi		En-Mr		En-Ta		En-Te		Et-En		Ne-En		Si-En
	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$	$r$	$τ$
Zero-shot
Llama-2-7B	0.015	0.005	−0.031	−0.001	0.054	0.042	0.034	0.054	0.005	−0.013	0.173	0.129	0.144	0.119	0.155	0.113
OpenChat-7B	0.267	0.187	0.315	0.188	0.323	0.137	0.400	0.270	0.155	0.109	0.550	0.411	0.476	0.320	0.439	0.299
Gemma-2-9B	0.093	0.254	0.482	0.289	0.430	0.311	0.400	0.353	0.052	0.134	0.620	0.524	0.329	0.416	0.427	0.332
Llama-3.1-8B	0.183	0.115	0.261	0.135	0.289	0.171	0.254	0.152	0.153	0.092	0.500	0.354	0.333	0.191	0.291	0.190
Llama-3.2-3B	0.100	0.068	0.148	0.067	0.163	0.084	0.128	0.091	0.082	0.071	0.355	0.255	0.231	0.155	0.136	0.085
Aya-Expanse-8B	0.013	0.096	0.055	0.102	0.027	0.069	−0.047	0.088	−0.006	0.038	0.240	0.229	0.299	0.259	0.309	0.240
Tower+-72B	0.278	0.184	0.478	0.286	0.500	0.322	0.386	0.189	0.156	0.106	0.754	0.586	0.604	0.462	0.368	0.296
Instruction Fine-Tuning
Llama-2-7B	0.539	0.342	0.477	0.234	0.515	0.335	0.526	0.333	0.228	0.195	0.597	0.420	0.558	0.335	0.389	0.260
OpenChat-7B	0.597	0.383	0.568	0.314	0.621	0.429	0.582	0.396	0.261	0.221	0.667	0.501	0.665	0.430	0.479	0.337
Gemma-2-9B	0.660	0.438	0.602	0.352	0.625	0.388	0.647	0.474	0.223	0.200	0.654	0.502	0.731	0.492	0.471	0.330
Llama-3.1-8B	0.613	0.416	0.517	0.272	0.608	0.388	0.625	0.438	0.206	0.167	0.693	0.533	0.684	0.444	0.557	0.399
Llama-3.2-3B	0.587	0.374	0.558	0.289	0.593	0.381	0.612	0.424	0.177	0.162	0.657	0.489	0.653	0.421	0.458	0.319
Aya-Expanse-8B	0.624	0.396	0.532	0.287	0.591	0.373	0.560	0.368	0.221	0.209	0.587	0.431	0.653	0.409	0.385	0.271
ALOPE-TL (-1)
Llama-2-7B	0.630	0.409	0.554	0.290	0.643	0.436	0.581	0.372	0.306	0.242	0.736	0.540	0.647	0.425	0.605	0.401
OpenChat-7B	0.043	0.033	0.111	0.059	−0.174	−0.080	0.159	0.021	0.125	0.117	0.132	0.090	−0.064	−0.003	−0.027	0.000
Gemma-2-9B	0.367	0.229	0.333	0.197	0.441	0.277	0.267	0.204	0.200	0.162	0.435	0.296	0.407	0.283	0.362	0.220
Llama-3.1-8B	0.630	0.430	0.572	0.331	0.606	0.445	0.593	0.403	0.330	0.248	0.710	0.534	0.672	0.470	0.584	0.388
Llama-3.2-3B	0.662	0.437	0.608	0.337	0.692	0.457	0.634	0.417	0.314	0.243	0.734	0.534	0.727	0.490	0.594	0.385
Aya-Expanse-8B	0.086	0.047	0.172	0.123	0.284	0.148	0.039	−0.006	0.231	0.190	0.105	0.077	0.048	0.007	0.080	0.051
ALOPE-TL (-7)
Llama-2-7B	0.623	0.409	0.407	0.231	0.569	0.382	0.540	0.348	0.240	0.218	0.733	0.539	0.641	0.433	0.613	0.408
OpenChat-7B	0.008	0.024	0.160	0.073	−0.133	−0.046	0.063	0.093	0.146	0.128	0.146	0.098	0.167	0.087	0.132	0.038
Gemma-2-9B	0.412	0.260	0.386	0.210	0.481	0.289	0.315	0.187	0.237	0.187	0.446	0.310	0.432	0.300	0.405	0.235
Llama-3.1-8B	0.641	0.428	0.579	0.337	0.654	0.444	0.593	0.377	0.363	0.266	0.740	0.542	0.669	0.459	0.596	0.387
Llama-3.2-3B	0.657	0.441	0.598	0.338	0.660	0.443	0.640	0.423	0.319	0.255	0.733	0.551	0.701	0.483	0.602	0.394
Aya-Expanse-8B	0.608	0.388	0.576	0.312	0.615	0.426	0.588	0.374	0.301	0.240	0.727	0.539	0.671	0.470	0.582	0.386
ALOPE-TL (-16)
Llama-2-7B	0.603	0.388	0.493	0.266	0.615	0.414	0.540	0.344	0.262	0.212	0.748	0.550	0.640	0.413	0.611	0.404
OpenChat-7B	0.489	0.323	0.477	0.241	0.592	0.378	0.482	0.295	0.260	0.208	0.185	0.120	0.471	0.296	0.395	0.242
Gemma-2-9B	0.427	0.270	0.431	0.232	0.483	0.290	0.358	0.223	0.142	0.092	0.463	0.323	0.419	0.293	0.406	0.239
Llama-3.1-8B	0.609	0.402	0.562	0.315	0.656	0.430	0.568	0.366	0.283	0.237	0.697	0.537	0.668	0.473	0.552	0.364
Llama-3.2-3B	0.611	0.400	0.587	0.322	0.677	0.426	0.619	0.390	0.331	0.233	0.744	0.544	0.721	0.498	0.609	0.405
Aya-Expanse-8B	0.552	0.333	0.522	0.268	0.611	0.392	0.560	0.337	0.271	0.215	0.724	0.528	0.656	0.410	0.580	0.383
ALOPE-TL(-11)
Llama-2-7B	0.385	0.245	0.293	0.206	0.393	0.241	0.246	0.174	0.175	0.200	0.406	0.274	0.130	0.108	0.078	0.032
Llama-3.1-8B	0.565	0.367	0.546	0.288	0.615	0.436	0.530	0.304	0.282	0.206	0.141	0.100	0.590	0.396	0.533	0.346
Llama-3.2-3B	0.654	0.430	0.577	0.335	0.673	0.434	0.659	0.444	0.354	0.257	0.745	0.548	0.718	0.496	0.604	0.400
Aya-Expanse-8B	0.555	0.347	0.563	0.286	0.598	0.403	0.528	0.313	0.317	0.233	0.561	0.397	0.501	0.316	0.492	0.306
ALOPE-TL(-20)
Llama-2-7B	0.485	0.329	0.465	0.277	0.600	0.383	0.480	0.317	0.262	0.232	0.647	0.488	0.537	0.358	0.560	0.378
Llama-3.1-8B	0.560	0.346	0.524	0.276	0.576	0.389	0.504	0.220	0.212	0.119	0.617	0.463	0.553	0.371	0.522	0.347
Llama-3.2-3B	0.487	0.304	0.508	0.281	0.604	0.410	0.475	0.207	0.299	0.195	0.540	0.424	0.493	0.341	0.498	0.324
Aya-Expanse-8B	0.538	0.308	0.516	0.207	0.528	0.340	0.493	0.179	0.282	0.201	0.471	0.327	0.439	0.305	0.441	0.273
ALOPE-TL(-24)
Llama-2-7B	0.526	0.356	0.544	0.293	0.610	0.375	0.486	0.259	0.241	0.164	0.618	0.445	0.526	0.357	0.513	0.331
Llama-3.1-8B	0.458	0.294	0.507	0.261	0.575	0.388	0.456	0.223	0.284	0.198	0.456	0.362	0.513	0.374	0.477	0.324
Llama-3.2-3B	0.516	0.311	0.449	0.257	0.548	0.355	0.448	0.251	0.270	0.205	0.549	0.392	0.510	0.371	0.523	0.343
Aya-Expanse-8B	0.433	0.262	0.484	0.220	0.528	0.300	0.506	0.235	0.221	0.153	0.409	0.264	0.408	0.281	0.418	0.237

Appendix D. Spearman Correlation Scores of Decoder-Based Methods Across Selected Models

Figure A1. This image visualizes the Spearman correlation scores obtained with decoder-based methods of zero-shot, instruction-fine-tuned (IFT) LLMs and ALOPE (only the best values obtained) across selected models and language pairs.

Appendix E. Spearman Correlation Scores Across Selected Transformer Layers (TL) with ALOPE Fine-Tuning

Figure A2. This image visualizes Spearman correlation scores obtained with ALOPE layer-specific approach with selected transformer layers (TL).

Appendix F. Tokenization Analysis with All the Language Pairs

Figure A3. Comparison of source and translation segment word counts with token counts produced by different language models across the language pairs. The X-axis shows the model names, while the Y-axis denotes the corresponding word/token counts.

References

Kumar, A.; Kunchukuttan, A.; Puduppully, R.; Dabre, R. In-context Example Selection for Machine Translation Using Multiple Features. arXiv 2023, arXiv:2305.14105. [Google Scholar]
Costa-jussà, M.R.; Cross, J.; Çelebi, O.; Elbayad, M.; Heafield, K.; Heffernan, K.; Kalbassi, E.; Lam, J.; Licht, D.; Maillard, J.; et al. No language left behind: Scaling human-centered machine translation. arXiv 2022, arXiv:2207.04672. [Google Scholar] [CrossRef]
Kocmi, T.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Gowda, T.; Graham, Y.; Grundkiewicz, R.; Haddow, B.; et al. Findings of the 2022 conference on machine translation (WMT22). In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 1–45. [Google Scholar]
Pitman, J. Google Translate: One Billion Installs, One Billion Stories—Blog.Google. 2021. Available online: https://blog.google/products/translate/one-billion-installs/ (accessed on 12 April 2023).
Almahasees, Z.; Meqdadi, S.; Albudairi, Y. Evaluation of Google Translate in Rendering English COVID-19 Texts into Arabic. J. Lang. Linguist. Stud. 2021, 17, 2065–2080. [Google Scholar] [CrossRef]
Ranathunga, S.; Lee, E.S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural Machine Translation for Low-Resource Languages: A Survey. ACM Comput. Surv. 2023, 55, 229. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7 –12 July 2002; Isabelle, P., Charniak, E., Lin, D., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef]
Graham, Y.; Baldwin, T.; Moffat, A.; Zobel, J. Continuous Measurement Scales in Human Evaluation of Machine Translation. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, Sofia, Bulgaria, 8–9 August 2013; Pareja-Lora, A., Liakata, M., Dipper, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2013; pp. 33–41. [Google Scholar]
Lommel, A.R.; Burchardt, A.; Uszkoreit, H. Multidimensional quality metrics: A flexible system for assessing translation quality. In Proceedings of the Translating and the Computer 35, London, UK, 28–29 November 2013. [Google Scholar]
Specia, L.; Paetzold, G.; Scarton, C. Multi-level translation quality prediction with quest++. In Proceedings of the ACL-IJCNLP 2015 System Demonstrations, Beijing, China, 26–31 July 2015; pp. 115–120. [Google Scholar]
Scarton, C.; Specia, L. Document-level translation quality estimation: Exploring discourse and pseudo-references. In Proceedings of the 17th Annual Conference of the European Association for Machine Translation, Dubrovnik, Croatia, 16–18 June 2014; pp. 101–108. [Google Scholar]
Kepler, F.; Trénous, J.; Treviso, M.; Vera, M.; Martins, A.F. OpenKiwi: An open source framework for quality estimation. arXiv 2019, arXiv:1902.08646. [Google Scholar] [CrossRef]
Kepler, F.; Trénous, J.; Treviso, M.; Vera, M.; Góis, A.; Farajian, M.A.; Lopes, A.V.; Martins, A.F.T. Unbabel’s Participation in the WMT19 Translation Quality Estimation Shared Task. In Proceedings of the Fourth Conference on Machine Translation, Florence, Italy, 1–2 August 2019; Volume 3: Shared Task Papers, Day 2, pp. 78–84. [Google Scholar] [CrossRef]
Specia, L.; Blain, F.; Logacheva, V.; Astudillo, R.F.; Martins, A.F.T. Findings of the WMT 2018 Shared Task on Quality Estimation. In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, 31 October–1 November 2018; pp. 689–709. [Google Scholar] [CrossRef]
Rei, R.; Treviso, M.; Guerreiro, N.M.; Zerva, C.; Farinha, A.C.; Maroti, C.; de Souza, J.G.C.; Glushkova, T.; Alves, D.; Lavie, A.; et al. CometKiwi: IST-Unbabel 2022 Submission for the Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-jussà, M.R., Federmann, C., Fishel, M., Fraser, A., Freitag, M., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 634–645. [Google Scholar]
Ranasinghe, T.; Orasan, C.; Mitkov, R. TransQuest: Translation Quality Estimation with Cross-lingual Transformers. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain, 13–18 September 2020; Scott, D., Bel, N., Zong, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5070–5081. [Google Scholar] [CrossRef]
Perrella, S.; Proietti, L.; Scirè, A.; Campolungo, N.; Navigli, R. MATESE: Machine Translation Evaluation as a Sequence Tagging Problem. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 569–577. [Google Scholar]
Moura, J.; Vera, M.; van Stigt, D.; Kepler, F.; Martins, A.F. Ist-unbabel participation in the wmt20 quality estimation shared task. In Proceedings of the Fifth Conference on Machine Translation, Virtual, 19–20 November 2020; pp. 1029–1036. [Google Scholar]
Baek, Y.; Kim, Z.M.; Moon, J.; Kim, H.; Park, E. PATQUEST: Papago Translation Quality Estimation. In Proceedings of the Fifth Conference on Machine Translation, Virtual, 19–20 November 2020; pp. 991–998. [Google Scholar]
Conneau, A.; Khandelwal, K.; Goyal, N.; Chaudhary, V.; Wenzek, G.; Guzmán, F.; Grave, E.; Ott, M.; Zettlemoyer, L.; Stoyanov, V. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Virtual, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8440–8451. [Google Scholar] [CrossRef]
Liu, Y.; Gu, J.; Goyal, N.; Li, X.; Edunov, S.; Ghazvininejad, M.; Lewis, M.; Zettlemoyer, L. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguist. 2020, 8, 726–742. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, MN, USA, 2–7 June 2019; Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; Volume 1 (Long and Short Papers), pp. 4171–4186. [Google Scholar] [CrossRef]
Kocmi, T.; Federmann, C. Large Language Models Are State-of-the-Art Evaluators of Translation Quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; Nurminen, M., Brenner, J., Koponen, M., Latomaa, S., Mikhailov, M., Schierl, F., Ranasinghe, T., Vanmassenhove, E., Vidal, S.A., Aranberri, N., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 193–203. [Google Scholar]
Mujadia, V.; Mishra, P.; Ahsan, A.M.; Sharma, D. Towards Large Language Model driven Reference-less Translation Evaluation for English and Indian Language. In Proceedings of the 20th International Conference on Natural Language Processing (ICON), Goa, India, 14–17 December 2023; Pawar, J.D., Lalitha Devi, S., Eds.; Goa University: Goa, India, 2023; pp. 357–369. [Google Scholar]
Sindhujan, A.; Kanojia, D.; Orasan, C.; Qian, S. When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, Abu Dhabi, United Arab Emirates, 19–20 January 2025; Hettiarachchi, H., Ranasinghe, T., Rayson, P., Mitkov, R., Gaber, M., Premasiri, D., Tan, F.A., Uyangodage, L., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 437–459. [Google Scholar]
Lim, S.; Park, J. Papago’s Submission to the WMT22 Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 627–633. [Google Scholar]
Bao, K.; Wan, Y.; Liu, D.; Yang, B.; Lei, W.; He, X.; Wong, D.F.; Xie, J. Alibaba-Translate China’s Submission for WMT 2022 Quality Estimation Shared Task. arXiv 2022, arXiv:2210.10049. [Google Scholar]
Geng, X.; Zhang, Y.; Huang, S.; Tao, S.; Yang, H.; Chen, J. NJUNLP’s Participation for the WMT2022 Quality Estimation Shared Task. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; pp. 615–620. [Google Scholar]
Sindhujan, A.; Kanojia, D.; Orăsan, C. Optimizing quality estimation for low-resource language translations: Exploring the role of language relatedness. In Proceedings of the International Conference on New Trends in Translation and Technology Conference 2024, Varna, Bulgaria, Spain, 3–6 July 2024; Incoma Ltd.: Sevilla, Spain, 2024; pp. 170–190. [Google Scholar]
Deoghare, S.; Choudhary, P.; Kanojia, D.; Ranasinghe, T.; Bhattacharyya, P.; Orăsan, C. A Multi-task Learning Framework for Quality Estimation. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2023, Toronto, ON, Canada, 9–14 July 2023; pp. 9191–9205. [Google Scholar]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A Survey of Large Language Models. arXiv 2025, arXiv:2303.18223. [Google Scholar] [PubMed]
Guzmán, F.; Chen, P.J.; Ott, M.; Pino, J.; Lample, G.; Koehn, P.; Chaudhary, V.; Ranzato, M. The flores evaluation datasets for low-resource machine translation: Nepali-english and sinhala-english. arXiv 2019, arXiv:1902.01382. [Google Scholar]
Blain, F.; Zerva, C.; Rei, R.; Guerreiro, N.M.; Kanojia, D.; de Souza, J.G.C.; Silva, B.; Vaz, T.; Jingxuan, Y.; Azadi, F.; et al. Findings of the WMT 2023 Shared Task on Quality Estimation. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 629–653. [Google Scholar] [CrossRef]
Zerva, C.; Blain, F.; Rei, R.; Lertvittayakumjorn, P.; de Souza, J.G.C.; Eger, S.; Kanojia, D.; Alves, D.; Orăsan, C.; Fomicheva, M.; et al. Findings of the WMT 2022 Shared Task on Quality Estimation. In Proceedings of the Seventh Conference on Machine Translation (WMT), Abu Dhabi, United Arab Emirates, 7–8 December 2022; Koehn, P., Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-Jussà, M.R., Federmann, C., Fishel, M., Fraser, A., Freitag, M., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022. [Google Scholar]
Specia, L.; Blain, F.; Fomicheva, M.; Fonseca, E.; Chaudhary, V.; Guzmán, F.; Martins, A.F.T. Findings of the WMT 2020 Shared Task on Quality Estimation. In Proceedings of the Fifth Conference on Machine Translation, Virtual, 19–20 November 2020; Barrault, L., Bojar, O., Bougares, F., Chatterjee, R., Costa-Jussà, M.R., Federmann, C., Fishel, M., Fraser, A., Graham, Y., Guzman, P., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 743–764. [Google Scholar]
Sindhujan, A.; Kanojia, D.; Orasan, C.; Ranasinghe, T. SurreyAI 2023 Submission for the Quality Estimation Shared Task. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 849–855. [Google Scholar] [CrossRef]
Wenzek, G.; Lachaux, M.A.; Conneau, A.; Chaudhary, V.; Guzmán, F.; Joulin, A.; Grave, E. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4003–4012. [Google Scholar]
Liang, D.; Gonen, H.; Mao, Y.; Hou, R.; Goyal, N.; Ghazvininejad, M.; Zettlemoyer, L.; Khabsa, M. XLM-V: Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Bouamor, H., Pino, J., Bali, K., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 13142–13152. [Google Scholar] [CrossRef]
Chi, Z.; Dong, L.; Wei, F.; Yang, N.; Singhal, S.; Wang, W.; Song, X.; Mao, X.L.; Huang, H.; Zhou, M. InfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Virtual, 6–11 June 2021; Toutanova, K., Rumshisky, A., Zettlemoyer, L., Hakkani-Tur, D., Beltagy, I., Bethard, S., Cotterell, R., Chakraborty, T., Zhou, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 3576–3588. [Google Scholar] [CrossRef]
Rei, R.; Guerreiro, N.M.; Pombal, J.; van Stigt, D.; Treviso, M.; Coheur, L.; de Souza, J.G.C.; Martins, A. Scaling up CometKiwi: Unbabel-IST 2023 Submission for the Quality Estimation Shared Task. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 841–848. [Google Scholar] [CrossRef]
Martins, A.F.T.; Astudillo, R.F. From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification. arXiv 2016, arXiv:1602.02068. [Google Scholar] [CrossRef]
Guerreiro, N.M.; Rei, R.; Stigt, D.V.; Coheur, L.; Colombo, P.; Martins, A.F.T. xcomet: Transparent Machine Translation Evaluation through Fine-grained Error Detection. Trans. Assoc. Comput. Linguist. 2024, 12, 979–995. [Google Scholar] [CrossRef]
Fomicheva, M.; Sun, S.; Fonseca, E.; Zerva, C.; Blain, F.; Chaudhary, V.; Guzmán, F.; Lopatina, N.; Specia, L.; Martins, A.F.T. MLQE-PE: A Multilingual Quality Estimation and Post-Editing Dataset. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Marseille, France, 20–25 June 2022; Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., et al., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 4963–4974. [Google Scholar]
Sai B, A.; Dixit, T.; Nagarajan, V.; Kunchukuttan, A.; Kumar, P.; Khapra, M.M.; Dabre, R. IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; Volume 1: Long Papers, pp. 14210–14228. [Google Scholar] [CrossRef]
Karpinska, M.; Raj, N.; Thai, K.; Song, Y.; Gupta, A.; Iyyer, M. DEMETR: Diagnosing Evaluation Metrics for Translation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 9540–9561. [Google Scholar] [CrossRef]
Rei, R.; Guerreiro, N.M.; Pombal, J.; Alves, J.; Teixeirinha, P.; Farajian, A.; Martins, A.F.T. Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs. arXiv 2025, arXiv:2506.17080. [Google Scholar]
Sindhujan, A.; Qian, S.; Matthew, C.C.C.; Orasan, C.; Kanojia, D. ALOPE: Adaptive Layer Optimization for Translation Quality Estimation using Large Language Models. arXiv 2025, arXiv:2508.07484. [Google Scholar] [CrossRef]
Xie, S.M.; Santurkar, S.; Ma, T.; Liang, P. Data Selection for Language Models via Importance Resampling. arXiv 2023, arXiv:2302.03169. [Google Scholar] [CrossRef]
Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, 23–25 May 2012; Calzolari, N., Choukri, K., Declerck, T., Doğan, M.U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; pp. 2214–2218. [Google Scholar]
Singh, S.; Vargus, F.; D’souza, D.; Karlsson, B.F.; Mahendiran, A.; Ko, W.Y.; Shandilya, H.; Patel, J.; Mataciunas, D.; O’Mahony, L.; et al. Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1: Long Papers, pp. 11521–11567. [Google Scholar] [CrossRef]
Lambert, N.; Morrison, J.; Pyatkin, V.; Huang, S.; Ivison, H.; Brahman, F.; Miranda, L.J.V.; Liu, A.; Dziri, N.; Lyu, S.; et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv 2025, arXiv:2411.15124. [Google Scholar]
Zhou, W.; Agrawal, R.; Zhang, S.; Indurthi, S.R.; Zhao, S.; Song, K.; Xu, S.; Zhu, C. WPO: Enhancing RLHF with Weighted Preference Optimization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 8328–8340. [Google Scholar] [CrossRef]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Kocmi, T.; Federmann, C. GEMBA-MQM: Detecting Translation Quality Error Spans with GPT-4. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 768–775. [Google Scholar] [CrossRef]
Sedgwick, P. Spearman’s rank correlation coefficient. BMJ 2014, 349, g7327. [Google Scholar] [CrossRef] [PubMed]
Cohen, I.; Huang, Y.; Chen, J.; Benesty, J.; Benesty, J.; Chen, J.; Huang, Y.; Cohen, I. Pearson correlation coefficient. In Noise Reduction in Speech Processing; Springer: Berlin/Heidelberg, Germany, 2009; pp. 1–4. [Google Scholar]
Lapata, M. Automatic Evaluation of Information Ordering: Kendall’s Tau. Comput. Linguist. 2006, 32, 471–484. [Google Scholar] [CrossRef]
Chok, N.S. Pearson’s Versus Spearman’s and Kendall’s Correlation Coefficients for Continuous Data. Ph.D. Thesis, University of Pittsburgh, Pittsburgh, PA, USA, 2010. [Google Scholar]
Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A neural framework for MT evaluation. arXiv 2020, arXiv:2009.09025. [Google Scholar] [CrossRef]
Sai, A.B.; Nagarajan, V.; Dixit, T.; Dabre, R.; Kunchukuttan, A.; Kumar, P.; Khapra, M.M. IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages. arXiv 2022, arXiv:2212.10180. [Google Scholar]
Graham, Y.; Baldwin, T. Testing for Significance of Increased Correlation with Human Judgment. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; Moschitti, A., Pang, B., Daelemans, W., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2014; pp. 172–176. [Google Scholar] [CrossRef]
Williams, E. Regression Analysis; Wiley Series in Probability and Statistics: Applied Probability and Statist Ics Section Series; Wiley: Hoboken, NJ, USA, 1959. [Google Scholar]
Nguyen, X.P.; Aljunied, M.; Joty, S.; Bing, L. Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, Bangkok, Thailand, 11–16 August 2024; Ku, L.W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; Volume 1, pp. 3501–3516. [Google Scholar] [CrossRef]

Figure 1. This figure shows the annotation process of Direct Assessment score for a machine-translated output sentence.

Figure 2. Comparison of the highest Spearman scores obtained for each language pair from different approaches. Within each group of bars (each language pair), the green asterisk (*) denotes the best-performing approach, while yellow asterisks indicate approaches with performance not significantly different from the best.

Figure 3. Comparison of original word counts with model-generated token counts for source and translated segments across selected language pairs. Complete results for all pairs appear in Appendix F.

Figure 4. Comparison of computational resource utilization for encoder-based and decoder-based QE models. (a) Disk space utilization between encoder and decoder-based models; (b) GPU memory utilization between encoder-based and decoder-based (instruction-fine-tuned LLMs, ALOPE).

Table 1. Dataset splits for translation quality estimation across 8 low-resource language pairs. Each dataset includes human-annotated DA scores for supervised model training and evaluation.

Data Split	En-Gu	En-Hi	En-Mr	En-Ta	En-Te	Et-En	Ne-En	Si-En
Training	7000	7000	26,000	7000	7000	7000	7000	7000
Testing	1000	1000	699	1000	1000	1000	1000	1000

Table 2. This table shows the Spearman correlation scores of encoder-based QE methods evaluated for the low-resource language pairs. MonoTQ indicates MonoTransQuest, CometKiwi-22 and -23 refer to respective models from Section 4.1. The bolded values indicate the highest Spearman correlation score obtained for each language.

Method	En-Gu	En-Hi	En-Mr	En-Ta	En-Te	Et-En	Ne-En	Si-En
MonoTQ-InfoXLM-Large	0.630	0.478	0.606	0.603	0.358	0.760	0.718	0.579
MonoTQ-XLM-V	0.552	0.446	0.556	0.607	0.358	0.766	0.705	0.595
MonoTQ-XLM-R-Large	0.599	0.491	0.457	0.627	0.339	0.737	0.700	0.575
CometKiwi-22 (InfoXLM)	0.574	0.408	0.627	0.567	0.231	0.827	0.755	0.648
CometKiwi-23 (XLM-R-XL)	0.637	0.546	0.635	0.616	0.338	0.860	0.789	0.703
xCOMET-XL	0.490	0.346	0.448	0.503	0.253	0.810	0.646	0.576

Table 3. Spearman correlation scores of decoder-based methods across selected models. Bold values indicate the highest Spearman scores obtained for each language pair among all the methods. An asterisk (*) marks the best score among zero-shot configurations, while a dagger (^†) denotes the best result from instruction fine-tuning. Underlined values represent the highest scores achieved using ALOPE fine-tuning for each language pair. The ALOPE-Best reports the Spearman score of the top-performing transformer layer of each model for each language pair.

Approach	Model	En-Gu	En-Hi	En-Mr	En-Ta	En-Te	Et-En	Ne-En	Si-En
Zero-shot	Llama-2-7B	0.006	−0.002	0.053	0.067	−0.016	0.168	0.153	0.144
	OpenChat-7B	0.249	0.254	0.183	0.358	0.145	0.571	0.448	0.417
	Gemma-2-9B	* 0.321	* 0.373	0.404	* 0.468	* 0.177	0.695	0.560	* 0.456
	Llama-3.1-8B	0.164	0.194	0.245	0.220	0.132	0.503	0.262	0.267
	Llama-3.2-3B	0.095	0.095	0.116	0.128	0.098	0.359	0.216	0.120
	Aya-Expanse-8B	0.123	0.135	0.089	0.117	0.049	0.315	0.352	0.330
	Tower+-72B	0.238	0.369	* 0.422	0.254	0.142	* 0.753	* 0.580	0.384
Instruction Fine-Tuned	Llama-2-7B	0.454	0.319	0.466	0.449	0.266	0.591	0.478	0.374
	OpenChat-7B	0.513	0.423	^† 0.587	0.528	^† 0.302	0.692	0.599	0.477
	Gemma-2-9B	^† 0.580	^† 0.471	0.533	^† 0.631	0.272	0.696	^† 0.672	0.468
	Llama-3.1-8B	0.555	0.368	0.530	0.586	0.233	^† 0.728	0.618	^† 0.558
	Llama-3.2-3B	0.501	0.393	0.524	0.567	0.224	0.678	0.587	0.451
	Aya-Expanse-8B	0.528	0.388	0.515	0.496	0.290	0.605	0.568	0.384
ALOPE-Best (Ours)	Llama-2-7B	0.567	0.414	0.609	0.525	0.356	0.751	0.606	0.573
	OpenChat-7B	0.461	0.350	0.536	0.422	0.301	0.180	0.427	0.352
	Gemma-2-9B	0.387	0.339	0.426	0.328	0.278	0.471	0.433	0.346
	Llama-3.1-8B	0.594	0.477	0.625	0.567	0.388	0.744	0.652	0.547
	Llama-3.2-3B	0.606	0.479	0.636	0.610	0.373	0.751	0.682	0.567
	Aya-Expanse-8B	0.538	0.447	0.597	0.528	0.347	0.741	0.646	0.544

Table 4. Spearman correlation scores across selected transformer layers (TL) using ALOPE fine-tuning. TL (-1), TL (-7), and TL (-16) represent the best-performing transformer layers at different depths for each model. The table compares layer-wise performance across 8 language pairs. The bolded values indicate the highest Spearman score obtained for each language pair. A graphical representation of these results is provided in Appendix E.

TL	Model	En-Gu	En-Hi	En-Mr	En-Ta	En-Te	Et-En	Ne-En	Si-En
TL (-1)	Llama-2-7B	0.563	0.414	0.609	0.525	0.356	0.742	0.596	0.565
	OpenChat-7B	0.048	0.088	−0.117	0.030	0.171	0.137	−0.005	−0.001
	Gemma-2-9B	0.331	0.288	0.403	0.289	0.238	0.431	0.411	0.322
	Llama-3.1-8B	0.594	0.469	0.620	0.567	0.363	0.734	0.647	0.547
	Llama-3.2-3B	0.604	0.477	0.636	0.580	0.348	0.735	0.674	0.543
	Aya-Expanse-8B	0.068	0.178	0.219	−0.006	0.275	0.115	0.012	0.077
TL (-7)	Llama-2-7B	0.567	0.336	0.542	0.484	0.317	0.739	0.606	0.573
	OpenChat-7B	0.034	0.108	−0.069	0.129	0.185	0.144	0.128	0.057
	Gemma-2-9B	0.375	0.304	0.426	0.273	0.278	0.454	0.433	0.342
	Llama-3.1-8B	0.590	0.477	0.625	0.528	0.388	0.744	0.638	0.544
	Llama-3.2-3B	0.606	0.479	0.617	0.585	0.369	0.751	0.664	0.553
	Aya-Expanse-8B	0.538	0.447	0.597	0.528	0.347	0.741	0.646	0.544
TL (-16)	Llama-2-7B	0.540	0.381	0.585	0.482	0.308	0.751	0.580	0.569
	OpenChat-7B	0.461	0.350	0.536	0.422	0.301	0.180	0.427	0.352
	Gemma-2-9B	0.387	0.339	0.425	0.328	0.134	0.471	0.422	0.346
	Llama-3.1-8B	0.558	0.453	0.602	0.523	0.350	0.737	0.652	0.513
	Llama-3.2-3B	0.557	0.459	0.597	0.547	0.338	0.745	0.682	0.567
	Aya-Expanse-8B	0.467	0.390	0.557	0.481	0.314	0.727	0.576	0.540

Table 5. Spearman correlation scores obtained with multi-layer strategies (multi-layer regression and dynamic weighting) of ALOPE, compared with the results obtained for the instructio-fine-tuned (IFT) model and the best results of layer-specific ALOPE across all language pairs. All experiments were conducted with the Llama-3.2 model. Bolded value indicates the highest Spearman score obtained for each language pair.

Language Pair	Multi-Layer Regression			Dynamic Weighting			IFT	Best of Layer Specific ALOPE
Language Pair	−1 to −7	−8 to −11	−12 to −16	−1 to −7	−8 to −11	−12 to −16	IFT	Best of Layer Specific ALOPE
En–Gu	0.569	0.528	0.565	0.532	0.565	0.563	0.501	0.606
En–Hi	0.458	0.440	0.416	0.465	0.382	0.462	0.393	0.479
En–Mr	0.597	0.596	0.566	0.603	0.587	0.599	0.524	0.636
En–Ta	0.496	0.524	0.534	0.532	0.525	0.492	0.567	0.610
En–Te	0.351	0.303	0.347	0.339	0.325	0.355	0.224	0.373
Et–En	0.654	0.668	0.682	0.679	0.681	0.673	0.678	0.751
Ne–En	0.550	0.597	0.575	0.566	0.579	0.606	0.587	0.682
Si–En	0.504	0.475	0.523	0.493	0.468	0.482	0.451	0.567

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sindhujan, A.; Kanojia, D.; Orăsan, C. Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios. Information 2025, 16, 916. https://doi.org/10.3390/info16100916

AMA Style

Sindhujan A, Kanojia D, Orăsan C. Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios. Information. 2025; 16(10):916. https://doi.org/10.3390/info16100916

Chicago/Turabian Style

Sindhujan, Archchana, Diptesh Kanojia, and Constantin Orăsan. 2025. "Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios" Information 16, no. 10: 916. https://doi.org/10.3390/info16100916

APA Style

Sindhujan, A., Kanojia, D., & Orăsan, C. (2025). Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios. Information, 16(10), 916. https://doi.org/10.3390/info16100916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Reference-Less Evaluation of Machine Translation: Navigating Through the Resource-Scarce Scenarios

Abstract

1. Introduction

2. Background

2.1. Quality Estimation

2.2. Encoder- and Decoder-Based Language Models

2.3. Direct Assessment Scores (DA Scores)

3. Data

4. Methods

4.1. Encoder-Based Methods

4.1.1. TransQuest

4.1.2. CometKiwi

4.1.3. xCOMET

4.2. Decoder-Based Methods

4.2.1. Instruction-Fine-Tuned LLMs for QE

4.2.2. TOWER+

4.2.3. ALOPE

5. Experiment Setup

6. Results and Discussion

6.1. Encoder-Based

6.2. Decoder-Based

6.3. Comparative Analysis: Encoder vs. Decoder

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Experiment Results of ALOPE with Additional Transformer Layers

Appendix B. Pearson’s and Kendall’s Tau Correlation Scores of Encoder-Based QE Approaches

Appendix C. Pearson’s and Kendall’s Tau Correlation Scores of Decoder-Based QE Approaches

Appendix D. Spearman Correlation Scores of Decoder-Based Methods Across Selected Models

Appendix E. Spearman Correlation Scores Across Selected Transformer Layers (TL) with ALOPE Fine-Tuning

Appendix F. Tokenization Analysis with All the Language Pairs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI