Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations

Mohammed, Tawffeek A. S.

doi:10.3390/info16060440

Open AccessArticle

Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations

by

Tawffeek A. S. Mohammed

Department of Foreign Languages, University of theWestern Cape, Bellville 7535, South Africa

Information 2025, 16(6), 440; https://doi.org/10.3390/info16060440

Submission received: 11 April 2025 / Revised: 18 May 2025 / Accepted: 21 May 2025 / Published: 26 May 2025

(This article belongs to the Special Issue Machine Translation for Conquering Language Barriers)

Download

Browse Figures

Versions Notes

Abstract

This study investigates translation quality between Arabic and English, comparing traditional rule-based machine translation systems, modern neural machine translation tools such as Google Translate, and large language models like ChatGPT. The research adopts both qualitative and quantitative approaches to assess the efficacy, accuracy, and contextual fidelity of translations. It particularly focuses on the translation of idiomatic and colloquial expressions as well as technical texts and genres. Using well-established evaluation metrics such as bilingual evaluation understudy (BLEU), translation error rate (TER), and character n-gram F-score (chrF), alongside the qualitative translation quality assessment model proposed by Juliane House, this study investigates the linguistic and semantic nuances of translations generated by different systems. This study concludes that although metric-based evaluations like BLEU and TER are useful, they often fail to fully capture the semantic and contextual accuracy of idiomatic and expressive translations. Large language models, particularly ChatGPT, show promise in addressing this gap by offering more coherent and culturally aligned translations. However, both systems demonstrate limitations that necessitate human post-editing for high-stakes content. The findings support a hybrid approach, combining machine translation tools with human oversight for optimal translation quality, especially in languages with complex morphology and culturally embedded expressions like Arabic.

Keywords:

machine translation (MT); large language models (LLMs); neural-based; rule-based; translation quality assessment (TQA); Google Translate; ChatGPT; Arabic; English; metrics

1. Introduction

Machine translation (MT) systems have witnessed tremendous changes in the last few years and in particular with the advent of generative AI. A recent report has revealed that a translation program similar to Google Translate (GT) can be used in the translation of Akkadian cuneiform into English [1]. It has successfully facilitated the decoding and translation of thousands of previously unread digitized tablets. However, the report has acknowledged that the accuracy of the translations remains a subject of debate. Like GT, Baidu Translate, and many other translation engines, the program is based on neural-based machine translation. Unlike rule-based machine systems that translate word-for-word, it mainly converts words into a string of numbers and utilizes neural networks. This operates via a complex mathematical formula which provides an output sentence/text in another language, rendering it more accurately and naturally [2,3]. Neural machine translation (NMT), a core task in natural language processing (NLP), has advanced significantly but still faces challenges such as handling idiomatic expressions, translating low-resource languages, managing uncommon words, and maintaining coherence [4].

The advent of large language models (LLMs) like ChatGPT has further driven progress in MT. LLMs are capable of zero-shot translation, achieving performances comparable to supervised systems, and are applicable to a wide range of tasks beyond translation [5,6]. These advanced AI systems are designed to process and generate human-like texts. They have demonstrated remarkable capabilities across various NLP tasks. These include answering questions, crafting narratives, applying logical reasoning, debugging code, performing MT, and more [6,7]. LLMs can interpret prompts, provide detailed responses, engage in follow-up questions, acknowledge errors, challenge flawed assumptions, and decline inappropriate requests [5].

In fact, MT models and systems can benefit from LLMs, and many translation management systems have already started to streamline generative AI to accommodate their workflow with impressive levels of output. However, human insight and supervision are still needed to ensure quality, accurate translations. The quality of these systems and models is language-specific. MT can still be literal and awkward. In a similar vein, LLMs are sometimes prone to malfunctions, and they can make grave errors [8,9]. To this end, this study is concerned with the assessment of MT and LLMs, particularly ChatGPT translations of some texts from Arabic into English and vice versa. This study focuses mainly on comparing the capabilities of LLMs to widely used systems like GT. Undoubtedly, MT systems and LLMs have made the translation of some genres much easier. At the same time, they are still underdeveloped and untrained to translate other genres and low-resource languages [10].

This study investigates the translation quality assessment of primitive rule-based and advanced machine-translated texts. In addition, this study also explores the translation quality assessment of machine-translated texts and their parallel LLM by comparing them with a reference translation. Evidence from well-known translation metrics is used. Moreover, a qualitative assessment of translation quality in line with some established models like House’s model [11,12,13] is attempted. This study also investigates the significance of statistical relations in the output of these systems by testing the following null hypotheses:

There is no statistically significant relation between the performance of the primitive and advanced MT systems.
There is no statistically significant relation between the output of the GT and ChatGPT in bilingual evaluation understudy (BLEU) and translation error rate (TER).

2. Literature Review

A growing body of recent research highlights both the significance and complexity of assessing MT and outputs generated by LLMs through metrical analysis. In general, these studies reveal a marked progression from conventional lexical-based evaluation metrics toward more sophisticated neural-based approaches, thereby emphasizing the necessity for comprehensive and multidimensional evaluation frameworks. Two distinct studies by Munková et al. [14,15] extensively investigated MT quality assessment across different language pairs and contexts. Munkova et al. [14], for instance, examine the effectiveness of various automatic evaluation metrics for MT quality between Slovak and English. The study uses many metrics, including BLEU, precision, recall, word error rate (WER), and position-independent error rate (PER) to investigate the output. The study underscores the significance of the use of multiple measures to comprehensively assess MT quality. The findings of the study suggest that some metrics provide unique insights into translation accuracy and error rates. It also suggests that a multifaceted evaluation approach for accurate MT quality assessment is necessary. The second study by Munková et al. [15] examines the effectiveness of automatic evaluation metrics in assessing machine-translated and post-edited MT sentence-level translations from Slovak to German. The study compares various accuracy and error-rate metrics such as BLEU, precision, recall, WER, and PER. Significant quality improvements have been documented in post-edited output compared to raw MT. Some sentences show higher scores for accuracy and lower error rates.

Juraska et al. [16] further advanced MT evaluation by introducing Google’s MetricX-23 framework. The study deals with three submissions: MetricX-23, MetricX-23-b, and MetricX-23-c. These three submissions employ different configurations of pretrained language models (mT5 and PaLM 2) and fine-tuning data (Direct Assessment and MQM). The study highlights the importance of synthetic data, two-stage fine-tuning, and model size in improving evaluation metrics. The findings of the study demonstrate the significant correlation between MetricX-23 and human ratings. That is, this metric has great potential as a robust MT evaluation tool. Supporting this shift, López-Caro [17] analyzed the transition from traditional metrics like BLEU and metric for evaluation of translation with explicit ordering (METEOR) to neural-based metrics such as crosslingual optimized metric for evaluation of translation (COMET) and bilingual evaluation understudy with representations from transformers (BLEURT). The study demonstrated the superior correlation between neural-based metrics and human judgment, especially in evaluating complex and domain-specific texts, advocating their broader adoption for future assessments. Critically engaging with traditional evaluation methods, Perrella et al. [18] critique traditional evaluation methods that are mainly based on correlation with human judgment as lacking in actionable insights. The study proposes alternative measures using precision, recall, and F-score. The study argues that this novel evaluation framework can improve the interpretability of MT metrics. This framework assesses two metrical scenarios, namely data filtering and translation re-ranking, and thus it aims to align metric scores more closely with practical use cases. In addition, the study highlights the limitations of manually curated datasets and the potential for metrics like COMET and MetricX-23 to outperform others under specific conditions. Lee et al. [19] categorized existing metrics into lexical-based, embedding-based, and supervised types, underscoring the limitations of traditional metrics like BLEU, particularly their failure to adequately capture semantic similarity. They recommend developing universal, language-agnostic metrics to ensure robust evaluations across different languages, addressing the current disparity in evaluation capabilities. From a practical standpoint, van Toledo et al. [20] investigate the quality of Dutch translations produced by Google, Azure, and IBM MT systems. The study enriches MT evaluation by integrating readability metrics such as T-Scan, revealing GT’s superiority in readability and coherence over other MT systems. Their findings suggest that readability metrics may significantly complement traditional quality assessment methods. Lastly, Munkova et al. [21] investigate the use of automated evaluation metrics in training future translators in an online environment. The study proposed OSTPERE, an online system for translation, post-editing, and assessment, which enables students to collaboratively practice and evaluate translations. Residuals of accuracy and error-rate metrics (e.g., BLEU, WER, TER) were used to identify translation errors and assess student performance. The study concludes that automated metrics effectively support formative assessment. They highlight critical, major, and minor errors in post-edited machine translations, and thus they can enhance teaching efficiency and translation competence. In short, these studies illustrate a clear trend towards integrating neural-based and multifaceted metric evaluations to promote more accurate, practical translation quality assessments in both research and educational contexts. While these studies have addressed evaluations of translation across various directionalities, to date, no prior research has comprehensively employed these diverse metrics to assess translation quality between Arabic and English using MT systems and LLMs. This study represents an original endeavor aimed at filling this gap, and contributing unique insights specifically tailored to Arabic–English translation evaluation.

3. Theoretical and Conceptual Framework

Translation quality has been given adequate attention in translation studies. Holmes’s map of translation studies (Figure 1) sketched a scope and structure for the field in which translation criticism appears as a branch of applied translation studies [22].

Qualitative translation quality assessment (TQA) methods, such as House’s model, adopt a functional, pragmatic approach to assessing translations. House’s model in its various versions [11,12,13] considers the text as a unit of translation, assessing translation quality based on the concept of textual equivalence. That is, the original text’s meaning, register, and stylistic features must be preserved in the translation. House’s model distinguishes between two types of translations, namely overt and covert translations. A translation is overt when the cultural and contextual elements of the source text are highlighted, and it is covert when the text is adapted to the target audience’s cultural norms [13].

To ensure the functional equivalence of a translation within its specific communicative context, House’s model provides a systematic framework for analyzing translations beyond numerical scores. Aspects such as field, tenor, and mode are duly considered to ensure the functional equivalence of the translation.

Metrics-based TQA has also been given considerable attention in the translation industry. Like descriptive TQA models, it also aims to evaluate a translated text based on linguistic, semantic, contextual, and technical standards. In metrics-based TQA, machine translations are often benchmarked against human reference translations. Common metrics for TQAs include BLEU, which measures n-gram overlap between the candidate and reference translations. BLEU may offer consistency, but it is often criticized for overlooking semantic adequacy and fluency [23]. METEOR, on the other hand, incorporates stemming, synonyms, and word order, and thus it provides better semantic alignment [24]. TER quantifies the number of edits needed to match the reference, focusing on effort-based evaluation [25]. Advanced methods such as COMET and bidirectional encoder representations from transformers (BERT) leverage neural networks and contextual embeddings to capture semantic and contextual nuances [26,27].

Hence, metric models are essential for efficiency and objectivity, particularly in MT evaluations. However, the choice of metric depends on the language pair, translation objectives, and the type of text. Hybrid or neural-based metrics are recommended for comprehensive evaluation. Hybridity may imply the use of more than one metric or human feedback.

Unlike quantitative metrics, House’s model allows for a deeper understanding of cultural, stylistic, and functional fidelity, making it particularly useful for assessing translations where nuance, tone, and cultural alignment are critical [12]. Qualitative methods like House’s model are often paired with quantitative metrics to provide a comprehensive evaluation, addressing not only measurable accuracy but also the contextual and interpretive quality of translations.

4. Materials and Methods

This qualitative and quantitative study aims to analyze the TQA of some Arabic texts that have been translated into English by rule-based machine translation systems as well as by neural-based translation systems, particularly GT. To attempt both quantitative and qualitative analyses, two main tasks were used in this study. Task 1 includes 14 short texts (sentences) that have been extracted from online posts on humorous translations that have been in circulation since 2004, nearly two years before the introduction of GT. The selection of these texts is motivated by the fact that there is evidence of MT for these sentences that dates back to 2004, when MT systems were largely underdeveloped. The same texts have been freshly translated by the current version of GT, which is based on neural systems as well as a recent version of ChatGPT. The second text is a technical text, which was sourced from Wikipedia. All the translations have been compared with reference translations for the sake of quality assessment.

Software

The first software that is used in this study is GT, which was established in 2006. The software was accessed via its public web interface (https://translate.google.com) in March 2024. GT is a leading machine translation tool supporting 133 languages, with 24 added in 2022 [28]. Its accuracy varies by language pair and content, achieving up to 94% accuracy in some cases [29]. A significant advancement occurred in 2016 with the adoption of NMT, which reduced translation errors by over 60% for major language pairs [30].

The second system is ChatGPT, specifically ChatGPT-4 (March 2024 release), accessed through OpenAI’s ChatGPT web platform (https://chat.openai.com). ChatGPT is a generative AI LLM that demonstrates significant strengths in translation tasks, leveraging its transformer-based architecture and fine-tuning via reinforcement learning from human feedback (RLHF) to excel in nuanced and contextually complex translations. Its zero-shot and few-shot capabilities enable effective translation for low-resource language pairs, surpassing traditional MT systems that rely heavily on parallel corpora [31]. Additionally, ChatGPT maintains contextual coherence, handles idiomatic expressions correctly, and adapts to user feedback for improved quality, making it ideal for domain-specific translations [32]. Its versatility across tasks like summarization and multilingual content generation further enhances its usefulness [33]. Performance benchmarks indicate competitive results, particularly in qualitative measures like tone and naturalness, positioning ChatGPT as a robust alternative to traditional translation systems [34].

To evaluate the performance of ChatGPT and GT, this study employs various metrics. The first is the BLEU metric, a widely used tool in MT for assessing translation quality through n-gram precision [23]. Alternative metrics such as TER, which focus on edit distance, are often argued to outperform BLEU in assessing nuanced translations involving morphologically rich languages like Arabic.

Another metric that may perform better with colloquial and idiomatic texts is chrF, because it evaluates similarity at the character level and thus it can ignore slight changes in phrasing or word forms. It is also better suited for colloquial language and idiomatic phrases, where word-level overlap may not capture nuanced similarity [35]. In any case, even perfect translations may be scored poorly by these metrics [36], and thus human oversight is a must.

5. Data Analysis

To assess the translation of older rule-based MT systems and recent neural-based systems like GT, this study compared the output of 14 short Arabic idiomatic expressions. The translation quality was assessed using the chrF metric along with a qualitative evaluation of the translation. ChrF works at the character level, making it ideal for morphologically rich languages like Arabic. It evaluates character-level similarity to capture minor structural differences. In order to mitigate potential distortion of Arabic fonts, this study employs the International Phonetic Alphabet (IPA) system for linguistic representation. The original Arabic source texts and their corresponding translations are provided in the appendices.

Table 1 lists the Arabic expressions, their two machine translations, as well as a reference translation.

The chrF scores for both the earlier MT and GT outputs are presented in Table 2. As stated earlier, chrF is a character-level metric that evaluates translation quality based on the overlap of character n-grams between the candidate and reference translations. This approach is especially useful for morphologically rich languages like Arabic. It also offers greater sensitivity to minor lexical or inflectional variations than word-based metrics such as BLEU. Given the fragmented, idiomatic, and colloquial nature of the short Arabic expressions examined in this section, chrF was selected as the most appropriate quantitative measure, as it better captures nuanced similarities that surface-level word matches may overlook. However, this study acknowledges that chrF scores alone may not fully capture the semantic or cultural fidelity of the translations. To address this, this study’s analysis extends beyond the numeric scores to include qualitative evaluations that consider idiomatic equivalence, contextual meaning, and the pragmatic appropriateness of the translations in each segment.

The scores listed in Table 2 reflect the character-level similarity between the translations and the reference, offering insights into how well each system captures nuances. Hence, in segment 1, both translations are literal and fail to capture the idiomatic phrase. The small chrF scores reflect slight character overlap (e.g., “my sky”), but neither conveys the intended meaning. In segment 2, both translations are literal, resulting in low chrF scores. GT’s slightly higher score is due to minor improvements in phrasing at the character level, but the idiomatic meaning is still missed. In segment 3, both translations fail to capture the colloquial meaning. GT’s higher score reflects better character alignment, but both remain literal and do not convey the intended meaning. In segment 4, GT significantly outperforms earlier MT, with a much closer translation. The chrF score for GT reflects a substantial improvement in capturing the intended meaning at the character level. In segment 5, both scores are low, indicating neither translation captures the imperative tone of the expression. Earlier MT scores were slightly higher due to character overlap, but both miss the intended directive nature of the phrase. In segment 6, both translations are close but fail to fully capture the idiomatic tone. Earlier MT scores are slightly higher due to better character overlap, but the meaning conveyed by GT is more aligned. In segment 7, GT significantly outperforms earlier MT, capturing the intended meaning better. Earlier MT misinterprets maːlek as “money”, resulting in poor alignment. In segment 8, both translations are literal and fail to capture the idiomatic expression. GT’s higher score reflects better character-level alignment. In segment 9, GT captures the meaning perfectly, resulting in a significantly higher score. Earlier MT’s literal translation leads to a nonsensical result. In segment 10, GT provides a semantically accurate translation, aligning well with the reference. Earlier MT misinterprets qaːhira as “Cairo”, producing a nonsensical result. In segment 11, GT provides a more contextually appropriate translation. Earlier MT’s translation is literal and misses the idiomatic nuance. In segment 12, GT captures the emotional sentiment better, providing a more meaningful translation. Earlier MT fails to convey the intended sentiment. In segment 13, GT aligns closely with the reference, capturing the intended meaning. The literal translation of the earlier MT system distorts the phrase. Finally, in segment 14, both translations struggle, but GT performs slightly better. Neither captures the cultural nuance of getting engaged/married.

GT consistently provides better translations at the character level, aligning more closely with the references in both meaning and form. However, both systems exhibit significant limitations in capturing idiomatic and cultural nuances, which require contextual understanding beyond direct translation.

To confirm whether a statistically significant relation exists in the performance of the two systems, a normality test using the Shapiro–Wilk test was conducted as shown in Table 3:

The Shapiro–Wilk test confirms that neither dataset follows a normal distribution. Since the normality assumption is violated, a Wilcoxon signed-rank test (non-parametric alternative to the paired t-test) should be used to assess paired differences. (See Table 4).

Results of the Wilcoxon signed-rank test show that the test statistic is 19.0 and the p-value is 0.035 (p < 0.05). That is, the Wilcoxon signed-rank test indicates a statistically significant difference between the chrF scores of the earlier MT system and GT renderings (p < 0.05). This suggests that the two systems differ in their performance. The hypothesis that there is no significant relation at (p < 0.05) between the performance of the earlier MT system and GT while translating the above idiomatic expressions from Arabic into English can therefore be rejected. The chrF scores reveal GT’s advantage in certain segments, especially those involving short, direct phrases. However, both systems struggle significantly with idiomatic and colloquial expressions, which require more than character-level alignment to capture meaning accurately.

While the first section of analysis in this study dealt with the quantitative metric-based and qualitative analyses of translation quality of some machine translations of Arabic allegorical, proverbial, and idiomatic expressions and clichés, the following section is devoted to the translation quality assessment of longer texts. The machine neural-based translation using GT and the LLM translation with ChatGPT of a technical text is examined. Both translations are analyzed quantitatively and qualitatively by comparing them with a human reference translation. The text, its GT rendering, its ChatGPT rendering, and the reference translation are given in Appendix A.

To assess the quality of the Arabic translations provided by GT and ChatGPT against the reference translation, BLEU is used. In addition, a qualitative analysis of the translation quality is provided as well. BLEU is a metric that evaluates the overlap between the candidate and reference translations, considering n-gram precision. The BLEU scores for both GT and ChatGPT translations, sentence by sentence, are given in Table 5.

In sentence 1, GT adheres closely to the reference, with minimal variation. However, ChatGPT uses ʔanˈðˤɪma watadafːuˈlaːt ʕaˈmal ‘systems and workflows’ instead of alˈʔanðˤɪma wˈsajr alˈʕamal ‘the systems and workflow’, which is slightly more verbose but maintains the meaning. The GT scores are higher due to closer n-gram matches, and ChatGPT scores are slightly lower due to different word choices and order. Similarly, in sentence 2, GT is concise and closer to the reference. ChatGPT, on the other hand, uses ʔad͡ʒˈzaːʔ min nuˈsˤuːs ‘parts of texts’ which diverges more from the reference but still conveys the meaning accurately. The translation is also semantically correct. Likewise, in sentence 3, GT is more aligned with the reference and ChatGPT uses haːðaː alʔusˈluːb ‘this method/style’ instead of haːðiːhi alʕamaˈlijja ‘this process’, which is a better stylistic choice but changes the nuance slightly. The scenario is not too different in sentence 4, where GT matches the reference more closely, whereas ChatGPT uses anːuˈsˤuːs almutagajˈjira ‘the variable texts’ with a comma for clarity. This is a minor stylistic change, which returns lower scores but is still accurate. In sentence 5, GT matches the reference translation more closely while ChatGPT uses tuˈtiːħ ‘enable’ instead of tasmaħ ‘allow’, which has a slight semantic shift. Finally, in sentence 6, GT has a closer match with the reference while ChatGPT includes commas and uses alˈhadar ‘the waste’ instead of annifaːˈjaːt ‘the garbage/trash’, slightly changing the tone.

To quantify how far GT and ChatGPT translations structurally deviate from the reference, TER was used, which measures the number of edits needed to transform the candidate translations into the reference translations. As shown in Table 5 above, the TER results for GT translations were [0.222, 0.450, 0.429, 0.167, 0.391, 0.435] for the six sentences, respectively. The average TER Score is 0.349. As for ChatGPT renderings, the sentence TER scores are [0.556, 0.650, 0.571, 0.222, 0.609, 0.783] and the average TER score is 0.565. Figure 2 below provides the BLUE vs. TER scores for GT and ChatGPT renderings.

In TER, lower scores indicate closer alignment with the reference. This shows that GT achieved better TER scores overall, with smaller deviations from the reference. ChatGPT, on the other hand, demonstrated higher TER scores, indicating more edits are needed to align the translations with the reference.

To statistically test the significance of the differences between BLEU and TER scores, normality is first assessed using the Shapiro–Wilk test for both BLEU and TER scores. (See Table 6).

Since normality is not rejected, the data are approximately normal. However, to remain conservative with small sample sizes, the Wilcoxon signed-rank test was used to assess paired differences between BLEU and TER scores. (See Table 7).

The results show that the p-value for GT BLEU vs. TER is 0.437 (not significant). The same is applicable to ChatGPT BLEU vs. TER, as the p-value = 0.093 (not significant, though closer to the threshold).

The differences between BLEU and TER scores for both GT and ChatGPT translations are not statistically significant. This suggests that the variations between BLEU and TER metrics are not large enough to warrant meaningful differences in performance evaluations for these translations. Hence, the second hypothesis asserts that “there is no statistically significance relation between the translations of GT and ChatGPT of the text” is accepted. This suggests that despite some stylistic differences in ChatGPT translation, this does not necessarily suggest the translation is poor or non-communicative. (See Table 8).

6. Qualitative Analysis

In addition to metric-based quantitative analysis of the text under investigation, this study attempts a qualitative TQA based on House’s model [11,12,13]. A register analysis of the GT and ChatGPT translations of the text shows that both reflect the same tone, level of formality, and technical precision of the source text to a great extent. However, a few inconsistencies are noted in phrases like wataʃmal haːðiː alˈʔanðˤima alˈqaːʔima ʕalaː alˈmantˤiq ‘these logic-based systems include’ in GT and taʃmal haːðiː alˈʔanðˤima alˈʔanðˤima alˈqaːʔima ʕalaː alˈmantˤiq ‘these systems include the systems based on logic’ in ChatGPT. Both clauses are technically correct but lack a natural tone in technical contexts. An alternative translation could be taʃmal haːðiː alˈʔanðˤima alˈʔanðˤima allatiː taˈquːm ʕalaː alˈmantˤiq ‘these systems include the systems that are based on logic’.

Similarly, GT uses correct terminology but lacks sophistication. For example, anːuˈsˤuːs aʃʃarˈtˤijja wa anːuˈsˤuːs almutagajˈjira ‘conditional texts and variable texts’ appears slightly repetitive and unrefined, and the same applies to the ChatGPT version, even though the latter tried to use some punctuation marks (,) to show that the terms are different. A more appropriate translation may consider more contextual information, as shown in Table 9.

The genre remains constant to a great extent in the two translations. The text is technical, and the function of this text includes ideational clauses of being and doing. In both translations, the lexical items are technical and concise and in line with the conventions of Arabic professional discourse. However, the repetition of taqˈliːl ‘reduction or minimization’ in segment 5 could be streamlined by restructuring the sentence. In addition, lists need to be formatted for clarity, and transitions between ideas could be smoother, which could be achieved by adding some expressions in Arabic, such as maː ˈjaliː ‘What follows’, ataːˈliː ‘the following’, and alˈʔaːtiː ‘the following’.

As a technical text, the tenor indicates that the author’s personal stance is formal. As for the mode, the text is written to be read. Even though both translations are grammatically and syntactically acceptable, the translation is primarily overt. Overt errors in House’s model are direct linguistic errors, including grammar, lexis, syntax, or textual omissions/additions. In GT, grammatical structures are mostly accurate, but the translation exhibits minor awkwardness in phrasing. Lexical issues are few in both translations. Some lexical choices lack precision, e.g., anːuˈsˤuːs aʃʃarˈtˤijja ‘the conditional texts’ is correct but does not differentiate enough from anːuˈsˤuːs almutagajˈjira ‘the variable texts’ (both terms feel redundant without nuanced distinction). The use of proper punctuation, as is the case in the ChatGPT translation (i.e., anːuˈsˤuːs aʃʃarˈtˤijja, wa anːuˈsˤuːs almutagajˈjira), ‘the conditional texts, and the variable texts’ ensures clarity and logical segmentation.

As for textual additions or omissions, few instances are found in both translations. albaˈriːd/ /aʃʃaħn ‘mail/shipping’ is accurate but could have been localized better with ʔirsaːl albaˈriːd ʔaw aʃʃaħn ‘sending the mail or the shipment’ for fluidity. While there are no significant omissions or additions, some subtle rephrasing enhances clarity: wataʃmal alfaːwaːʔid alʔiḍaːfiːja maː ˈjalɪː: tawfiːr alwaqt wa alˈmaːl bisabab taqliːl attaʕaːmul maʕ alwaraq ‘additional benefits include the following: saving time and money by reducing paper handling’ is more readable and stylistically appropriate.

The use of alʔaxˈtˤaːʔ albaʃaˈrijja ‘human errors’ instead of alˈxatˤaʔ alˈbaʃariː ‘the human error’ in the ChatGPT translation is not wrong but shifts the meaning slightly to imply multiple errors rather than the concept of human error. Similarly, alˈhaːtɪf ‘the phone’ in GT in sentence 6 should be almukaːlɑˈmaːt alhaːtɪˈfijja ‘the phone calls’ for clarity and parallelism with alfaːkˈsaːt ‘the faxes’. Similarly, alˈhadr ‘the waste’ is more ambiguous than an.ni.faːˈjaːt ‘trash/garbage’ and could create ambiguity in ChatGPT translation.

In GT, some phrasing feels overtly literal, e.g., ʔinxɪˈfaːdˤ attaʕaːˈmul maʕ alˈwaraq ‘a decrease in handling paper’ translates accurately but does not fully capture the nuanced benefit implied in the source text. In addition, the translation lacks the subtlety of professional Arabic texts, especially in technical and formal contexts. ChatGPT translation is somewhat more localized. For example, taqliːl attaʕaːˈmul maʕ alˈwaraq ‘reducing paper handling’ subtly shifts the emphasis in a way that feels more natural and professional. On the other hand, more idiomatic phrasing is needed in both translations; wajutam ʔistixˈdaːm haːðihi alʕamalijˈjaː biʃaklin mutazaːjid ‘this process is increasingly being used’, is smoother and aligns with Arabic formal writing norms better than juːstaχdam haːðaː alʔusˈluːb biʃaklin mutazaːjid ‘this style is increasingly used’ from ChatGPT and tustaxdam haːðiː alʕamalijˈjaː biʃaklin mutazaːjid ‘this process is increasingly used’ from GT. In addition, alʔusˈluːb ‘the style’ is less precise than alʕamalijˈja ‘the process’ because the text focuses on the automation process rather than a general method or style.

As for cohesion, textual cohesion and flow are somewhat lacking in both translations. Sentence divisions occasionally feel disconnected, e.g., taʃmal haːðiː alʔanðˤima alqaːʔima ʕalaː alˈmantˤiq walˈlatiː tastaʕmil ʔadʒˈzaːʔ min anˈnˤaṣˤ ‘these systems include those based on logic and that use parts of the text’ has an awkward structure and the repetition of phrases like taqliːl ʔidˈxɑːl alˌbajaːnaːt ‘reducing data entry’ without slight variation impacts the fluency of the text. Smooth transitions and appropriate use of conjunctions make translations easier to read.

Even though metrics-based data may imply that the GT is far better than ChatGPT, the qualitative analysis shows that ChatGPT translations may outperform GT in some respects, including enhanced lexico-grammatical precision, greater textual cohesion and flow, and a more professional stylistic register. However, for improvement, both systems could focus on refining phrasing for enhanced readability and tone.

7. Discussion

This study has attempted quantitative metric-based and qualitative translation assessments of texts translated by early MT systems, GT and ChatGPT between Arabic and English. The findings of this study have shown that early MT systems (i.e., rule-based MT systems) as well as neural-based systems still encounter many problems in the translation of idioms, proverbs, clichés, and colloquial expressions. This finding is consistent with the findings of Alzeebaree [37], who concludes that GT, which is more advanced than most earlier MT systems, fails to render culture-specific aspects of most idioms. In addition, GT renders some collocations literally and the results are non-standard collocations in the target language. This finding is in line with the findings of Abdullah Naeem’s study [38] that show the translated collocations are not adapted to the beliefs and experiences of the target audience. In addition, the findings of this study show that GT generally fails to achieve adequate translations of complex colloquial and informal expressions, a finding confirmed in other studies [39]. The performance of other MT systems is similar to that of GT. Findings from other studies show that Microsoft Bing and other systems also encounter problems in the rendering of idiomatic expressions [40,41]. The qualitative analysis of Arabic–English translations shows that LLMs, and in this case ChatGPT, work well in the translation of these expressions, and it outperforms earlier MT systems and GT in terms of accuracy, fluency, and cultural sensitivity [42].

Furthermore, the findings of this study are in conformity with those of other studies that tackled MT in various directionalities, between English and other languages such as French, Swedish, Farsi, and Chinese. Some weaknesses, irregularities, and limitations in MT in general are reported in many studies in the literature [43,44,45,46,47,48].

Another important finding of this study is that while MT metrics are important in the assessment of translation quality, their scores might be misleading. Findings from the translation of Task 2 show that GT translations are consistently closer to the reference translation, both in structure and terminology. ChatGPT’s translations are accurate, but generally deviate stylistically, introducing unnecessary changes that lower fidelity to the reference, but are ultimately more professional. This finding confirms those of other studies that tackled various language pairs, including [19,49,50]. In general, these studies and this one highlight that a higher BLEU score does not always correlate with human judgment of translation quality. Traditional metrics (e.g., BLEU, TER) over-reward lexical similarity while ignoring deeper aspects of translation quality, such as fluency and context-appropriateness.

The qualitative translation assessment shows that ChatGPT has produced better overall translation outputs. This finding is in line with those of a recent study on the translation of expressive genres such as poetry [51] that ChatGPT outperformed GT and DeepL Translator across various evaluation criteria, which highlights its potential in handling complex literary translations. However, while ChatGPT-4 may achieve competitive translation quality in the translation of specific genres and high-resource languages, as is the case for English–Arabic translations, it may have limited capabilities for other genres or low-resource languages, a finding that has been highlighted in some recent studies including [10,52].

While this study relies primarily on traditional metrics such as chrF, BLEU, and TER, it is increasingly clear that these may not adequately reflect semantic fidelity, especially in expressive or idiomatic translations. Recent work on generative AI evaluation, such as the Pass@k metric developed by OpenAI [53], highlights alternative frameworks for assessing meaning preservation across multiple plausible outputs. Though originally designed for code generation, Pass@k’s underlying logic, capturing whether a correct or acceptable output is produced among a set of completions, could be adapted to translation tasks, particularly for languages with high variability or nuance like Arabic. Incorporating such probable metrics could provide richer insights into the strengths and limitations of LLM outputs beyond deterministic match scores.

Moreover, the growing use of RLHF, as demonstrated in recent studies [54], suggests a promising pathway for refining translation evaluation. Crowd-sourced or expert human ratings on fluency, semantic accuracy, and cultural appropriateness can be used not only to validate model outputs but to inform weighting schemes in hybrid metrics. These insights open important future possibilities for integrating structured human judgments with automatic evaluation tools to produce more context-appropriate and functionally relevant translation assessments.

This study has several limitations. Firstly, it has evaluated the translation quality of specific text types and genres, necessitating further research to examine the performance of MT systems and LLMs across other text types and genres. Secondly, this study has focused on GT and ChatGPT. The performance outcomes of these systems may differ slightly or significantly from other neural-based MT systems and generative AI models. Additionally, the scope of this study is restricted to certain metrics and language pairs, which complicates the generalization of its findings. Future research should incorporate new metrics, language pairs, MT systems, and LLMs. Another limitation is the reliance on standard automatic evaluation metrics, which, while useful, may not fully capture nuanced aspects of meaning, idiomatic accuracy, or cultural appropriateness in translations. As noted in recent generative AI research, alternative frameworks like Pass@k and RLHF offer more robust methods for evaluating meaning preservation in flexible output scenarios [53,54]. Future work should consider integrating these frameworks and employing structured human evaluations to augment metric-based assessments.

Future studies may also explore more comprehensive integration of human evaluative feedback, potentially crowd-sourced or expert-based, to refine or adjust automatic metrics like chrF. It is also possible to explore dynamic evaluation strategies based on transfer learning principles, where meaning transfer and contextual adaptation are assessed through targeted translation tasks. Such hybrid frameworks would not only evaluate output accuracy but also model learning behaviors and generalizations, offering more scalable and human-aligned metrics for LLM evaluation in translation.

8. Conclusions

MT has notably minimized language barriers and excels in the translation of many text types and genres. Its latest systems have enhanced productivity, and with proper interaction with human translators, these systems may enhance the translation quality of texts that would have been otherwise lost in machine translation. Findings from metrics such as chrF and BLEU show that MT systems that have long been struggling with the translation of simple texts have undergone tremendous developments and have become capable of translating complex structures and complicated genres.

Specialized neural-based MT systems and LLMs typically offer superior translation quality for widely spoken languages and standard texts. Nevertheless, the optimal scenario in the translation industry involves a hybrid approach that integrates LLMs with other translation systems, while also ensuring sufficient attention is given to human-machine interaction. LLMs, particularly ChatGPT, show promise in addressing this gap by offering more coherent and culturally aligned translations. However, both systems demonstrate limitations that necessitate human post-editing for high-stakes content. While chrF and similar metrics provide helpful quantitative benchmarks, their true value lies in how they are complemented by qualitative evaluations of meaning preservation, idiomaticity, and cultural appropriateness. Hybrid assessment frameworks are crucial, particularly in Arabic–English translation, due to the significant morphological complexity and cultural distinctions involved. This study emphasizes that the combination of metric-based and human-centered approaches yields a more complete picture of LLM translation capabilities. Drawing on concepts akin to in-context learning and transfer learning, future research could leverage adaptive evaluation strategies grounded in human feedback to enhance both model performance and interpretability.

Funding

This research received no external funding.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The full R scripts (https://rdrr.io/snippets/) used for the statistical analyses (including the Shapiro–Wilk and Wilcoxon signed-rank tests) are publicly available. The R code is hosted at the following permanent URLs: https://pastecode.io/s/jg6moeqt (accessed on 10 May 2025), https://pastecode.io/s/8vsgrw6o (accessed on 10 May 2025), https://pastecode.io/s/74fyfc6p (accessed on 10 May 2025), and https://pastecode.io/s/37jq2rc3 (accessed on 10 May 2025).

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Arabic Expressions and Their Machine and Reference Translations

Segment	Original Arabic	IPA	Earlier MT	Google Translation	Reference Translation
1	حل عن سماي	ħal ʕan samaːj	Dilute my sky	Solve my sky	Get off my back/Leave me alone
2	بتؤمرني أمر	bitʔuʔmirni ʔamr	You order me order	You command me an order	At your service/Whatever you say
3	ياليل ياعين	ja leːl ja ʕeːn	Oh night oh eye	Oh night, oh eye	Oh, what a night!/Oh dear!
4	مالك؟	maːlek	Your money	What’s wrong?	What’s wrong with you?/What’s the matter?
5	بلا حكي فاضي	bala ħaki faːdˤi	No need for empty speech	No empty talk	Stop with the nonsense
6	مش فاضيلك	miʃ faːdˤilak	Not empty for you	I don’t have time for you	I’m not free for you/I don’t have time for this
7	شو مالك	ʃuː maːlek	What’s your money	What’s wrong with you	What’s up with you?/What’s your problem?
8	على راسي	ʕalaː raːsi	On my head	On my head	Of course/With pleasure
9	يستر على عرضك	justr ʕala ʕarḍak	Cover your width	May God protect your honor	May God protect your honor (used as a plea)
10	ظروف قاهره	ʒˤuruːf qaːhira	Cairo envelopes	Compelling circumstances	Overwhelming circumstances
11	شو الدعوه	ʃuː idːaʕwa	What’s the invitation?	What’s the matter	What’s the deal?/What’s going on?
12	يابعد عمري	ja baʕd ʕumri	After my age	Oh my life	You’re my everything
13	ما تدفع أنا بدفع	maː tidfaʕ ʔanaː badfaʕ	Don’t push .. I will push	Don’t pay, I’ll pay	Don’t pay, I’ll pay
14	فتح الله كتب كتابه على فيفي	fataħ allaːh kataba kitaːbahu ʕala fiːfiː	Open God wrote his book on in in	Fatah Allah wrote his book on Fifi	Fathallah got engaged/married to Fifi

Appendix B. English Technical Text and Its GT, ChatGPT, and Reference Translations

	English	Google Translation	ChatGPT Translation	Reference Translation
1	Document automation (also known as document assembly) is the design of systems and workflows that assist in the creation of electronic documents.	أتمتة المستندات (المعروفة أيضًا باسم تجميع المستندات) هي تصميم الأنظمة وسير العمل التي تساعد في إنشاء المستندات الإلكترونية.	أتمتة المستندات (المعروفة أيضًا بتجميع المستندات) هي تصميم أنظمة وتدفقات عمل تساعد في إنشاء المستندات الإلكترونية.	أتمتة المستندات (تُعرف أيضًا باسم تجميع المستندات) هو تصميم الأنظمة وسير العمل التي تساعد في إنشاء مستندات إلكترونية.
2	These include logic-based systems that use segments of pre-existing text and/or data to assemble a new document.	وتشمل هذه الأنظمة القائمة على المنطق والتي تستخدم أجزاء من النص و/أو البيانات الموجودة مسبقًا لتجميع مستند جديد.	تشمل هذه الأنظمة الأنظمة القائمة على المنطق التي تستخدم أجزاء من نصوص و/أو بيانات موجودة مسبقًا لتجميع مستند جديد.	وتشمل هذه النظم القائمة على المنطق التي تستخدم شرائح من النص و/أو البيانات الموجودة مسبقا لتجميع وثيقة جديدة.
3	This process is increasingly used within certain industries to assemble legal documents, contracts and letters.	تُستخدم هذه العملية بشكل متزايد في بعض الصناعات لتجميع المستندات القانونية والعقود والرسائل.	يُستخدم هذا الأسلوب بشكل متزايد في بعض الصناعات لتجميع المستندات القانونية والعقود والرسائل.	ويتم استخدام هذه العملية بشكل متزايد في بعض الصناعات لتجميع الوثائق والعقود والرسائل القانونية.
4	Document automation systems can also be used to automate all conditional text, variable text, and data contained within a set of documents.	يمكن أيضًا استخدام أنظمة أتمتة المستندات لأتمتة جميع النصوص الشرطية والنصوص المتغيرة والبيانات الموجودة داخل مجموعة من المستندات.	يمكن أيضًا استخدام أنظمة أتمتة المستندات لأتمتة جميع النصوص الشرطية، والنصوص المتغيرة، والبيانات الموجودة ضمن مجموعة من المستندات.	يمكن أيضًا استخدام أنظمة أتمتة المستندات لأتمتة جميع النصوص الشرطية والنص المتغير والبيانات الموجودة في مجموعة من المستندات.
5	Automation systems allow companies to minimize data entry, reduce the time spent proofreading and reduce the risks associated with human error.	تسمح أنظمة الأتمتة للشركات بتقليل إدخال البيانات وتقليل الوقت المستغرق في التدقيق اللغوي وتقليل المخاطر المرتبطة بالخطأ البشري.	تتيح أنظمة الأتمتة للشركات تقليل إدخال البيانات، وتقليل الوقت المستغرق في التدقيق اللغوي، وتقليل المخاطر المرتبطة بالأخطاء البشرية.	تسمح أنظمة الأتمتة للشركات بتقليل إدخال البيانات إلى الحد الأدنى، وكذلك تقليل الوقت الذي تستغرقه في قراءة الدليل، وتقليل المخاطر المرتبطة بالخطأ البشري.
6	Additional benefits include: time and financial savings due to decreased paper handling, document loading, storage, distribution, postage/shipping, faxes, telephone, labor and waste.	تشمل الفوائد الإضافية: توفير الوقت والمال بسبب انخفاض التعامل مع الورق وتحميل المستندات والتخزين والتوزيع والبريد/الشحن والفاكسات والهاتف والعمالة والنفايات.	وتشمل الفوائد الإضافية: توفير الوقت والمال بسبب تقليل التعامل مع الورق، وتحميل المستندات، وتخزينها، وتوزيعها، والشحن/البريد، والفاكسات، والمكالمات الهاتفية، والعمالة، والهدر.	وتشمل المزايا الإضافية ما يلي: توفير الوقت والمال بسبب انخفاض معالجة الورق وتحميل المستندات والتخزين والتوزيع والبريد/الشحن والفاكسات والهاتف والعمالة والنفايات.

References

Mair, V. AI for Akkadian. 2023. Available online: https://languagelog.ldc.upenn.edu/nll/?p=59411 (accessed on 20 May 2025).
Doan, C. Comparing Encoder-Decoder Architectures for Neural Machine Translation: A Challenge set Approach. Ph.D. Thesis, Université d’Ottawa/University of Ottawa, Ottawa, ON, Canada, 2021. Available online: https://ruor.uottawa.ca/items/bd124184-0c9e-484a-8973-434adc1f9adc (accessed on 15 April 2025).
Konyk, M.; Vysotska, V.; Goloshchuk, S.; Holoshchuk, R.; Chyrun, S.; Budz, I. Technology of Ukrainian-English machine translation based on recursive neural network as LSTM. COLINS 2023, 1, 357–370. [Google Scholar]
He, J.; Neubig, G.; Berg-Kirkpatrick, T. Efficient nearest neighbor language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online, 3 November 2021; Moens, M.-F., Huang, X., Specia, L., Yih, S.W.-T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 5703–5714. [Google Scholar] [CrossRef]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned language models are zero-shot learners. arXiv 2022, arXiv:2109.01652. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Gandolfi, A. GPT-4 in education: Evaluating aptness, reliability, and loss of coherence in solving calculus problems and grading submissions. Int. J. Artif. Intell. Educ. 2025, 35, 367–397. [Google Scholar] [CrossRef]
Zhou, Z.; Gan, W.; Xie, J.; Guo, Z.; Zhang, Z. Harnessing the potential of large language models in medicine: Opportunities, challenges, and ethical considerations. Int. J. Surg. 2024, 110, 5850–5851. [Google Scholar] [CrossRef]
Mohammed, T.A. From Google Translate to ChatGPT: The use of large language models in translating, editing, and revising. In Role of AI in Translation and Interpretation; IGI Global Scientific Publishing: Hershey, PA, USA, 2025; pp. 1–32. [Google Scholar]
House, J. A Model for Translation Quality Assessment; Gunter Narr Verlag: Berlin, Gremany, 1977. [Google Scholar]
House, J. Translation Quality Assessment: A Model Revisited; Gunter Narr: Tübingen, Germany, 1997. [Google Scholar]
House, J. Translation Quality Assessment: Past and Present; Routledge: Oxfordshire, UK, 2014. [Google Scholar]
Munkova, D.; Hajek, P.; Munk, M.; Skalka, J. Evaluation of machine translation quality through the metrics of error rate and accuracy. Procedia Comput. Sci. 2020, 171, 1327–1336. [Google Scholar] [CrossRef]
Munková, D.; Munk, M.; Skalka, J.; Kasaš, K. Automatic evaluation of MT output and post-edited MT output for genealogically related languages. In Innovation in Information Systems and Technologies to Support Learning Research; Serrhini, M., Silva, C., Aljahdali, S., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; Volume 7, pp. 416–425. [Google Scholar] [CrossRef]
Juraska, J.; Deutsch, D.; Finkelstein, M.; Freitag, M. MetricX-24: The Google submission to the WMT 2024 metrics shared task. arXiv 2024, arXiv:2410.03983. [Google Scholar]
López Caro, Á. Machine Translation Evaluation Metrics Benchmarking: From Traditional MT to LLMs. Master’s Thesis, The University of Barcelona, Barcelona, Spain, 2023. Available online: https://hdl.handle.net/2445/214303 (accessed on 20 May 2025).
Perrella, S.; Proietti, L.; Huguet Cabot, P.-L.; Barba, E.; Navigli, R. Beyond correlation: Interpretable evaluation of machine translation metrics. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 20689–20714. [Google Scholar] [CrossRef]
Lee, S.; Lee, J.; Moon, H.; Park, C.; Seo, J.; Eo, S.; Koo, S.; Lim, H. A survey on evaluation metrics for machine translation. Mathematics 2023, 11, 1006. [Google Scholar] [CrossRef]
van Toledo, C.; Schraagen, M.; van Dijk, F.; Brinkhuis, M.; Spruit, M. Readability metrics for machine translation in Dutch: Google vs. Azure & IBM. Appl. Sci. 2023, 13, 4444. [Google Scholar] [CrossRef]
Munkova, D.; Munk, M.; Benko, L.; Hajek, P. The role of automated evaluation techniques in online professional translator training. PeerJ Comput. Sci. 2021, 7, e706. [Google Scholar] [CrossRef] [PubMed]
Toury, G. Descriptive Translation Studies–and Beyond; John Benjamins Publishing: Amsterdam, The Netherlands, 1995. [Google Scholar]
Maučec, M.S.; Donaj, G. Machine translation and the evaluation of its quality. In Recent Trends in Computational Intelligence; IntechOpen: London, UK, 2019; p. 143. [Google Scholar]
Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 5–9 June 2005; pp. 65–72. Available online: https://aclanthology.org/W05-0909.pdf (accessed on 20 May 2025).
Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, Cambridge, MA, USA, 8–12 April 2006; pp. 223–231. Available online: https://aclanthology.org/2006.amta-papers.25/ (accessed on 13 March 2025).
Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 10 June 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7881–7892. [Google Scholar] [CrossRef]
Zhao, W.; Glavaš, G.; Peyrard, M.; Gao, Y.; West, R.; Eger, S. On the limitations of cross-lingual encoders as exposed by reference-free machine translation evaluation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; Jurafsky, D., Chai, J., Schluter, N., Tetreault, J., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 1656–1671. [Google Scholar] [CrossRef]
Caswell, I. 24 New Languages added to Google Translate. Available online: https://blog.google/products/translate/24-new-languages/?utm_source=chatgpt.com (accessed on 11 May 2022).
Alkhawaja, L. Unveiling the new frontier: ChatGPT-3 powered translation for Arabic-English language pairs. Theory Pract. Lang. Stud. 2024, 14, 347–357. [Google Scholar] [CrossRef]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Robinson, N.; Ogayo, P.; Mortensen, D.R.; Neubig, G. ChatGPT MT: Competitive for high- (but not low-) resource languages. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 1 December 2023; Koehn, P., Haddow, B., Kocmi, T., Monz, C., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 392–418. [Google Scholar] [CrossRef]
Sahari, Y.; Al-Kadi, A.M.T.; Ali, J.K.M. A cross sectional study of ChatGPT in translation: Magnitude of use, attitudes, and uncertainties. J. Psycholinguist. Res. 2023, 52, 2937–2954. [Google Scholar] [CrossRef]
Alawida, M.; Mejri, S.; Mehmood, A.; Chikhaoui, B.; Isaac Abiodun, O. A comprehensive study of ChatGPT: Advancements, limitations, and ethical considerations in natural language processing and cybersecurity. Information 2023, 14, 462. [Google Scholar] [CrossRef]
Jiang, Z.; Lv, Q.; Zhang, Z.; Lei, L. Distinguishing translations by human, NMT, and ChatGPT: A linguistic and statistical approach. arXiv 2024, arXiv:312.10750. [Google Scholar]
Popović, M. chrF: Character n-gram F-score for automatic MT evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, Lisbon, Portugal, 15 September 2015; pp. 392–395. [Google Scholar] [CrossRef]
Castaldo, A.; Monti, J. Prompting large language models for idiomatic translation. In Proceedings of the First Workshop on Creative-Text Translation and Technology, Sheffield, UK, 4 June 2024; pp. 37–44. Available online: https://unora.unior.it/handle/11574/231020 (accessed on 20 May 2025).
Alzeebaree, Y. Machine translation and issues of multiword units: Idioms and collocations. East. J. Lang. Linguist. Lit. 2020, 1, 1–23. [Google Scholar]
Abdullah Naeem, D.A. Machine translation problems in English-Arabic collocations and post-editing: Google Translate as a case study. Humanit. Educ. Sci. J. 2023, 28, 487–515. [Google Scholar] [CrossRef]
Aizouky, Z. Arabic-English Google Tanslation Evaluation and Arabic Sentiment Analysis. Master’s Thesis, University of Alberta, Edmonton, AB, Canada, 2020. Available online: https://era.library.ualberta.ca/items/55622b1e-bed3-4260-ad86-042408dff84d (accessed on 20 May 2025).
Mounassar, A.A.A. The effectiveness of machine translation systems in translating English fixed phrases into Arabic. Univ. Sci. Technol. J. Manag. Hum. Sci. 2024, 2, 2. [Google Scholar] [CrossRef]
Musaad, D.M.M.A.M.; Towity, D.A.A.A. Translation evaluation of three machine translation systems, with special references to idiomatic expressions. Humanit. Educ. Sci. J. 2023, 29, 678–708. [Google Scholar] [CrossRef]
Hamdan, S. Exploring the Efficiency of ChatGPT vs. Google Translate in Translating Idioms and Idiomatic Expressions: “The Catcher in the Rye” as a Case Study; Effat Univeristy: Jeddah, Saudi Arabia, 2024; Available online: https://repository.effatuniversity.edu.sa/handle/20.500.14131/1795 (accessed on 16 March 2025).
Farrús, M.; Costa-Jussa, M.R.; Marino, J.B.; Poch, M.; Hernández, A.; Henríquez, C.; Fonollosa, J.A. Overcoming statistical machine translation limitations: Error analysis and proposed solutions for the Catalan–Spanish language pair. Lang. Resour. Eval. 2011, 45, 181–208. [Google Scholar] [CrossRef]
Geer, D. Statistical machine translation gains respect. Computer 2005, 38, 18–21. [Google Scholar] [CrossRef]
Kenny, D.; Doherty, S. Statistical machine translation in the translation curriculum: Overcoming obstacles and empowering translators. Interpret. Transl. Train. 2024, 8, 276–294. [Google Scholar] [CrossRef]
Popović, M.; Ney, H. Towards automatic error analysis of machine translation output. Comput. Linguist. 2011, 37, 657–688. [Google Scholar] [CrossRef]
Sheppard, F. Medical writing in English: The problem with Google Translate. Presse Medicale 2011, 40, 565–566. [Google Scholar] [CrossRef]
Stahlberg, F. Neural machine translation: A review. J. Artif. Intell. Res. 2020, 69, 343–418. [Google Scholar] [CrossRef]
Koehn, P.; Knowles, R. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, BC, Canada, 17 August 2027; Luong, M.T., Birch, A., Neubig, G., Finch, A., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2017; pp. 28–39. [Google Scholar] [CrossRef]
Reiter, E. A structured review of the validity of BLEU. Computational Linguistics 2018, 44, 393–401. [Google Scholar] [CrossRef]
Gao, R.; Lin, Y.; Zhao, N.; Cai, Z.G. Machine translation of Chinese classical poetry: A comparison among ChatGPT, Google Translate, and DeepL Translator. Humanit. Soc. Sci. Commun. 2024, 11, 835. [Google Scholar] [CrossRef]
Hendy, A.; Abdelrehim, M.; Sharaf, A.; Raunak, V.; Gabr, M.; Matsushita, H.; Kim, Y.J.; Afify, M.; Awadalla, H.H. How good Are GPT models at machine translation? A comprehensive evaluation. arXiv 2023, arXiv:2302.09210. [Google Scholar]
Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; de Oliveira Pinto, H.P.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating large language models trained on code. arXiv 2022, arXiv:2107.03374. [Google Scholar]
Wong, M.F.; Jain, A.; Nikolov, A.; White, T.; Bembenek, J.; Cai, L.; Wu, L.; Mackenzie, T.; Klyman, A. Aligning crowd-sourced human feedback for reinforcement learning on code generation by large language models. arXiv 2024, arXiv:2503.15129. [Google Scholar] [CrossRef]

Figure 1. Holmes’ map of translation studies.

Figure 2. BLUE vs. TER scores for GT and ChatGPT.

Table 1. Arabic expressions and their machine and reference translations.

Segment	Original Arabic	IPA	Earlier MT	GT Translation	Reference Translation
1	حل عن سماي	ħal ʕan samaːj	Dilute my sky	Solve my sky	Get off my back/Leave me alone
2	بتؤمرني أمر	bitʔuʔmirni ʔamr	You order me order	You command me an order	At your service/Whatever you say
3	ياليل ياعين	ja leːl ja ʕeːn	Oh Night Oh Eye	Oh night, oh eye	Oh, what a night!/Oh dear!
4	مالك؟	maːlek	Your Money	What’s wrong?	What’s wrong with you?/What’s the matter?
5	بلا حكي فاضي	bala ħaki faːdˤi	No need for empty speech	No empty talk	Stop with the nonsense
6	مش فاضيلك	miʃ faːdˤilak	Not empty for you	I don’t have time for you	I’m not free for you/I don’t have time for this
7	شو مالك	ʃuː maːlek	What’s your money	What’s wrong with you	What’s up with you?/What’s your problem?
8	على راسي	ʕalaː raːsi	On my Head	On my head	Of course/With pleasure
9	يستر على عرضك	justr ʕala ʕarḍak	cover your width	May God protect your honor	May God protect your honor (used as a plea)
10	ظروف قاهره	ʒˤuruːf qaːhira	Cairo envelopes	Compelling circumstances	Overwhelming circumstances
11	شو الدعوه	ʃuː idːaʕwa	what’s the invitation???	What’s the matter	What’s the deal?/What’s going on?
12	يابعد عمري	ja baʕd ʕumri	after my age	Oh my life	You’re my everything
13	ما تدفع أنا بدفع	maː tidfaʕ ʔanaː badfaʕ	don’t push .. I will push	Don’t pay, I’ll pay	Don’t pay, I’ll pay
14	فتح الله كتب كتابه على فيفي	fataħ allaːh kataba kitaːbahu ʕala fiːfiː	Open god wrote his book on in in	Fatah Allah wrote his book on Fifi	Fathallah got engaged/married to Fifi

Table 2. ChrF translations scores against the reference.

Segment	chrF Score (Earlier MT)	chrF Score (GT)
1	0.085	0.086
2	0.103	0.112
3	0.139	0.174
4	0.045	0.276
5	0.096	0.045
6	0.092	0.596
7	0.515	0.302
8	0.044	0.058
9	0.137	1.0
10	0.098	0.630
11	0.218	0.279
12	0.088	0.059
13	0.136	0.484
14	0.105	0.199

Table 3. Shapiro–Wilk test results.

System	Statistic	p-Value
Earlier MT	0.619889	6.13496 × 10⁻⁵
GT	0.8528956	0.02432052

Table 4. Wilcoxon signed-rank test result.

Comparison	Statistic	p-Value
Earlier MT vs. GT	19	0.03527832

Table 5. BLEU scores for GT and ChatGPT translations.

Translation System	Sentence	BLEU Score	TER Score
GT	1	0.563	0.222
GT	2	0.083	0.45
GT	3	0.533	0.429
GT	4	0.627	0.167
GT	5	0.417	0.391
GT	6	0.413	0.435
ChatGPT	1	0.076	0.556
ChatGPT	2	0.202	0.65
ChatGPT	3	0.366	0.571
ChatGPT	4	0.563	0.222
ChatGPT	5	0.077	0.609
ChatGPT	6	0.114	0.783

Table 6. Shapiro–Wilk test results.

System	Metric	Statistic	p-Value
GT	BLEU	0.8626287	0.1983686
GT	TER	0.8008779	0.0598519
ChatGPT	BLEU	0.8475236	0.1502916
ChatGPT	TER	0.8738806	0.2421162

Table 7. Shapiro–Wilk test results.

System	Metric	p-Value
GT	BLEU	0.1983686
GT	TER	0.0598519
ChatGPT	BLEU	0.1502916
ChatGPT	TER	0.2421162

Table 8. Wilcoxon signed-rank test results.

Between Systems (GT vs. ChatGPT)
Metric	p-Value
BLEU	0.09375
TER	0.03125
Within Systems (BLEU vs. TER)
System	Comparison	p-Value
GT	BLEU vs. TER	0.4375
ChatGPT	BLEU vs. TER	0.09375

Table 9. An alternative translation for segment 4.

Source Text	Target Text	Target Text (IPA)	Back Translation
Document automation systems can also be used to automate all conditional text, variable text, and data contained within a set of documents.	يمكن أيضًا استخدام أنظمة أتمتة المستندات لأتمتة جميع النصوص الشرطية (التي تعتمد على شروط أو منطق (مثل قواعد “إذا-فإن”)، والنصوص المتغيرة، (التي تركز على التخصيص أو استبدال البيانات بدلاً من التكيف المعتمد على المنطق) والبيانات الموجودة ضمن مجموعة من المستندات.	ʕalaː ʃuruːtˤ ʔaw ˈmantˤiq (mitˤl qawaːʕid “ʔiðaː–faʔinːa”)), wa anːuˈsˤuːs almutagajˈjira (allatiː turakːizu ʕalaː attaxˈsˤiːs ʔaw ʔistibdaːl albajaːnaːt badalan min attaˈkajjuf almuʕˈtamid ʕalaː alˈmantˤiq), wa albajaːnaːt almad͡ʒuːda ḍimna mad͡ʒmuːʕa min almustandaːt	Document automation systems can also be used to automate all conditional texts (those that depend on conditions or logic, such as “if-then” rules), variable texts (which focus on personalization or data substitution rather than logic-based adaptation), and the data contained within a collection of documents.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mohammed, T.A.S. Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations. Information 2025, 16, 440. https://doi.org/10.3390/info16060440

AMA Style

Mohammed TAS. Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations. Information. 2025; 16(6):440. https://doi.org/10.3390/info16060440

Chicago/Turabian Style

Mohammed, Tawffeek A. S. 2025. "Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations" Information 16, no. 6: 440. https://doi.org/10.3390/info16060440

APA Style

Mohammed, T. A. S. (2025). Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations. Information, 16(6), 440. https://doi.org/10.3390/info16060440

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluating Translation Quality: A Qualitative and Quantitative Assessment of Machine and LLM-Driven Arabic–English Translations

Abstract

1. Introduction

2. Literature Review

3. Theoretical and Conceptual Framework

4. Materials and Methods

Software

5. Data Analysis

6. Qualitative Analysis

7. Discussion

8. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Arabic Expressions and Their Machine and Reference Translations

Appendix B. English Technical Text and Its GT, ChatGPT, and Reference Translations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI