Bringing Context into MT Evaluation: Translator Training Insights from the Classroom
Abstract
1. Introduction
- Q1: How can context-aware translation evaluation activities be effectively implemented within the constraints of an MA translation technology lab setting?
- Q2: What is the impact of context-aware MT evaluation on students’ perceived ability to assess translation quality compared to sentence-level evaluation?
2. Related Work
2.1. MT Evaluation in the Classroom
2.2. The Role of Context in MT Evaluation
3. Context of This Study and Educational Setting
3.1. The Translation Technology Module
- Understanding the principles underlying contemporary translation technologies;
- Enhancing texts for machine translation or human translation facilitated by a translation memory tool;
- Grasping the principles of contemporary machine translation;
- Critically assessing contemporary translation technologies and their outputs;
- Making informed decisions regarding the utilization of these technologies while considering associated risks;
- Understanding current professional, socio-technical, and ethical issues and trends related to technology in the profession, research, and industry/market.
3.1.1. Lectures
3.1.2. Student Demographics
4. Methodological Setup: Sentence vs. Document Evaluation Lab
4.1. Metrics
- None of it
- Little of it
- Most of it
- All of it
- No fluency
- Little fluency
- Near native
- Native
- Mistranslation: The target content does not accurately represent the source content.
- Untranslated: Content that should have been translated has been left untranslated.
- Word Form: There is a problem in the form of a word (e.g., the English target text has “becomed” instead of “became”).
- Word Order: The word order is incorrect (syntax) (e.g., “The house beautiful is” instead of “The house is beautiful”).
- No errors: The target content is flawless.
4.2. Materials
- -
- Tab 1: Sentence-level evaluation of fluency, adequacy, and error (Figure 1)
- -
- Tab 2: Sentence-level evaluation of ranking (Figure 2)
- -
- Tab 3: Inter-annotator agreement (Figure 3)
- -
- Tab 4: Document-level evaluation of fluency, adequacy, and error (Figure 4)
- -
- Tab 5: Document-level evaluation of ranking (Figure 5).





4.3. Corpus
4.4. Translation Systems
4.5. The Lab
- Task 1: Sentence-Level Evaluation—Students evaluated a selection of individual sentences for fluency, adequacy, error identification, and ranking, without access to surrounding context. They then collaborated with a peer working in the same language pair to compute inter-annotator agreement (IAA) scores.
- Task 2: Document-Level Evaluation—Students were presented with the full documents from which the same sentences had been extracted. They re-evaluated the same sentences, now situated in their original context, and again computed IAA and compared both sets of scores (from Task 1 and Task 2).
4.5.1. Task 1: Sentence-Level Evaluation
4.5.2. Task 2: Document-Level Evaluation
5. Insights from Student Reflections
- What were your expectations in terms of translation quality before doing this exercise?
- Were the results as you had expected?
- How did you approach evaluating the MT output(s) quality with single random sentences in contrast to evaluating sentences in context (with the full texts)?
- As a translator, how did you feel assessing the translation quality with the two different methodologies (single sentence vs. sentence in context) in terms of the following:
- a.
- effort
- b.
- how easy it was to understand the meaning of the source
- c.
- how easy was to recognise translation errors.
- Did you have any cases of misevaluation? If so, explain why you think they happened and give examples (source, target, gloss translation).
- Were the IAA results (if any) as you expected? Comment on the differences.
- With which methodology were you most satisfied and confident with the evaluation you provided? Why?
- Will this change how you use MT?
5.1. Expectations
5.2. Approach to Sentence vs. Document-Level Evaluation
“I tackled each sentence individually, concentrating on their linguistic correctness”
“When assessing the translation quality, I found it is harder to evaluate the accuracy of a single random sentence than that of a sentence in full context, thus I was inclined to focus more on the assessment of fluency. For example, the word ‘they’ in the sentence ‘They will test your will’ can be translated as 他们 or 它们 depending on the context, the former one generally refers to human beings, while the latter refers to non-human things, animals or abstract concepts. In the absence of context, it can be assumed that both translations are correct.”
5.3. Perception of Difficulty
“I felt less strict on my interpretation and evaluation and more indulgent with imprecision and ambiguities [for the sentence-level evaluation]—and this exactly because I did not have the context. My perception of the task was positive: the work seemed to flow more easily and to demand small cognitive effort. Evaluating sentences in context in its turn, was a much more intellectually demanding task, requiring rereadings and a heavier load of interpretation that made my approach to the sentences more subjective and the categorisation of mistakes more demanding. However, it gave me the lacking context and allowed me to perceive mistakes that were invisible before.”
“some of the mistakes were very obvious and in that sense it did not really matter if it was a single sentence or not, but for more complex or subtle mistakes having context surrounding the sentence that was being analysed did help to point them out.”
5.4. Misevaluation
“The initial confusion was caused by the lack of context which prevented me from identifying the gender. Nevertheless, there was no ground to mark down the MTs’ outputs as incorrect as there was no background information to prove this.”
“The presence of the context influenced some of my decisions made on a sentence-based level. E.g., it was only after I had read the whole text, did I understand the full meaning of ‘I am going to have my son try to turn the legs the other way and hope it helps’ and, as a result, was able to adjust my final evaluation.”
- It is pretty.
- They will test your will.
- They are AWFUL.
- Bring it!
- I am going to have my son try to turn the legs the other way and hope it helps.
- The massive organ fills the space and adds to the romance!
- Warriors hope fans’ return to Chase Center gives team juice
- Bring it!
5.5. Inter-Annotator Agreement-IAA
“These results may be explained by the fact that translators do not stress the aspect of gender as much on a sentence-level, but they become duly aware of that aspect once the evaluation is carried out on a document-level.”
5.6. Satisfaction
“I was more confident evaluating translations on sentence-level because it has less emphasis on the gender aspect. The other methodology [document-level] required more attentiveness from the translator to catch translation errors that manifest as minor gender inflection issues, which can be daunting especially for big projects.”
“I was confident and satisfied with document level because I knew exactly what was expected of me. I was confident at sentence level until I got to document level and context was brought in.”
“The methodology I am most satisfied with is the sentence-in-context assessment with the adequacy metric. This approach allows for a comprehensive understanding of the source text and a clearer evaluation of the referential relationship. It also reduces inaccurate evaluations.”
“Overall, I preferred the evaluation at the document level, as I was able to understand the context and be more critical of the quality of the MT output, whereas I was not confident when evaluating the output at the sentence-level since I had no context to understand fully the source sentences or to judge if there was a gender error.”
5.7. Future Use of MT
“I never considered the term “quality” to be so difficult to operationalise. It is now clear that when trying to measure quality one should always define what is perceived as “high” or “low” in their evaluation, as well as state the framework it takes place in.”
“This will change my thought on how to use MT. As translators, we should be more aware how to link different MT systems and evaluation methods to different situations.”
“In sum, this type of evaluation already makes me perceive MT with gentler eyes and understand it has limitations since it cannot always deal with nuances related to grammar errors properly, which is rather up for human evaluation.”
“I am more aware of the errors that are prevalent in machine translation regarding fluency and accuracy problems, and it will be interesting to see the progression of Irish in various translation engines in the future.”
“This will change the way I use machine translation. When I use machine translation, I will put the entire text into the online machine translation system instead of copying and pasting sentence by sentence into the machine translation system.”
“In order to get better outputs from the online translation engine, I suppose it is better to input all the sentences with context rather than single random sentences extracted from a text, since the engine can recognize the context and then generate better translations.”
“I think the way I use MT in the future will change in the way that I will try to give the MT more context instead of just one sentence, since I saw that it can change the translation.”
“This highlights the importance of providing a MT engine with context to ensure a higher possibility of receiving accurate output.”
6. Insights from the Exercise
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Kenny, D. Technology and Translator Training. In The Routledge Handbook of Translation and Technology; O’Hagan, M., Ed.; Routledge: London, UK; New York, NY, USA, 2019; pp. 498–515. [Google Scholar]
- Doherty, S.; Kenny, D.; Way, A. Taking Statistical Machine Translation to the Student Translator. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program, San Diego, CA, USA, 28 October–1 November 2012. [Google Scholar]
- Doherty, S.; Kenny, D. The design and evaluation of a Statistical Machine Translation syllabus for translation students. Interpret. Transl. Train. 2014, 8, 295–315. [Google Scholar] [CrossRef]
- Mellinger, C.D. Translators and Machine Translation: Knowledge and Skills Gaps in Translator Pedagogy. Interpret. Transl. Train. 2017, 11, 280–293. [Google Scholar] [CrossRef]
- Tavares, C.; Tallone, L.; Oliveira, L.; Ribeiro, S. The Challenges of Teaching and Assessing Technical Translation in an Era of Neural Machine Translation. Educ. Sci. 2023, 13, 541. [Google Scholar] [CrossRef]
- Winters, M.; Deane-Cox, S.; Böser, U. Translation, Interpreting and Technological Change: Innovations in Research, Practice and Training; Bloomsbury Advances in Translation, Bloomsbury Publishing: London, UK, 2024. [Google Scholar]
- Venkatesan, H. Technology preparedness and translator training: Implications for curricula. Babel 2023, 69, 666–703. [Google Scholar] [CrossRef]
- Castilho, S.; Knowles, R. Editors’ foreword. Nat. Lang. Process. 2025, 31, 983–985. [Google Scholar] [CrossRef]
- Gaspari, F.; Almaghout, H.; Doherty, S. A Survey of Machine Translation Competences: Insights for Translation Technology Educators and Practitioners. Perspect. Stud. Transl. Theory Pract. 2015, 23, 333–358. [Google Scholar] [CrossRef]
- González Pastor, D.M. Introducing Machine Translation in the Translation Classroom: A Survey on Students’ Attitudes and Perceptions. Transl. Interpret. Stud. 2021, 16, 47–65. [Google Scholar] [CrossRef]
- Hatim, B. Translating Text in Context. In The Routledge Companion to Translation Studies; Munday, J., Ed.; Routledge: London, UK, 2009; pp. 50–67. [Google Scholar]
- Melby, A.K. Future of Machine Translation: Musings on Weaver’s Memo. In The Routledge Handbook of Translation and Technology; O’Hagan, M., Ed.; Routledge: London, UK, 2019; pp. 419–436. [Google Scholar]
- Killman, J. Machine Translation and Legal Terminology: Data-Driven Approaches to Contextual Accuracy. In Handbook of Terminology: Volume 3. Legal Terminology; Temmerman, R., Van Campenhoudt, M., Eds.; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2023; pp. 485–510. [Google Scholar]
- Esqueda, M.D. Machine translation: Teaching and learning issues. Trab. Linguística Apl. 2021, 60, 282–299. [Google Scholar] [CrossRef]
- Moorkens, J. What to expect from Neural Machine Translation: A practical in-class translation evaluation exercise. Interpret. Transl. Train. 2018, 12, 375–387. [Google Scholar] [CrossRef]
- Guerberof-Arenas, A.; Moorkens, J. Machine Translation and Post-editing Training as Part of a Master’s Programme. Jostrans J. Spec. Transl. 2019, 31, 217–238. [Google Scholar] [CrossRef]
- Guerberof-Arenas, A.; Valdez, S.; Dorst, A.G. Does Training in Post-editing Affect Creativity? J. Spec. Transl. 2024, 41, 74–97. [Google Scholar] [CrossRef]
- Girletti, S.; Lefer, M.A. Introducing MTPE pricing in translator training: A concrete proposal for MT instructors. Interpret. Transl. Train. 2024, 18, 1–18. [Google Scholar] [CrossRef]
- Läubli, S.; Castilho, S.; Neubig, G.; Sennrich, R.; Shen, Q.; Toral, A. A Set of Recommendations for Assessing Human–Machine Parity in Language Translation. J. Artif. Intell. Res. 2020, 67, 653–672. [Google Scholar] [CrossRef]
- Kocmi, T.; Avramidis, E.; Bawden, R.; Bojar, O.; Dvorkovich, A.; Federmann, C.; Fishel, M.; Freitag, M.; Gowda, T.; Grundkiewicz, R.; et al. Findings of the 2023 Conference on Machine Translation (WMT23): LLMs Are Here but Not Quite There Yet. In Proceedings of the Eighth Conference on Machine Translation, Singapore, 6–7 December 2023; pp. 1–42. [Google Scholar] [CrossRef]
- Castilho, S.; Popović, M.; Way, A. On Context Span Needed for MT Evaluation. In Proceedings of the Twelfth International Conference on Language Resources and Evaluation (LREC’20), Marseille, France, 13–15 May 2020; pp. 3735–3742. [Google Scholar]
- Castilho, S. On the same page? Comparing inter-annotator agreement in sentence and document level human machine translation evaluation. In Proceedings of the Fifth Conference on Machine Translation, Online, 19–20 November 2020. [Google Scholar]
- Castilho, S. Towards Document-Level Human MT Evaluation: On the Issues of Annotator Agreement, Effort and Misevaluation. In Proceedings of the Workshop on Human Evaluation of NLP Systems. Association for Computational Linguistics, Online, 19–23 April 2021; pp. 34–45. [Google Scholar]
- Freitag, M.; Foster, G.; Grangier, D.; Ratnakar, V.; Tan, Q.; Macherey, W. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Trans. Assoc. Comput. Linguist. 2021, 9, 1460–1474. [Google Scholar] [CrossRef]
- Castilho, S.; Knowles, R. A survey of context in neural machine translation and its evaluation. Nat. Lang. Process. 2024, 31, 986–1016. [Google Scholar] [CrossRef]
- Seel, N.M.; Lehmann, T.; Blumschein, P.; Podolskiy, O.A. Instructional Design for Learning: Theoretical Foundations; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
- Castilho, S.; Cavalheiro Camargo, J.L.; Menezes, M.; Way, A. DELA Corpus: A Document-Level Corpus Annotated with Context-Related Issues. In Proceedings of the Sixth Conference on Machine Translation. Association for Computational Linguistics, Online, 10–11 November 2021; pp. 571–582. [Google Scholar]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Lommel, A.; Uszkoreit, H.; Burchardt, A. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica 2014, 12, 455–463. [Google Scholar] [CrossRef]
- Castilho, S.; Mallon, C.Q.; Meister, R.; Yue, S. Do online Machine Translation Systems Care for Context? In What About a GPT Model? In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, Tampere, Finland, 12–15 June 2023; pp. 393–417. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Castilho, S. Bringing Context into MT Evaluation: Translator Training Insights from the Classroom. Information 2025, 16, 965. https://doi.org/10.3390/info16110965
Castilho S. Bringing Context into MT Evaluation: Translator Training Insights from the Classroom. Information. 2025; 16(11):965. https://doi.org/10.3390/info16110965
Chicago/Turabian StyleCastilho, Sheila. 2025. "Bringing Context into MT Evaluation: Translator Training Insights from the Classroom" Information 16, no. 11: 965. https://doi.org/10.3390/info16110965
APA StyleCastilho, S. (2025). Bringing Context into MT Evaluation: Translator Training Insights from the Classroom. Information, 16(11), 965. https://doi.org/10.3390/info16110965

