Evaluating Intralingual Machine Translation Quality: Application of an Adapted MQM Scheme to German Plain Language
Abstract
1. Introduction
2. Related Work
2.1. Plain Language
2.2. Plain Language in Automated Tasks
2.3. Error Analysis for Intralingual Translation
3. Materials and Methods
3.1. Corpus Data
3.2. Evaluation Framework for Intralingual Translation
3.3. Annotation Process
3.3.1. First Round of Annotation
- Which parts of the categorization did not work and why?
- Where does the categorization miss clear definitions?
- Where are better examples needed?
- Which errors could not be fit into the existing categories and why?
- What kinds of categories are missing?
- Which categories have not been used at all?
3.3.2. Updated Annotation Categories
4. Results
4.1. Error Distribution
- 1.
- Mistranslation (1909 errors).
- 2.
- Completeness (743 errors).
- 3.
- Missing addition (109 errors).
- 4.
- Wrong addition (91 errors).
- “Mikropillen mit neuen Gestagenen wie zum Beispiel Gestoden, Desogestrel oder Drospirenon haben ein höheres Risiko für eine Venenthrombose als andere Mikropillen.” (EN: Micro-pills containing new progestogens such as gestodene, desogestrel or drospirenone carry a higher risk of venous thrombosis than other micro-pills.)The way this is phrased in the German MT text implies that the micro-pills themselves are at risk of venous thrombosis, when in fact the people who take these pills are at a higher risk of venous thrombosis. This could potentially be confusing to readers with lower reading skills.
- “Wenn ein großer Teil der Bevölkerung gegen eine Krankheit immun ist, gibt es auch weniger Krankheiten.” (EN: When a large proportion of the population is immune to a disease, there are less diseases overall.)The German MT translation wrongly implies that the more people are vaccinated and thus immune to a certain disease, the fewer diseases exist overall. This is a wrong interpretation of the fact that if more people are immune to a disease, fewer cases of this disease should (in theory) occur.
- “Die Studien müssen gut sein, wenn sie für die evidenzbasierte Medizin gelten sollen.” (EN: Studies must be of good quality if they are to be considered valid for evidence-based medicine.)The machine translation misinterprets the fact that studies must be conducted according to the criteria of evidence-based medicine in order to be reliable.
- “An den Kniekehlen und dem Ende des Darmes ist ein Gelenk.” (EN: There is a joint at the back of the knees and at the end of the intestine.)This is simply not true, as there is no joint at the end of the human intestine.
- In one case, the MT system even hallucinated information in English: “OPIATES, COX inhibitors and other medicines.” The MT system put this English fragment as the heading to a German text on painkillers.
- 1.
- Textual conventions (1822 errors).
- 2.
- Spelling (356 errors).
- 3.
- Grammar (294 errors).
- 4.
- Unclear reference (229 errors).
- 5.
- Punctuation (137 errors).
- “Diese Beschwerden können entstehen.” (EN: These complaints may arise.).The sentence does not fit into the context of the text due to a missing reference and conjunction.
- “Manche Antibiotika machen Menschen mit Diabetes zu Diabetikerinnen oder Diabetikern mit Zuckerstürzen.” (EN: Some antibiotics cause people with diabetes to become diabetic with hypoglycaemia.)This sentence seems redundant and wrong, as the information in the German sentence tells the reader that people who already suffer from diabetes might suffer from diabetes caused by the antibiotics.
- “Was hilft Pfefferminze?” (EN: What does peppermint help?)This sentence is grammatically wrong, as it should be “Gegen was hilft Pfefferminze?” (EN: What does peppermint help with?)
- “Die Schleimhaut von der Gebärmutter geht weg.” (EN: The lining of the uterus goes away).The German phrasing is very child-like and not appropriate for adult audiences.
- “Die Pille hat Vor- und Nachteile. Die Mikropille hat viele Vorteile”. (EN: The pill has advantages and disadvantages. The micro pill has many advantages.)This is unnecessary repetition, which can be stigmatising for readers.
4.2. Error Category “Non-Classifiable”
- Unnecessary repetition of information
- Inappropriate addressing of readers (i.e., sudden change from formal “Sie” (“you”) to informal “Du” (“you”) or informal “Wir” (“we”))
- Complex sentence structure
- Audience appropriateness
- (1)
- Sie sprechen mit dem Arzt. Und der Arzt untersucht Sie. (EN: You talk to the doctor. And the doctor examines you.)
4.3. Further Observations
- 2.1.6 Date/Time (4 annotated segments).
- 2.1.7 Entity (4 annotated segments).
- 2.5.3 Incomplete procedure (10 annotated segments).
4.4. Difficulties
5. Discussion
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Teubner, T.; Flath, C.M.; Weinhardt, C.; Van Der Aalst, W.; Hinz, O. Welcome to the era of chatgpt et al. the prospects of large language models. Bus. Inf. Syst. Eng. 2023, 65, 95–101. [Google Scholar] [CrossRef]
- Dale, R. A year’s a long time in generative AI. Nat. Lang. Eng. 2024, 30, 201–213. [Google Scholar] [CrossRef]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 6–12 July 2002; pp. 311–318. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 29–30 June 2005; pp. 65–72. [Google Scholar]
- Rei, R.; Stewart, C.; Farinha, A.C.; Lavie, A. COMET: A neural framework for MT evaluation. arXiv 2020, arXiv:2009.09025. [Google Scholar] [CrossRef]
- Flesch, R. “Simplification of Flesch Reading Ease Formula”: Reply. J. Appl. Psychol. 1952, 36, 54–55. [Google Scholar] [CrossRef]
- Xu, W.; Napoles, C.; Pavlick, E.; Chen, Q.; Callison-Burch, C. Optimizing statistical machine translation for text simplification. Trans. Assoc. Comput. Linguist. 2016, 4, 401–415. [Google Scholar] [CrossRef]
- Castilho, S.; Doherty, S.; Gaspari, F.; Moorkens, J. Approaches to human and machine translation quality assessment. In Translation Quality Assessment: From Principles to Practice; Springer: Berlin/Heidelberg, Germany, 2018; pp. 9–38. [Google Scholar]
- Lommel, A.; Uszkoreit, H.; Burchardt, A. Multidimensional quality metrics (MQM): A framework for declaring and describing translation quality metrics. Tradumàtica 2014, 12, 455–463. [Google Scholar] [CrossRef]
- McDonald, S.V. Accuracy, Readability, and Acceptability in Translation. Appl. Transl. 2022, 16, 21–29. [Google Scholar] [CrossRef]
- Baumgart, M.; Hösel, C.; Breck, D.; Schuster, M.; Roschke, C.; Ritter, M. Development of a holistic web-based interface assistance system to support the intralingual translation process. In Proceedings of the International Conference on Human–Computer Interaction, Virtual, 24–29 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 505–511. [Google Scholar]
- González-Sordé, M.; Matamala, A. Empirical evaluation of Easy Language recommendations: A systematic literature review from journal research in Catalan, English, and Spanish. Univers. Access Inf. Soc. 2024, 23, 1369–1387. [Google Scholar] [CrossRef]
- Luque Lopéz, L. Leveraging Large Language Models to Translate into Easy Language: An Exploratory Study on University Websites. Master’s Thesis, Université de Genève, Geneva, Switzerland, 2025. [Google Scholar]
- Freyer, N.; Kempt, H.; Klöser, L. Easy-read and large language models: On the ethical dimensions of LLM-based text simplification. Ethics Inf. Technol. 2024, 26, 50. [Google Scholar] [CrossRef]
- Jakobson, R. On linguistic aspects of translation. In The Translation Studies Reader; Routledge: Abingdon, UK, 1959; pp. 46–60. [Google Scholar]
- Maaß, C. Easy Language–Plain Language–Easy Language Plus: Balancing Comprehensibility and Acceptability; Frank & Timme: Berlin, Germany, 2020. [Google Scholar]
- Maaß, C.; Hernández Garrido, S. Einfache Sprache: Einfach, leicht, verständlich? In Einfache Sprache mit KI-Tools: Ein Leitfaden für die Redaktionelle Praxis; Springer: Berlin/Heidelberg, Germany, 2025; pp. 17–36. [Google Scholar]
- DIN ISO 24495-1:2024-03; Einfache Sprache—Teil 1: Grundsätze und Leitlinien. Deutsches Institut für Normung: Berlin, Germany, 2024.
- DIN 8581-1; Einfache Sprache—Anwendung für das Deutsche—Teil 1: Sprachspezifische Festlegungen. Deutsches Institut für Normung: Berlin, Germany, 2024.
- ISO 24495-1:2023; Plain Language—Part 1: Governing Principles and Guidelines. International Organization for Standardization (ISO): Geneva, Switzerland, 2023.
- Deilen, S.; Lapshinova-Koltunski, E.; Garrido, S.; Hörner, J.; Maaß, C.; Theel, V.; Ziemer, S. Evaluation of intralingual machine translation for health communication. In Proceedings of the 25th Annual Conference of the European Association for Machine Translation, Sheffield, UK, 24–27 June 2024; Volume 1, pp. 469–479. [Google Scholar]
- Maddela, M.; Dou, Y.; Heineman, D.; Xu, W. LENS: A Learnable Evaluation Metric for Text Simplification. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Toronto, ON, Canada, 9–14 July 2023; Rogers, A., Boyd-Graber, J., Okazaki, N., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 16383–16408. [Google Scholar] [CrossRef]
- Dale, E.; Chall, J.S. A formula for predicting readability: Instructions. Educ. Res. Bull. 1948, 27, 37–54. [Google Scholar]
- Spache, G. A new readability formula for primary-grade reading materials. Elem. Sch. J. 1953, 53, 410–413. [Google Scholar] [CrossRef]
- Coleman, M.; Liau, T.L. A computer readability formula designed for machine scoring. J. Appl. Psychol. 1975, 60, 283–284. [Google Scholar] [CrossRef]
- Isnaeni, N.R. Readability of English written materials. Elite Engl. Lit. J. 2017, 1, 179–191. [Google Scholar]
- Cachola, I.; Khashabi, D.; Dredze, M. Evaluating the Evaluators: Are readability metrics good measures of readability? In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 24022–24038. [Google Scholar]
- Saggion, H. Artificial Intelligence and Natural Language Processing for Easy-to-Read Texts. J. Lang. Law 2024, 82, 84–103. [Google Scholar] [CrossRef]
- Guo, Y.; August, T.; Leroy, G.; Cohen, T.A.; Wang, L.L. APPLS: Evaluating Evaluation Metrics for Plain Language Summarization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Volume 2024, pp. 9194–9211. [Google Scholar]
- Gao, M.; Ruan, J.; Sun, R.; Yin, X.; Yang, S.; Wan, X. Human-like Summarization Evaluation with ChatGPT. arXiv 2023, arXiv:2304.02554. [Google Scholar] [CrossRef]
- Stodden, R.; Momen, O.; Kallmeyer, L. DEplain: A German parallel corpus with intralingual translations into plain language for sentence and document simplification. arXiv 2023, arXiv:2305.18939. [Google Scholar] [CrossRef]
- Grabar, N.; Saggion, H. Evaluation of automatic text simplification: Where are we now, where should we go from here. In Proceedings of the Traitement Automatique des Langues Naturelles, ATALA, Avignon, France, 27 June–1 July 2022; pp. 453–463. [Google Scholar]
- Patil, U.; Calvillo, J.; Lago, S.; Schumann, A.K. Quantifying word complexity for Leichte Sprache: A computational metric and its psycholinguistic validation. In Proceedings of the 1st Workshop on Artificial Intelligence and Easy and Plain Language in Institutional Contexts (AI & EL/PL), Geneva, Switzerland, 23 June 2025; pp. 94–107. [Google Scholar]
- Bredel, U.; Maaß, C. Leichte Sprache: Theoretische Grundlagen? Orientierung für die Praxis; Duden: Berlin/Mannheim, Germany, 2016. [Google Scholar]
- Anschütz, M.; Oehms, J.; Wimmer, T.; Jezierski, B.; Groh, G. Language models for German text simplification: Overcoming parallel data scarcity through style-specific pre-training. arXiv 2023, arXiv:2305.12908. [Google Scholar] [CrossRef]
- Elmakias, I.; Vilenchik, D. An oblivious approach to machine translation quality estimation. Mathematics 2021, 9, 2090. [Google Scholar] [CrossRef]
- Deilen, S.; Garrido, S.H.; Lapshinova-Koltunski, E.; Maaß, C. Using ChatGPT as a CAT tool in Easy Language translation. arXiv 2023, arXiv:2308.11563. [Google Scholar] [CrossRef]
- Ahrens, S.; Deilen, S.; Garrido, S.H.; Lapshinova-Koltunski, E.; Maaß, C. Evaluation of Machine Translation Errors in German Plain Language Texts in the Domain of Health Information. In Proceedings of the 21st Conference on Natural Language Processing (KONVENS 2025): Workshops, Hildesheim, Germany, 9–12 September 2025; Wartena, C., Heid, U., Eds.; HsH Applied Academics: Hannover, Germany, 2025; pp. 176–185. [Google Scholar]
- Ahrens, S.; Deilen, S.; Lapshinova-Koltunski, E.; Garrido, S.H.; Maaß, C. Evaluation of translations into plain german produced by humans and mt systems including chatgpt. SKASE J. Transl. Interpret. 2025, 18, 38–54. [Google Scholar]
- Hansen-Schirra, S.; Nitzke, J.; Gutermuth, S. Language (Geasy Corpus): What Sentence Alignments Can Tell Us About Translation Strategies in Intralingual. In New Perspectives on Corpus Translation Studies; Springer: Singapore, 2021; p. 281. [Google Scholar]
- Deilen, S.; Lapshinova-Koltunski, E.; Garrido, S.H.; Maaß, C.; Hörner, J.; Theel, V.; Ziemer, S. Towards ai-supported health communication in plain language: Evaluating intralingual machine translation of medical texts. In Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) at LREC-COLING 2024, Torino, Italy, 20 May 2024; pp. 44–53. [Google Scholar]
- Kuckartz, U. Qualitative Inhaltsanalyse. Methoden, Praxis, Computerunterstützung; Beltz Juventa: Weinheim, Germany, 2018. [Google Scholar]
- Lu, Q.; Qiu, B.; Ding, L.; Zhang, K.; Kocmi, T.; Tao, D. Error analysis prompting enables human-like translation evaluation in large language models. arXiv 2023, arXiv:2303.13809. [Google Scholar]
- Fernandes, P.; Deutsch, D.; Finkelstein, M.; Riley, P.; Martins, A.F.; Neubig, G.; Garg, A.; Clark, J.H.; Freitag, M.; Firat, O. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv 2023, arXiv:2308.07286. [Google Scholar] [CrossRef]
- Kocmi, T.; Federmann, C. GEMBA-MQM: Detecting translation quality error spans with GPT-4. arXiv 2023, arXiv:2310.13988. [Google Scholar] [CrossRef]





| Sources | Baseline | M1 | M2 | GPT4o | |
|---|---|---|---|---|---|
| tok | 62,050 | 66,641 | 43,140 | 40,475 | 13,486 |
| sent | 3167 | 5639 | 3587 | 3726 | 949 |
| Error Type | Definition |
|---|---|
| Terminology | The use of a term does not fit to the field conventions, is incorrectly used in the target text or is not equivalent to the term in the source text. |
| Inconsistent terminology | Multiple terms are used to describe the same concept when just one term is needed or appropriate. |
| Wrong term | Use of a term that is not the term a domain expert would use or that gives rise to a conceptual mismatch |
| Accuracy | Content in the target text does not match the propositions from the source text. |
| Mistranslations | Target content does not accurately represent the source content. |
| Ambiguous content | Ambiguity is introduced where specification is needed. |
| Hallucination | Machine translation produces an output that is totally decoupled from the source text. |
| Wrong or missing explanation | Explanation is necessary and added, but does not represent the information from the source text (wrong) or an explanation is needed but is not present in the target text (missing). |
| Incomplete information | Relevant information from the source text is missing in the target text. |
| Linguistic conventions | Errors related to the linguistic level of the source text. |
| Grammar | Grammatical rules are violated in the source text. |
| Punctuation | Punctuation is used incorrectly. |
| Spelling | Words are misspelled. |
| Cohesion and coherence | Connectors necessary to understand the text as a whole are missing or incorrect (cohesion). Semantic relationships within the text are not clear (coherence) |
| Audience appropriateness | Content in the target text is not valid, appropriate or acceptable for the target audiences. |
| Inaccurate advice | Target text contains advice that is not in the source text or that is not suitable for the situation in question. |
| Stigmatising content | Content can lead to stigmatization of end users. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Deilen, S.; Hernández Garrido, S.; Lapshinova-Koltunski, E.; Maaß, C.; Werner, A. Evaluating Intralingual Machine Translation Quality: Application of an Adapted MQM Scheme to German Plain Language. Information 2026, 17, 53. https://doi.org/10.3390/info17010053
Deilen S, Hernández Garrido S, Lapshinova-Koltunski E, Maaß C, Werner A. Evaluating Intralingual Machine Translation Quality: Application of an Adapted MQM Scheme to German Plain Language. Information. 2026; 17(1):53. https://doi.org/10.3390/info17010053
Chicago/Turabian StyleDeilen, Silvana, Sergio Hernández Garrido, Ekaterina Lapshinova-Koltunski, Chris Maaß, and Annie Werner. 2026. "Evaluating Intralingual Machine Translation Quality: Application of an Adapted MQM Scheme to German Plain Language" Information 17, no. 1: 53. https://doi.org/10.3390/info17010053
APA StyleDeilen, S., Hernández Garrido, S., Lapshinova-Koltunski, E., Maaß, C., & Werner, A. (2026). Evaluating Intralingual Machine Translation Quality: Application of an Adapted MQM Scheme to German Plain Language. Information, 17(1), 53. https://doi.org/10.3390/info17010053

