You are currently viewing a new version of our website. To view the old version click .
Information
  • Article
  • Open Access

6 April 2023

Four Million Segments and Counting: Building an English-Croatian Parallel Corpus through Crowdsourcing Using a Novel Gamification-Based Platform

,
and
1
Faculty of Mathematics and Computer Science, Adam Mickiewicz University in Poznań, 61-712 Poznań, Poland
2
Department of Information and Communication Sciences, Faculty of Humanities and Social Sciences, University of Zagreb, 10000 Zagreb, Croatia
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Machine Translation for Conquering Language Barriers

Abstract

Parallel corpora have been widely used in the fields of natural language processing and translation as they provide crucial multilingual information. They are used to train machine translation systems, compile dictionaries, or generate inter-language word embeddings. There are many corpora available publicly; however, support for some languages is still limited. In this paper, the authors present a framework for collecting, organizing, and storing corpora. The solution was originally designed to obtain data for less-resourced languages, but it proved to work very well for the collection of high-value domain-specific corpora. The scenario is based on the collective work of a group of people who are motivated by the means of gamification. The rules of the game motivate the participants to submit large resources, and a peer-review process ensures quality. More than four million translated segments have been collected so far.

1. Introduction

The use of machine translation services has nowadays become a standard way for acquiring and comprehending information and data that are written in foreign languages. In a globalized world with more than 7000 languages [], multilingual communication is essential regardless of the type of business, research, education, etc. Therefore, building language resources and tools, such as digital corpora and machine translation systems, which can be used independently or be integrated into other tools such as Computer-Assisted Translation (CAT) tools, represent an important element in business and research.
It is estimated that 50% of all languages are low-resourced, although this term encompasses various definitions, including being a language with limited language resources and rarely used in language technologies, having a limited number of labeled datasets, having a limited online presence, having a small number of speakers, etc. []. Even if parallel data exists, it is oftentimes of lower quality or originates from very specific sources, such as religious texts or IT documentation, which are usually very different from the desired domain. The domain, however, is crucial for the implementation of effective machine translation systems. Therefore, the lack of data, the low data quality, and the noisiness of the data are common problems for many languages.
Parallel corpora represent a fundamental resource for many different research tasks and scientific analyses, for building various applications, and for educational purposes. Parallel corpora are used as an indispensable source in the field of Natural Language Processing (NLP) []. This is related to building machine translation systems and creating translation memories that are commonly used in Computer-Assisted Translation (CAT) tools and Named Entity Recognition (NER) systems for building dictionaries, text mining, and extracting collocations []. Software that is based on parallel corpora, such as concordance searching applications [], machine translation systems, and CAT tools, directly depend on the quality of parallel corpora, their size, domain, and language pair.
Research conducted by [] shows that translators who work in a real translation environment and use machine translation systems generally have benefits in terms of productivity with regards to the use of machine translation systems.
These systems have been analyzed by applying several quality assurance and evaluation methods, as in [], where the authors performed an extensive quality assessment of parallel resources used in CAT tools, or by using automatic quality metrics for evaluating machine translation systems [,,,].
Parallel corpora can be used in the education process, especially in the domain of computer and information sciences, for research on NLP [], in language studies on tasks of evaluating and assessing semantics [], for the translation process and terminology analysis [], for post-editing tasks after the use of machine translation [], or CAT technology []. However, building scalable, high-quality parallel corpora is a challenging and resource-intensive task in terms of time, effort, cost, and knowledge.
One of the main issues with statistical machine translation (SMT) and neural machine translation (NMT), which have become dominant approaches for building machine translation systems, is the lack of large-scale parallel data, which is especially relevant for low-resource languages [,,].
SMT relies on statistical models that use parallel corpora to identify patterns and relationships between words in the source and target languages. This information is then used to translate new sentences [].
NMT, on the other hand, uses deep learning techniques to generate translations []. An NMT system is trained on large amounts of bilingual data and intends to learn a shared representation of the source and target languages that can be used for translation.
As one of the 24 official EU languages, Croatian still suffers from a significant lack of bilingual data and has limited data resources available, which are necessary for developing a variety of language technologies, including machine translation systems.
In order to facilitate the acquisition and management of parallel corpora, it is desirable to have a digital platform that is language independent, easy to use, and accessible to a large number of users. For this purpose, a web-based application called “TMrepository” was designed in an effort to provide a straightforward, customizable, and free service needed for collecting and storing parallel corpora. The platform is based on the concepts of crowdsourcing and gamification, which make the tedious task of collecting parallel data more effective, appealing, and pleasant.
The main goal of this paper is to present an English-Croatian parallel corpus that was created by using a specially built web-based platform. Furthermore, the specific aims of this paper are as follows:
(i)
To analyze the importance of parallel corpora, specifically for machine translation purposes.
(ii)
To present the integration of crowdsourcing and gamification methods into a new web-based platform for creating and organizing parallel corpora.
(iii)
To demonstrate the functionalities of the created system.
(iv)
To analyze the resulting English-Croatian parallel corpus that contains more than four million segments, i.e., more precisely, translation units.
It should also be noted that segments can be understood as text chunks, i.e., text lines that do not necessarily end with a sentence delimiter. In a corpus, they are fundamental logical units that have the tendency to be repetitive, and they come in the form of a whole sentence, parts of sentences, multi-word units, phrases, or even abbreviations. A typical parallel corpus consists of corresponding pairs of segments in the source and target languages. Since they are stored line by line in a corpus, they can be treated as translation units.
This paper is organized in the following way: In the Introduction section, the motivation for building a web-based platform for collecting and storing parallel corpora is discussed. The second section presents related work and research on building parallel corpora for machine translation, crowdsourcing, and gamification. In the third section, the crowdsourcing platform of “TMrepository” is presented, along with its main functionalities. The fourth section exhibits details of the experimental setup, and the research results are elaborated in terms of the harvested parallel corpus, which contains more than four million parallel segments. Finally, in the last section, conclusions are stated and suggestions for further research are given.

3. Novel Crowdsourcing and Gamification-Based Corpus Management Platform

This research focuses on building an English-Croatian parallel corpus using the crowdsourcing approach and a novel gamification-based platform called “TMrepository”. It is a unique online application developed with the objective of collecting parallel data, which is needed for various analyses and developing machine translation systems and new CAT and NLP tools.
Although it was mainly built with the Croatian-English language pair in mind, it can be applied to other languages as well. It is publicly available at http://concordia.vm.wmi.amu.edu.pl/tmrepository/ (accessed on 2 February 2023), and its primary purpose is to provide a user-friendly, sentence-level repository of translation memories. In this particular research, it was set up to store and manage Croatian-English and English-Croatian translation memories.
“TMrepository” additionally has gamification-inspired elements to increase the motivation of contributors—mostly students. This platform is intended to be used in the future by computer science and information science students and researchers who work with machine translations and other NLP-related tasks, or students of translation studies who create their own translation memories during their studies.
In order to attract a larger crowd, registration for the system is free and open to anyone. The user is shown a list of their own contributions after logging in (Figure 1). Uploading new resources to the system is the user’s main activity, which is performed by means of an upload form. Here the user is asked to provide relevant information, such as the title of the translation memory, a brief description, and the type of resource. Available types are as follows:
Figure 1. The main window of the system—a list of uploaded translation memories.
  • manual translation,
  • manual translation—automatically aligned,
  • corpus,
  • corpus—automatically aligned.
The distinction between options “manual translation” and “corpus” depends on the author of the translation: the first option indicates that the translations were performed by the contributing user or collected from public sources (e.g., the web), whereas “corpus” indicates use of already existing translations.
The label “automatically aligned” refers to resources that were already aligned with appropriate software (e.g., hunalign) before a user initiated an upload, as opposed to resources that are already pre-aligned with perfect or nearly perfect alignment quality.
This study focuses mainly on collecting translation memories of the type “corpus—automatically aligned”, which is understood as resources automatically aligned by the contributing user and containing translations completed by other people. However, exceptions to this are possible.
Available import formats include a pair of text files, a TMX file, and a pair of Word documents. Import from TXT files assumes that two text files with UTF-8 encoding are provided, each having an equal number of lines, where one of the files contains sentences in L1 and the other in L2. Alternatively, the system is able to import TMX files. A custom-built stream TMX parser was developed to prevent problems with large TMX files using a lot of memory. The last option, a pair of Word documents in DOC or DOCX format, automatically aligns any two Word documents on the sentence level. There are no assumptions about how documents should be formatted.
“TMrepository” automatically extracts text out of uploaded documents, splits them into sentences, and performs automatic alignment with the use of the hunalign algorithm. This algorithm does not require any linguistic resources to perform the alignment. It operates in two phases: In the first pass, the source and target files are analyzed, and a rudimentary bilingual dictionary is created. Then, in the second pass, the dictionary is used to calculate the best sentence matches between the source and target sentences.
The ranking page lists all contributing users, sorted by the total number of sentence pairs uploaded in descending order. The first three users receive virtual medals and are graphically exposed (Figure 2). This gamification element is used to introduce competitiveness among the users and thus increase their motivation.
Figure 2. List of the best contributors.
The rankings are valuable since they also list the names of the translation memories that the top three most productive users provided. This is being performed in an effort to serve as inspiration for other users when considering suitable corpora sources. For instance, noticing that people are uploading “Harry Potter” and other books might direct the corpus search to different book titles. Similarly, the fact that one user uploaded TV manuals may encourage other contributors to search for additional translated user manuals and technical documents. Every user has access to the most recent ranking at all times.
The resources collected on the platform can be exported to various popular formats, including:
  • TXT,
  • TMX,
  • Moses parallel files (a format widely used in machine translation system training).
The total size of the resources collected in the TMX format exceeds 200 MB. However, it is important to note that it is possible to export individual corpora as well as groups of corpora filtered by specific conditions, such as:
  • source and target languages (the platform is ready to accept not only English-Croatian corpora, but the languages can be customized),
  • type of corpus (manual translation, corpus, automatically aligned, etc.),
  • domain (news, manuals, tourism, song lyrics, etc.).
The export feature allows for the creation of domain-specific corpora for the needs of various natural language processing experiments and the training of machine translation systems.
The outcome of the “TMrepository” project is an extensive corpus that is meant to be applied to various natural language processing tasks. The main purpose, however, was to use it in machine translation. This purpose strongly influenced the design of the platform and the type of data it stores.
First of all, machine translation requires extensive data. State-of-the-art neural models are able to generalize over vast amounts of information to provide nearly human-quality translations. To enable this generalization, it is necessary to use significantly sized datasets for training. Hence, “TMrepository” was designed as a web platform accessible by multiple researchers at once. The collective work of many researchers allowed for the collection of data in an order of magnitude suitable for statistical and neural machine translation training.
The other aspect of the “TMrepository” that was specifically crafted for machine translation training is the organization of data by domains. Machine translation models are known to perform better when used on data coming from a single domain. This platform allows for the export of data filtered by one or more domains.
And most importantly, it was created to collect datasets for a low-resourced language pair, English-Croatian. As opposed to projects focused solely on the accumulation of parallel corpora by crawling and aligning texts from the internet, “TMrepository” also values the quality of the corpora. The result is a dataset that is potentially very interesting from the point of view of machine translation system developers.

4. Experimental Scenario

This paper presents a study on applying the concepts of crowdsourcing and gamification to a group of students with the use of “TMrepository”. The initial experiments were conducted in Poland. The students taking part in the experiment were participating in an academic course on Natural Language Processing, as part of their computer science studies.
They had completed four to six semesters of study prior to the experiment. Thus, their backgrounds covered areas such as basic algorithms and C++ programming, object-oriented programming, web applications, mathematical analysis, algebra, logic, and set theory. The students, however, had no previous training in linguistics, machine translation, or NLP. Moreover, none of them spoke Croatian, and all were native speakers of Polish. Despite certain similarities between Croatian and Polish, two Slavic languages, speakers of just one of these languages cannot fully understand the other.
During the course, the fundamentals of web scraping were covered in lectures using command-line tools such as wget and Python’s urllib module. The web crawling software framework PyCrawler, which can be used to crawl corpora from the web, was also presented during a lecture. After the lectures, students were asked to start an NLP project of their choice. Building translation memories for the Croatian-English language pair was one of the suggested tasks. The students that selected this assignment were told to create translation memories by “any means necessary” and by using knowledge learned during lectures.
After conducting initial experiments in Poland, additional corpora acquisition was carried out in Croatia with the help of students and experienced researchers with a focus on natural language processing. Here, the researchers were recruited for the project predominantly based on their experience using automated tools for corpus creation. This was the single mandatory skill that enabled the participants to produce valuable linguistic resources. Besides that, in summary, the profile of all contributors varied by:
  • nationality—all participating students and researchers came from Poland and Croatia;
  • level of language understanding—all students and researchers had at least intermediate English understanding skills, but only Croats could read and fully understand Croatian (even though the Polish and Croatian languages exhibit some similarities, they are not mutually intelligible);
  • experience—from students in the early stages of their studies and graduate students in their twenties to experienced natural language processing researchers;
  • gender—the distribution of women and men among the participants was nearly even;
  • occupation—participants were either studying or researching the fields of information and communication sciences, computer science, linguistics, or data science with a special focus on natural language processing.
The goal of this paper was to present a platform for language resource acquisition and analyze the main characteristics of the collected data. As the work on “TMrepository” is still ongoing, the authors plan to conduct more experiments with regard to its usability, user-friendliness, and effectiveness. Furthermore, once the quality assurance phase is complete, the resulting corpus will be used to train and evaluate a group of machine translation engines for multiple translation domains.

5. Collected Corpus

Besides creating the web-based platform “TMrepository”, which was designed for facilitating the collection of parallel corpora, the results of this research also include a four-million-segment English-Croatian parallel corpus. Precisely 4,091,227 translation units, which are comprised of almost 110 million words, were collected during this study (Table 2).
Table 2. Collected corpus—broken down by domains.
The most common sources for parallel corpora include the following:
  • Croatian-English and English-Croatian parallel corpora, such as SETIMES or TED Talks,
  • technical documentation for various products,
  • tourism websites,
  • manuals,
  • song lyrics,
  • legal documents.
All the resources collected on “TMrepository” originate from publicly available data and open-source materials that were gathered by researchers who were willingly participating in this open-source project. Their original work is therefore not copyrighted and does not violate laws or regulations. In addition, collecting data from the internet is a standard procedure in web crawling.
The differences in translation memory size, the diversity of domains and domain independence, the variations in language register and style, and the ability to update the resources that have been collected are the main characteristics of the acquired parallel data. As initially expected by the authors, the most represented domain is “General”, since accessing this type of corpora is easy. It contains non-specific data that covers a wide range of generic topics (e.g., from news), and this is usually a good starting point for building general-purpose machine translation systems.
The following domains are similar in size: “Technical”, “Tourism”, “Manuals”, and “Books”, due to the availability of bilingual resources on the internet. The domain “Technical” consists mostly of standards and guidelines related to the ICT industry, while “Tourism” was predominantly collected from tourist web sites. The domain “Manuals” contains multilingual manuals for devices and home appliances, whereas data from the domain “Books” was mainly collected from open libraries and e-book platforms.
Next in line is the “News” domain, which was primarily collected by scraping online newspapers and internet portals. The least represented domains were “Song lyrics” and “Legal—law” (again, similar in size), followed by “Film subtitles” and “Literature—creative”.
Each collected domain can further be used for conducting task-oriented research, e.g., for building domain-specific machine translation systems, CAT tools, topic detection, terminology extraction, data analyses, NLP, etc.

6. Conclusions and Future Research

The main goal of this paper was to present a four-million-segment (almost 110 million words) English-Croatian corpus. It was built using a newly created web-based platform that works with other languages as well. In order to investigate the viability of a realistic implementation of software for collecting, storing, and organizing linguistic data, the authors examined the significance and various use-cases of parallel corpora, particularly for the purpose of machine translation. The platform integrates and combines the concepts of crowdsourcing and gamification, making it appropriate for both practical use by large audiences and for educational purposes. This is especially true for students that deal with natural language processing, machine translation, linguistics, CAT tools, language resources, etc. The platform has a user-friendly interface, is free to use, and is available to all users regardless of language. It facilitates international collaboration since the number of domains and language pairs can be expanded arbitrarily.
However, the platform is an ongoing project, so for future research, the authors plan to include additional functionalities, and to motivate new potential users to actively contribute to the rise of the Croatian-English corpus, especially for domains that are hardly available. In addition, the authors intend to incorporate more gamification elements into the web platform, such as challenges, avatars, levels, points, etc., to maximize the positive aspects of this methodology, to make the platform more attractive, and to encourage more users to participate by turning a boring task into a fun and entertaining one.
The corpus collected on “TMrepository” is due to be made public under an appropriate open-source license in the near future. The quality assurance process is still in progress, but once it is finished, it will be possible to release resources of optimal quality. This is all being implemented in order to create a place for preserving the resources of endangered languages.

Author Contributions

Conceptualization, R.J., S.S. and I.D.; methodology, R.J., S.S. and I.D.; software, R.J.; validation, R.J., S.S. and I.D.; formal analysis, R.J., S.S. and I.D.; investigation, R.J., S.S. and I.D.; resources, R.J., S.S. and I.D.; data curation, R.J.; writing—original draft preparation, R.J., S.S. and I.D.; writing—review and editing, I.D.; visualization, R.J., S.S. and I.D.; supervision, I.D.; funding acquisition, I.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Joshi, P.; Santy, S.; Budhiraja, A.; Bali, K.; Choudhury, M. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 6282–6293. [Google Scholar] [CrossRef]
  2. Haddow, B.; Bawden, R.; Barone, A.V.M.; Helcl, J.; Birch, A. Survey of Low-Resource Machine Translation. Comput. Linguist. 2022, 48, 673–732. [Google Scholar] [CrossRef]
  3. Hedderich, M.A.; Lange, L.; Adel, H.; Strötgen, J.; Klakow, D. A Survey on Recent Approaches for Natural Language Processing in Low-Resource Scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Mexico City, Mexico, 6–11 June 2021; pp. 2545–2568. [Google Scholar] [CrossRef]
  4. Volk, M. Parallel Corpora, Terminology Extraction and Machine Translation. In Proceedings of the 16. DTT-Symposion. Terminologie und Text(e), Mannheim, Germany, 22–24 March 2018; pp. 3–14. [Google Scholar] [CrossRef]
  5. Jaworski, R.; Seljan, S.; Dunđer, I. Usability Analysis of the Concordia Tool Applying Novel Concordance Searching. In Proceedings of the International Conference on Information Technology & Systems (ICITS 2021), Libertad, Ecuador, 4–6 February 2021; pp. 128–138. [Google Scholar] [CrossRef]
  6. Macken, L.; Prou, D.; Tezcan, A. Quantifying the Effect of Machine Translation in a High-Quality Human Translation Production Process. Informatics 2020, 7, 12. [Google Scholar] [CrossRef]
  7. Seljan, S.; Erdelja, N.Š.; Kučiš, V.; Dunđer, I.; Bach, M.P. Quality Assurance in Computer-Assisted Translation in Business Environments. In Natural Language Processing for Global and Local Business; Pinarbasi, F., Nurdan Taskiran, M., Eds.; IGI Global Hershey: Hershey, PA, USA, 2021; pp. 242–270. [Google Scholar] [CrossRef]
  8. Eo, S.; Park, C.; Moon, H.; Seo, J.; Lim, H. Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation. Appl. Sci. 2021, 11, 6584. [Google Scholar] [CrossRef]
  9. Elmakias, I.; Vilenchik, D. An Oblivious Approach to Machine Translation Quality Estimation. Mathematics 2021, 9, 2090. [Google Scholar] [CrossRef]
  10. Wang, Y.; Li, X.; Yang, Y.; Anwar, A.; Dong, R. Hybrid System Combination Framework for Uyghur–Chinese Machine Translation. Information 2021, 12, 98. [Google Scholar] [CrossRef]
  11. Seljan, S.; Dunđer, I. Automatic quality evaluation of machine-translated output in sociological-philosophical-spiritual domain. In Proceedings of the Iberian Conference on Information Systems and Technologies (CISTI 2015), Aveiro, Portugal, 17–20 June 2015; pp. 1–4. [Google Scholar] [CrossRef]
  12. Jaworski, R.; Seljan, S.; Dunđer, I. Towards educating and motivating the crowd—A crowdsourcing platform for harvesting the fruits of NLP students’ labour. In Proceedings of the 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics (LTC 2017), Poznań, Poland, 17–19 November 2017; pp. 332–336. [Google Scholar]
  13. Kučiš, V.; Seljan, S. The role of online translation tools in language education. Babel 2014, 60, 303–324. [Google Scholar] [CrossRef]
  14. Gašpar, A.; Seljan, S.; Kučiš, V. Measuring Terminology Consistency in Translated Corpora: Implementation of the Herfindahl-Hirshman Index. Information 2022, 13, 43. [Google Scholar] [CrossRef]
  15. Béchara, H.; Orăsan, C.; Parra Escartín, C.; Zampieri, M.; Lowe, W. The Role of Machine Translation Quality Estimation in the Post-Editing Workflow. Informatics 2021, 8, 61. [Google Scholar] [CrossRef]
  16. Han, B. Translation, from Pen-and-Paper to Computer-Assisted Tools (CAT Tools) and Machine Translation (MT). Proceedings 2020, 63, 56. [Google Scholar] [CrossRef]
  17. Wang, R.; Tan, X.; Luo, R.; Qin, T.; Liu, T.-Y. A Survey on Low-Resource Neural Machine Translation. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI-21), Online, 19–26 August 2021; pp. 4636–4643. [Google Scholar] [CrossRef]
  18. Ngo, T.V.; Nguyen, P.-T.; Ha, T.-L.; Dinh, K.-Q.; Nguyen, L.-M. Improving Multilingual Neural Machine Translation For Low-Resource Languages: French, English—Vietnamese. In Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages, Suzhou, China, 4 December 2020; pp. 55–61. [Google Scholar] [CrossRef]
  19. Ranathunga, S.; Lee, E.-S.A.; Prifti Skenduli, M.; Shekhar, R.; Alam, M.; Kaur, R. Neural Machine Translation for Low-Resource Languages: A Survey. ACM Comput. Surv. 2023, 55, 1–37. [Google Scholar] [CrossRef]
  20. Koehn, P.; Och, F.J.; Marcu, D. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL ’03), Edmonton, AB, Canada, 27 May–1 June 2003; Volume 1, pp. 127–133. [Google Scholar] [CrossRef]
  21. Kamath, U.; Liu, J.; Whitaker, J. Deep Learning for NLP and Speech Recognition; Springer: Berlin/Heidelberg, Germany, 2019. [Google Scholar] [CrossRef]
  22. Bahdanau, D.; Cho, K.; Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015; pp. 1–15. [Google Scholar] [CrossRef]
  23. Koehn, P. Statistical Machine Translation; Cambridge University Press: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
  24. Dong, D.; Wu, H.; He, W.; Yu, D.; Wang, H. Multi-Task Learning for Multiple Language Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP), Beijing, China, 27–31 July 2015; Volume 1, pp. 1723–1732. [Google Scholar] [CrossRef]
  25. Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y. A Convolutional Encoder Model for Neural Machine Translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; Volume 1, pp. 123–135. [Google Scholar] [CrossRef]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar] [CrossRef]
  27. Singh, T.D. Building Parallel Corpora for SMT System: A Case Study of English-Manipuri. Int. J. Comput. Appl. 2012, 52, 47–51. [Google Scholar] [CrossRef]
  28. Dunđer, I. Statistical Machine Translation System and Computational Domain Adaptation (Sustav za Statističko Strojno Prevođenje i Računalna Adaptacija Domene). Ph.D. Thesis, University of Zagreb, Zagreb, Croatia, 2015. [Google Scholar]
  29. Parida, S.; Bojar, O.; Dash, S.R. OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation. In Proceedings of the Third International Conference on Smart Computing and Informatics (SCI 2018-19), Bhubaneswar, India, 21–22 December 2018; Volume 1, pp. 495–504. [Google Scholar] [CrossRef]
  30. Ambati, V.; Vogel, S. Can Crowds Build Parallel Corpora for Machine Translation Systems? In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT ‘10), Los Angeles, NY, USA, 6 June 2010; pp. 62–65. [Google Scholar]
  31. Abdurakhmonova, N. Linguistic Issues of Creating Parallel Corpora for Uzbek Multilingual Machine Translation System. BuxDU Ilmiy Axborotnomasi 2020, 6, 60–68. [Google Scholar]
  32. Doğru, G.; Martín-Mor, A.; Aguilar-Amat, A. Parallel Corpora Preparation for Machine Translation of Low-Resource Languages: Turkish to English Cardiology Corpora. In Proceedings of the LREC 2018 Workshop ‘MultilingualBIO: Multilingual Biomedical Text Processing’, Miyazaki, Japan, 7–12 May 2018; pp. 12–15. [Google Scholar]
  33. Shearing, S.; Kirov, C.; Khayrallah, H.; Yarowsky, D. Improving Low Resource Machine Translation using Morphological Glosses. In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, Boston, MA, USA, 17–21 March 2018; Volume 1, pp. 132–139. [Google Scholar]
  34. Forcada, M.L. Free/Open-Source Machine Translation for the Low-Resource Languages of Spain. In Proceedings of the 3rd Conference on Language, Data and Knowledge (LDK 2021), Zaragoza, Spain, 1–4 September 2021. [Google Scholar] [CrossRef]
  35. Chu, C.; Wang, R. A Survey of Domain Adaptation for Neural Machine Translation. In Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018), Santa Fe, NM, USA, 20–26 August 2018; pp. 1304–1319. [Google Scholar] [CrossRef]
  36. Dabre, R.; Chu, C.; Kunchukuttan, A. A survey of multilingual neural machine translation. ACM Comput. Surv. 2020, 53, 99. [Google Scholar] [CrossRef]
  37. Maruf, S.; Saleh, F.; Haffari, G. A Survey on Document-level Neural Machine Translation: Methods and Evaluation. ACM Comput. Surv. 2021, 54, 45. [Google Scholar] [CrossRef]
  38. Kuwanto, G.; Akyürek, A.F.; Tourni, I.C.; Li, S.; Jones, A.G.; Wijaya, D. Low-Resource Machine Translation Training Curriculum Fit for Low-Resource Languages. arXiv 2021, arXiv:2103.13272. [cs.CL], Computation and Language. [Google Scholar] [CrossRef]
  39. Sen, S.; Hasanuzzaman, M.; Ekbal, A.; Bhattacharyya, P.; Way, A. Neural machine translation of low-resource languages using SMT phrase pair injection. Nat. Lang. Eng. 2020, 27, 271–292. [Google Scholar] [CrossRef]
  40. Beloucif, M.; Gonzalez, A.V.; Bollmann, M.; Søgaard, A. Naive Regularizers for Low-Resource Neural Machine Translation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September 2019; pp. 102–111. [Google Scholar] [CrossRef]
  41. Koehn, P.; Knowles, R. Six Challenges for Neural Machine Translation. In Proceedings of the First Workshop on Neural Machine Translation, Vancouver, BC, Canada, 3–4 July 2017; pp. 28–39. [Google Scholar] [CrossRef]
  42. Seljan, S.; Dunđer, I.; Pavlovski, M. Human Quality Evaluation of Machine-Translated Poetry. In Proceedings of the International Convention on Information, Communication and Electronic Technology (MIPRO), Opatija, Croatia, 28 September–2 October 2020; pp. 1040–1045. [Google Scholar] [CrossRef]
  43. Lambebo, A.; Woldeyohannis, M.; Yigezu, M. A Parallel Corpora for bi-directional Neural Machine Translation for Low Resourced Ethiopian Languages. In Proceedings of the 2021 International Conference on Information and Communication Technology for Development for Africa (ICT4DA), Bahir Dar, Ethiopia, 22–24 November 2021; pp. 71–76. [Google Scholar] [CrossRef]
  44. Zhang, J.; Tian, Y.; Mao, J.; Han, M.; Wen, F.; Guo, C.; Gao, Z.; Matsumoto, T. WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation. Electronics 2023, 12, 1140. [Google Scholar] [CrossRef]
  45. Ha, T.-L.; Niehues, J.; Waibel, A. Toward Multilingual Neural Machine Translation with Universal Encoder and Decoder. In Proceedings of the 13th International Conference on Spoken Language Translation (IWSLT 2016), Seattle, DC, USA, 8–9 December 2016. [Google Scholar] [CrossRef]
  46. Lakew, S.M.; Cettolo, M.; Federico, M. A Comparison of Transformer and Recurrent Neural Networks on Multilingual Neural Machine Translation. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, NM, USA, 20–26 August 2018; pp. 641–652. [Google Scholar] [CrossRef]
  47. Tan, X.; Ren, Y.; He, D.; Qin, T.; Zhao, Z.; Liu, T.-Y. Multilingual Neural Machine Translation with Knowledge Distillation. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019; pp. 1–15. [Google Scholar] [CrossRef]
  48. Aji, A.F.; Bogoychev, N.; Heafield, K.; Sennrich, R. In Neural Machine Translation, What Does Transfer Learning Transfer? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), Online, 5–10 July 2020; pp. 7701–7710. [Google Scholar] [CrossRef]
  49. Kim, Y.; Gao, Y.; Ney, H. Effective Cross-lingual Transfer of Neural Machine Translation Models without Shared Vocabularies. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), Florence, Italy, 28 July–2 August 2019; pp. 1246–1257. [Google Scholar] [CrossRef]
  50. Dabre, R.; Nakagawa, T.; Kazawa, H. An Empirical Study of Language Relatedness for Transfer Learning in Neural Machine Translation. In Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation (PACLIC 2017), Manila, Philippines, 16–18 November 2017; pp. 282–286. [Google Scholar]
  51. Zoph, B.; Yuret, D.; May, J.; Knight, K. Transfer Learning for Low-Resource Neural Machine Translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP 2016), Austin, TX, USA, 1–5 November 2016; pp. 1568–1575. [Google Scholar] [CrossRef]
  52. Wang, W.; Zhang, Z.; Du, Y.; Chen, B.; Xie, J.; Luo, W. Rethinking Zero-shot Neural Machine Translation: From a Perspective of Latent Variables. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 4321–4327. [Google Scholar] [CrossRef]
  53. Currey, A.; Heafield, K. Zero-Resource Neural Machine Translation with Monolingual Pivot Data. In Proceedings of the 3rd Workshop on Neural Generation and Translation (NGT 2019), Hong Kong, China, 3–7 November 2019; pp. 99–107. [Google Scholar] [CrossRef]
  54. O’Brien, S. Collaborative translation. In Handbook of Translation Studies; Gambier, Y., van Doorslaer, L., Eds.; John Benjamins Publishing Company: Amsterdam, The Netherlands, 2011; Volume 2, pp. 17–20. [Google Scholar] [CrossRef]
  55. Howe, J. The Rise of Crowdsourcing. Wired Mag. 2006, 14. Available online: http://www.wired.com/wired/archive/14.06/crowds.html (accessed on 11 February 2023).
  56. Howe, J. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business, 1st ed.; Crown Publishing Group: New York, NY, USA, 2008. [Google Scholar]
  57. Quinn, A.J.; Bederson, B.B. Human Computation: A Survey and Taxonomy of a Growing Field. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11), New York, NY, USA, 7–12 May 2011; pp. 1403–1412. [Google Scholar] [CrossRef]
  58. Sabou, M.; Bontcheva, K.; Derczynski, L.; Scharl, A. Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; pp. 859–866. [Google Scholar]
  59. Li, H.; Shen, H.; Xu, S.; Zhang, C. Visualizing NLP annotations for Crowdsourcing. arXiv 2015, arXiv:1508.06044. [cs.CL], Computation and Language. [Google Scholar] [CrossRef]
  60. Munro, R.; Gunasekara, L.; Nevins, S.; Polepeddi, L.; Rosen, E. Tracking Epidemics with Natural Language Processing and Crowdsourcing. In Proceedings of the AAAI Spring Symposium—Wisdom of the Crowd (AAAI 2012), Palo Alto, CA, USA, 26–28 March 2012. [Google Scholar]
  61. Sabou, M.; Bontcheva, K.; Scharl, A. Crowdsourcing Research Opportunities: Lessons from Natural Language Processing. In Proceedings of the 12th International Conference on Knowledge Management and Knowledge Technologies (i-KNOW ’12), Graz, Austria, 5–7 September 2012; pp. 1–8. [Google Scholar] [CrossRef]
  62. Vamshi, A.; Vogel, S.; Carbonell, J. Collaborative Workflow for Crowdsourcing Translation. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW ’12), Seattle, DC, USA, 11–15 February 2012; pp. 1191–1194. [Google Scholar]
  63. Zaidan, O.F.; Callison-Burch, C. Crowdsourcing Translation: Professional Quality from Non-professionals. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (HLT ’11), Portland, OR, USA, 19–24 June 2011; Volume 1, pp. 1220–1229. [Google Scholar]
  64. Muntés-Mulero, V.; Paladini, P.; Solé, M.; Manzoor, J. Multiplying the Potential of Crowdsourcing with Machine Translation. In Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Commercial MT User Program (AMTA 2012), San Diego, CA, USA, 28 October–1 November 2012. [Google Scholar]
  65. Muegge, U. Teaching computer-assisted translation in the 21st century. In TransÜD. Arbeiten zur Theorie und Praxis des Übersetzens und Dolmetschens (Alles Hängt Mit Allem Zusammen: Translatologische Interdependenzen. Festschrift für Peter A. Schmitt); Ende, A.-K., Herold, S., Weilandt, A., Eds.; Frank & Timme: Berlin, Germany, 2013; Volume 59. [Google Scholar]
  66. Canovas, M.; Samson, R. Open source software in translator training. Rev. Tradumàtica 2011, 9, 6–56. [Google Scholar] [CrossRef]
  67. Robson, K.; Plangger, K.; Kietzmann, J.H.; McCarthy, I.; Pitt, L. Game on: Engaging customers and employees through gamification. Bus. Horiz. 2016, 59, 29–36. [Google Scholar] [CrossRef]
  68. Morschheuser, B.; Werder, K.; Hamari, J.; Abe, J. How to gamify? Development of a method for gamification. In Proceedings of the 50th Annual Hawaii International Conference on System Sciences (HICSS), Hawaii, HI, USA, 4–7 January 2017; Volume 50, pp. 1298–1307. [Google Scholar] [CrossRef]
  69. Abdelali, A.; Durrani, N.; Guzmán, F. iAppraise: A Manual Machine Translation Evaluation Environment Supporting Eye-tracking. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations (NAACL 2016), San Diego, CA, USA, 13–15 June 2016; pp. 17–21. [Google Scholar] [CrossRef]
  70. Graliński, F.; Jaworski, R.; Borchmann, Ł.; Wierzchon, P. Gonito.net—Open platform for research competition, cooperation and reproducibility. In Proceedings of the 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, Portorož, Slovenia, 28 May 2016; pp. 13–20. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.