WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation

Featured Application: This research crawled a bilingual Japanese-Chinese corpus of a certain size through websites. As a necessary resource for Japanese-Chinese neural machine translation (NMT), it is beneﬁcial for researchers to promote the progress of Japanese-Chinese language-related natural language processing research. Speciﬁcally, topics include comparative analysis of grammar, comparative studies of Chinese and Japanese languages, compilation of dictionaries, etc. This will have great signiﬁcance and contribution to the cultural exchange and industrial cooperation between China and Japan. It also has important theoretical signiﬁcance and application value to the industrialization of Japanese-Chinese machine translation. In addition, the application of this research will be of great signiﬁcance in strengthening civil communication and enhancing mutual understanding between China and Japan, as the current Chinese and Japanese relations are not well perceived by the citizens of both countries. We hope that the construction and pathways of the Japanese-Chinese bilingual corpus in this research will help to solve the problem of language barriers in Japanese-Chinese people-to-people communication and mutual understanding. We offer the WCC-JC as a free download under the premise that it is intended for research purposes only. Abstract: Currently, there are only a limited number of Japanese-Chinese bilingual corpora of a sufﬁcient amount that can be used as training data for neural machine translation (NMT). In particular, there are few corpora that include spoken language such as daily conversation. In this research, we attempt to construct a Japanese-Chinese bilingual corpus of a certain scale by crawling the subtitle data of movies and TV series from the websites. We calculated the BLEU scores of the constructed WCC-JC (Web Crawled Corpus—Japanese and Chinese) and the other compared corpora. We also manually evaluated the translation results using the translation model trained on the WCC-JC to conﬁrm the quality and effectiveness.


Introduction
In recent years, Japan has been the second largest trading partner of China, and China is the largest trading partner of Japan. As an important contributor to cultural exchange and industrial cooperation, China and Japan have been paying attention to each other. However, the language barrier problem has become a serious challenge that limits China and Japan from strengthening their communications. Both countries expect a high-quality Chinese-Japanese machine translation system. Therefore, the establishment of a high-quality Chinese-Japanese machine translation system is helpful for the language barrier arising from Chinese-Japanese exchange activities. It is of great significance to the development of both Japan and China in terms of technology and economy.
Machine translation is an important field of artificial intelligence research which translates source language into target language. It is one of the most effective methods for solving language barriers. After years of development, NMT is a new machine translation model with great potential that has exhibited superior translation results to the traditional frameworks for various language pairs. NMT is capable of training large-scale parallel corpora, although the amount of training data greatly affects the quality of translation results. NMT has a great potential for industrialization, as well as significant research value, making it the most advanced hotspot in machine translation research today.
In the field of machine translation, Chinese-Japanese machine translation is difficult due to the complex intertwining of the two languages. In order to obtain high-quality translations, a large amount of data in the Japanese-Chinese bilingual corpus is required. There are tens or hundreds of millions of sentence pairs, including English, and language pairs existing between European languages. It contain sentences from many different fields. However, there is a lack of Japanese-Chinese bilingual corpora that have been made public. For example, the Japanese Patent Office (JPO) Japanese-Chinese bilingual corpus has 130 million entries (about 26 GB) and 0.1 billion entries (about 1.4 GB), but research results using all of the data need to be submitted (https://alaginrc.nict.go.jp/jpo-outline. html (accessed on 30 June 2017)), and the only type of content in this corpus is patented. The ASPEC-JC corpus contains approximately 670k sentences [1]. This corpus contains only abstracts of scientific papers. The above two corpora are already have a very large number of sentences among the publicly available Japanese-Chinese corpora. However, there are still not many Japanese-Chinese bilingual corpora suitable for translating sentences from daily-use spoken language.
Currently, most of the research on NMT focuses on improving the translation quality of translation models. However, the lack of corpus with the basic properties remains an insurmountable problem. Although approaches such as back translation have been proposed to generate pseudo-data, they still cannot reach the essence of the problem. It is necessary to perform this basic research and lead the way by releasing the data. Professor Feifei Li of Stanford University had released the famous ImageNet dataset. This dataset led the field of computer vision and triggered the wave of AI that is sweeping the world today [2]. Professor Feifei Li said "One thing ImageNet changed in the field of AI is suddenly people realized the thankless work of making a dataset was at the core of AI research". This is exactly what our research was originally intended to do.
Most of the current research focuses on the translation of sentences that fall into the category of written language. On the other hand, research on spoken language is more ambiguous than written language, and it requires a greater understanding of context. For instance, the spoken language is usually shorter in length than written sentences. Even in spoken language, slang words and dialects may be overlooked by conventional translators. In addition, multi-modal translation is recognized as a future research trend, and a suitable bilingual corpus of spoken language is needed.
In this article, our research goal is to construct a Japanese-Chinese bilingual corpus by crawling data from websites. The corpus (WCC-JC) is constructed by crawling the data from movie, anime, and TV series subtitles. Then, the subtitles are mapped to Japanese and Chinese languages. The corpus also covers spoken language bilingualism, which has not been well researched in the existing Chinese-Japanese bilingual corpora. After the Japanese-Chinese bilingual corpus is completed, it is expected to attract researchers' attention as a crucial resource for Japanese-Chinese NMT and to promote the progress of NMT research.
In the remainder of this article, Section 2 presents the related works. Section 3 describes the construction of WCC-JC. Section 4 reports the experimental framework, the results obtained from Japanese-Chinese and Chinese-Japanese translation experiments, and the manual evaluations. Section 5 discusses the legitimacy of WCC-JC. Finally, Section 6 concludes with a discussion of the contributions of this article and the future works.

Related Works
The corpus was systematically introduced and analyzed in Stefanowitsch Anatol's book Corpus linguistics: A guide to the methodology [3]. Several examples from corpus linguistics are surveyed to show how they fit into the outlined methodology.
The Japanese-Chinese bilingual corpus at Beijing Normal University contains about 80 full texts of novels, poems, essays, biographies, political commentaries, and legal treatises from the modern to the contemporary period [4]. However, due to copyright restrictions, this corpus was only partially available for research, and could not be trained and used in NMT.
Zhang et al. constructed a Japanese-Chinese parallel corpus by human translation, which was part of the NICT multilingual corpus. The quality was high and annotated, but there were only about 40,000 sentence pairs [5].
Koehn collected a corpus of parallel text in 11 languages from the proceedings of the European Parliament, which are published on the Web. This corpus had found widespread use in the NLP community [6].
Lavecchia et al. proposed a method to automatically build a bilingual corpus from movie subtitle files, and also created a translation table by the method [7].
Baroni et al. introduced ukWaC, deWaC, and itWaC, three very large corpora of English, German, and Italian built by Web crawling, and described the methodology and tools used in their construction. The corpora contained more than a billion words each, and were thus among the largest resources for the respective languages [8].
Smith et al. used their open-source extension of the STRAND algorithm-mined 32 terabytes of the Common Crawl, a public Web crawl hosted on Amazon's Elastic Cloud. Even with minimal cleaning and filtering, the resulting data boosted translation performance across the board for five different language pairs in the news domain [9].
Chu et al. proposed a bilingual sentence extraction system to construct a Japanese-Chinese bilingual corpus from Wikipedia [10]. Using the system, they constructed a Japanese-Chinese bilingual corpus containing more than 126k highly accurate bilingual sentences from Wikipedia.
Benko reported on the first phase of an ongoing project to create a Web corpus and summarized the problems encountered in the process [11].
The United Nations Parallel Corpus v1.0 was composed of official records and other parliamentary documents of the United Nations that were in the public domain. These documents were mostly available in the six official languages of the United Nations. The current version of the corpus contained content that was produced and manually translated between 1990 and 2014, including sentence-level alignments [12].
Pryzant et al. constructed a Japanese-English correspondence database of movie and TV program subtitles crawled from the Internet. The JESC was the largest freely available Japanese-English bilingual corpus (about 2.8 million sentences) [13].
OpenSubtitles2018 was a multilingual parallel corpus of movie subtitle data [14]. The Japanese-English bilingual corpus was a parallel corpus of two million sentences consisting of approximately 2000 movies, and will be considered for use in the field of machine translation and other tasks that take advantage of the characteristics of movie subtitles.
Park et al. proposed a simple, linguistically motivated solution to improve the performance of Korean-Chinese neural machine translation models by using a common vocabulary [15].
Morishita et al. constructed JParaCrawl, a large-scale Web-based Japanese-English bilingual corpus, by crawling the Web on a large scale, automatically collecting Japanese-English bilingual sentences, and filtering out noisy bilingual pairs [16].
Guokun et al. automatically built a corpus by crawling language resources from the Internet, but the data were not well filtered [17].
Hasan et al. built a sentence segmentation method for Bengali, a sparse language, and constructed a non-English bilingual corpus [18]. Thus, the construction of an open Japanese-Chinese bilingual corpus for NMT has significant implications on the resource scarcity problem.
Václav et al. compared two corpora of Czech. One was a traditional corpus and the other was a Web-crawled corpus, which had been extensively compared and analyzed for quality [19].
The EuroparlTV Multimedia Parallel Corpus (EMPAC) was a collection of subtitles in English and Spanish for videos from the European Parliament's Multimedia Centre. The corpus covered a time span from 2009 to 2017 and it was made up of 4000 texts amounting to two and half million tokens for every language, corresponding to approximately 280 h of video [20].
Nakazawa et al. introduced the BSD corpus and the results of the BSD translation task in WAT2020. Additionally, they discussed the challenges of dialogue translation based on the analysis of the translation results [21]. The BSD corpus was constructed using "dialogue" in "business" as the domain of the bilingual corpus.
Liu et al. conducted a parallel corpus in the field of biomedicine for English-Chinese translation [22]. They compared the effectiveness of different algorithms/tools for sentence boundary detection and sentence alignment, and used the constructed corpus, fine-tuning the NMT models.
Dou et al. used pretrained language models, but by fine-tuning them on parallel texts with the aim of improving alignment quality, they proposed a method for efficiently extracting alignments from these fine-tuned models and demonstrated that their models can consistently outperform all previous state-of-the-art models of the species [23].
Schwenk et al. presented an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, extracting 135M parallel sentences for 16,720 different language pairs, and achieving strong BLEU scores for many language pairs [24].
For Japanese-Chinese translation, Zhang et al. proposed the following three data augmentation methods to improve the quality of Japanese-Chinese NMT: (1) radicals as an additional input feature [25]; (2) the created Chinese character decomposition table [26]; (3) a corpus augmentation approach [27], considering the lack of resources in bilingual corpora.
The related works above are sorted by year. We also show the classification of these related works on five aspects: (1) Corpus linguistics; (2) Japanese-Chinese bilingual corpora; (3) Web-crawled corpora; (4) other corpora; (5) corpus augmentation through a summarized Table 1. Table 1. Summary of related works.

Construction of Japanese-Chinese Bilingual Corpus
The corpus to be constructed, WCC-JC, is a collection of Japanese-Chinese bilingual sentences from the Web. This method discovers websites that may contain Japanese-Chinese bilingual sentences, and attempts to extract bilingual sentences from the Web data.

Web Crawling
Considering a website that contains many Japanese-Chinese bilingual texts, we use Scrapy (https://scrapy.org/ (accessed on 10 October 2021)) to retrieve subtitle files from the website (http://assrt.net/ (accessed on 10 June 2020)) that contains subtitle files of movies, dramas, and TV series. In these subtitle files, there are bilingual translations of slang, spoken language, explanatory text, and story commentary. These are areas that have not been dealt with much in the existing corpora.

Extraction of Bilingual Sentences
Most of the acquired subtitle files are advanced SubStation Alpha (ASS) files. Figure 1 shows an example of the contents of an ASS file (dummy contents). As shown in the figure, in the dialogue line of the ASS file, the information of display start time, end time, style, subtitle display content, and so on are described. The language information often appears in the layer field (the first "0" in the figure) and the style field ("DefaultJp" in the figure). If there is a dialogue line containing one of "ja", "jp", "日 (Japan)" in the style name, the style is judged to be Japanese.
If there is a dialogue line containing one of "cn", "ch", "zh", "中 (China)", or "default" in the style name, the style is considered to be Chinese. If neither Japanese style nor Chinese style is found, it is judged that the subtitle file does not contain Japanese-Chinese bilingual subtitles.
After determining whether the dialogue lines are Japanese or Chinese, the dialogue lines with Japanese style and Chinese style are sorted in ascending order by start time (value of the start field in dialogue lines), respectively. The correspondence is stored by extracting the corresponding bilingual sentences in the timeline of start (value of the start field in dialogue lines) and end (value of the end field in dialogue lines). If the timelines do not correspond exactly, we also consider contextual one-to-many and many-to-many relationships to check whether the timelines between multiple sentence pairs correspond correctly. If the timelines of multiple sentence pairs correspond accurately, we treat the multiple sentence pairs as one bilingual pair. This allows corresponding to as many pairs of bilingual sentences as possible. Finally, we obtained the raw parallel corpus.

Text Processing
Before corpus segmentation, the text in the raw parallel corpus needs to be normalized. We perform text normalization by changing traditional Chinese characters to simplified Chinese characters and using zenhan (https://pypi.org/project/zenhan/ (accessed on 10 October 2021)) to normalize Japanese katakana to full-width format, respectively. Finally, the text is sorted and duplicates are removed to obtain the filtered parallel corpus.

Corpus Segmentation
Since the generated corpus WCC-JC consists of subtitle data of many works, it is necessary to extract validation data (development data) and test data for NMT. Therefore, using other corpora as a reference, we decided that 2000 sentence pairs of ten characters or more were randomly extracted twice as the development data and the test data from the collected sentence pairs, and the remaining sentence pairs were used as training data. Figure 2 shows the whole workflow of the corpus construction. It is mainly divided into four steps: (1) Web crawling; (2) extraction; (3) text processing; (4) corpus segmentation. The above are also the contents of Sections 3.   Table 2 shows the number of sentences in the ASPEC-JC, OpenSubtitles, and the constructed corpus. It can also be seen that, in terms of capacity, ASPEC-JC > OpenSubtitles > WCC-JC. This is mainly due to the length of the sentences in the corpus. Figure 3 shows a right-skewed sentence length distribution. The average length of Chinese sentences and Japanese sentences are 10.6 and 13.7, respectively. The content of the WCC-JC is composed of spoken subtitles, and the sentences are generally shorter. Even though the number of sentences is higher than ASPEC-JC, it does not contain as much information as ASPEC-JC.

Experiment and Evaluation
In order to confirm the effectiveness of this corpus, we conducted experiments. In Section 4.1, we configured the NMT system of the following experiments. In Section 4.2, we calculated the BLEU scores of ASPEC-JC, OpenSubtitles, and the constructed corpus with their own generated NMT models, and used JPO scores to manually evaluate the test data W's translation results.
We showed the ability of the constructed corpus to generalize to different domains. For Japanese-Chinese translation, the character level and LSTM models were found to be the most effective [28]. In our experiments, we compared character level, subword level, LSTM model, and transformer model, respectively.

Configuring the NMT System
In this experiment, we trained the NMT model using fairseq [29]. We used two predefined architectures of fairseq, lstm-wiseman-iwslt-de-en [30], and transformer-iwsltde-en [31], as the LSTM model and the transformer model. The LSTM model had the embedding size of 512, 1 encoder layer, and 1 decoder layer. The transformer model had the embedding size of 512, 6 encoder layers with 8 encoder attention heads, and 6 decoder layers with 8 decoder attention heads. The two models' remaining parameters were the same, the dropout rate of 0.1; the Adam optimizer with betas of 0.9 and 0.98; the learning rate of 1 × 10 −7 ; the max tokens of 4096; the max update steps of 150,000; the batch size of 128; and for the translation process, the beam size of 5. Subword-level realization used subword-nmt (https://github.com/rsennrich/subword-nmt (accessed on 23 May 2017)), with the vocabulary size of 32,000.
BiLingual Evaluation Understudy (BLEU) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another [32]. The BLEU scores were calculated for each experiment, using the "fairseq-score" command after word segmentation. In other words, we took the word-level evaluation.

Evaluation
In this section, the human evaluators are native Chinese and Japanese. The Chinese evaluators had studied abroad in Japan and had passed the N2 or higher level of the Japanese Language Proficiency Test (JLPT).

Alignment Evaluation
We checked the validity of bilingual sentence alignments based on the procedure of [33]. Human evaluators randomly sampled 1000 sentence pairs. On average, 88% of these pairs were perfectly aligned, 7% partially aligned, and 5% misaligned. Thus, we may conclude that WCC-JC is noisy but has a significant signal that there is a lot of room for improvement.

Translation Evaluation
In addition to alignment, we evaluated the quality of the translations of WCC-JC. Our evaluators used the JPO adequacy criterion with the level of content transfer. This is a subjective evaluation of how accurately the machine translation results convey the substantive content of the source text in light of the content of the reference translation, based on a 5-point (5 being the best and 1 being the worst) system for scoring the quality of a translation pair (https://www.jpo.go.jp/system/laws/sesaku/kikaihonyaku/tokkyohonyaku_ hyouka.html (accessed on 5 September 2021)). Table 3 shows the JPO adequacy criterion from 5 to 1. This criterion is also used in the next Section 4.2.3. Table 3. The JPO adequacy criterion.

Scores
Scoring Criteria We sampled and evaluated 1000 sentence pairs from the pool of non-misaligned sentences, observing an average JPO adequacy score of 4.67, implying that the amateur and crowd-sourced translations are high quality.

Machine Translation Performance and Manual Evaluation
In Tables 4-7, A was the test data of ASPEC-JC; O was the test data of OpenSubtitles; W was the test data of WCC-JC and N was the test data of 185 sentences extracted from the text of NHK Radio's "ま い に ち 中 国 語 (Everyday Chinese)" (https: //www2.nhk.or.jp/gogaku/onayami/chinese/ (accessed on 15 November 2021)) to verify the generalization capability of the translation models. The two predefined architectures of fairseq, lstm-wiseman-iwslt-de-en and transformer-iwslt-de-en were abbreviated as LSTM and Transformer, respectively. From Tables 4-7, we could conclude that different corpora had the highest BLEU values on their own test data, except for the test data O under Japanese→Chinese (J→C)'s character level with transformer, where our WCC-JC was more effective than OpenSubtitles' own translation. This also showed the generalizability of WCC-JC. Additionally, WCC-JC achieved better results than ASPEC-JC and OpenSubtitles on both test data W and N in all cases. Moreover, by comparing the results of different tables, we could basically conclude that the transformer model of character level was more effective in both J→C and Chinese→Japanese (C→J) directions.
There are many possible reasons that affect low BLEU values. The most likely cause is that one Japanese sentence had been translated into many different Chinese sentences, also known as a one-to-many situation. Regarding the problem that our BLEU values in Tables 4-7 are not very high, we investigated the duplicate Japanese sentences, which are one-to-many Japanese-Chinese sentences. The details are shown in Table 8, which shows the top 10 duplicate Japanese sentences. As we can see, these sentences were all very common daily-use utterances in spoken language, and they were translated differently when used in different scenarios, that is, one-to-many. Table 9 shows the results of one-to-many translations (only ten kinds are shown here), and the corresponding English translations show that these were colloquial short sentences. Although the characters were different, the meanings were actually very similar, that is, only because the spoken language in different scenarios will be a little different, even if when translated into English it has basically a similar meaning. However, when evaluating the translation results, we cannot include all the translation references. This was also a problem that was difficult to avoid with the evaluation metric of machine translation.   Figure 4. Results of the manual evaluation (Japanese→Chinese).  Figure 5. Results of the manual evaluation (Chinese→Japanese).
We set up three groups of A, B, and C. Two people in each group were evaluated independently with the criteria from the previous Section 4.2.2. The specific information is represented in Table 10. Then, the average scores were calculated. The numbers in the bar chart indicate the percentage of each group's evaluation value. The numbers in parentheses after the group name represent the average of the JPO scores. We can see that our average results of JPO scores were 3.87, 3.90, and 4.00; 3.97, 4.04, and 4.07 on J→C and C→J, respectively. We obtained relatively high-level results. However, due to the nature of the WCC-JC, lots of the sentences are very short, so a deeper analysis is needed to determine how satisfactory the results are. We should also perform some evaluations of the other experiments to find out the differences in the results and the related reasons.

Dataset Publication and Copyright Law
In the case of collecting data from the Web, and publishing the data, there is a problem of copyright infringement because the data contain the copyrighted works of others.
We have consulted with relevant professional lawyers, and WCC-JC does not violate the Copyright Law of the People's Republic of China.
According to the revised Copyright Law of Japan, Article 30-4, which came into effect in 2019, "It is permissible to exploit a work, in any way and to the extent considered necessary, in any of the following cases, or in any other case in which it is not a person's purpose to personally enjoy or cause another person to enjoy the thoughts or sentiments expressed in that work", and even works of others can be widely used for data analysis (translation model creation, image recognition model creation, etc.) (https://www.bunka. go.jp/seisaku/chosakuken/hokaisei/h30_hokaisei, https://www.japaneselawtranslation. go.jp/en/laws/view/3379 (both accessed on 15 November 2021)). Our data were obtained for Japanese-Chinese NMT research through an automated acquisition process without any human intervention, which does not violate the Copyright Law of Japan.

Conclusions
In this research, we introduced a Japanese-Chinese bilingual corpus: WCC-JC. This corpus was constructed by crawling the Web on a large scale and automatically collecting Japanese-Chinese bilingual sentences. In the end, about 753k sentence pairs of Japanese-Chinese bilingual data were obtained. The corpus is one of the largest Japanese-Chinese corpora available at present, and includes bilingual texts in spoken languages, which have not been widely treated in existing corpora.
In the experiments using conversational sentences extracted from language course textbooks, we confirmed that although the BLEU values were low, the translation accuracy was the highest among the compared Japanese-Chinese corpora. We also obtained relatively high-level results from the manual evaluations.
Future works include constructing a larger-scale Web-crawled corpus. Another important issue is to improve the accuracy of the alignment of bilingual sentences by the subtitle display time. We are also considering adding more language pairs in the future.