WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus

: This research introduces WCC-EC 2.0 (Web-Crawled Corpus—English and Chinese), a comprehensive parallel corpus designed for enhancing Neural Machine Translation (NMT), featuring over 1.6 million English-Chinese sentence pairs meticulously gathered via web crawling. This corpus, extracted through an advanced web crawler, showcases the vast linguistic diversity and richness of English and Chinese, uniquely spanning the rarely covered news and music domains. Our methodical approach in web crawling and corpus assembly, coupled with rigorous experiments and manual evaluations, demonstrated its superiority by achieving high BLEU scores, marking significant strides in translation accuracy and model resilience. Its inclusion of these specific areas adds significant value, providing a unique dataset that enriches the scope for NMT research and development. With the rise of NMT technology, WCC-EC 2.0 emerges not only as an invaluable resource for researchers and developers, but also as a pivotal tool for improving translation accuracy, training more resilient models, and promoting interlingual communication.


Introduction
In recent years, Neural Machine Translation (NMT) [1][2][3][4] has become the leading technology in the machine translation field, providing many improvements over older methods.The core of NMT was an encoder-decoder neural network model, improved by an attention mechanism that skillfully captured the complex relationships and nuances between languages.This feature greatly increased the accuracy and smoothness of translations.As a result, NMT was better than traditional statistical methods at translating long sentences, words with multiple meanings, and sentences with complicated grammar.An important feature of NMT was its end-to-end training process, which allowed for direct learning from the source to the target language without needing the multiple stages typical of older methods.This unified approach made the model's design and training simpler and greatly improved the efficiency of the translation system.
However, NMT also has its weaknesses.It depends heavily on large, high-quality parallel corpora, which is a challenge for languages with fewer resources.NMT models are also sensitive to mistakes or unclear parts in the input, which could lead to less accurate translations.They could struggle with specialized terms and specific contexts.On the other hand, Large Language Models (LLMs), such as ChatGPT [5] and GPT-4 [6], trained on large text collections, are good at learning a wide range of language knowledge and understanding context.This broad training gave LLMs an advantage in creating translations that are coherent and make sense, especially with complex or unclear text.Their ability to translate is impressive, and several studies [7][8][9][10] have highlighted their effectiveness.In fact, some LLMs have produced translations that are as good as the best systems from the Workshop on Machine Translation (WMT) [11].Researchers have been actively looking into how these LLMs perform with both widely spoken and low-resource languages [9,12,13], multilingual translation [8], and document-level translation [14,15].There has also been a trend towards making LLMs better at translating through fine-tuning [16,17] and developing better strategies for prompting [18,19].Moreover, NMT still needs improvements for some language structures and low-resource languages [20].It is vital to find effective methods to make translation models perform better across different areas of language processing.
The quality and size of a corpus greatly affects how well NMT and LLMs work.Creating large, high-quality corpora is still a key focus of research and practical use of NMT.In this situation, having large and high-quality English-Chinese corpora that are specific to certain fields is very valuable.They give researchers more diverse data, which is crucial for advancing NMT.
WCC-EC 1.0 [21] contains approximately 340,000 pairs of English-Chinese news data.This corpus is useful because it covers many subjects and has formal language features, but it is not perfect.To make up for its lack of casual language and everyday expressions, this study added about 1.3 million pairs of lyrics data to WCC-EC 1.0, creating the WCC-EC 2.0 parallel corpus that included both news and music content.There were several reasons for adding the lyrics:

•
Firstly, there was a clear gap in the creation of bilingual English-Chinese corpora for music, a gap not addressed by major projects such as WMT and OPUS [22], even though there were corpora with tens of millions of pairs.Furthermore, lyrics tended to be more conversational, full of spoken phrases, slang, and common expressions that are closer to everyday language.The common use of words with multiple meanings in spoken language gives NMT more context to work with, helping it deal with uncertainties and improve translation accuracy.

•
The scaling laws of language modeling suggests that the effectiveness of a model depended on how much data it has.In recent years, increasing the amount of data has been a common way to make language models better.However, the Epoch AI Research team [23] predicted that the stock of high-quality language data would run out by 2026, and even the stock of lower-quality data would start to run out between 2030 and 2050.Therefore, combining a music domain parallel corpus with WCC-EC 1.0 to create a Domain-Specific Parallel Corpus-WCC-EC 2.0 is very important.
The main contributions of this paper are as follows: • We introduced WCC-EC 2.0, an expanded parallel corpus that combins approximately 1.3 million pairs of lyrics data with existing news data, significantly enhancing the diversity of languages represented and augmenting the corpus's applicability for NMT research.Additionally, we have filled a notable gap by including texts from the music domain, which are currently scarce.This corpus is freely available for download for non-commercial research purposes, providing a valuable resource for the NMT community.

•
We set up a strong human evaluation system that goes well with automatic measures, such as BLEU scores, to give a more detailed look at the quality of translations.This system makes it possible to judge translations based on how natural they sound, how complete they are, and how well they use everyday language, giving a deeper insight into the performance of WCC-EC 2.0.
This study mainly focuses on how WCC-EC 2.0 was built, how its quality was checked, and the evaluations of both the lyrics and news data.The paper is organized as follows.Section 2 discussed how parallel corpora are built and aligned.Section 3 described the complete process of building the corpus in this study, including the problems faced and the solutions found.Section 4 explained the experimental steps and gives a detailed analysis of the results.Section 5 ended with a review of WCC-EC 2.0 and looked at possible directions for future work.

Related Works
The importance of a corpus in NMT was crucial and cannot be overstated.As the fundamental base for training NMT models, a large bilingual corpus allowed the model to learn the language patterns and links between the source and target languages.This learning was key to achieving high-quality translations.

Corpus Construction
The scale of a corpus was directly correlated with the performance of models in NMT.A substantial bilingual dataset aided the models in better understanding the correspondences between different languages, thereby enhancing translation quality.Over recent years, the development of WMT has been pivotal.Its annual dataset releases provided standardized benchmarks for researchers.Tiedemann [22] presented OPUS, an extensive freely available parallel corpus encompassing over 200 languages with tools for exploration and integration, enhancing research and development in linguistic studies.Initiatives such as Mackenzie et al.'s [24] creation of the CC-News-En corpus from the Common Crawl Foundation data mitigated the shortage of journalism corpora to an extent.To clarify corpus evaluation, Lefer's [25] chapter on Parallel Corpora in "A Practical Handbook of Corpus Linguistics" outlined the main features of parallel corpora.It also explored methods of analysis, including the combined use of parallel and comparable corpora, and addressed challenges in corpus design and analysis.ParaCrawl [26], the renowned parallel corpus developed by Marta Bañón et al., which leverages open-source software for web crawling to assemble the largest publicly available parallel corpus as of 2020, served as an inspiration for this research.Ziemski et al. [27] detailed the establishment of the official United Nations Parallel Corpus, marking the first parallel corpus from UN documents for its six official languages, accessible under a liberal license.Liu et al. [28] developed a pipeline for acquiring and processing an English-Chinese parallel corpus from the New England Journal of Medicine, demonstrating significant translation quality improvements with targeted training data.Liu et al. [29] introduced DuRecDial 2.0, a bilingual parallel dialog dataset for English and Chinese, aimed at advancing monolingual, multilingual, and cross-lingual conversational recommendation systems, showcasing the benefits of incorporating additional English data for Chinese conversational recommendations.Furthermore, Zhang et al. [30,31] made significant contributions by developing the WCC-JC Japanese-Chinese translation corpus and the manually aligned WCC-JC 2.0, a large-scale Japanese-Chinese parallel corpus, through web crawling, providing considerable support for Japanese-Chinese translation research.
Increases in corpus data volume have significantly contributed to the field of NMT.Sugiyama et al. [32] introduced context-aware neural machine translation (CAMT), which involved generating a large-scale pseudo-parallel corpus through back-translating monolingual data, supported by a substantial amount of parallel corpora.In 2023, Li et al. [33] enhanced traditional back-translation techniques with their novel approach, instruction back-translation.This method involved fine-tuning a language model with a minimal dataset and then using it to generate instructional cues for web documents, selecting only high-quality examples for training.To augment low-resource corpora, Morita et al. [34] proposed a method of hybrid and dynamic hybrid sampling, merging optimal and random sampling.Experimental results showed that dynamic hybrid sampling consistently outperformed previous optimal sampling methods.Zhang and Matsumoto [35] developed a strategy to expand the parallel corpus available for Japanese-Chinese NMT system, aiming to significantly elevate the translation quality in situations where bilingual corpora are limited.Zhang et al. [21] introduced a technique for corpus extension, involving dividing long sentences into shorter segments, recombining them, and then back-translating.This method not only expanded the corpus but also improved the quality of the predictions.Additionally, we have conducted an extensive review and comparative analysis of various corpora in recent years.The basic profiles of these corpora are concisely presented in Table 1, providing a clear understanding of their characteristics and differences.

Text Alignment
Discussing corpora inevitably brings up the topic of text alignment.Historically, methods for word alignment in parallel texts through unsupervised learning were common.Another approach involved using pre-trained contextual word embeddings from multilingual language models.Dou et al. [36] combined these methods by fine-tuning pre-trained models and improving alignment quality with specific goals, achieving stable performance across different language pairs.For aligning texts involving Chinese, embedding techniques such as Word2Vec [37] and FastText [38] proved effective in calculating sentence vectors for alignment.
Li et al. [39] observed that although word-based models are more susceptible to data sparsity and out-of-vocabulary words, they generally performed better across tasks, regardless of Chinese word segmentation.This observation provided a strong argument for re-evaluating the necessity of Chinese word segmentation in deep learning models.Jiang et al. [40] proposed a novel neural Conditional Random Field (CRF) alignment model, which not only took advantage of the sequential nature of sentences in parallel documents but also used a neural sentence pair model to capture semantic similarities, outperforming previous methods in monolingual sentence alignment tasks.
Web crawling has become a popular method for corpus acquisition, though the resulting corpus often contains considerable noise and impurities.Zhang et al. [41] utilized the multilingual capabilities of BERT for sentence alignment and employed the Generative Pre-Training (GPT) language model as a domain filter to achieve data domain balance.Cao et al.'s [42] analysis of BERT identified systematic issues, such as misalignments in open class lexemes and word pairs across different character sets, which were corrected through a series of alignment procedures.
Overall, these diverse investigations not only laid the foundation for the development of NMT, but also offered valuable insights for overcoming challenges in corpus construction and text alignment.In-depth research in these areas has improved our understanding of NMT mechanisms and continuously enhanced translation quality.Future efforts can explore innovative approaches in corpus construction, alignment methods, and data expansion, further propelling the field of NMT forward.

Construction of the WCC-EC 2.0
Our corpus, featuring texts from both the news and lyrics domains, was originally established as WCC-EC 1.0, comprising about 340,000 English-Chinese news data pairs.This section discusses the construction process of WCC-EC 2.0, highlighting the main challenges we encountered during the calibration phase and the strategies we implemented to overcome them.

Web Crawling
For the English-Chinese bilingual news texts in WCC-EC 1.0, we selected the KEKE English website (http://www.kekenet.com(accessed on 15 February 2024)) as our primary data source.This website provides a broad spectrum of news content across various fields, such as campus life, entertainment, international affairs, economics, sports, social issues, and factual reporting, ensuring the timeliness and relevance of our data.We utilized the crawler frameworks BeautifulSoup (https://www.crummy.com/software/BeautifulSoup/(accessed on 15 February 2024)) and Scrapy (https://scrapy.org/(accessed on 15 February 2024)) to navigate and download the bilingual articles, successfully constructing a diverse corpus of bilingual news articles.
The English-Chinese bilingual lyrics corpus was acquired with LrcHelper (https: //github.com/ludoux/LrcHelper(accessed on 15 February 2024)), a tool designed for downloading lyrics from NetEase Cloud Music based on song list IDs.Through LrcHelper, we downloaded approximately 230,000 LyRiCs (LRC) files over nearly 15 days, covering around 2200 English song lists.A comprehensive de-duplication process was carried out on both the news and lyrics texts to ensure the uniqueness and quality of our corpus.

Extraction of Bilingual Sentences
In constructing the news corpus, we began with segment alignment, followed by sentence splitting, and introduced the concept of a translation matrix during sentence alignment.A detailed methodology is documented in WCC-EC 1.0 [21], which extensively described the process of acquiring news texts, data augmentation methods, and corpus quality measurement, leading to the creation of a high-quality news parallel corpus.
For version 2.0, the lyrics texts were updated to LRC files, containing a significant amount of extraneous non-lyrical content.Extracting usable data without disrupting the original alignment of the lyrics required the development of a sophisticated proofreading program.This program, designed around the temporal alignment of English lyrics and their Chinese translations within a one-second interval, ensured the efficient extraction of bilingual text.Despite these efforts, the data still contained considerable noise.We applied regular expressions to remove annotations, primarily those within " 【】 " and "[ ]", and eliminated special symbols such as "/n" and any non-standardized Chinese symbols, yielding more than 2 million sentence pairs.As depicted in Figure 1, the left side presents a sample of the lyric data, while the right side shows a sample of the news text before data cleaning.46.98]寻找你的微笑 <br> <div id="flag more" class="f-hide">...</div> <div class="crl">...</div> </div> <div class="news_view" id="article_text" itemprop="articleBody" style="font-size: 16px;"> <br> "近日，一场突发性的大规模火灾在首尔市中心爆发，造成大面积的火势蔓延。消防部门紧急出动多支 灭火队伍，正在全力扑救。火灾起因尚不明确，但据目击者称，火势迅速蔓延，浓烟滚滚。" <br> <br> "首尔市政府已经发布紧急通告，呼吁市民迅速撤离附近地区，确保安全。同时，相关救援部门已展开 全面救援行动，协助被困人员撤离。目前还没有关于人员伤亡的确切信息。" <br> <br> <br> "新闻记者 米彩 micai@news.com"</div> <div class="news_view" id="article_text" itemprop="articleBody" style="font-size: 16px;"> <br> "In recent days, a sudden and large-scale fire broke out in the downtown area of Seoul, causing extensive flames to spread.Multiple firefighting teams have been urgently dispatched to combat the fire.The cause of the fire is not yet clear, but according to eyewitnesses, the fire rapidly spread with thick smoke billowing."<br> <br> "The Seoul City government has issued an emergency notice, urging residents to evacuate the nearby areas quickly to ensure their safety.At the same time, relevant rescue departments have launched a comprehensive rescue operation to assist stranded individuals in evacuating.Currently, there is no precise information regarding casualties."<br> <br> <br> "News Reporter CaiMi micai@news.com</div>

Text Alignment
In the process of aligning news texts, we used the Sentence BERT (SBERT) model [43] to calculate the similarity between different sentences, thereby identifying pairs of sentences with a similarity exceeding a predefined threshold.In the sentence embedding model, a higher similarity value indicates a closer semantic resemblance between two sentences.To improve the performance of the SBERT model, we used an intra-paragraph matching method that took paragraph information into account to enhance the accuracy of matching parallel sentences [21].To validate this method, we randomly selected 1000 sentence pairs and conducted manual verification.The alignment rate increased from 92.1% to 95.2% as a result.
The lyrics text inherently had high alignment.However, the NetEase Cloud Music platform did not have strict regulations for lyric uploads, leading to complex redundant noise data that could affect the corpus's alignment rate.Therefore, data cleaning was essential before conducting alignment experiments.This involved resolving issues such as severe mixed-language lyrics and errors leading to bilingual content being exclusively in English or Chinese.Considering English lyrics often contained profanity and the text needed to conform to moral and legal standards, we systematically removed sentences with uncivilized language based on predefined keywords.Finally, we randomly selected a sample set of 2000 pairs of data for manual test alignment experiments, showing an impressive alignment rate of approximately 98.4%, indicative of a high level of alignment.
Ultimately, the lyrics text and news text were combined to form the WCC-EC 2.0 corpus.

Corpus Segmentation
In the final phase of corpus construction, selecting sentence pairs for the corpus's validation and test sets was crucial.We adopted the conventional NMT parallel corpus construction method for this task.Given the corpus's composition of lyrics and news articles, we conducted proportional random sampling to ensure that the 2000 selected sentence pairs accurately represented the proportional distribution of lyrics and news articles throughout the entire corpus.Moreover, to maintain data quality, we exclusively chose sentence pairs containing sentences with at least 10 characters.As a result, a total of 4000 sentence pairs were sampled, with 2000 pairs each forming the development set and the test set, respectively.The remaining data were designated as the training set.Table 2 presents the dataset breakdown across different corpora: WCC-EC 2.0, denoted as WCC-EC 2.0-Full; the lyrics segment of WCC-EC 2.0, referred to as WCC-EC 2.0-Lyrics; and the news segment of WCC-EC 2.0, identified as WCC-EC 2.0-News (also known as WCC-EC 1.0) [21].Additionally, it includes the V16 Dataset (https://data.statmt.org/news-commentary/v16/training/news-commentary-v16.en-zh.tsv.gz(accessed on 15 February 2024)).

Experiment and Evaluation
To assess the corpus's quality, we executed the following experiments, with detailed descriptions of the datasets employed provided in Section 4.1.The experimental framework is expounded upon in Section 4.2.In Section 4.3, the evaluation of WCC-EC 2.0 quality is presented in comparison.

Dataset
We employed two datasets for our experiments.The first dataset is derived from News Commentary v16, a news dataset provided by WMT2022 (https://www.statmt.org/wmt22 (accessed on 15 February 2024)), containing about 313,000 English-Chinese sentence pairs.From this dataset, we randomly extracted 4000 statements, allocating 2000 for the development set and the remaining 2000 for the test set.
Furthermore, by leveraging our self-developed WCC-EC 2.0, which consists of approximately 1.6 million Chinese and English sentence pairs, we ensured the independence of WCC-EC 2.0-Full, WCC-EC 2.0-Lyrics, and WCC-EC 2.0-News during the experiments.To achieve this, we randomly selected 4000 sentence pairs from both WCC-EC 2.0-Lyrics and WCC-EC 2.0-News, allocating 2000 pairs as the test set and the other 2000 as the development set for each segment.As a result, both the test set and the development set comprised a total of 4000 utterances each.Following this, we randomly chose 2000 sentence pairs from these 4000 in the development set as the test set for WCC-EC 2.0-Full, and another 2000 from the 4000 in the test set as the validation set.This method ensured data independence while maximizing the use of the available data.

Setting Up the NMT Framework
For our subsequent experiments, we chose the fairseq framework (https://github.com/facebookresearch/fairseq (accessed on 15 February 2024)) and employed its Transformer pre-trained model.This model is equipped with 6 encoder and 6 decoder layers, featuring 8 encoder attention heads and a word vector size of 512.The architecture of Transformers is shown in Figure 3.We adhered to other hyperparameter configurations as well, including a dropout rate of 0.3, beta values of 0.9 and 0.98, setting the learning rate at 1 × 10 −7 , a token limit of 4096, the batch size of 128, and a maximum update cap of 200,000 steps.For preprocessing subwords, we implemented the BPE algorithm (https://github.com/rsennrich/subword-nmt(accessed on 15 February 2024)) with a vocabulary size of 32,000.The configuration of these hyperparameters reflects the typical setup of deep learning models in machine translation tasks.It ensures both the depth and complexity of the model to capture the intricate relationships between languages, while effectively preventing overfitting and enhancing training stability through the adjustment of training parameters, thereby guaranteeing the quality and efficiency of machine translation.
During the prediction phase, a beam size of 8 was used to produce the translation outputs.Considering the absence of spaces in Chinese text, Jieba (https://github.com/fxsjy/jieba (accessed on 15 February 2024)) was utilized for sentence segmentation, and Moses (https://github.com/moses-smt/mosesdecoder(accessed on 15 February 2024)) was applied for punctuation and case modifications.The quality of machine translation was evaluated using the BLEU (Bilingual Evaluation Understudy) metric [44], a recognized standard, calculated with the "fairseq-score" command following word segmentation.
The specific procedure integrated into our methodology is outlined as follows: 1.
Word segmentation and tokenization: Initially, Jieba was applied to segment Chinese sentences, followed by the use of Moses for tokenizing both Chinese and English sentences.This step transformed the original continuous text into individual words or tokens, allowing the model to better comprehend the structure and semantics of the text.

2.
BPE: We utilized the subword-nmt tool for encoding the bilingual files with Byte-Pair Encoding.BPE was a method that encoded common word combinations or phrases into single tokens, which helped the model better understand and process low-frequency vocabulary, thus enhancing the accuracy of translation.

3.
Length limit: The clean-corpus-n.perltool in Moses was employed to clean the data and eliminate sentences exceeding 256 words.This step could reduce the computational burden on the model and avoid potential interference from overly long sentences during training.This step might have decreased the size of the dataset, but we believe it could improve the training efficiency and translation quality of the model.

4.
Generate input text: We generated the vocabulary and binaries needed for model training using the "fairseq-preprocess" preprocessing function in fairseq.
By meticulously following these steps, we aimed to refine our model's training and prediction processes, enhancing the overall accuracy and quality of our NMT system.

Evaluation 4.3.1. Machine Translation Performance and Analysis
We used BLEU scores, calculated with the fairseq-generate command, to evaluate our model's performance.As shown in Tables 3 and 4, the test data from the "V16 Dataset" is referred to as "V".Similarly, the test set from "WCC-EC 2.0-Full" is labeled as "W".Additionally, test data from the "WCC-EC 2.0-Lyrics" are tagged as "WL".The test set from the "WCC-EC 2.0-News" is abbreviated as "WN".Due to the experimental English-Chinese dataset being categorized as low-resource domain material, the BLEU scores observed in our experiments were relatively low.In the English→Chinese translation task (referenced in Table 3), the translation results of version 16 of the WCC were relatively poor.However, WCC-EC 2.0, despite being a low-resource dataset, has a significantly large volume of data, which contributed to better translation outcomes compared to version 16.However, a decline in translation quality was noted in sections of the corpus related to news and music, with the news segment particularly experiencing a more marked decrease in quality.Surprisingly, the BLEU score for the model trained on WCC-EC 2.0-News in the "V" test set was higher than its score in its own test set.Although WCC-EC 2.0-Full did not achieve the lead across all test sets, the gap with other corpora was minimal.This can be attributed to our strict deduplication process, ensuring that the training set did not contain test sets from other corpora.Notably, the average result of WCC-EC 2.0-Full was the best among all the corpora considered, demonstrating significant generalizability.
In the Chinese→English translation experiment (as detailed in Table 4), WCC-EC 2.0-Full achieved the highest BLEU score of 24.72 on the W test set.This score was not only relatively high but also significantly surpassed the performance of V16, which scored 12.8 on the "V" test set.The superior performance of WCC-EC 2.0-Full, developed by our team, underscored its effectiveness in Chinese→English translation tasks.These results robustly demonstrated the utility of the WCC-EC 2.0 and its potential to enhance the accuracy of machine translation.Considering the BLEU score results as a whole, we believed the lower scores were particularly due to the inclusion of colloquial texts from the music domain.These texts tended to be shorter in length and contained less knowledge and contextual information, which likely led to lower BLEU scores.

Manual Evaluation Results and Analysis
To substantiate the validity of the WCC-EC 2.0, we employed the evaluation criteria defined by the Japan Patent Office (JPO) (https://www.jpo.go.jp/system/laws/sesaku/ kikaihonyaku/tokkyohonyaku_hyouka.html(accessed on 15 February 2024)).This criterion assesses the fidelity of translated content through a five-tier grading system, where a score of 5 represents the highest level of accuracy.The nuances of the JPO grading methodology are comprehensively delineated in Table 5.During the manual evaluation phase, we enlisted experts fluent in both Chinese and English for an in-depth review.The participants' profiles, outlined in Table 6, encompass a wide range of professionals from researchers to professors.Importantly, the evaluators who are native Chinese speakers held at least a master's degree and had significant exposure to academic English.Their proficiency in English was further validated by high scores on English proficiency exams, including CET-6, a TOEFL score of at least 90, and an IELTS score of 6.5 or higher.This combination of linguistic skill and scholarly background guarantees that our evaluation process is thorough and reliable.
During our manual evaluation, we placed particular emphasis on the consistency and accuracy of the evaluation criteria to ensure objectivity and fairness in the assessment results.Through these measures, we successfully improved the quality of the assessment results presentation.Additionally, during the data analysis process, rigorous checks and filtering were applied to the evaluation data to eliminate potential biases and errors.These efforts not only enhanced the reliability of the evaluation results, but also further elevated the quality of their presentation.
We undertook a manual evaluation of the translations that garnered the highest BLEU scores on the "W" test set, detailed in Tables 3 and 4, for both the English → Chinese and Chinese → English directions.To achieve this, we formed three evaluator groups, designated as X, Y, and Z, with their particulars specified in Table 6.Each group was chosen based on their unique areas of expertise to ensure a balanced and thorough assessment.
The analysis, depicted in Figures 4 and 5, showed minimal variation in the average scores among the groups, with the greatest difference being only 0.15.Significantly, Team X tended to assign marginally higher scores, whereas Teams Y and Z exhibited similar scoring trends.The consistency in scores from the latter two groups might have indicated that their evaluations were more informative.Given that all groups' scores exceeded the 4.0 mark, it suggested that the critical components of the translations were effectively conveyed.Additionally, considering the colloquial and accessible nature of the language used, we believed that the manual evaluations might have carried a tendency to award higher scores.These results affirmed the effectiveness and value of the WCC-EC 2.0 in producing high-quality translations, while also suggesting that the approachable language style contributed to somewhat more generous scoring in the manual evaluations.Table 6.Profiles of participants in the assessment.

Summary and Future Work
In this study, we outlined the detailed process used to develop the WCC-EC 2.0, a large bilingual English-Chinese corpus.With more than 1.6 million sentence pairs across the news and music sectors, this corpus was carefully put together and aligned.Notably, it achieves the highest average BLEU score across various test sets, complemented by an exceptional average manual evaluation score exceeding 4 points.Its large size and availability to the public make it one of the most comprehensive resources for NMT involving the English-Chinese language pair.
Given the online source of our data, we tackled potential copyright issues.The process of extracting and sharing content from the internet often involves the risk of copyright infringement, especially with materials that belong to others.To ensure we were following the law, we consulted with legal experts and confirmed that the WCC-EC 2.0 complied with Chinese copyright laws.
Our validation of the corpus included thorough English-Chinese translation experiments and detailed manual reviews, confirming its adaptability and textual authenticity.Future research will focus on employing data augmentation techniques to improve the corpus quality and open up new possibilities in NMT.Additionally, we plan to use the WCC-EC 2.0 for fine-tuning open-source LLMs such as BLOOM [45] and LLaMA 2 [46], thus boosting their translation performance.With the backing of the WCC-EC 2.0, we are set to make significant progress in English-Chinese translation.This advancement is expected to support cooperative projects in infrastructure development, trade dynamics, and policy coordination.

Figure 2
Figure 2 illustrates the comprehensive process of constructing the WCC-EC 2.0, encompassing four key phases: (1) web crawling; (2) data extraction; (3) text alignment; and (4) corpus segmentation.Detailed descriptions of these stages are provided in Sections 3.1-3.4,covering the various phases involved in building the corpus.

Figure 3 .
Figure 3.The architecture diagram of the transformers used in the experiment.

Figure 5 .
Figure 5. Manual evaluation results for the Chinese → English translation.

Table 1 .
Basic information of some parallel corpora, where EU, AR, ZH, EN, FR, RU, and ES, respectively, represent the European Union, Arabic, Chinese, English, French, Russian, and Spanish.

Table 2 .
Overview of the English-Chinese parallel corpora.

Table 3 .
English→Chinese translation results (BLEU scores), bold numbers indicate the best scores.

Table 4 .
Chinese→English translation results (BLEU scores), bold numbers indicate the best scores.

Table 5 .
Details of the JPO evaluation criteria.
Manual evaluation results for the English → Chinese translation.