MDPI - Publisher of Open Access Journals

15 pages, 557 KiB

Open AccessEditor’s ChoiceArticle

WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

by Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao and Tadahiro Matsumoto

Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140 - 26 Feb 2023

Cited by 8 | Viewed by 3615

Abstract

Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only. Full article

(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)

► Show Figures

Figure 1

12 pages, 950 KiB

Open AccessArticle

Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input

by Wei Zhang, Yanlu Xie, Binghuai Lin, Liyuan Wang and Jinsong Zhang

Appl. Sci. 2022, 12(13), 6494; https://doi.org/10.3390/app12136494 - 27 Jun 2022

Viewed by 2488

Abstract

From a very brief speech, human listeners can estimate the pitch range of the speaker and normalize pitch perception. Spectral features which inherently involve both articulatory and phonatory characteristics were speculated to play roles in this process, but few were reported to directly correlate with speaker’s F0 range. To mimic this human auditory capability and validate the speculation, in a preliminary study we proposed an LSTM-based method to estimate speaker’s F0 range from a 300 ms-long speech input, which turned out to outperform the conventional method. By two more experiments, this study further improved the method and verified its validity in estimating the speaker-specific underlying F0 range. After incorporating a novel measurement of F0 range and a multi-task training approach, Experiment 1 showed that the refined model gave more accurate estimates than the initial model. Based on a Japanese-Chinese bilingual parallel speech corpus, Experiment 2 found that the F0 ranges estimated with the model from the Chinese speech and the model from the Japanese speech produced by the same set of speakers had no significant difference, whereas the conventional method showed significant difference. The results indicate that the proposed spectrum-based method captures the speaker-specific underlying F0 range which is independent of the linguistic content. Full article

(This article belongs to the Special Issue Machine Learning for Language and Signal Processing)

► Show Figures

Figure 1

16 pages, 794 KiB

Open AccessArticle

Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora

by Jinyi Zhang and Tadahiro Matsumoto

Appl. Sci. 2019, 9(10), 2036; https://doi.org/10.3390/app9102036 - 17 May 2019

Cited by 30 | Viewed by 5450

Abstract

The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see ^{Supplementary Materials}) that can reproduce our proposed method. Full article

(This article belongs to the Section Computing and Artificial Intelligence)

► Show Figures

Graphical abstract

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI