Sign in to use this feature.

Years

Between: -

Subjects

remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline
remove_circle_outline

Journals

Article Types

Countries / Regions

Search Results (3)

Search Parameters:
Keywords = Japanese-Chinese parallel corpus

Order results
Result details
Results per page
Select all
Export citation of selected articles as:
15 pages, 557 KiB  
Article
WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation
by Jinyi Zhang, Ye Tian, Jiannan Mao, Mei Han, Feng Wen, Cong Guo, Zhonghui Gao and Tadahiro Matsumoto
Electronics 2023, 12(5), 1140; https://doi.org/10.3390/electronics12051140 - 26 Feb 2023
Cited by 8 | Viewed by 3615
Abstract
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable [...] Read more.
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only. Full article
(This article belongs to the Special Issue Natural Language Processing and Information Retrieval)
Show Figures

Figure 1

12 pages, 950 KiB  
Article
Estimation of the Underlying F0 Range of a Speaker from the Spectral Features of a Brief Speech Input
by Wei Zhang, Yanlu Xie, Binghuai Lin, Liyuan Wang and Jinsong Zhang
Appl. Sci. 2022, 12(13), 6494; https://doi.org/10.3390/app12136494 - 27 Jun 2022
Viewed by 2488
Abstract
From a very brief speech, human listeners can estimate the pitch range of the speaker and normalize pitch perception. Spectral features which inherently involve both articulatory and phonatory characteristics were speculated to play roles in this process, but few were reported to directly [...] Read more.
From a very brief speech, human listeners can estimate the pitch range of the speaker and normalize pitch perception. Spectral features which inherently involve both articulatory and phonatory characteristics were speculated to play roles in this process, but few were reported to directly correlate with speaker’s F0 range. To mimic this human auditory capability and validate the speculation, in a preliminary study we proposed an LSTM-based method to estimate speaker’s F0 range from a 300 ms-long speech input, which turned out to outperform the conventional method. By two more experiments, this study further improved the method and verified its validity in estimating the speaker-specific underlying F0 range. After incorporating a novel measurement of F0 range and a multi-task training approach, Experiment 1 showed that the refined model gave more accurate estimates than the initial model. Based on a Japanese-Chinese bilingual parallel speech corpus, Experiment 2 found that the F0 ranges estimated with the model from the Chinese speech and the model from the Japanese speech produced by the same set of speakers had no significant difference, whereas the conventional method showed significant difference. The results indicate that the proposed spectrum-based method captures the speaker-specific underlying F0 range which is independent of the linguistic content. Full article
(This article belongs to the Special Issue Machine Learning for Language and Signal Processing)
Show Figures

Figure 1

16 pages, 794 KiB  
Article
Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora
by Jinyi Zhang and Tadahiro Matsumoto
Appl. Sci. 2019, 9(10), 2036; https://doi.org/10.3390/app9102036 - 17 May 2019
Cited by 30 | Viewed by 5450
Abstract
The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for [...] Read more.
The translation quality of Neural Machine Translation (NMT) systems depends strongly on the training data size. Sufficient amounts of parallel data are, however, not available for many language pairs. This paper presents a corpus augmentation method, which has two variations: one is for all language pairs, and the other is for the Chinese-Japanese language pair. The method uses both source and target sentences of the existing parallel corpus and generates multiple pseudo-parallel sentence pairs from a long parallel sentence pair containing punctuation marks as follows: (1) split the sentence pair into parallel partial sentences; (2) back-translate the target partial sentences; and (3) replace each partial sentence in the source sentence with the back-translated target partial sentence to generate pseudo-source sentences. The word alignment information, which is used to determine the split points, is modified with “shared Chinese character rates” in segments of the sentence pairs. The experiment results of the Japanese-Chinese and Chinese-Japanese translation with ASPEC-JC (Asian Scientific Paper Excerpt Corpus, Japanese-Chinese) show that the method substantially improves translation performance. We also supply the code (see Supplementary Materials) that can reproduce our proposed method. Full article
(This article belongs to the Section Computing and Artificial Intelligence)
Show Figures

Graphical abstract

Back to TopTop