Next Article in Journal
Empirical Study on the Factors Affecting Individuals’ Switching Intention to Augmented/Virtual Reality Content Services Based on Push-Pull-Mooring Theory
Next Article in Special Issue
Viability of Neural Networks for Core Technologies for Resource-Scarce Languages
Previous Article in Journal
Adoption Barriers of IoT in Large Scale Pilots
Previous Article in Special Issue
Improving Basic Natural Language Processing Tools for the Ainu Language
Open AccessArticle

Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

by Yang Yuan 1,2,3, Xiao Li 1,2,3,* and Ya-Ting Yang 1,2,3
1
Xinjiang Technical Institute of Physics & Chemistry, Chinese Academy of Sciences, Urumqi 830011, China
2
University of Chinese Academy of Sciences, Beijing 100049, China
3
Xinjiang Laboratory of Minority Speech and Language Information Processing, Urumqi 830011, China
*
Author to whom correspondence should be addressed.
Information 2020, 11(1), 24; https://doi.org/10.3390/info11010024
Received: 15 November 2019 / Revised: 18 December 2019 / Accepted: 26 December 2019 / Published: 29 December 2019
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks. View Full-Text
Keywords: word embedding; word alignment probability; distance attenuation function; Word2vec; GloVe word embedding; word alignment probability; distance attenuation function; Word2vec; GloVe
Show Figures

Figure 1

MDPI and ACS Style

Yuan, Y.; Li, X.; Yang, Y.-T. Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages. Information 2020, 11, 24.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop