Next Article in Journal
Instance Segmentation of Microscopic Foraminifera
Next Article in Special Issue
Comparative Analysis of Current Approaches to Quality Estimation for Neural Machine Translation
Previous Article in Journal
Hydrothermal Carbonization of Spent Coffee Grounds
 
 
Article

English–Welsh Cross-Lingual Embeddings

1
School of Computer Science and Informatics, Cardiff University, Cardiff CF24 3AA, UK
2
School of Mathematics, Cardiff University, Cardiff CF24 4AG, UK
3
School of English, Communication and Philosophy, Cardiff University, Cardiff CF10 3EU, UK
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Academic Editors: Arturo Montejo-Ráez and Salud María Jiménez-Zafra
Appl. Sci. 2021, 11(14), 6541; https://doi.org/10.3390/app11146541
Received: 18 May 2021 / Revised: 4 July 2021 / Accepted: 5 July 2021 / Published: 16 July 2021
(This article belongs to the Special Issue Current Approaches and Applications in Natural Language Processing)
Cross-lingual embeddings are vector space representations where word translations tend to be co-located. These representations enable learning transfer across languages, thus bridging the gap between data-rich languages such as English and others. In this paper, we present and evaluate a suite of cross-lingual embeddings for the English–Welsh language pair. To train the bilingual embeddings, a Welsh corpus of approximately 145 M words was combined with an English Wikipedia corpus. We used a bilingual dictionary to frame the problem of learning bilingual mappings as a supervised machine learning task, where a word vector space is first learned independently on a monolingual corpus, after which a linear alignment strategy is applied to map the monolingual embeddings to a common bilingual vector space. Two approaches were used to learn monolingual embeddings, including word2vec and fastText. Three cross-language alignment strategies were explored, including cosine similarity, inverted softmax and cross-domain similarity local scaling (CSLS). We evaluated different combinations of these approaches using two tasks, bilingual dictionary induction, and cross-lingual sentiment analysis. The best results were achieved using monolingual fastText embeddings and the CSLS metric. We also demonstrated that by including a few automatically translated training documents, the performance of a cross-lingual text classifier for Welsh can increase by approximately 20 percent points. View Full-Text
Keywords: natural language processing; distributional semantics; machine learning; language model; word embeddings; machine translation; sentiment analysis natural language processing; distributional semantics; machine learning; language model; word embeddings; machine translation; sentiment analysis
Show Figures

Figure 1

MDPI and ACS Style

Espinosa-Anke, L.; Palmer, G.; Corcoran, P.; Filimonov, M.; Spasić, I.; Knight, D. English–Welsh Cross-Lingual Embeddings. Appl. Sci. 2021, 11, 6541. https://doi.org/10.3390/app11146541

AMA Style

Espinosa-Anke L, Palmer G, Corcoran P, Filimonov M, Spasić I, Knight D. English–Welsh Cross-Lingual Embeddings. Applied Sciences. 2021; 11(14):6541. https://doi.org/10.3390/app11146541

Chicago/Turabian Style

Espinosa-Anke, Luis, Geraint Palmer, Padraig Corcoran, Maxim Filimonov, Irena Spasić, and Dawn Knight. 2021. "English–Welsh Cross-Lingual Embeddings" Applied Sciences 11, no. 14: 6541. https://doi.org/10.3390/app11146541

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop