Next Article in Journal
Imperfections of Scalar Approximation in Calibration of Computer-Generated Holograms for Optical Surface Measurements
Previous Article in Journal
Accurate Prediction and Key Feature Recognition of Immunoglobulin
 
 
Article

Creating Welsh Language Word Embeddings

1
School of Computer Science & Informatics, Cardiff University, Cardiff CF24 3AA, UK
2
School of Mathematics, Cardiff University, Cardiff CF24 4AG, UK
3
School of English, Communication & Philosophy, Cardiff University, Cardiff CF10 3EU, UK
*
Author to whom correspondence should be addressed.
Academic Editors: Rafael Valencia-Garcia and Francisco García-Sánchez
Appl. Sci. 2021, 11(15), 6896; https://doi.org/10.3390/app11156896
Received: 14 March 2021 / Revised: 18 July 2021 / Accepted: 21 July 2021 / Published: 27 July 2021
Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence significantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufficiently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed specifically for Welsh. To account for rich inflection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the first to focus specifically on Welsh word embeddings. View Full-Text
Keywords: Welsh language; natural language processing; human language technology; machine learning; word embeddings Welsh language; natural language processing; human language technology; machine learning; word embeddings
MDPI and ACS Style

Corcoran, P.; Palmer, G.; Arman, L.; Knight, D.; Spasić, I. Creating Welsh Language Word Embeddings. Appl. Sci. 2021, 11, 6896. https://doi.org/10.3390/app11156896

AMA Style

Corcoran P, Palmer G, Arman L, Knight D, Spasić I. Creating Welsh Language Word Embeddings. Applied Sciences. 2021; 11(15):6896. https://doi.org/10.3390/app11156896

Chicago/Turabian Style

Corcoran, Padraig, Geraint Palmer, Laura Arman, Dawn Knight, and Irena Spasić. 2021. "Creating Welsh Language Word Embeddings" Applied Sciences 11, no. 15: 6896. https://doi.org/10.3390/app11156896

Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop