Creating Welsh Language Word Embeddings

: Word embeddings are representations of words in a vector space that models semantic relationships between words by means of distance and direction. In this study, we adapted two existing methods, word2vec and fastText, to automatically learn Welsh word embeddings taking into account syntactic and morphological idiosyncrasies of this language. These methods exploit the principles of distributional semantics and, therefore, require a large corpus to be trained on. However, Welsh is a minoritised language, hence signiﬁcantly less Welsh language data are publicly available in comparison to English. Consequently, assembling a sufﬁciently large text corpus is not a straightforward endeavour. Nonetheless, we compiled a corpus of 92,963,671 words from 11 sources, which represents the largest corpus of Welsh. The relative complexity of Welsh punctuation made the tokenisation of this corpus relatively challenging as punctuation could not be used for boundary detection. We considered several tokenisation methods including one designed speciﬁcally for Welsh. To account for rich inﬂection, we used a method for learning word embeddings that is based on subwords and, therefore, can more effectively relate different surface forms during the training phase. We conducted both qualitative and quantitative evaluation of the resulting word embeddings, which outperformed previously described word embeddings in Welsh as part of larger study including 157 languages. Our study was the ﬁrst to focus speciﬁcally on Welsh word embeddings.


Introduction
Natural language processing (NLP) studies the ways in which the analysis and synthesis of information expressed in a natural language can be automated.In recent years, most breakthroughs and improvements in the field have been the result of applying machinelearning techniques.One such case is that of word embeddings [1].A word embedding is a mapping from the lexico-semantic space of words to the n-dimensional real valued vector space.Here, the dimensionality n is a hyper-parameter, i.e., a parameter whose value is set before the learning process begins.Compared to a traditional document-term matrix, whose second dimension will correspond to the size of the vocabulary, the dimension of the word embeddings is typically chosen to be relatively small, e.g., 300.Unlike document-term matrices, which are sparse, i.e., have a great many zero values, word embedding vectors are dense.The dimensions of word embedding vectors correspond to latent variables sampled from the distribution of words in a large corpus.As such, word embeddings tend to arrange semantically related words in similar spatial patterns.For example, the distance between the words 'shoe' and 'sock' should be relatively small compared to the distance between the words 'shoe' and 'butter'.Similarly, the vectors between 'foot' and 'sock' on one hand and 'hand' and 'glove' on the other should be near equal, i.e., have similar direction and magnitude.Owing to these latent semantic properties, it has Appl.Sci.2021, 11, 6896 2 of 17 been demonstrated that in many cases the use of word embeddings improves performance of downstream NLP tasks such as named entity recognition and sentiment analysis.
To date, there has been much research on the creation of word embeddings for the English language [2].In this study, however, we focus specifically on the Welsh language.Welsh is the native language of Wales, a country that is part of the United Kingdom (UK), in which it has the status of an official language alongside English.According to the 2011 UK Census, 19% of residents in Wales aged three and over were able to speak Welsh.Subsequently, the Office for National Statistics Annual Population Survey for the year ending in March 2019 determined that 896,900 Welsh residents (30% of the total population) aged three or over were able to speak Welsh.Nonetheless, Welsh is considered a low resource language in the sense that relative to English there are fewer corpora and NLP tools that are readily available.Empirical evidence suggests that the observance of lexico-semantic patterns in word embeddings is correlated with the size of corpus used for training [2].
Having assembled a large corpus of Welsh, the next challenge in training word embeddings is the recognition of words as discrete units of text, the process commonly known as tokenisation.The relative complexity of Welsh punctuation, particularly the extensive use of apostrophes that differs from their typical use in English, makes tokenisation challenging.Finally, Welsh is a morphologically rich language where inflection can give rise to multiple surface forms of a single word.Moreover, Welsh words can be inflected at their beginning as well as their ending, rendering the traditional stemming approaches ineffective in linking together related surface forms.Arguably, word embeddings trained on the original surface forms can capture the patterns of their inflection.However, such an approach is not feasible for languages that are highly inflected, yet low resourced as different surface forms may not occur frequently enough to establish a pattern of inflection.
Once these challenges are overcome, the actual process of training word embeddings is relatively straightforward as most of the state-of-the-art algorithms are, in fact, language independent.Several generic methods for learning word embeddings have been developed and applied successfully to different languages.However, such a general approach is not optimised with respect to the specific characteristics of individual languages and in turn the resulting word embeddings may not be optimal.In this study, we describe a novel workflow for training Welsh word embeddings that has been developed to overcome the above challenges.Specifically, we assembled a large corpus of Welsh.We considered several tokenisation methods including one designed specifically for Welsh.To account for word inflection, we opted for a generic word embedding method that is based on subwords and, therefore, can effectively relate different surface forms that share subwords.
The remainder of the paper is organised as follows.Section 2 presents a review of the state of the art in Welsh NLP.Section 3 presents the proposed workflow for learning Welsh word embeddings.Section 4 presents a qualitative and quantitative evaluation of the resulting word embeddings.The quantitative evaluation required us to create a Welsh word embedding benchmark.Finally, Section 5 draws conclusions from this study and discusses future research directions.

Related Work
This section reviews language resources that can support NLP in Welsh and, in particular, creation of word embeddings in this language.The first step in training word embeddings is to assemble a large corpus.Corpus-based language studies provide empirically based objective analyses of patterns of language as it is actually used, using evidence from a corpus (singular) or corpora (plural).CorCenCC (Corpws Cenedlaethol Cymraeg Cyfoes-the National Corpus of Contemporary Welsh) is a major corpus [3] containing 10 million words of written, spoken and digital (or 'e') Welsh language.It contains multiple language samples from real-life communication, allowing linguists to explore Welsh as it is actually used.It organises data into multiple facets, which can be used to study sublanguages as defined by [4].All data are also annotated with different types of linguistic Appl.Sci.2021, 11, 6896 3 of 17 information including morphological units, tokens, part-of-speech (POS) [5] and semantic categories [6,7].In addition to linguistic research, the corpus can support a range of other applications such as learning and teaching of Welsh, but also NLP.In this study, together with other data sources, samples of data from an early release of CorCenCC were used to train word embeddings (see Section 3.1).
Having identified the relevant sources of data, the first step towards the creation of word embeddings is the process of identifying individual words as discrete units of text, the process known as tokenisation.Welsh Natural Language Toolkit (WNLT) [8] implements a set of rule-based Welsh NLP tools for tokenisation, lemmatisation, POS tagging and named entity recognition (NER), which are embedded into the GATE framework [9].Similar NLP capabilities have been implemented to support pre-processing of documents stored in CorCenCC, which are tokenised and tagged using CyTag, a rule-based POS tagger [5].Another POS tagger, which can be used as a web service without the need to install it locally, can tag lexical categories (e.g., verbs and nouns) as well as the features specific to the Welsh language such as mutations [10].The same team developed a lemmatiser, which can be used to normalise any inflected, mutated and/or conjugated word into its lemma [11].
Members of the CorCenCC team also developed downstream NLP methods for multiword term recognition [12] and semantic tagging [6,7].These methods were originally developed for English and successfully adapted for Welsh [13][14][15].These methods can be useful for improving the performance of the downstream task of machine translation methods.For example, verbatim translations often deviate from the established terminology in the target language.Therefore, high-quality translations, performed by either humans or machines, require management of terminologies.Most machine translation systems require a terminology dictionary, e.g., [16,17] and/or the ability to extract terms dynamically [12] to support translations that use established terminology in the target language.In general, phrase-based statistical machine translation can improve the levels of translation quality where sufficiently large parallel corpora can be used for training as demonstrated in the case of English and Welsh [18].In particular, the ability to align translated texts into paired sentences in the two languages [19] can support training of cross-lingual word embeddings [20], which can allow existing English language resources to be re-used for applications in Welsh.
Welsh word embeddings were first described in a study that presented a general method for creating word embeddings that was tested across 157 languages [2].They were since used to support a machine-learning approach to a joint task of POS and semantic tagging [21].However, these embeddings were created using a generic approach, which does not take into account specific characteristics of the Welsh language.For example, the text was segmented by the ICU tokeniser, which is language agnostic and not entirely appropriate for Welsh as it features an extensive use of apostrophes that differs from their typical use in other languages.Furthermore, a single method for creating word embeddings was considered, whereas alternative methods could prove to be more suitable for Welsh.

Methods
This section describes the proposed workflow for training word embeddings in Welsh.The workflow consists of three main steps.First, we assembled a large text corpus of Welsh language.Next, the corpus was pre-processed to identify individual words as discrete units of language.Finally, different methods for training word embeddings were applied to the pre-processed corpus.The following sections describe these steps in more detail.

Corpus Collection
Welsh is considered a low resource language in the sense that relative to English there are fewer corpora that are readily available.In particular, no single Welsh text corpus is large enough to train word embeddings.To support this goal specifically, we compiled a large corpus of 92,963,671 words from 11 sources.Their summaries are provided in Table 1.Additional details are provided in the remainder of this section.• CorCenCC-CorCenCC is the first large-scale general corpus of Welsh language.The corpus currently contains over 10 million words of spoken, written and electronic language and collection is still ongoing.The corpus is designed to provide resources for the Welsh language that can be used in language technology (speech recognition, predictive text etc.), pedagogy, lexicography and academic research contexts among others.The development of CorCenCC was informed, from the outset, by representatives of all anticipated academic and community user groups.It therefore represents a user-driven model that will inform future corpus design, by providing a template for corpus development in any language and in particular lesser-used or minoritised languages.We obtained samples of some of the raw electronic text from an early release of the corpus, which included HTML web pages, and personal email and instant messaging correspondences, for use in the present study.

•
Wikipedia-Wikipedia is a multilingual crowdsourced encyclopaedia.English version was the first edition of Wikipedia, which was founded in January 2001.As of 29 September 2019, it consists of 5,938,555 entries covering a wide range of subjects.
Given its size and diversity, English Wikipedia is commonly used to train word embeddings in English.Welsh Wikipedia was founded in July 2003, but it is unfortunately still significantly smaller than its English counterpart.As of 29 September 2019, it consists of 106,128 entries.

•
National Assembly for Wales 1999-2006-The National Assembly for Wales is the devolved parliament of Wales, which has many powers including those to make legislation and set taxes.The Welsh Language Act 1993 obliges all public sector bodies to give equal importance to both Welsh and English when delivering services to the public in Wales.This means that all documents shared by the National Assembly are available in both languages.By performing a web crawling, Jones er al. [18] assembled a parallel corpus from the public Proceedings of the Plenary Meetings of the Assembly between the years 1999-2006 inclusive.The authors used this corpus to support the development of a statistical machine translation method.For the purposes of our current study, we only used the Welsh language portion of the corpus.

•
National Assembly for Wales 2007-2011-Similarly, Donnelly [22] created a parallel corpus from the same source but covering the period from 2007 until 2011.Again, we used the Welsh language portion of the corpus in the present study.

•
Cronfa Electroneg o Gymraeg-This corpus consists of 500 articles of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) fiction and factual prose [23].It includes articles from novels and short stories, religious writing, children literature, non-fiction material from education, science, business and leisure activities, public lectures, newspapers and magazines, reminiscences, academic writing, and general administrative materials.

•
An Crúbadán-This corpus was created by [24] by crawling of Welsh text from Wikipedia, Twitter, blogs, the Universal Declaration of Human Rights and a Jehovah's Witnesses website (JW.org) [25].To prevent duplication of data we removed all Wikipedia articles from this corpus before using it in the present study.

•
DECHE-The Digitisation, E-publishing and Electronic Corpus (DECHE) project publishes e-versions of Welsh scholarly books that are out of print and unlikely to be re-printed in traditional paper format [26].Books are nominated by lecturers working through the medium of Welsh and prioritised by the Coleg Cymraeg Cenedlaethol, which funds the project.We collected the text data from this project by downloading all e-books available.

•
BBC Cymru Fyw-BBC Cymru Fyw is an online Welsh language service provided by BBC Wales containing news and magazine-style articles.Using the Corpus Crawler tool [27], we constructed a corpus containing all articles published on BBC Cymru Fyw between 1 January 2011 and 17 October 2019 inclusive.

•
Gwerddon-Gwerddon is a Welsh-medium academic e-journal, which publishes research in arts, humanities and sciences.We downloaded all articles published in 29 editions of this journal.

•
Beibl.net-The website beibl.netcontains articles corresponding to all books of the Bible translated into an accessible variety of modern standard Welsh, along with informational pages.

Pre-Processing
Given a text corpus represented as a sequence of characters, tokenisation is the task of segmenting this sequence into tokens, which roughly correspond to words.Brute-force tokenisation, which removes punctuation and then identifies tokens as continuous character sequences between white spaces, oversimplifies the task and consequently achieves subpar results [32].For example, consider the following sentence: "Mae'r haul yn taro'r paneli'n gyson â'n systemau'n rhedeg ar drydan wedi'i gynhyrchu yn y mis d'wetha' " (Engl."The sun hits the panels consistently and our systems run on electricity produced in the last month").The use of apostrophes in this example represents three different processes.For example, 'r is a form of the definite article which must follow a vowel (e.g., mae'r and taro'r).Similarly, 'n is either a function word yn reduced following a vowel (e.g., paneli'n and systemau'n) or a possessive pronominal, in this case ein (e.g., â'n systemau).Furthermore, 'i in wedi'i is another function word ei, an agreement proclitic following wedi.So far, these examples represent words or grammatical items separated from other words using an apostrophe.However, the last word d'wetha' represents a different use of apostrophe, which represents a less conservative written variety of standard Welsh by omitting the sounds of the full word diwethaf and shortening it to d'wetha' as is most commonly heard in standard speech.Clearly, one cannot assume that the apostrophe represents a word boundary.
We considered three tokenisation methods for the Welsh language.The first tokenisation method, used as a baseline, is a brute-force method consisting of the following steps.First, all characters are converted to lowercase.Next, all punctuation characters are removed.Finally, tokens are identified using white spaces.The second tokenisation method considered was the one from the Gensim library.This tokeniser returns tokens corresponding to maximal contiguous sequences of alphabetic characters.We chose not to remove accentuation from the lowercased text.The final tokeniser we considered was the one from the WNLT, which has been developed specifically for Welsh.
Following tokenisation, we removed rarely occurring tokens using by setting a threshold at 5 occurrences.This step helps to remove misspelled words from the corpus.No stemming or lemmatisation were performed because previous studies have found that these actions remove information that can be used by machine learning to create language models [33].

Training
Traditionally, word embeddings are computed by minimising the distance between the words that appear in similar contexts.Prominent examples of such word embedding methods include word2vec [34], GloVe [35] and fastText [36].Contextual word embeddings take this approach to the next level by creating different embeddings for the same word used in different contexts to convey different meanings.For example, the word 'bank' can be interpreted as either 'financial bank' and 'river bank' depending on the context.Examples of such word embedding methods include ELMo [37] and BERT [38].Finally, word embeddings can be enriched with different types of information.For example, sentiment embeddings [39] incorporate the sentiment of words into their embeddings.The benefit of sentiment embeddings over standard embeddings is that the distance between opposite words such as 'good' and 'bad' that tend to appear in similar contexts will become larger to reflect their semantics more appropriately.
In this study, we focused solely on traditional word embedding methods, specifically word2vec [34] and fastText [36].These approaches produce one vector per word, which enabled us to make direct comparison to the existing baseline approach [2].Word2vec has two versions known as skip-gram and continuous bag of words (CBOW) respectively [34,40].Given a target word w t , skip-gram aims to predict its context.Formally, the objective of the skip-gram method is to maximize the log-likelihood defined in Equation (1), where C t is the set of indices of context words surrounding the target word w t .
For each word w, the skip-gram method defines two vectors u w and v w in R n , which are learnt automatically.These vectors are commonly referred to as input and output vectors respectively [36].Given this, the skip-gram version estimates the probability p(w c |w t ) using the SoftMax function defined in Equation (2) where s : R n × R n → R n is the scoring function defined in Equation (3) and u ⊤ t denotes the transpose of the vector u t .
This formulation of p(w c |w t ) renders the learning of the vectors u w and v w impractical because the cost of computing derivatives is proportional to the size of the corpus W. To overcome this challenge, [40] proposed two approximations known as hierarchical SoftMax and negative sampling.
Conversely, given a context, CBOW aims to predict the target word.Formally, instead of modelling p(w c |w t ), CBOW models p(w t |w c ) [34].The fastText method generalises the two versions of word2vec, i.e., skip-gram and CBOW, by considering the subwords within the words [36].The authors argue that this method is useful for morphologically rich languages such as Turkish and Finnish.Welsh is also morphologically rich, where inflection can give rise to multiple surface forms of a single word.Consider, for example, the word 'ci' (Engl.dog).In the phrase 'ei gi' (Engl.his dog), soft mutation applies to the word 'ci'.On the other hand, in the phrase 'ei chi' (Engl.her dog), aspirate mutation applies to the word 'ci' [41].Therefore, both 'gi' and 'chi' correspond to the same lemma-'ci'.Mutations occur frequently in Welsh.Subword information has the potential to allow a word embedding method to relate different mutations of the same word.Therefore, fastText represents an appropriate choice of the word embedding method for this language.
FastText inserts special boundary characters < and > at the beginning and end respectively of each word.This allows the model to distinguish prefixes and suffixes from other character sequences.Each word w is then represented as a set G w of character m-grams where the original word is also included in the set.For example, consider the case where w is the word 'sheep' and m = 3.In this case, {<sh, she, hee, eep, ep>, <sheep>}.For each m-gram g in the vocabulary, the fastText method defines a corresponding vector z g , which is learnt.Given this, the fastText method is identical to the word2vec model except that the scoring function in Equation (3) is replaced by the scoring function in Equation (4).
In our experiments, the hyperparameters of the word2vec and fastText methods were set to the following values.For the fastText method we considered all m-grams for 3 ≤ m ≤ 6.For both word2vec and fastText, we used a value of 300 for the dimension of the word embedding vectors n following the best practices described in [36] and trained for 20 epochs.We used the implementations of the word2vec and fastText methods available in the Gensim library.

Results and Analysis
Using different combinations of methods described in the previous section, we trained a total of 12 versions of Welsh word embeddings.We compared them against Welsh word embeddings described by [2,42] as the baseline.
Word embeddings are commonly evaluated using a combination of qualitative and quantitative methods.Qualitative methods involve manual selection of prototype words and inspection of their neighbourhood in the vector space.Quantitative methods can be divided into two categories, intrinsic and extrinsic methods [43,44].Extrinsic methods evaluate word embeddings with respect to their effect on downstream NLP applications such as NER and sentiment analysis.Intrinsic methods evaluate how accurately word embeddings capture the semantic similarity of words under an assumption that semantically similar words will be close spatially in the vector space.In this study, we evaluated the word embeddings quantitatively using intrinsic methods.We considered four intrinsic methods, which are based on similarity, clustering, synonymy and analogy, respectively.Several the evaluation methods considered involved the creation of a corresponding dataset from English to Welsh.In all cases the translation in question was performed by a bilingual Welsh-English speaker.These translations were verified by a second bilingual Welsh-English speaker where any disagreement was resolved through discussion.
The remainder of the section provides further details of the experimental setup together with the corresponding results.

Word Similarity
Similarity-based methods for evaluating word embeddings use ground-truth pairwise semantic similarity of words, where semantic similarity is represented by a value within a fixed range with higher values indicating greater semantic similarity.The correlation of the semantic similarity of words and the cosine similarity of the corresponding word embeddings is used to gauge the utility of the word embeddings.The higher the correlation, the higher the utility of word embeddings.The ground-truth semantic similarity is typically estimated by native speakers.
We considered two word similarity datasets.First, the WordSimilarity-353 dataset contains a total of 353 word pairs in English [45,46].Each word pair is associated with the mean taken from semantic similarity estimated independently by multiple individuals on a Likert scale from 0 to 10 inclusive.We adapted this dataset by translating it into Welsh using bilingual Welsh-English speakers.We used Spearman's rank correlation coefficient to measure the correlation between semantic similarity and cosine similarity.Table 2 provides the results.The naive tokenisation combined with CBOW performed best.In fact, all CBOW methods regardless of tokenisation outperformed Grave's model.On the other hand, all methods based on skip-gram performed worse than Grave's model.The second dataset we considered was SimLex-999, which contains a total of 999 word pairs consisting of 666 noun pairs, 222 verb pairs and 111 adjective pairs in English [47,48].Each word pair is associated with the mean semantic similarity estimated independently by 50 individuals on a Likert scale from 0 to 10 inclusive.We adapted this dataset by translating it into Welsh using bilingual Welsh-English speakers.Brute-force translation may introduce minor inaccuracies or other biases into the ground truth.For example, the word 'bank' has two interpretations, each being similar to the words 'river' and 'money' respectively.However, this homonymy is not observed in Welsh.Other considerations are cultural.For example, references to American concepts such as 'baseball', 'dollar' and 'buck' may occur rarely if at all in the Welsh corpus.Two word pairs with no Welsh equivalents, 'football-soccer' and 'dollar-buck', were removed from the dataset.
In addition to semantic similarity, SimLex-999 also provides the strength of free association represented as a value in the range 0 to 10 inclusive.For example, the words 'car' and 'petrol' are not semantically similar but have high free association.The strength of free association was calculated using the University of South Florida Free Association Dataset [49].This dataset was generated by presenting human subjects with one of 5000 cue concepts and asking them to write the first word that comes to mind.Table 3 provides the results.The baseline performed the best, although all values are very low.Note here that the Spearman correlation coefficients obtained in all cases were very low.This is expected, especially in the SimLex-999 dataset, with state-of-the-art English word embeddings, which were trained on corpus (around 1000 times larger than our Welsh corpus) also with very low correlations [47], between 0.2 and 0.45.There has also been discussion on the problems of using word similarity metrics such as these for evaluating word embeddings [50].Thus, in the following sections we provide further alternative evaluation metrics.

Word Clustering
Concept categorisations is a common method for evaluating word embeddings [51].It checks whether words can be grouped into natural categories from their vectors only.For example, the words 'bear' and 'bull' belong to an animal class, while cupboard and chair will belong to a furniture class.
We adapted a concept categorisation dataset from [52] by translating it into Welsh using bilingual Welsh-English speakers.It consists of 214 words assorted into 13 categories, which are provided in the Appendix A of this article.The vectors for each of the 214 words were clustered using k-means clustering combined with cosine distance and Euclidean distance respectively, with k = 13 to match the number of categories.Ideally, the 13 clusters should map directly to the 13 categories.Three measures were used to evaluate the clustering results.

•
Purity measures the extent to which clusters contain words of the same category.It is calculated using Equation ( 5), where N is the number of words in total, M is the set of clusters and D is the set of known categories.It is calculated as the average count of the categories per cluster.Purity is commonly used to evaluate vector semantics, e.g., [51,53].Its main shortcoming is that it does not penalise a single category being distributed over more than one cluster.For example, words belonging to an education category being distributed over more than one cluster.
• Rand Index measures the extent to which pairs of words that do or do not belong to the same category end up in the same cluster or not.For each word pair, clustering can produce a true positive (the same category and the same cluster), true negative (different categories and different clusters), false positive (different categories, but the same cluster), or false negative (the same category, but different clusters).The counts are given by TP, TN, FP and FN, respectively.These measures were used to measure accuracy in [52].Rand index is calculated as the proportion of correctly predicted pairs as prescribed by Equation ( 6).
• Entropy measures how words from the same categories are distributed across the clusters.Low entropy indicates that words of the same category tend to be grouped within the same cluster.This measure, used in [53], is given in Equation ( 7).
Only the fastText models can be evaluated in this way.As word2vec models do not capture subword information, vectors for unseen words cannot be implied, and so the k-means algorithm fails.The results for the fastText models are given in Table 4.
WNLT tokenisation combined skip-gram performed best when Euclidean distance was used for clustering, while Gensim tokenisation combined with skip-gram performed best when cosine distance was used for clustering.Grave's model performed the worst in both cases, indicating that a larger and more representative corpus yields word embeddings that perform better at concept categorisation tasks.

Word Synonyms
The ability to link synonyms has been used to evaluate word embeddings [44,51] as well as other machine-learning tasks [54].A dataset similar to the one based on multiplechoice synonym questions used in the Test of English as a Foreign Language was created for Welsh [55].This is not a case of simple translation of English data, as synonymy is unique to a language and cannot be mapped easily from one language to another.Therefore, a new dataset was constructed with 50 questions (including nouns, adjectives, and verbs), given in the Appendix A of this article.Given a word (e.g., rusty) and a set of related words one of which is a (near)synonym (e.g., {corroded, black, dirty, painted}), an answer is selected as the word with the closest cosine distance.We measured the percentage of questions where the correct synonym (in this case corroded) was chosen.
Again, only the fastText models can be evaluated in this way, as those without subword information will be biased (will have fewer word choices) if presented with words not in the original dataset.The results for the fastText models are given in Table 5.
WNLT tokenisation combined with skip-gram performed the best on synonym prediction, while Grave's model performed the worst.

Word Analogies
Word analogies can be used to evaluate whether the semantic relationships between words correspond to the mathematical relationships between their respective embeddings.For example, the relationship between the word 'king' and 'queen' should be identical to the relationship between the words 'actor' and 'actress'.Therefore, if x king , x queen , x actor and x actress are the trained vectors for the words 'king', 'queen', 'actor' and 'actress', respectively, then we would expect: Given a set of examples of various language-specific relationships, the Gensim library allows us to measure the proportion for which the above equation holds, where the proximity is calculated using five nearest neighbours of the vector to the left.We translated language-independent relationships such as those between nations and nationalities to Welsh.A dataset of grammatical relationships was constructed by a native Welsh speaking linguist and included adjective-opposite (435 pairs), adjective-comparative (55 pairs), adjective-superlative (55 pairs), adjectives-equative (55 pairs), nationalities (210 pairs), languages (120 pairs), noun-plural (6555 pairs), noun-singular (105 pairs), noungender (703 pairs), adjective-gender (105 pairs), adjective-plural (528 pairs), verb-nonfinite (325 pairs), verb-past-1st-singular (45 pairs), verb-past-3rd-singular (45 pairs), verb-pastimpersonal (45 pairs), verb-present-impersonal (45 pairs), inflectional-preposition-1stsingular (36 pairs), and inflectional-preposition-2nd-plural (36 pairs).
The accuracy over all 9503 pairs, for each of the models, are given in Table 6.All models perform much better than Grave's model.There are also obvious rankings between tokenisation and training methods: gensim tokenisation performed better than WNLT, which in turn performed better than naive tokenisation; CBOW yields much more accurate models than skip-gram here; and, surprisingly, word2vec performs marginally better than fastText in all cases.

Qualitative Evaluation
We considered 30 nearest neighbours of a small set of prototype words.All prototype words were present in the text corpus used in training the word embeddings.Nearest neighbours were identified using the cosine similarity measure, which was calculated using the formula in Equation (9).
Given the cost of inspecting the neighbourhood manually, we limited this aspect of evaluation to comparison of a single model, namely the fastText skip-gram model with WNTL tokenisation, to the baseline.This model was chosen as it demonstrated the best results in our quantitative evaluation presented above.The following are a small selection of the words used for comparison and comments on the performance of the models: This pattern is not evident in Grave's model, which gives nearest neighbours to these towns most commonly as mutations and misspellings of the original word, and less commonly as Welsh towns further afield.

Conclusions
In this paper, we have presented a systematic evaluation of Welsh word embeddings trained using different combinations of word embedding and tokenisation approaches.
Although Welsh word embeddings have been created in the past, this is the first study that focuses solely on Welsh language and evaluates the embeddings with respect to its own patterns of syntax and semantics.In this respect, our model outperformed the only other existing model and as such sets the new baseline for the Welsh NLP community.To train the embeddings, we assembled the largest corpus of Welsh language.Although the corpus itself cannot be re-shared publicly due to data access restrictions, they can be collected from the original sources and used to re-create the corpus.Nonetheless, the word embeddings are made publicly available together with the associated code at [3].
Based on accuracy and consistency of performance on a wide variety of tasks, the recommendation arising from this study is to use fastText embeddings trained on WNLTtokenised text using the skip-gram method.The only exception is word analogy for which word2vec embeddings trained on Gensim-tokenised text using the CBOW method performed better.These observations need to be taken into account when selecting the type of embeddings to support specific downstream tasks.For example, tasks such as document similarity may benefit from using word2vec embeddings, whereas tasks such as named entity recognition, which may require reasoning about newly encountered words, may benefit from using fastText embeddings.
In addition to this resource, we also created several datasets for the evaluation of word embeddings in Welsh.This study laid a foundation for developing cross-lingual word embeddings in which the vector space is shared between words in Welsh and English [56], where we demonstrated how NLP tools originally developed for English can be re-purposed for Welsh.Our future work will focus on learning contextual word embeddings in Welsh using approaches such BERT [57], where the same word can have a different vector representation depending on the current context.In addition, BERT generates embeddings at a subword level, which would help the out-of-vocabulary problem associated with small training datasets.In particular, using BERT to learn cross-lingual embeddings could effectively address the problem of code-switching and especially intraword switching.

Table 1 .
Data sources used to collect the training corpus.

Table 2 .
The results achieved on the WordSimilarity-353 dataset.

Table 3 .
The results achieved on the SimLex-999 dataset.

Table 4 .
The results for the concept categorisation task.

Table 5 .
The results achieved on the synonymy detection task.

Table 6 .
The results achieved for the word analogy task.
Our model lists a variety of weather phenomena including eira (snow), gwyntoedd (winds), cawodydd (showers), cenllysg (hail), gwlyb (wet), stormydd (storms), corwyntoedd (hurricanes), and taranau (thunder).Grave's model does list some related words such as monswn (monsoon), but mainly lists derivations of the original word e.g., glawiog (rainy), and other unrelated words such as car-boot and sgubai (sweep), although this may relate to rain sweeping across the land.•hapus (happy): Our model lists several synonyms or related adjectives, including lwcus (lucky), falch (glad), ffodus (fortunate), and bodlon (satisfied).Grave's model list some of these, but contains many other less similar words that could appear in the same context, such as anhapus (unhappy), eisiau (want), teimlon (felt), and grac (angry).It also lists words of similar spelling, but unrelated semantically: siapus (shapely) and napus (brassica napus; a species of rapeseed), which may indicate their model is relying too heavily on subword information.• meddalwedd (software): Both models list many words related to computing and technology here, including salwedd (malware), amgryptio (encrypting), cyfrifiadurol (computational), metaddata (meta-data), telegyfathrebu (telecommunication), and rhyngwyneb (interface).Our model provides a greater variety of words, while Grave's model provides some English words and product names e.g., DropBox.This may be due to the fact that a more recent corpus was used by our model, as there will have been more technological articles published, and more technological terminology developed, in recent years.• ffrangeg (the French language).There was a stark difference in the lists produced by the two models.Our model returned several other western European languages including llydaweg (Breton), isalmaeneg (Dutch), galaweg (Gallo) and sbaeneg (Spanish).Grave's model however gives several compound names, e.g., Arabeg-Ffrangeg (Arabic-French) and FfrangegSaesneg (French-English), while also returning several foreign words.• croissant (the loan word 'croissant'): Again, there was a stark difference between the models here.Our model listed other foreign or loan words for food including gefrüstuckt, brezel, müsli and spaghetti, along with some unrelated foreign words.Grave's model lists several unrelated foreign words, many with similar spellings to the original word, e.g., Eblouissant, Pourrissant, and Florissant, again indicating the model's possible over-reliance on subword information.Caerfyrddin, a large town in West Wales has nearest neighbours made up from other towns in West Wales, for example Llanelli, Aberteifi, Hwlffordd, Llambed, Penfro, Bwlchclawdd, Castellnewyddemlyn, Aberystwyth and Ceredigion.• Caernarfon, a large town in North Wales, has nearest neighbours made up from other towns in North Wales, for example Dolgellau, Cricieth, Porthmadog, Llanllyfni, Pwllheli, Llangefni, Llandudno, Felinheli and Biwmares.• Pontypridd, a large town in the South Wales valleys, has nearest neighbours made up from other towns in the South Wales valleys, for example Pontyp ŵl, Aberdâr, Pontyclun, Rhymni, Pontygwaith, Rhondda, Tonypandy, Abercynon and Trefforest.