Creating Welsh Language Word Embeddings
Abstract
:1. Introduction
2. Related Work
3. Methods
3.1. Corpus Collection
- CorCenCC—CorCenCC is the first large-scale general corpus of Welsh language. The corpus currently contains over 10 million words of spoken, written and electronic language and collection is still ongoing. The corpus is designed to provide resources for the Welsh language that can be used in language technology (speech recognition, predictive text etc.), pedagogy, lexicography and academic research contexts among others. The development of CorCenCC was informed, from the outset, by representatives of all anticipated academic and community user groups. It therefore represents a user-driven model that will inform future corpus design, by providing a template for corpus development in any language and in particular lesser-used or minoritised languages. We obtained samples of some of the raw electronic text from an early release of the corpus, which included HTML web pages, and personal email and instant messaging correspondences, for use in the present study.
- Wikipedia—Wikipedia is a multilingual crowdsourced encyclopaedia. English version was the first edition of Wikipedia, which was founded in January 2001. As of 29 September 2019, it consists of 5,938,555 entries covering a wide range of subjects. Given its size and diversity, English Wikipedia is commonly used to train word embeddings in English. Welsh Wikipedia was founded in July 2003, but it is unfortunately still significantly smaller than its English counterpart. As of 29 September 2019, it consists of 106,128 entries.
- National Assembly for Wales 1999–2006—The National Assembly for Wales is the devolved parliament of Wales, which has many powers including those to make legislation and set taxes. The Welsh Language Act 1993 obliges all public sector bodies to give equal importance to both Welsh and English when delivering services to the public in Wales. This means that all documents shared by the National Assembly are available in both languages. By performing a web crawling, Jones er al. [18] assembled a parallel corpus from the public Proceedings of the Plenary Meetings of the Assembly between the years 1999-2006 inclusive. The authors used this corpus to support the development of a statistical machine translation method. For the purposes of our current study, we only used the Welsh language portion of the corpus.
- National Assembly for Wales 2007–2011—Similarly, Donnelly [22] created a parallel corpus from the same source but covering the period from 2007 until 2011. Again, we used the Welsh language portion of the corpus in the present study.
- Cronfa Electroneg o Gymraeg—This corpus consists of 500 articles of approximately 2000 words each, selected from a representative range of text types to illustrate modern (mainly post 1970) fiction and factual prose [23]. It includes articles from novels and short stories, religious writing, children literature, non-fiction material from education, science, business and leisure activities, public lectures, newspapers and magazines, reminiscences, academic writing, and general administrative materials.
- An Crúbadán—This corpus was created by [24] by crawling of Welsh text from Wikipedia, Twitter, blogs, the Universal Declaration of Human Rights and a Jehovah’s Witnesses website (JW.org) [25]. To prevent duplication of data we removed all Wikipedia articles from this corpus before using it in the present study.
- DECHE—The Digitisation, E-publishing and Electronic Corpus (DECHE) project publishes e-versions of Welsh scholarly books that are out of print and unlikely to be re-printed in traditional paper format [26]. Books are nominated by lecturers working through the medium of Welsh and prioritised by the Coleg Cymraeg Cenedlaethol, which funds the project. We collected the text data from this project by downloading all e-books available.
- BBC Cymru Fyw—BBC Cymru Fyw is an online Welsh language service provided by BBC Wales containing news and magazine-style articles. Using the Corpus Crawler tool [27], we constructed a corpus containing all articles published on BBC Cymru Fyw between 1 January 2011 and 17 October 2019 inclusive.
- Gwerddon—Gwerddon is a Welsh-medium academic e-journal, which publishes research in arts, humanities and sciences. We downloaded all articles published in 29 editions of this journal.
- Beibl.net—The website beibl.net contains articles corresponding to all books of the Bible translated into an accessible variety of modern standard Welsh, along with informational pages.
3.2. Pre-Processing
3.3. Training
4. Results and Analysis
4.1. Word Similarity
4.2. Word Clustering
- Purity measures the extent to which clusters contain words of the same category. It is calculated using Equation (5), where N is the number of words in total, M is the set of clusters and D is the set of known categories. It is calculated as the average count of the categories per cluster. Purity is commonly used to evaluate vector semantics, e.g., [51,53]. Its main shortcoming is that it does not penalise a single category being distributed over more than one cluster. For example, words belonging to an education category being distributed over more than one cluster.
- Rand Index measures the extent to which pairs of words that do or do not belong to the same category end up in the same cluster or not. For each word pair, clustering can produce a true positive (the same category and the same cluster), true negative (different categories and different clusters), false positive (different categories, but the same cluster), or false negative (the same category, but different clusters). The counts are given by , , and , respectively. These measures were used to measure accuracy in [52]. Rand index is calculated as the proportion of correctly predicted pairs as prescribed by Equation (6).
4.3. Word Synonyms
4.4. Word Analogies
4.5. Qualitative Evaluation
- nofio (to swim): Both models list several sports and water-related activities as nearest neighbours although each model gives different words, including beicio (cycling), caiacio (kayaking), syrffio (surfing), plymio (diving), nofwyr (swimmers) and pwll (pool). Grave’s model also lists several mutations and misspellings of bwrcini (burkini), which may indicate a small or un-diverse training corpus.
- glaw (rain): Our model lists a variety of weather phenomena including eira (snow), gwyntoedd (winds), cawodydd (showers), cenllysg (hail), gwlyb (wet), stormydd (storms), corwyntoedd (hurricanes), and taranau (thunder). Grave’s model does list some related words such as monswn (monsoon), but mainly lists derivations of the original word e.g., glawiog (rainy), and other unrelated words such as car-boot and sgubai (sweep), although this may relate to rain sweeping across the land.
- hapus (happy): Our model lists several synonyms or related adjectives, including lwcus (lucky), falch (glad), ffodus (fortunate), and bodlon (satisfied). Grave’s model list some of these, but contains many other less similar words that could appear in the same context, such as anhapus (unhappy), eisiau (want), teimlon (felt), and grac (angry). It also lists words of similar spelling, but unrelated semantically: siapus (shapely) and napus (brassica napus; a species of rapeseed), which may indicate their model is relying too heavily on subword information.
- meddalwedd (software): Both models list many words related to computing and technology here, including salwedd (malware), amgryptio (encrypting), cyfrifiadurol (computational), metaddata (meta-data), telegyfathrebu (telecommunication), and rhyngwyneb (interface). Our model provides a greater variety of words, while Grave’s model provides some English words and product names e.g., DropBox. This may be due to the fact that a more recent corpus was used by our model, as there will have been more technological articles published, and more technological terminology developed, in recent years.
- ffrangeg (the French language). There was a stark difference in the lists produced by the two models. Our model returned several other western European languages including llydaweg (Breton), isalmaeneg (Dutch), galaweg (Gallo) and sbaeneg (Spanish). Grave’s model however gives several compound names, e.g., Arabeg-Ffrangeg (Arabic-French) and FfrangegSaesneg (French-English), while also returning several foreign words.
- croissant (the loan word ‘croissant’): Again, there was a stark difference between the models here. Our model listed other foreign or loan words for food including gefrüstuckt, brezel, müsli and spaghetti, along with some unrelated foreign words. Grave’s model lists several unrelated foreign words, many with similar spellings to the original word, e.g., Eblouissant, Pourrissant, and Florissant, again indicating the model’s possible over-reliance on subword information.
- gwario (spend): Here Grave’s model seems to group related words better. Our model lists mutations and conjugations of the original verb in addition to verbs formed by prefixation, e.g., orwario (overspend) and tanwario (underspend), and some other related words e.g., wastraffu (waste), arbed (save), and talu (pay). Grave’s model lists other words related to finance including Bitcoins, miliwnau (millions), arbedir (to have saved), chyllidebu (budgeting) and arian (money).
- Caerfyrddin, a large town in West Wales has nearest neighbours made up from other towns in West Wales, for example Llanelli, Aberteifi, Hwlffordd, Llambed, Penfro, Bwlchclawdd, Castellnewyddemlyn, Aberystwyth and Ceredigion.
- Caernarfon, a large town in North Wales, has nearest neighbours made up from other towns in North Wales, for example Dolgellau, Cricieth, Porthmadog, Llanllyfni, Pwllheli, Llangefni, Llandudno, Felinheli and Biwmares.
- Pontypridd, a large town in the South Wales valleys, has nearest neighbours made up from other towns in the South Wales valleys, for example Pontypŵl, Aberdâr, Pontyclun, Rhymni, Pontygwaith, Rhondda, Tonypandy, Abercynon and Trefforest.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A
Class | Words |
---|---|
anifail | arth, tarw, camel, cath, buwch, carw, ci, eleffant, ceffyl, cath bach, llew, mwnci, llygoden, wystrysen, ci bach, llygoden mawr, dafad, teigr, crwban, sebra |
adeilad | ladd-dy, canolfan, clwb, dortur, tŷ gwydr, cyntedd, ysbyty, gwesty, tŷ, tafarn, llyfrgell, meithrin, bwyty, ysgol, nendwr, tafarndy, theatr, fila, puteindy |
dillad | trôns, blows, cot, costiwm, megyg, het, siaced, jîns, mwclis, pyjamas, cochl, scarff, crys, siwt, trywsus, gwisg |
crëwr | pensaer, arlunydd, adeiladwr, lluniwr, crefftwr, dylunydd, datblygwr, ffermwr, dyfeisiwr, crëwr, gwneuthurwr, cerddor, cychwynnwr, paentiwr, ffotograffydd, cynhyrchydd, teilwr |
afiechyd | acne, anthracs, arthritis, asthma, canser, colera, sirosis, clefyd siwgwr, ecsema, ffliw, glawcoma, hepatitis, lewcemia, camfaethiad, llid yr ymennydd, pla, cryd cymalau, brech wen |
teimlad | dicter, awydd, ofn, hapusrwydd, llawenydd, cariad, poen, angerdd, pleser, tristwch, sensitifrwydd, cywilydd, rhyfeddod |
ffrwyth | afal, banana, aeron, ceirios, grawnwin, ciwi, lemwn, mango, melon, olewydd, oren, eirinen wlanog, gellygen, pinafal, mefys, dyfrfelon |
dodrefn | gwely, silf llyfrau, cwpwrdd, cadair, cowtsh, crud, desg, tresel, lamp, lolfa, sedd, soffa, bwrdd, cwpwrdd dillad |
corff | pigwrn, braich, clyst, llygad, gwyneb, bys, troed, llaw, pen, coes, trwyn, ysgwydd, byd troed, tafod, dant, addwrn |
cyhoeddiad | atlas, llyfr, llyfryn, pamffled, catalog, llyfr coginio, geiriadur, gwyddoniadur, llawllyfr, cyfnodolyn, cylchgrawn, seinglawr, llyfr ffôn, cyfeirlyfr, gwerslyfr, llyfr gwaith |
teulu | bachgen, plentyn, cefnithr, merch, tad, geneth, ŵyr, tadcu, nain, gŵr, crwt, mam, epil, sibling, mab, gwraig |
amser | canrif, degawd, oed, noswaith, hydref, awr, mis, broe, nos, goramser, chwarter, tymor, semester, gwanwyn, haf, wythnos, penwythnos, gaeaf, blwyddyn |
cerbyd | awyren, llong awyr, cerbyd, beic, cwch, car, criwser, hofrennydd, beic modur, tryc, roced, llong, lori, fan |
Question | Synonym | Related Words |
---|---|---|
doeth | call | twp, ffôl, hapus |
budr | brwnt | glân, gwyn, du |
cyflym | clou | araf, hir, byr |
bwrw | taro | mwytho, cyffwrdd, cicio |
llefrith | llaeth | cwrw, caws, buwch |
hawdd | rhwydd | anodd, arferol, rhydd |
hynod | mor | tipyn, eithaf, bach |
adnabyddus | enwog | dieithr, cerddor, amlwg |
dolur | poen | moddion, salwch, meddyg |
distaw | tawel | swnllyd, uchel, cyffroes |
mân | bach | mawr, lled, cul |
hogyn | bachgen | menyw, gwraig, myfyriwr |
teyrngar | ffyddlon | celwydd, diafol, duw |
merch | geneth | dyn, disgybl, gwr |
blinedig | cysglyd | egnïol, gwelu, cysgu |
creu | cynhyrchu | dinistrio, torri, adeiladu |
rhyfel | brwydr | heddwch, byddin, milwr |
rhwystro | atal | gadael, caniatáu, eisiau |
edrych | sbio | clywed, methu, dweud |
rhyfedd | anarferol | normal, lliwgar, doeth |
anibendod | llanast | taclus, ystafell, brwnt |
cnoi | brathu | bwyta, yfed, dannedd |
ceisio | trio | llwyddo, methu, cysgu |
ffeindio | canfod | colli, nofio, chwilio |
lleol | agos | pell, estron, gwyrdd |
cwtogi | lleihau | ehangu, plygu, ymestyn |
ehangu | tyfu | neidio, magu, meithrin |
mur | wal | drws, ffenest, to |
diogel | saff | peryglus, agored, rhwystredig |
cyflawni | cwblhau | methu, gwagio, colli |
deunydd | defnydd | dillad, pren, gwydr |
gweiddi | bloeddio | sibrwd, siarad, peswch |
pili pala | glöyn byw | pryfed, prycopyn, morgrugyn |
teisen | cacen | bara, bisced, tost |
digalon | trist | siriol, doniol, hwyl |
cweryla | dadlau | cytuno, cusanu, bloeddio |
hybu | hyrwyddo | digaloni, rhwystro, cynyddu |
hardd | prydferth | hyll, esmwyth, caled |
cymorth | help | rhwystr, moddion, athro |
ffug | afreal | gwirioneddol, ffeithiol, stori |
gwagle | gofod | llawn, awyr, bydysawd |
bad | cwch | awyren, car, hofrennydd |
dryll | gwn | cyllell, cleddyf, arf |
cyfarfod | cwrdd | croesawi, siarad, ffarwelio |
swydd | gwaith | gwyliau, rheswm, grym |
holi | gofyn | ateb, gorymateb, meddwl |
hanfodol | angenrheidiol | diangen, gwreiddiol, gwahanol |
lleoliad | man | sefyllfa, amser, ystafell |
creulon | cas | cwrtais, neis, crac |
oherwydd | achos | rheswm, pam, esboniad |
References
- Young, T.; Hazarika, D.; Poria, S.; Cambria, E. Recent trends in deep learning based natural language processing. IEEE Comput. Intell. Mag. 2018, 13, 55–75. [Google Scholar] [CrossRef]
- Grave, E.; Bojanowski, P.; Gupta, P.; Joulin, A.; Mikolov, T. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Corpws Cenedlaethol Cymraeg Cyfoes (Corcencc). 2020. Available online: https://github.com/CorCenCC (accessed on 26 July 2021).
- Harris, Z. A Theory of Language and Information: A Mathematical Approach, 1st ed.; Clarendon Press: Oxford, UK, 1991. [Google Scholar]
- Neale, S.; Donnelly, K.; Watkins, G.; Knight, D. Leveraging lexical resources and constraint grammar for rule-based part-of-speech tagging in welsh. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Piao, S.S.; Rayson, P.; Knight, D.; Watkins, G. Towards a Welsh semantic annotation system. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, Miyazaki, Japan, 7–12 May 2018; European Language Resources Association: Reykjavik, Iceland, 2018. [Google Scholar]
- Piao, S.S.; Neale, S.; Ezeani, I.; Rayson, P.E.; Knight, D.; Donnelly, K. Open Welsh language resources for a corpus annotation framework. In Proceedings of the 10th International Corpus Linguistics Conference, Wales, UK, 23–27 July 2019. [Google Scholar]
- Welsh Natural Language Toolkit (Wnlt). 2020. Available online: https://sourceforge.net/projects/wnlt-project/ (accessed on 26 July 2021).
- Cunningham, H.; Maynard, D.; Bontcheva, K. Text Processing with GATE (Version 6); University of Sheffield: Sheffield, UK, 2011. [Google Scholar]
- Jones, D.B.; Robertson, P.; Prys, G. Welsh Language Parts-of-Speech Tagger Api Service. 2015. Available online: http://techiaith.cymru/api/parts-of-speech-tagger-api/?lang=en (accessed on 26 July 2021).
- Jones, D.B.; Robertson, P.; Prys, G. Welsh Language Lemmatizer Api Service. 2015. Available online: http://techiaith.cymru/api/lemmatizer/?lang=en (accessed on 26 July 2021).
- Spasić, I.; Owen, D.; Knight, D.; Artemiou, A. Unsupervised multi–word term recognition in Welsh. In Proceedings of the Celtic Language Technology Workshop, Dublin, Ireland, 19 August 2019; pp. 1–6. [Google Scholar]
- Spasić, I.; Greenwood, M.; Preece, A.; Francis, N.; Elwyn, G. Flexiterm: A flexible term recognition method. J. Biomed. Semant. 2013, 4, 27. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Spasić, I. Acronyms as an integral part of multi-word term recognition—A token of appreciation. IEEE Access 2018, 6, 8351–8363. [Google Scholar] [CrossRef]
- Rayson, P.; Archer, D.; Piao, S.; McEneryb, T. The UCREL semantic analysis system. In Proceedings of the LREC-04 Workshop, Beyond Named Entity Recognition Semantic Labelling for NLP, Lisbon, Portugal, 25 May 2004; pp. 7–12. [Google Scholar]
- Welsh National Language Technologies Portal. Terminology Dictionary Widget. 2020. Available online: http://techiaith.cymru/cloud/widgets/terminology-dictionary-widget/?lang=en (accessed on 26 July 2021).
- Welsh Government. Termcymru. 2020. Available online: https://gov.wales/bydtermcymru (accessed on 26 July 2021).
- Jones, D.; Eisele, A. Phrase-based statistical machine translation between English and Welsh. In Proceedings of the 5th SALTMIL Workshop on Minority Languages at the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, 24–26 May 2006. [Google Scholar]
- Welsh National Language Technologies Portal. Welsh-English Aligner. 2020. Available online: http://techiaith.cymru/translation/aligner/?lang=en (accessed on 26 July 2021).
- Ruder, S.; Vulić, I.; Søgaard, A. A survey of cross-lingual word embedding models. J. Artif. Intell. Res. 2019, 65, 569–631. [Google Scholar] [CrossRef] [Green Version]
- Ezeani, I.; Piao, S.S.; Neale, S.; Rayson, P.; Knight, D. Leveraging pre-trained embeddings for Welsh taggers. In Proceedings of the 4th Workshop on Representation Learning for NLP, Florence, Italy, 2 August 2019; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 270–280. [Google Scholar]
- Donnelly, K. Kynulliad3: A Corpus of 350,000 Aligned Welsh-English Sentences from the Third Assembly (2007–2011) of the National Assembly for Wales. 2013. Available online: http://cymraeg.org.uk/kynulliad3 (accessed on 26 July 2021).
- Ellis, N.C.; O’Dochartaigh, C.; Hicks, W.; Morgan, M.; Laporte, N. Cronfa Electroneg o Gymraeg (ceg): A 1 Million Word Lexical Database and Frequency Count for Welsh. 2001. Available online: https://www.bangor.ac.uk/canolfanbedwyr/ceg.php.en (accessed on 26 July 2021).
- Scannell, K.P. The crúbadán project: Corpus building for under-resourced languages. In Proceedings of the 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium, 15–16 September 2007. [Google Scholar]
- Tystion Jehofa. 2020. Available online: https://www.jw.org/cy/ (accessed on 26 July 2021).
- Prys, D.; Jones, D.; Roberts, M. Deche and the Welsh national corpus portal. In Proceedings of the First Celtic Language Technology Workshop, Dublin, Ireland, 23 August 2014; pp. 71–75. [Google Scholar]
- Corpus Crawler. 2020. Available online: https://github.com/google/corpuscrawler (accessed on 26 July 2021).
- golwg360. 2020. Available online: https://golwg360.cymru (accessed on 26 July 2021).
- O’r Pedwar Gwynt. 2020. Available online: https://pedwargwynt.cymru (accessed on 26 July 2021).
- Pobl Caerdydd. 2020. Available online: https://poblcaerdydd.com/ (accessed on 26 July 2021).
- Cylchgrawn Barn. 2020. Available online: https://barn.cymru/ (accessed on 26 July 2021).
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to information retrieval. In An Introduction To Information Retrieval; Cambridge University Press: Cambridge, MA, USA, 2008; Volume 151, p. 5. [Google Scholar]
- Howard, J. Lesson 12. In Practical Deep Learning for Coders; fast.ai; Available online: https://course.fast.ai/ (accessed on 26 July 2021).
- Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 2017, 5, 135–146. [Google Scholar] [CrossRef] [Green Version]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, LA, USA, 21–23 June 2018; pp. 2227–2237. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Tang, D.; Wei, F.; Qin, B.; Yang, N.; Liu, T.; Zhou, M. Sentiment embeddings with applications to sentiment analysis. IEEE Trans. Knowl. Data Eng. 2015, 28, 496–509. [Google Scholar] [CrossRef]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- King, G. Modern Welsh: A Comprehensive Grammar; Routledge: London, UK, 2015. [Google Scholar]
- Fasttext Word Vectors. 2020. Available online: https://fasttext.cc/docs/en/crawl-vectors.html (accessed on 26 July 2021).
- Schnabel, T.; Labutov, I.; Mimno, D.; Joachims, T. Evaluation methods for unsupervised word embeddings. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 298–307. [Google Scholar]
- Bakarov, A. A survey of word embeddings evaluation methods. arXiv 2018, arXiv:1801.09536. [Google Scholar]
- Finkelstein, L.; Gabrilovich, E.; Matias, Y.; Rivlin, E.; Solan, Z.; Wolfman, G.; Ruppin, E. Placing search in context: The concept revisited. ACM Trans. Inf. Syst. 2002, 20, 116–131. [Google Scholar]
- The Wordsimilarity-353 Test Collection. Available online: http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/ (accessed on 26 July 2021).
- Hill, F.; Reichart, R.; Korhonen, A. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 2015, 41, 665–695. [Google Scholar] [CrossRef]
- Simlex-999. 2020. Available online: https://fh295.github.io/simlex.html (accessed on 26 July 2021).
- Nelson, D.L.; McEvoy, C.L.; Schreiber, T.A. The university of south florida free association, rhyme, and word fragment norms. Behav. Res. Methods Instrum. Comput. 2004, 36, 402–407. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Faruqui, M.; Tsvetkov, Y.; Rastogi, P.; Dyer, C. Problems with evaluation of word embeddings using word similarity tasks. arXiv 2016, arXiv:1605.02276. [Google Scholar]
- Baroni, M.; Dinu, G.; Kruszewski, G. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; pp. 238–247. [Google Scholar]
- Almuhareb, A.; Poesio, M. Attribute-based and value-based clustering: An evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; pp. 158–165. [Google Scholar]
- Almuhareb, A.; Poesio, M. Concept learning and categorization from the web. In Proceedings of the Annual Meeting of the Cognitive Science Society, Stresa, Italy, 21–23 July 2005; Volume 27. [Google Scholar]
- Turney, P.D. Mining the web for synonyms: Pmi-ir versus lsa on toefl. In European Conference on Machine Learning; Springer: Berlin/Heidelberg, Germany, 2001; pp. 491–502. [Google Scholar]
- Landauer, T.K.; Dumais, S.T. A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 1997, 4, 211–240. [Google Scholar] [CrossRef]
- Espinosa-Anke, L.; Palmer, G.; Corcoran, P.; Filimonov, M.; Spasić, I.; Knight, D. English-Welsh cross-lingual embeddings. Appl. Sci. 2021, in press. [Google Scholar] [CrossRef]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, Minneapolis, MN, USA, 2–9 June 2019; pp. 4171–4186. [Google Scholar]
Source | Number of Words |
---|---|
CorCenCC | 1,875,540 |
Welsh Wikipedia | 21,233,177 |
National Assembly for Wales 1999–2006 | 11,527,963 |
National Assembly for Wales 2007–2011 | 8,883,970 |
Cronfa Electroneg o Gymraeg | 1,046,800 |
An Crúbadán | 22,572,066 |
DECHE | 2,126,153 |
BBC Cymru Fyw | 14,791,835 |
Gwerddon | 749,573 |
Welsh-medium websites | 7,388,917 |
The Bible | 749,573 |
Word Embedding Method | Tokenisation | Version | Semantic Correlation |
---|---|---|---|
fastText | naive | skip-gram | 0.0495 |
fastText | naive | CBOW | 0.1164 |
fastText | Gensim | skip-gram | 0.0681 |
fastText | Gensim | CBOW | 0.1326 |
fastText | WNLT | skip-gram | 0.0632 |
fastText | WNLT | CBOW | 0.1108 |
word2vec | naive | skip-gram | 0.0448 |
word2vec | naive | CBOW | 0.1157 |
word2vec | Gensim | skip-gram | 0.0692 |
word2vec | Gensim | CBOW | 0.1285 |
word2vec | WNLT | skip-gram | 0.0604 |
word2vec | WNLT | CBOW | 0.1067 |
Grave’s | ⋯ | ⋯ | 0.0785 |
Word Embedding Method | Tokenisation | Version | Semantic Correlation | Free Association Correlation |
---|---|---|---|---|
fastText | naive | skip-gram | 0.1131 | 0.0263 |
fastText | naive | CBOW | 0.0692 | 0.0415 |
fastText | Gensim | skip-gram | 0.1373 | 0.0471 |
fastText | Gensim | CBOW | 0.0967 | 0.0466 |
fastText | WNLT | skip-gram | 0.1246 | 0.0358 |
fastText | WNLT | CBOW | 0.0941 | 0.0496 |
word2vec | naive | skip-gram | 0.1075 | 0.0265 |
word2vec | naive | CBOW | 0.0700 | 0.0427 |
word2vec | Gensim | skip-gram | 0.1374 | 0.0461 |
word2vec | Gensim | CBOW | 0.0975 | 0.0452 |
word2vec | WNLT | skip-gram | 0.1247 | 0.0321 |
word2vec | WNLT | CBOW | 0.0964 | 0.0491 |
Grave’s | ⋯ | ⋯ | 0.1466 | 0.0546 |
Word Embeddings | Euclidean Distance | Cosine Distance | ||||||
---|---|---|---|---|---|---|---|---|
Method | Tokenisation | Version | Purity | Rand Index | Entropy | Purity | Rand Index | Entropy |
fastText | naive | skip-gram | 0.4860 | 0.8305 | 3.3298 | 0.6542 | 0.9178 | 2.2509 |
fastText | naive | CBOW | 0.3832 | 0.7451 | 4.2892 | 0.5140 | 0.8826 | 3.4872 |
fastText | Gensim | skip-gram | 0.5701 | 0.8811 | 2.8500 | 0.7242 | 0.9319 | 2.1092 |
fastText | Gensim | CBOW | 0.4533 | 0.6958 | 3.9220 | 0.5140 | 0.8984 | 3.1224 |
fastText | WNLT | skip-gram | 0.5794 | 0.8689 | 2.8448 | 0.6636 | 0.9225 | 2.2045 |
fastText | WNLT | CBOW | 0.3972 | 0.7479 | 4.2486 | 0.5514 | 0.8692 | 3.2374 |
Grave’s | ⋯ | ⋯ | 0.2383 | 0.5640 | 5.3069 | 0.4486 | 0.7965 | 3.9540 |
Method | Tokenisation | Version | % Correct |
---|---|---|---|
fastText | naive | skip-gram | 38% |
fastText | naive | CBOW | 32% |
fastText | Gensim | skip-gram | 36% |
fastText | Gensim | CBOW | 30% |
fastText | WNLT | skip-gram | 42% |
fastText | WNLT | CBOW | 30% |
Grave’s | ⋯ | ⋯ | 28% |
Method | Tokenisation | Version | Accuracy |
---|---|---|---|
fastText | naive | skip-gram | 16.64% |
fastText | naive | CBOW | 23.67% |
fastText | Gensim | skip-gram | 18.75% |
fastText | Gensim | CBOW | 28.01% |
fastText | WNLT | skip-gram | 17.97% |
fastText | WNLT | CBOW | 25.37% |
word2vec | naive | skip-gram | 16.66% |
word2vec | naive | CBOW | 23.88% |
word2vec | Gensim | skip-gram | 19.43% |
word2vec | Gensim | CBOW | 28.22% |
word2vec | WNLT | skip-gram | 18.21% |
word2vec | WNLT | CBOW | 25.21% |
Grave’s | ⋯ | ⋯ | 9.00% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Corcoran, P.; Palmer, G.; Arman, L.; Knight, D.; Spasić, I. Creating Welsh Language Word Embeddings. Appl. Sci. 2021, 11, 6896. https://doi.org/10.3390/app11156896
Corcoran P, Palmer G, Arman L, Knight D, Spasić I. Creating Welsh Language Word Embeddings. Applied Sciences. 2021; 11(15):6896. https://doi.org/10.3390/app11156896
Chicago/Turabian StyleCorcoran, Padraig, Geraint Palmer, Laura Arman, Dawn Knight, and Irena Spasić. 2021. "Creating Welsh Language Word Embeddings" Applied Sciences 11, no. 15: 6896. https://doi.org/10.3390/app11156896
APA StyleCorcoran, P., Palmer, G., Arman, L., Knight, D., & Spasić, I. (2021). Creating Welsh Language Word Embeddings. Applied Sciences, 11(15), 6896. https://doi.org/10.3390/app11156896