Developing Core Technologies for Resource-Scarce Nguni Languages
Abstract
:1. Introduction
2. Data Resources
2.1. Corpora
2.2. Protocols
2.3. Annotated Data
3. Core Technologies
3.1. Part-of-Speech Taggers
3.2. Lemmatizers
3.3. Morphological Analyzers
3.3.1. Tier 1: Morphological Decomposition
3.3.2. Tier 2: Morphological Tagging
4. Evaluation Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Gurevych, I.; Eckle-Kohler, J.; Matuschek, M. Linked lexical knowledge bases: Foundations and applications. Synth. Lect. Hum. Lang. Technol. 2016, 9, 1–146. [Google Scholar] [CrossRef]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing, 2nd ed.; Pearson Education: Upper Saddle River, NJ, USA, 2009. [Google Scholar]
- Packham, S. Crowdsourcing a Text Corpus for a Low Resource Language. Master’s Thesis, University of Cape Town, Cape Town, South Africa, 2016. [Google Scholar]
- Loubser, M.; Puttkammer, M. Viability of Neural Networks for Core Technologies for Resource-Scarce Languages. Information 2020, 11, 41. [Google Scholar] [CrossRef] [Green Version]
- Eiselen, R.; Puttkammer, M.J. Developing Text Resources for Ten South African Languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation, Reykjavik, Iceland, 26–31 May 2014; pp. 3698–3703. [Google Scholar]
- Thomas, M.; Cotterell, R.; Fraser, A.; Schütze, H. Joint lemmatization and morphological tagging with lemming. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2268–2274. [Google Scholar]
- Straka, M. UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Brussels, Belgium, 31 October–1 November 2018; pp. 197–207. [Google Scholar]
- Doke, C. Bantu languages, inflexional with a tendency towards agglutination. Afr. Stud. 1950, 9, 1–19. [Google Scholar] [CrossRef]
- Bosch, S.; Jones, J.; Pretorius, L.; Anderson, W. Resource development for South African Bantu languages: Computational morphological analysers and machine-readable lexicons. In Proceedings of the Workshop on Networking the Development of Language Resources for African Languages of the 5th International Conference on Language Resources and Evaluation, Genoa, Italy, 22–28 May 2006; pp. 38–43. [Google Scholar]
- Gaustad, T.; Puttkammer, M.J. Linguistically annotated dataset for four official South African languages with a conjunctive orthography: isiNdebele, isiXhosa, isiZulu, and Siswati. Data Brief 2021. under review. [Google Scholar]
- Gaustad, T.; Puttkammer, M.J. Development of linguistically annotated parallel language resources for four South African languages. In Proceedings of the 2nd workshop on Resources for African Indigenous Language (RAIL) at the International Conference of the Digital Humanities Association of Southern Africa (DHASA) 2021, online, 29 November–3 December 2021; pp. 1–8. [Google Scholar]
- Hocking, J. Language identification for South African languages. In Proceedings of the Annual Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech), Cape Town, South Africa, 27–28 November 2014; p. 307. [Google Scholar]
- Expert Advisory Group on Language Engineering Standards (EAGLES). Available online: http://www.ilc.cnr.it/EAGLES/home.html (accessed on 24 October 2021).
- Voutilainen, A. Part-of-speech tagging. In The Oxford Handbook of Computational Linguistics, 1st ed.; Mitkov, R., Ed.; Oxford University Press: New York, NY, USA, 2003; pp. 219–232. [Google Scholar]
- Van Rooy, B.; Schäfer, L. The effect of learner errors on POS tag errors during automatic POS tagging. S. Afr. Linguist. Appl. Lang. Stud. 2002, 20, 325–335. [Google Scholar] [CrossRef]
- Taljard, E.; Faaß, G.; Heid, U.; Prinsloo, D.J. On the development of a tagset for Northern Sotho with special reference to the issue of standardisation. J. Lit. Crit. Comp. Linguist. Lit. Stud. 2008, 29, 111–137. [Google Scholar] [CrossRef]
- Lafferty, J.; McCallum, A.; Pereira, F.C. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, Williamstown, MA, USA, 28 June–1 July 2001; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 2001; pp. 282–289. [Google Scholar]
- Plisson, J.; Lavrac, N.; Mladenic, D. A rule-based approach to word lemmatization. In Proceedings of the 7th International Multi-Conference Information Society (IS 2004), Ljubljana, Slovenia, 11–15 October 2004; pp. 83–86. [Google Scholar]
- Groenewald, H.J. Automatic Lemmatisation for Afrikaans. Master’s Thesis, North-West University, Potchefstroom, South Africa, 2006. [Google Scholar]
- Kessikbayeva, G.; Cicekli, I. Rule based morphological analyzer of Kazakh language. In Proceedings of the 2014 Joint Meeting of SIGMORPHON and SIGFSM, Baltimore, MD, USA, 27 June 2014; pp. 46–54. [Google Scholar]
- Van den Bosch, A.; Daelemans, A. Memory-based morphological analysis. In Proceedings of the 37th annual meeting of the association for computational Linguistics, College Park, MD, USA, 20–26 June 1999; pp. 285–292. [Google Scholar]
- Van de Velde, M.; Bostoen, K.; Nurse, D.; Philippson, G. The Bantu Languages, 2nd ed.; Routledge: New York, NY, USA, 2019. [Google Scholar]
- Moeng, T.; Reay, S.; Daniels, A.; Buys, J. Canonical and Surface Morphological Segmentation for Nguni Languages. arXiv 2021, arXiv:2104.00767. [Google Scholar]
- Daelemans, W.; Zavrel, J.; van der Sloot, K.; van den Bosch, A. MBT: Memory-Based Tagger, Reference Guide. Technical Report ILK 99-01. In Induction of Linguistic Knowledge, Computational Linguistics; Version 2.0; Tilburg University: Tilburg, The Netherlands, 2002. [Google Scholar]
- Pilon, S.; Puttkammer, M.J.; Van Huyssteen, G.B. Die ontwikkeling van ‘n woordafbreker en kompositumanaliseerder vir Afrikaans. J. Lit. Crit. Comp. Linguist. Lit. Stud. 2008, 29, 21–41. [Google Scholar] [CrossRef]
- Zalmout, N.; Habash, N. Don’t throw those morphological analyzers away just yet: Neural morphological disambiguation for Arabic. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 9–11 September 2017; pp. 704–713. [Google Scholar]
Token | Lemma | POS (Full Set) | Morphological Analysis |
---|---|---|---|
Umuntu | ntu | N01 | u[NPrePre1]-mu[BPre1]-ntu[NStem] |
ozosiza | siza | REL | o[RelConc1]-zo[Fut]-siz[VRoot]-a[VerbTerm] |
ngemali | mali | ADV | nga[AdvPre]-i[NPrePre9]-mali[NStem] |
kudingeka | dinga | V | ku[SC15]-ding[VRoot]-ek[NeutExt]-a[VerbTerm] |
aqoqe | qoqa | REL | a[RelConc1]-qoq[VRoot]-e[VerbTerm] |
izimali | mali | N10 | i[NPrePre10]-zi[BPre10]-mali[NStem] |
zokukhokhela | khokhela | POSS10 | za[PossConc10]-u[NPrePre15]-ku[BPre15]-khokhel[VRoot]-a[VerbTerm] |
amaphasela | phasela | N06 | a[NPrePre6]-ma[BPre6]-phasela[NStem] |
okudla | dla | REL | oku[RelConc15]-dla[NStem] |
. | . | PUNC | .[Punc] |
Language | POS Tag Count (Simplified Set) | POS Tag Count (Full Set) |
---|---|---|
NR | 16 | 95 |
SS | 16 | 102 |
XH | 16 | 105 |
ZU | 16 | 105 |
Instance Number | Left Context | Point of Focus | Right Context | Class | |||
---|---|---|---|---|---|---|---|
1 | - | - | - | n | g | o | = |
2 | - | - | n | g | o | k | = |
3 | - | n | g | o | k | u | = |
4 | n | g | o | k | u | p | o > a*u* |
5 | g | o | k | u | p | h | = |
6 | o | k | u | p | h | a | * |
7 | k | u | p | h | a | t | = |
8 | u | p | h | a | t | h | = |
9 | p | h | a | t | h | e | = |
10 | h | a | t | h | e | l | = |
11 | a | t | h | e | l | e | = |
12 | t | h | e | l | e | n | = |
13 | h | e | l | e | n | e | 0 > an*il* |
14 | e | l | e | n | e | - | = |
15 | l | e | n | e | - | - | ne > 0 |
16 | e | n | e | - | - | - | = |
Language | Morpheme Tag Count |
---|---|
NR | 70 |
SS | 68 |
XH | 62 |
ZU | 71 |
Dataset | Lemmatization | POS Tagging | Morpheme Decomposition | |
---|---|---|---|---|
(Simplified Set) | (Full Set) | |||
NR | 90.35 | 91.54 | 85.28 | 86.71 |
SS | 90.20 | 91.42 | 87.46 | 84.94 |
XH | 92.99 | 95.91 | 93.99 | 94.13 |
ZU | 90.33 | 92.65 | 88.60 | 86.87 |
NCHLT Text Accuracy | ||||
NR | 80.32 | - | 82.57 | 82.26 |
SS | 81.60 | - | 82.08 | 83.42 |
XH | 79.82 | - | 84.18 | 84.66 |
ZU | 81.56 | - | 83.83 | 85.19 |
Language | Morpheme Decomposition | Morpheme Tagging | Morphological Analysis | ||
---|---|---|---|---|---|
Instance-Level | Word-Level | Instance-Level | Word-Level | ||
NR | 94.32 | 86.71 | 93.07 | 83.63 | 84.75 |
SS | 94.21 | 84.94 | 90.70 | 80.61 | 81.48 |
XH | 97.97 | 94.13 | 96.10 | 92.27 | 93.83 |
ZU | 94.60 | 86.87 | 91.77 | 83.46 | 84.37 |
Part-of-Speech | NR | SS | XH | ZU |
---|---|---|---|---|
Abbreviation | 0 | 9.09 | 0 | 0 |
Adjective | 2.33 | 3.65 | 0 | 3.49 |
Adverb | 8.77 | 6.12 | 1.36 | 3.69 |
Class-indicating demonstrative | 16.67 | 16.45 | 2.11 | 12.86 |
Conjunction | 0.57 | 42.34 | 1.02 | 0 |
Copulative | 41.38 | 21.13 | 21.04 | 15.28 |
Foreign | 0.00 | 70.00 | 6.25 | 33.33 |
Ideophone | 30.00 | 0 | 11.11 | 25.00 |
Interjection | 11.76 | 24.14 | 5.88 | 0.00 |
Noun | 4.36 | 4.52 | 0.72 | 4.39 |
Numerative | 0 | 5.56 | 0 | 0 |
Possessive | 7.34 | 12.15 | 2.05 | 4.96 |
Pronoun | 4.80 | 8.00 | 1.94 | 0 |
Relative | 6.40 | 9.10 | 4.14 | 8.14 |
Verb | 9.37 | 6.93 | 4.57 | 5.66 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
du Toit, J.S.; Puttkammer, M.J. Developing Core Technologies for Resource-Scarce Nguni Languages. Information 2021, 12, 520. https://doi.org/10.3390/info12120520
du Toit JS, Puttkammer MJ. Developing Core Technologies for Resource-Scarce Nguni Languages. Information. 2021; 12(12):520. https://doi.org/10.3390/info12120520
Chicago/Turabian Styledu Toit, Jakobus S., and Martin J. Puttkammer. 2021. "Developing Core Technologies for Resource-Scarce Nguni Languages" Information 12, no. 12: 520. https://doi.org/10.3390/info12120520
APA Styledu Toit, J. S., & Puttkammer, M. J. (2021). Developing Core Technologies for Resource-Scarce Nguni Languages. Information, 12(12), 520. https://doi.org/10.3390/info12120520