Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax †
Abstract
:1. Introduction
1.1. Linked Open Data for Sumerian
1.2. The MTAAC Project
2. Corpus Data
3. Technical Setup
3.1. CoNLL Format
- ID Unique identifier composed of side (o/r), line number, and token number.
- FORM Transliteration of the token in C-ATF format.
- SEGM Dash-separated morphological segmentation. Affix standardization follows ETCSRI. (http://oracc.museum.upenn.edu/etcsri/glossing/).
- XPOSTAG Part-of-speech and morpheme glosses, sequentially aligned with SEGM. Tags mostly follow ETSCRI.
- HEAD Head of the current token in dependency syntax, i.e., either its ID or 0 (for root).
- DEPREL Dependency label of the relation in HEAD, following CoNLL-U specifications.
- MISC Comments; other content.
3.2. CoNLL-RDF
3.3. Annotation Workflow
4. Annotating Morphology
4.1. Dictionary-Based Pre-Annotation
4.2. Rule-Based Pre-Annotation with SPARQL
4.3. Application and Evaluation
5. Annotating Syntax
5.1. RDF-Based Pre-Annotation
1. | Reduce adjective to preceding noun with adjectival modifier relation: | NOUN0 ADJCASE ⇒ NOUNCASE ADJ |
e.g., nita kalag-ga “strong male”. | ||
2. | Reduce noun in the genitive to preceding noun with genitive modifier relation: | NOUN NOUNGEN ⇒ NOUN NOUN |
e.g., lugal urim5ki-ma “king of Ur”. | ||
3. | Reduce noun with case marker to preceding noun with no case marker with appositional modifier relation: | NOUN0 NOUNCASE ⇒ NOUNCASE NOUN |
e.g., dinanaDAT nin-a-ni “to Inanna, his lady”. | ||
4. | Reduce noun to preceding noun with case relation: | NOUN0 NOUNCASE1+CASE2 ⇒ NOUNCASE1 NOUN |
e.g., lugalERG urim5ki-ma-ke4 “king of Ur”. | ||
5. | Reduce noun to preceding numeral with numeral modifier relation: | NUM0 NOUN(CASE) ⇒ NUM(CASE) NOUN |
e.g., 3(u) sila3 “thirty sila (measuring unit)” | ||
6. | Reduce noun in case to following verb with absolutive relation: | NOUNABS VERB ⇒ NOUN VERB |
e.g., numun-na-ni he2-eb-til-le-ne | ||
“may they end his lineage”. | ||
7. | Reduce noun in case to following verb with case relation: | NOUNCASE VERB ⇒ NOUN VERB |
8. | Reduce a sequence of numerals to the first: | NU NU ⇒ NU NU |
9. | Render mathematical operators as prepositions: (Note that rule 9 extends beyond the Shift-Reduce framework by considering non-adjacent elements.) | NU minus NU ⇒ NU (minus NU) |
10. | A numeral interval after time unit (day, month, or year) is analyzed like its numeral modifier | year NU ⇒ year NU |
11. | Reduce a numeral to its unit of measurement: | NU N ⇒ NU N |
Finally, a generic fall-back rule applies that considers unattached post-nominal elements as appositions: | ||
12. | Reduce post-nominal elements to the nominal: | N X ⇒ N X |
5.2. Application and Evaluation
5.3. Limits of Syntactic Pre-Annotation
1. | Nominal clause. Clauses that do not contain an independent verbal form might not be parsed correctly in some cases | urdu2 lu2-še lugal-zu-u3 |
slave man=that=ABS master=your=ABS | ||
‘Slave! Is that man your master?’ [29], 716, no. 7 | ||
2. | Word order. Sumerian normally has an SOV word order, with the verb at the final position. | |
However, exceptional right-dislocated clauses are known. Clause boundaries will not be correctly recognized in such cases. | i3-ĝu10 i3-gu7-e d nisaba-ke4 | |
fat=my=ABS VP-eat -3SG.A:IPFV Nisaba =ERG | ||
‘She will eat my cream, Nisaba.’ [29], 300, no. 27 | ||
3. | Enclitic copula. The Sumerian copula me can be both independent and enclitic. In the latter case, the analysis of the token in the context of other words is ambiguous, as it contains both nominal and verbal annotation: | |
še dub-sar-ne-kam | nagar-me-eš2 | |
barley scribe =PL =GEN=ABS=be:3N.S | carpenter=ABS=be -3PL.S | |
‘This is barley of the scribes.’ | ‘They are carpenters.’ [29], 681-2, nos. 24 and 27 | |
4. | Enclitic possessive pronouns. To facilitate subsequent dependency parsing, enclitic possessives are analyzed in terms of their morphosyntactic characteristics, not on grounds of their semantics: In their function, enclitic possessives are referential and this could be explicitly expressed with links between possessor and possessum within UD using the language-specific but popular nmod:poss relation. However, such links cannot be easily integrated into UD-compliant syntactic annotation as it may easily lead to non-projective trees (i.e., crossing edges): | |
sipa-de3-ne / gu2-ne-ne-a / e-ne-ĝar | ||
shepherd=PL =DAT neck=their =LOC VP-3PL.OO-3SG.A-place-3N.S/DO | ||
‘He placed this (as a burden) on the shepherds, on their necks.’ [29], 686, no. 21a | ||
In this example, the locative argument syntactically depends on the verb; at the same time, the enclitic possessive (glossed as ‘their’) refers to the preceding argument. Therefore, these semantic relations are to be captured in a subsequent processing step akin to anaphor resolution in other languages. |
6. Beyond Syntax
6.1. Annotating Semantics
6.2. Machine Translation
7. Summary
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Chiarcos, C.; McCrae, J.; Cimiano, P.; Fellbaum, C. Towards Open Data for linguistics: Linguistic Linked Data. In New Trends of Research in Ontologies and Lexical Resources: Ideas, Projects, Systems; Oltramari, A., Vossen, P., Qin, L., Hovy, E., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 7–25. [Google Scholar] [CrossRef]
- Chiarcos, C.; Pagé-Perron, É.; Khait, I.; Schenk, N.; Reckling, L. Towards a Linked Open Data Edition of Sumerian Corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Buil Aranda, C.; Corby, O.; Das, S.; Feigenbaum, L.; Gearon, P.; Glimm, B.; Harris, S.; Hawke, S.; Herman, I.; Humfrey, N.; et al. SPARQL 1.1 Overview. 2013. Available online: https://www.w3.org/TR/sparql11-overview/ (accessed on 5 June 2016).
- Alivernini, S.; D’Agostino, F.; Romano, M.; Severini, L. Ur_Namma, an OWL Ontology of a Sumerian Grammar. 2006. Available online: http://www.epistematica.com/2012/05/ur_namma-an-owl-ontology-of-a-sumerian-grammar/ (accessed on 5 June 2016).
- Jaworski, W. Ontology-Based Knowledge Discovery from Documents in Natural Language. Ph.D. Thesis, Uniwersytet Warszawski, Warszawa, Poland, 2008. [Google Scholar]
- Nurmikko-Fuller, T. Telling Ancient tales to Modern Machines: Ontological Representation of Sumerian Literary Narratives. Ph.D. Thesis, University of Southampton, Southampton, UK, 2015. [Google Scholar]
- Pagé-Perron, É.; Sukhareva, M.; Khait, I.; Chiarcos, C. Machine Translation and Automated Analysis of the Sumerian Language. In LaTeCH-CLfL Workshop, Association for Computational Linguistics (ACL) Anthology; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 10–16. [Google Scholar] [CrossRef]
- Cimiano, P.; McCrae, J.; Buitelaar, P. Lexicon Model for Ontologies. Available online: https://www.w3.org/2016/05/ontolex/ (accessed on 5 June 2016).
- Crofts, N.; Doerr, M.; Gill, T.; Stead, S.; Stiff, M. Definition of the CIDOC Conceptual Reference Model; Version 5.0.4. 2011. Available online: http://old.cidoc-crm.org/docs/cidoc_crm_version_5.0.4.pdf (accessed on 5 June 2016).
- de Melo, G. Lexvo.org: Language-Related Information for the Linguistic Linked Data Cloud. Semant. Web J. 2015, 6, 393–400. [Google Scholar] [CrossRef]
- Elliott, T.; Gillies, S. Pleiades: The un-GIS for ancient geography. J. Geogr. Inf. Sci. 2008, 22, 1091–1108. [Google Scholar]
- Chiarcos, C.; Sukhareva, M. OLiA—Ontologies of Linguistic Annotation. Semant. Web 2015, 6, 379–386. [Google Scholar] [CrossRef]
- Goetze, A. Cuneiform Texts from Various Collections; Yale Oriental Series, Babylonian Texts; Yale University Press: New Haven, CT, USA, 2009. [Google Scholar]
- Black, J.A.; Cunningham, G.; Ebeling, G.; Flückiger-Hawker, J.; Robson, E.; Taylor, J.; Zólyomi, G. The Electronic Text Corpus of Sumerian Literature. 1998–2006. Available online: http://etcsl.orinst.ox.ac.uk (accessed on 22 February 2015).
- Molina, M. Syntactic annotation for a Hittite corpus: Problems and principles. In Proceedings of the Workshop on Computational Linguistics and Language Science (CLLS 2016), Moscow, Russia, 26 April 2016. [Google Scholar]
- Smith, E. Query-Based Annotation and the Sumerian Verbal Prefixes. Ph.D. Thesis, University of Toronto, Toronto, ON, Canada, 2010. [Google Scholar]
- Nivre, J.; Agić, Ž.; Ahrenberg, L.; Aranzabe, M.J.; Asahara, M.; Atutxa, A.; Ballesteros, M.; Bauer, J.; Bengoetxea, K.; Berzak, Y.; et al. Universal Dependencies 1.4. 2016. Available online: http://hdl.handle.net/11234/1-1827 (accessed on 5 June 2016).
- Bamman, D.; Crane, G.R. The Ancient Greek and Latin Dependency Treebanks. In Language Technology for Cultural Heritage; Springer: New York, NY, USA, 2011; pp. 79–98. [Google Scholar]
- Zeldes, A.; Schroeder, C.T. An NLP Pipeline for Coptic. In Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, Berlin, Germany, 11 August 2016; pp. 146–155. [Google Scholar] [CrossRef]
- Chiarcos, C.; Fäth, C. CoNLL-RDF: Linked corpora done in an NLP-friendly way. In International Conference on Language, Data and Knowledge; Springer: New York, NY, USA, 2017; pp. 74–88. [Google Scholar]
- Chiarcos, C.; Schenk, N. The ACoLi CoNLL Libraries: Beyond tab-separated values. In Proceedings of the 11th Language Resources and Evaluation Conference (LREC-2018), Miyazaki, Japan, 7–12 May 2018; European Language Resources Association (ELRA): Miyazaki, Japan, 2018. [Google Scholar]
- Owen, D.I. Garshana Studies; CDL Press: Bethesda, MD, USA, 2011. [Google Scholar]
- Das, S.; Sundara, S.; Cyganiak, R. R2RML: RDB to RDF Mapping Language; Technical Report; 2012. Available online: https://www.w3.org/TR/r2rml/ (accessed on 5 June 2016).
- Tinney, S. Sumerian Lemmatization Primer; Technical Report; 2017. Available online: http://oracc.museum.upenn.edu/doc/help/languages/sumerian/sumerianprimer/index.html (accessed on 5 June 2016).
- Tablan, V.; Peters, W.; Maynard, D.; Cunningham, H.; Bontcheva, K. Creating tools for morphological analysis of Sumerian. In Proceedings of the 5th Language Resources and Evaluation Conference (LREC-2006), Genoa, Italy, 22–28 May 2006. [Google Scholar]
- Jaworski, W. Contents modelling of neo-Sumerian Ur III economic text corpus. In Proceedings of the 22nd International Conference on Computational Linguistics—Volume 1, Manchester, UK, 18–22 August 2008; Association for Computational Linguistics: Stroudsburg, PA, USA, 2008; pp. 369–376. [Google Scholar]
- Tinney, S. Annotation of Sumerian Syntax; 2017. Available online: http://oracc.museum.upenn.edu/doc/help/languages/sumerian/syntax/index.html (accessed on 14 September 2018).
- Nivre, J.; Hall, J.; Nilsson, J.; Chanev, A.; Eryigit, G.; Kübler, S.; Marinov, S.; Marsi, E. MaltParser: A language-independent system for data-driven dependency parsing. Nat. Lang. Eng. 2007, 13, 95–135. [Google Scholar] [CrossRef]
- Jagersma, A.H. A Descriptive Grammar of Sumerian. Ph.D. Thesis, Faculty of the Humanities, Leiden University, Leiden, The Netherlands, 2010. [Google Scholar]
- Hayes, J.L. A Manual of Sumerian Grammar and Texts. Second Revised and Expanded Edition; Number 5 in Artanes, Undena Publications: Malibu, CA, USA, 2000. [Google Scholar]
- Björkelund, A.; Bohnet, B.; Hafdell, L.; Nugues, P. A high-performance syntactic and semantic dependency parser. In Proceedings of the Coling 2010: 23rd International Conference on Computational Linguistics, Beijing, China, 23–27 August 2010; pp. 33–36. [Google Scholar]
- Koehn, P. Statistical Machine Translation; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Klein, G.; Kim, Y.; Deng, Y.; Senellart, J.; Rush, A. OpenNMT: Open-Source Toolkit for Neural Machine Translation. In Proceedings of the ACL 2017, System Demonstrations, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 67–72. [Google Scholar]
Training Set | Predictions (% of 2000 Tokens) | ||
---|---|---|---|
(Tokens) | Correct | None | Incorrect |
1000 | 48.0 | 50.4 | 1.7 |
2000 | 63.9 | 33.3 | 2.8 |
5000 | 71.9 | 19.7 | 8.5 |
10,000 | 77.7 | 16.9 | 5.5 |
15,000 | 81.7 | 12.1 | 6.3 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chiarcos, C.; Khait, I.; Pagé-Perron, É.; Schenk, N.; Jayanth; Fäth, C.; Steuer, J.; Mcgrath, W.; Wang, J. Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax. Information 2018, 9, 290. https://doi.org/10.3390/info9110290
Chiarcos C, Khait I, Pagé-Perron É, Schenk N, Jayanth, Fäth C, Steuer J, Mcgrath W, Wang J. Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax. Information. 2018; 9(11):290. https://doi.org/10.3390/info9110290
Chicago/Turabian StyleChiarcos, Christian, Ilya Khait, Émilie Pagé-Perron, Niko Schenk, Jayanth, Christian Fäth, Julius Steuer, William Mcgrath, and Jinyan Wang. 2018. "Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax" Information 9, no. 11: 290. https://doi.org/10.3390/info9110290
APA StyleChiarcos, C., Khait, I., Pagé-Perron, É., Schenk, N., Jayanth, Fäth, C., Steuer, J., Mcgrath, W., & Wang, J. (2018). Annotating a Low-Resource Language with LLOD Technology: Sumerian Morphology and Syntax. Information, 9(11), 290. https://doi.org/10.3390/info9110290