3. Related Work
Existing approaches to the problem of tokenization and word segmentation can be largely divided into rule-based and data-driven methods. Data-driven systems may be further subdivided into lexicon-based systems and those employing statistical language models or machine learning.
In space-delimited languages, rule-based tokenizers—such as the Stanford Tokenizer (
https://nlp.stanford.edu/software/tokenizer.html; accessed on 26 September 2019) [
14]—are sufficient for most applications. On the other hand, in languages where word boundaries are not explicitly marked in text (such as Chinese and Japanese), word segmentation is a challenging task, receiving a great deal of attention from the research community. For such languages, a variety of data-driven word segmentation systems have been proposed. Among dictionary-based algorithms, one of the most popular approaches is the longest match method (also referred to as the maximum matching algorithm or MaxMatch) [
15] and its variations [
16,
17]. In more recent work, however, statistical and machine learning methods prevail [
18,
19,
20,
21,
22]. Furthermore, as in many other Natural Language Processing tasks, the past few years have witnessed an increasing interest towards artificial neural networks among the researchers studying word segmentation, especially for Chinese. A substantial part of the advancements in this area stem from using large external resources, such as raw text corpora, for pretraining neural models [
23,
24,
25,
26,
27]. Unfortunately, such large-scale data is not available for many lesser-studied languages, including Ainu. For Japanese and Chinese, word segmentation is sometimes modelled jointly with part-of-speech tagging, as the output of the latter task can provide useful information to the segmenter [
21,
28,
29,
30].
Outside of the East Asian context, word segmentation-related research is focused mainly on languages with complex morphology and/or extensive compounding—such as Finnish, Turkish, German, Arabic and Hebrew—where splitting coarse-grained surface forms into smaller units leads to a significant reduction in the vocabulary size and thus lower proportion of out-of-vocabulary words [
31,
32,
33,
34,
35]. Apart from that, even in languages normally using explicit word delimiters, there exist special types of text specific to the web domain, such as Uniform Resource Locators (URL) and
hashtags, whose analysis requires the application of a word segmentation procedure [
35,
36].
In 2016 Grant Jenks released WordSegment—a Python module for word segmentation, utilizing a Stupid Backoff model (
http://www.grantjenks.com/docs/wordsegment/; accessed on 26 September 2019). Due to relatively low computational cost, Stupid Backoff [
37] is good for working with extremely large models, such as the Google’s trillion-word corpus (
https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html; accessed on 26 September 2019) used as WordSegment’s default training data. In terms of the model’s accuracy, however, other language modelling methods—in particular the approach proposed by Kneser and Ney [
38] and enhanced by Chen and Goodman [
39]—proved to perform better, especially with smaller amounts of data [
37]. For that reason, in this research, apart from comparing our word segmentation algorithm to WordSegment, we carried out additional experiments with a segmentation algorithm based on an n-gram model with modified Kneser-Ney smoothing. In the context of word segmentation, Kneser-Ney smoothing has previously been used by Doval and Gómez-Rodríguez [
35].
Apart from models concerned directly with words, a widely practised approach to word segmentation is to define it as a character sequence labelling task, where each character is assigned with a tag representing its position in relation to word boundaries. While the early works belonging to this category relied on “traditional” classification techniques, such as maximum entropy models [
40] and Conditional Random Fields [
41], in recent studies neural architectures are being actively explored [
23,
27,
28,
30,
42]. In 2018, Shao et al. [
43] released a language-independent character sequence tagging model based on recurrent neural networks with Conditional Random Fields interface, designed for performing word segmentation in the Universal Dependencies framework. It obtained state-of-the-art accuracies on a wide range of languages. One of the key components of their methodology (originally proposed in [
30]) are the concatenated n-gram character representations, which offer a significant performance boost in comparison to conventional character embeddings, without resorting to external data sources. We used their implementation in the experiments described later in this paper, in order to verify how a character-based neural model performs under extremely low-resource conditions, such as those of the Ainu language, and how it compares with segmenters utilizing lexical n-grams, including ours.
To address the problem of word segmentation in the Ainu language, Ptaszynski and Momouchi [
44] proposed a segmenter based on the longest match method. Later, Ptaszynski et al. [
45] investigated the possibility of improving its performance by expanding the dictionary base used in the process. Nowakowski et al. [
46] developed a lexicon-based segmentation algorithm maximizing mean token length. Finally, Nowakowski et al. [
47] proposed a segmenter searching for the minimal sequence of n-grams matching the input string, an early and less efficient version of the MiNgMatch algorithm presented in this paper.
6. Results and Discussion
The results of the evaluation experiments with our algorithm are presented in
Table 5. The variant without the limit of n-grams per input segment produces unbalanced results (especially on SYOS), with relatively low Precision. After setting the limit to 2, Precision improves at the cost of a drop in Recall. The F-score is better for SYOS, while on AKJ there is a very slight drop.
Table 6 shows the results of experiments with the Stupid Backoff model. When no backoff factor is applied, results for both test sets are similar to those from the MiNgMatch Segmenter without the limit of n-grams per input segment. Setting the backoff factor to an appropriate value allows for significant improvement in Precision and F-score (and in some cases also small improvements in Recall). For the F-score, it is better to set a low backoff factor (e.g., 0.09) for 1-grams only, than to set it to a fixed value for all backoff steps (e.g., 0.4, as Brants et al. [
37] did). A backoff factor of 0.4 gives significant improvement in Precision with higher order n-gram models, but at the same time Recall drops drastically and overall performance deteriorates. For models with an n-gram order of 3 or higher, the backoff factor has a bigger impact on the results than further increasing the order of n-grams included in the model. A comparison with the results yielded by MiNgMatch shows that setting the limit of n-grams per input segment is more effective than Stupid Backoff as a method for improving precision of the segmentation process—it leads to a much smaller drop in Recall.
The results of the experiment with models employing modified Kneser-Ney smoothing are shown in
Table 7. They achieve higher Precision than both the other types of n-gram models. Nevertheless, due to very low Recall, the overall results are low.
The results obtained by the Universal Segmenter are presented in
Table 8. The default model (regardless of what kind of character representations are used—conventional character embeddings or concatenated n-gram vectors) learns from the training data that the first and the last character of a word (corresponding to
B,
E and
S tags) are always adjacent either to the boundary of a space-delimited segment or to a punctuation mark. As a result, the model separates punctuation from alpha-numeric strings found in the input, but never applies further segmentation to them.
US-ISP models are better but still notably worse than lexical n-gram models (especially on SYOS). Unlike with default settings, the model trained on data without whitespaces learns to predict word boundaries within strings of alpha-numeric characters. However, when presented with test data including spaces, they impede the segmentation process rather than supporting it. As shown in
Table 9, if we only take into account the word boundaries not already indicated in the raw test set, the model makes more correct predictions in data where the whitespaces have all been removed.
Models with multi-word tokens achieve significantly higher results. Precision of the US-MWTs model is on par with the segmenter applying Kneser-Ney smoothing, while maintaining relatively high Recall. It yields lower Recall than the model with randomly generated multi-word tokens, but the F-score is higher due to better Precision.
With the exception of the US-ISP model on SYOS, all variants of the neural segmenter achieved the best performance with concatenated 9-gram vectors. This contrasts with the results reported by Shao et al. [
30] for Chinese, where in most cases there was no further improvement beyond 3-grams. This behavior is a consequence of differences between writing systems: words in Chinese are on average composed of less characters than in languages using alphabetic scripts. Due to a much bigger character set size,
hanzi characters are also more informative to word segmentation [
43], hence better performance with models using shorter context.
Author Contributions
Conceptualization, M.P., F.M. and K.N.; methodology, M.P., F.M. and K.N.; software, K.N.; validation, F.M.; formal analysis, K.N.; investigation, K.N.; resources, F.M.; data curation, M.P. and K.N.; writing—original draft preparation, K.N.; writing—review and editing, M.P. and F.M.; visualization, K.N.; supervision, F.M. and M.P.; project administration, F.M. and M.P.
Funding
This research received no external funding.
Acknowledgments
The authors would like to thank the anonymous reviewers for their constructive feedback. We also thank Christopher Bozek for diligent proofreading of the manuscript. We are also grateful to Jagna Nieuważny and Ali Bakdur for useful discussions.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Shibatani, M. The Languages of Japan; Cambridge University Press: London, UK, 1990. [Google Scholar]
- Tamura, S. Ainugo Chitose Hōgen Jiten: The Ainu-Japanese Dictionary: Saru Dialect; Sōfūkan: Tokyo, Japan, 1998. [Google Scholar]
- Refsing, K. The Ainu language. The Morphology and Syntax of the Shizunai Dialect; Aarhus University Press: Aarhus, Denmark, 1986. [Google Scholar]
- Hokkaidō Utari Kyōkai [Hokkaido Ainu Association]. Akor Itak; [our language]; Hokkaidō Utari Kyōkai: Sapporo, Japan, 1994. [Google Scholar]
- Nakagawa, H. Ainugo Chitose Hōgen Jiten; [dictionary of the Chitose dialect of Ainu]; Sōfūkan: Tokyo, Japan, 1995. [Google Scholar]
- National Institute for Japanese Language and Linguistics. A Topical Dictionary of Conversational Ainu. Available online: http://ainutopic.ninjal.ac.jp (accessed on 25 August 2017).
- Kirikae, H. Ainu ni yoru Ainugo hyōki [transcription of the Ainu language by Ainu people]. Koku-bungaku kaishaku to kanshō 1997, 62, 99–107. [Google Scholar]
- Nakagawa, H. Ainu-jin ni yoru Ainugo hyōki e no torikumi [efforts to transcribe the Ainu language by Ainu people]. In Hyōki no Shūkan no Nai Gengo no Hyōki = Writing Unwritten Languages; Shiohara, A., Kodama, S., Eds.; Tōkyō gaikoku-go daigaku, Ajia/Afurika gengo bunka kenkyūjo: Tokyo, Japan, 2006; pp. 1–44. [Google Scholar]
- Endō, S. Nabesawa Motozō ni yoru Ainugo no kana hyōki taikei: Kokuritsu Minzoku-gaku Hakubutsukan shozo hitsu-roku nōto kara [Ainu language notation method used by Motozo Nabesawa: from the written notes held by the National Museum of Ethnology]. Kokuritsu Minzoku-gaku Hakubutsukan chōsa hōkoku 2016, 134, 41–66. [Google Scholar]
- Sunasawa, K. Ku Sukup Oruspe; [my life’s story]; Miyama Shobō: Sapporo, Japan, 1983. [Google Scholar]
- Satō, T. Ainugo Bunpō no Kiso; [basics of Ainu grammar]; Daigaku Shorin: Tokyo, Japan, 2008. [Google Scholar]
- The Foundation for Ainu Culture. Chūkyū Ainugo—Saru; [intermediate Ainu course—Saru dialect]; The Foundation for Ainu Culture: Sapporo, Japan, 2014. [Google Scholar]
- Sapporo Gakuin Daigaku, Jinbungaku-bu [Sapporo Gakuin University, Faculty of Humanities]. Ainu Bunka ni Manabu; [Learning from Ainu Culture]; Sapporo Gakuin Daigaku Seikatsu Kyōdō Kumiai: Ebetsu, Japan, 1990. [Google Scholar]
- Manning, C.; Surdeanu, M.; Bauer, J.; Finkel, J.; Bethard, S.; McClosky, D. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Baltimore, MD, USA, 23–24 June 2014; pp. 55–60. [Google Scholar]
- Wu, Z.; Tseng, G. Chinese Text Segmentation for Text Retrieval: Achievements and Problems. J. Am. Soc. Inf. Sci. 1993, 44, 532–542. [Google Scholar] [CrossRef]
- Nagata, M. A Self-Organizing Japanese Word Segmenter using Heuristic Word Identification and Re-estimation. In Proceedings of the Fifth Workshop on Very Large Corpora, Beijing, China, Hong Kong, China, 18–20 August 1997. [Google Scholar]
- Sassano, M. Deterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Volume 2: Short Papers, Gothenburg, Sweden, 3–7 April 2014; Association for Computational Linguistics: Gothenburg, Sweden, 2014; pp. 79–83. [Google Scholar]
- Papageorgiou, C.P. Japanese Word Segmentation by Hidden Markov Model. In Proceedings of the Workshop on Human Language Technology; Association for Computational Linguistics, HLT ’94, Stroudsburg, PA, USA, 8–11 March 1994; pp. 283–288. [Google Scholar]
- Palmer, D.D. A Trainable Rule-Based Algorithm for Word Segmentation. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, 11–12 July 1997; pp. 321–328. [Google Scholar]
- Peng, F.; Feng, F.; McCallum, A. Chinese Segmentation and New Word Detection Using Conditional Random Fields. In Proceedings of the 20th International Conference on Computational Linguistics, COLING ’04, Geneva, Switzerland, 23–27 August 2004; Association for Computational Linguistics: Stroudsburg, PA, USA, 2004. [Google Scholar]
- Kudo, T.; Yamamoto, K.; Matsumoto, Y. Applying Conditional Random Fields to Japanese Morphological Analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain, 25–26 July 2004; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 230–237. [Google Scholar]
- Neubig, G.; Nakata, Y.; Mori, S. Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Human Language Technologies, Portland, OR, USA, 19–24 June 2011; Association for Computational Linguistics: Portland, OR, USA, 2011; pp. 529–533. [Google Scholar]
- Pei, W.; Ge, T.; Chang, B. Max-Margin Tensor Neural Network for Chinese Word Segmentation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, MD, USA, 22–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 293–303. [Google Scholar]
- Cai, D.; Zhao, H. Neural Word Segmentation Learning for Chinese. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; pp. 409–420. [Google Scholar]
- Yang, J.; Zhang, Y.; Dong, F. Neural Word Segmentation with Rich Pretraining. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, BC, Canada, 30 July–4 August 2017; pp. 839–849. [Google Scholar]
- Yang, J.; Zhang, Y.; Liang, S. Subword Encoding in Lattice LSTM for Chinese Word Segmentation. arXiv 2018, arXiv:1810.12594. [Google Scholar]
- Ma, J.; Ganchev, K.; Weiss, D. State-of-the-art Chinese Word Segmentation with Bi-LSTMs. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4902–4908. [Google Scholar]
- Zheng, X.; Chen, H.; Xu, T. Deep Learning for Chinese Word Segmentation and POS Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 18–21 October 2013; pp. 647–657. [Google Scholar]
- Morita, H.; Kawahara, D.; Kurohashi, S. Morphological Analysis for Unsegmented Languages using Recurrent Neural Network Language Model. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015; pp. 2292–2297. [Google Scholar]
- Shao, Y.; Hardmeier, C.; Tiedemann, J.; Nivre, J. Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF. In Proceedings of the Eighth International Joint Conference on Natural Language Processing, Taipei, Taiwan, 27 November–1 December 2017. [Google Scholar]
- Monroe, W.; Green, S.; Manning, C.D. Word Segmentation of Informal Arabic with Domain Adaptation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, MD, USA, 22–27 June 2014; Association for Computational Linguistics: Baltimore, MD, USA, 2014; pp. 206–211. [Google Scholar]
- Ma, J.; Henrich, V.; Hinrichs, E. Letter Sequence Labeling for Compound Splitting. In Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, Berlin, Germany, 11 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 76–81. [Google Scholar]
- Björkelund, A.; Falenska, A.; Yu, X.; Kuhn, J. IMS at the CoNLL 2017 UD Shared Task: CRFs and Perceptrons Meet Neural Networks. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, Vancouver, BC, Canada, 3–4 August 2017; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 40–51. [Google Scholar]
- Tuggener, D. Proceedings of the 3rd Swiss Text Analytics Conference—SwissText 2018, CEUR Workshop Proceedings, Winterthur, Switzerland, 12–13 June 2018; ZHAW Zurich University of Applied Sciences; Institute of Applied Information Technology (InIT): Winterthur, Switzerland, 2018; Volume 2226, pp. 42–49. [Google Scholar]
- Doval, Y.; Gómez-Rodríguez, C. Comparing Neural- and N-Gram-Based Language Models for Word Segmentation. J. Assoc. Inf. Sci. Technol. 2019, 70, 187–197. [Google Scholar] [CrossRef] [PubMed]
- Wang, K.; Thrasher, C.; Hsu, B.J.P. Web Scale NLP: A Case Study on Url Word Breaking. In Proceedings of the 20th International Conference on World Wide Web, WWW ’11, Hyderabad, India, 28 March–1 April 2011; ACM: New York, NY, USA, 2011; pp. 357–366. [Google Scholar]
- Brants, T.; Popat, A.C.; Xu, P.; Och, F.J.; Dean, J. Large Language Models in Machine Translation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; Association for Computational Linguistics: Prague, Czech Republic, 2007; pp. 858–867. [Google Scholar]
- Kneser, R.; Ney, H. Improved backing-off for M-gram language modeling. In Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA, 9–12 May 1995; pp. 181–184. [Google Scholar]
- Chen, S.F.; Goodman, J. An Empirical Study of Smoothing Techniques for Language Modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, Stroudsburg, PA, USA, 23–28 June 1996; Association for Computational Linguistics: Stroudsburg, PA, USA, 1996; pp. 310–318. [Google Scholar]
- Xue, N. Chinese Word Segmentation as Character Tagging. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing-Volume 17 2003, Sapporo, Japan, 11–12 July 2003; pp. 29–48. [Google Scholar]
- Zhao, H.; Huang, C.N.; Li, M.; Lu, B.L. Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling. In Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, Wuhan, China, 1–3 November 2006; Huazhong Normal University: Wuhan, China, 2006; pp. 87–94. [Google Scholar]
- Huang, W.; Cheng, X.; Chen, K.; Wang, T.; Chu, W. Toward Fast and Accurate Neural Chinese Word Segmentation with Multi-Criteria Learning. arXiv 2019, arXiv:1903.04190. [Google Scholar]
- Shao, Y.; Hardmeier, C.; Nivre, J. Universal Word Segmentation: Implementation and Interpretation. Trans. Assoc. Comput. Linguist. 2018, 6, 421–435. [Google Scholar] [CrossRef]
- Ptaszynski, M.; Momouchi, Y. Part-of-Speech Tagger for Ainu Language Based on Higher Order Hidden Markov Model. Expert Syst. Appl. 2012, 39, 11576–11582. [Google Scholar] [CrossRef]
- Ptaszynski, M.; Ito, Y.; Nowakowski, K.; Honma, H.; Nakajima, Y.; Masui, F. Combining Multiple Dictionaries to Improve Tokenization of Ainu Language. In Proceedings of the 31st Annual Conference of the Japanese Society for Artificial Intelligence, Tsukuba, Tokyo, 13–15 November 2017. [Google Scholar]
- Nowakowski, K.; Ptaszynski, M.; Masui, F. Improving Tokenization, Transcription Normalization and Part-of-speech Tagging of Ainu Language through Merging Multiple Dictionaries. Proceedings of 8th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Workshop on Language Technology for Less Resourced Languages (LT-LRL), Poznan, Poland, 17–19 November 2017; pp. 317–321. [Google Scholar]
- Nowakowski, K.; Ptaszynski, M.; Masui, F. Word n-gram based tokenization for the Ainu language. In Proceedings of the International Workshop on Modern Science and Technology (IWMST 2018), Wuhan, China, 25–26 October 2018; pp. 58–69. [Google Scholar]
- Chiri, Y. Ainu Shin-yōshū; [collection of Ainu mythic epics]; Kyōdo Kenkyūsha: Tokyo, Japan, 1923. [Google Scholar]
- Kirikae, H. Ainu Shin-yōshū Jiten: Tekisuto, Bumpō Kaisetsu Tsuki; [lexicon to Yukie Chiri’s Ainu Shin-yōshū with text and grammatical notes]; Daigaku Shorin: Tokyo, Japan, 2003. [Google Scholar]
- Bugaeva, A.; Endō, S.; Kurokawa, S.; Nathan, D. A Talking Dictionary of Ainu: A New Version of Kanazawa’s Ainu Conversational Dictionary. Available online: http://lah.soas.ac.uk/projects/ainu/ (accessed on 25 November 2015).
- Jinbō, K.; Kanazawa, S. Ainugo Kaiwa Jiten; [Ainu conversational dictionary]; Kinkōdō Shoseki: Tokyo, Japan, 1898. [Google Scholar]
- Nakagawa, H.; Bugaeva, A.; Kobayashi, M.; Kimura, K.; Akasegawa, S. Glossed Audio Corpus of Ainu Folklore. Available online: http://ainucorpus.ninjal.ac.jp (accessed on 10 December 2017).
- Momouchi, Y.; Kobayashi, R. Dictionaries and Analysis Tools for the Componential Analysis of Ainu Place Name (in Japanese). Eng. Res. Bull. Grad. Sch. Eng. Hokkai-Gakuen Univ. 2010, 10, 39–49. [Google Scholar]
- Chiba University Graduate School of Humanities and Social Sciences. Ainugo Mukawa Hōgen Nihongo—Ainugo Jiten [Japanese–Ainu dictionary for the Mukawa dialect of Ainu]. Available online: http://cas-chiba.net/Ainu-archives/index.html (accessed on 5 February 2017).
- Nibutani Ainu Culture Museum. Ainu Language & Ainu Oral Literature. Available online: http://www.town.biratori.hokkaido.jp/biratori/nibutani/culture/language/ (accessed on 11 December 2017).
- The Ainu Museum. Ainu-go Ākaibu [Ainu Language Archive]. Available online: http://ainugo.ainu-museum.or.jp/ (accessed on 15 August 2018).
- Chiri, M. Bunrui Ainu-go jiten. Dai-ikkan: shokubutsu-hen; [dictionary of Ainu, vol. I: plants]; Nihon Jōmin Bunka Kenkyūsho: Tokyo, Japan, 1953. [Google Scholar]
- Shigeru, K. Kayano Shigeru no Ainugo Jiten; [Shigeru Kayano’s Ainu dictionary]; Sanseidō: Tokyo, Japan, 1996. [Google Scholar]
- Bugaeva, A. Southern Hokkaido Ainu. In The languages of Japan and Korea; Tranter, N., Ed.; Routledge: London, UK, 2012; pp. 461–509. [Google Scholar]
- Norvig, P. Natural language corpus data. In Beautiful data; Segaran, T., Hammerbacher, J., Eds.; O’Reilly Media: Sebastopol, CA, USA, 2009; pp. 219–242. [Google Scholar]
- Katayama, T. Ainu shin-yōshū o yomitoku; [studying the Ainu shin-yōshū]; Katayama Institute of Linguistic and Cultural Research: Tokyo, Japan, 2003. [Google Scholar]
Table 1.
Statistics of Ainu text collections and dictionaries used as the training data.
Text Collection: | SYOStrain | TDOA | GACF | MOPL | MUKA | NIBU | AASI | AAJIhw | Overall |
---|
Characters (excluding spaces): | 36,780 | 54,545 | 82,436 | 26,872 | 317,099 | 459,253 | 803,945 | 131,449 | 1,912,379 |
Segments: | 8786 | 12,978 | 22,559 | 9246 | 71,232 | 122,314 | 218,069 | 16,107 | 481,291 |
Avg. segment length: | 4.186 | 4.203 | 3.654 | 2.906 | 4.452 | 3.755 | 3.687 | 8.161 | 3.973 |
Table 2.
A fragment from the Talking Dictionary of Ainu (TDOA)/Ainugo kaiwa jiten (AKJ) dataset.
Ainugo kaiwa jiten [51]: | Nepka ayep an shiri he an? |
TDOA [50]: | nep ka a= ye p an siri an? |
AKJ: | nepka ayep an siri an? |
Translation [50]: | Is there something you (wanted) to say? |
Table 3.
Statistics of the samples used for evaluation and their modern transcription equivalents.
Data | Characters (Excluding Spaces) | Segments | Avg. Segment Length | OoV * Rate |
---|
SYOS: | 2667 | 441 | 6.070 | 0.259 |
Kirikae [49]: | 643 | 4.163 | 0.034 |
AKJ: | 4840 | 1143 | 4.234 | 0.044 |
Bugaeva et al. [50]: | 1259 | 3.844 | 0.005 |
Table 4.
Operations on training data for the Universal Segmenter.
Original text [48]: |
---|
Shineanto ta shirpirka kusu [...] |
(“One day, since the weather was nice [...]”) |
Modernized transcription [49]: |
Sine an to ta sir pirka kusu [...] |
Training data (in CoNLL-U format *) with multi-word tokens: |
1-3 sineanto _ _ _ _ _ _ _ _ |
1 sine _ _ _ _ _ _ _ _ |
2 an _ _ _ _ _ _ _ _ |
3 to _ _ _ _ _ _ _ _ |
4 ta _ _ _ _ _ _ _ _ |
5-6 sirpirka _ _ _ _ _ _ _ _ |
5 sir _ _ _ _ _ _ _ _ |
6 pirka _ _ _ _ _ _ _ _ |
7 kusu _ _ _ _ _ _ _ _ |
Training data (in CoNLL-U format *) with multi-token words: |
1 sine an to _ _ _ _ _ _ _ _ |
2 ta _ _ _ _ _ _ _ _ |
3 sir pirka _ _ _ _ _ _ _ _ |
4 kusu _ _ _ _ _ _ _ _ |
Table 5.
Evaluation results—MiNgMatch Segmenter (best results in bold).
Max. N-gram Order: | 2 | 3 | 4 | 5 |
---|
Test Data: | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ |
---|
Max. n-grams per segment | Max. * | Precision | 0.918 | 0.968 | 0.923 | 0.969 | 0.925 | 0.969 | 0.923 | 0.969 |
Recall | 0.918 | 0.986 | 0.943 | 0.989 | 0.952 | 0.989 | 0.952 | 0.990 |
F-score | 0.918 | 0.977 | 0.933 | 0.979 | 0.938 | 0.979 | 0.938 | 0.980 |
2 | Precision | 0.952 | 0.972 | 0.955 | 0.972 | 0.957 | 0.972 | 0.956 | 0.973 |
Recall | 0.899 | 0.981 | 0.924 | 0.985 | 0.933 | 0.985 | 0.933 | 0.986 |
F-score | 0.925 | 0.976 | 0.939 | 0.978 | 0.945 | 0.979 | 0.944 | 0.979 |
Table 6.
Evaluation results—Stupid Backoff model (best results in bold).
Max. N-gram Order: | 2 | 3 | 4 | 5 |
---|
Test Data: | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ |
---|
Backoff factor | 1 | Precision | 0.915 | 0.965 | 0.923 | 0.965 | 0.923 | 0.965 | 0.923 | 0.965 |
Recall | 0.916 | 0.986 | 0.958 | 0.989 | 0.961 | 0.988 | 0.961 | 0.988 |
F-score | 0.915 | 0.975 | 0.940 | 0.977 | 0.942 | 0.976 | 0.942 | 0.976 |
0.4 | Precision | 0.924 | 0.967 | 0.934 | 0.968 | 0.952 | 0.972 | 0.958 | 0.974 |
Recall | 0.921 | 0.988 | 0.944 | 0.986 | 0.913 | 0.979 | 0.875 | 0.966 |
F-score | 0.922 | 0.977 | 0.939 | 0.977 | 0.932 | 0.975 | 0.915 | 0.970 |
n = 1: 0.09 n > 1: 1 | Precision | 0.934 | 0.968 | 0.937 | 0.968 | 0.937 | 0.965 | 0.937 | 0.965 |
Recall | 0.927 | 0.986 | 0.961 | 0.987 | 0.963 | 0.986 | 0.963 | 0.986 |
F-score | 0.930 | 0.977 | 0.949 | 0.977 | 0.950 | 0.975 | 0.950 | 0.975 |
Table 7.
Evaluation results—model with Kneser-Ney smoothing (best results in bold).
Max. N-gram Order: | 2 | 3 | 4 | 5 |
---|
Test Data: | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ |
---|
Precision | 0.970 | 0.979 | 0.975 | 0.978 | 0.975 | 0.979 | 0.975 | 0.979 |
Recall | 0.693 | 0.946 | 0.724 | 0.944 | 0.726 | 0.943 | 0.726 | 0.943 |
F-score | 0.808 | 0.962 | 0.831 | 0.961 | 0.832 | 0.961 | 0.832 | 0.961 |
Table 8.
Evaluation results—Universal Segmenter (best results in bold).
Model Version: | Default | ISP | MWTs_rnd | MWTs |
---|
Test Data: | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ | SYOS | AKJ |
---|
Order of concatenated n-gram vectors | – | F-score | 0.809 | 0.931 | 0.874 | 0.950 | - | - | - | - |
3 | F-score | 0.813 | 0.931 | 0.894 | 0.946 | 0.939 | 0.971 | 0.938 | 0.974 |
5 | F-score | 0.812 | 0.931 | 0.888 | 0.958 | 0.945 | 0.973 | 0.948 | 0.978 |
7 | F-score | 0.812 | 0.931 | 0.897 | 0.955 | 0.944 | 0.973 | 0.948 | 0.977 |
9 | Precision | 0.998 | 0.981 | 0.910 | 0.952 | 0.950 | 0.972 | 0.976 | 0.980 |
Recall | 0.686 | 0.885 | 0.910 | 0.962 | 0.947 | 0.976 | 0.927 | 0.979 |
F-score | 0.813 | 0.931 | 0.910 | 0.957 | 0.948 | 0.974 | 0.951 | 0.979 |
11 | F-score | 0.812 | 0.931 | 0.914 | 0.954 | 0.946 | 0.974 | 0.950 | 0.975 |
Table 9.
US-ISP model (with 9-gram vectors): F-score for word boundaries not indicated in original transcription.
Spaces in Test Data Retained | Test Data |
---|
SYOS | AKJ |
---|
YES | 0.811 | 0.880 |
NO | 0.844 | 0.930 |
Table 10.
N-gram coverage.
N-gram Order: | 1 | 2 | 3 | 4 | 5 |
---|
Test data: | N-gram coverage: |
SYOS | 0.966 | 0.627 | 0.338 | 0.188 | 0.128 |
AKJ | 0.995 | 0.792 | 0.487 | 0.236 | 0.099 |
Table 11.
Word-level Accuracy for OoV words (best models only).
Model: | MiNgMatch | SB-0.4 | SB-0.09 | mKN | US-MWTs |
---|
Test data: | Accuracy: |
SYOS | 0.320 | 0.120 | 0.000 | 0.320 | 0.560 |
AKJ | 0.000 | 0.000 | 0.000 | 0.125 | 0.625 |
Table 12.
Error comparison (Jaccard index).
Test set: SYOS | MiNgMatch | SB-0.09 | mKN |
---|
US-MWTs | 0.220 | 0.153 | 0.195 |
mKN | 0.176 | 0.159 | |
SB-0.09 | 0.505 | | |
Test set: AKJ | MiNgMatch | SB-0.09 | mKN |
US-MWTs | 0.414 | 0.390 | 0.474 |
mKN | 0.369 | 0.355 | |
SB-0.09 | 0.714 | | |
Table 13.
Statistics of word segmentation errors.
Model: | MiNgMatch | SB-0.09 | mKN | US-MWTs |
---|
Test data: | Errors: |
SYOS | 71 | 66 | 189 | 62 |
AKJ | 50 | 58 | 91 | 49 |
Table 14.
Katayama’s transcription evaluated against Kirikae’s transcription.
Precision | Recall | F-score |
---|
0.979 | 0.941 | 0.960 |
Table 15.
SYOS: outputs of the best models re-evaluated against combined gold standard data (Kirikae [
49] + Katayama [
61]). Best results in bold.
Model: | MiNgMatch | SB-0.4 | SB-0.09 | mKN | US-MWTs |
---|
Precision | 0.965 | 0.969 | 0.949 | 0.981 | 0.980 |
Recall | 0.954 | 0.880 | 0.964 | 0.718 | 0.933 |
F-score | 0.959 | 0.922 | 0.956 | 0.829 | 0.956 |
Table 16.
Execution times in seconds.
Model: | MiNgMatch | SB-0.09 | mKN | US-MWTs |
---|
Time (s) | 0.812 | 5.837 | 4.785 | 7.717 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).