MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

: Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to ﬁnding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modiﬁed Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition


Introduction
One way to handle ambiguity-a major challenge in any Natural Language Processing task-is to consider the target text in context.A typical approach is to use an n-gram model, where the probability of a word depends on the n − 1 previous words.In this paper we argue that in the context of word segmentation, the problem can be reduced to finding the shortest sequence of n-grams matching the input text, with little or no drop in performance compared to state-of-the-art methodologies.In order to verify the usability of our approach in a scenario involving extremely sparse data, where its performance is expected to suffer the most, we tested it with a small corpus of text in Ainu, a critically endangered language isolate native to the island of Hokkaido in northern Japan.
Word segmentation is a part of the process of tokenization, a preprocessing stage present in a wide range of higher level Natural Language Processing tasks (such as part-of-speech tagging, entity recognition and machine translation), where the text is divided into basic meaningful units (referred to as tokens), such as words and punctuation marks.In the case of writing systems using explicit word delimiters (e.g., whitespaces), tokenization is considered a trivial task.However, sometimes the information about word boundaries is not encoded in the surface form (as in Chinese script), or orthographic words are too coarse-grained and need to be further analyzed -which is the case for many texts written in Ainu.In order to effectively process such texts, one needs to identify the implicit word boundaries.
The main contributions of this work are: (i) fast n-gram model yielding results comparable to state-of-the-art systems in the task of word segmentation of the Ainu language; (ii) open source implementation (https://github.com/karol-nowakowski/MiNgMatchSegmenter);(iii) comparison of 4 different segmenters, including lexical n-gram models and a neural model performing word segmentation in the form of character sequence labelling.
The remainder of this paper is organized as follows.In Section 2 we describe the problem of word segmentation in the Ainu language.In Section 3 we review the related work.Section 4 explains the proposed approach to word segmentation.In Section 5 we introduce the Ainu language resources used in this research.This section also provides a description of word segmentation models applied in our experiments, as well as evaluation metrics.In Section 6 we analyze the experimental results.Finally, Section 7 contains conclusions and ideas for future improvements.

Word Segmentation in the Ainu Language
Ainu is an agglutinative language exhibiting some of the characteristics associated with polysynthesis, such as pronominal marking and noun incorporation (especially in the language of classical Ainu literature [1] (p. 5)).The following example demonstrates noun incorporation in Ainu: kotan apapa ta a=eponciseanu [2] (p.111) kotan apa-pa ta a-e-pon-cise-anu village entrance-mouth at we/people-for[someone]-small-house-lay "We built a small hut for [her] at the entrance to the village." Ainu verbs and nouns combine with a variety of affixes-marking reciprocity, causativity, plurality and other categories-as well as function words: adnouns, verb auxiliaries and various types of postpositions, among others (in her analysis of the Shizunai dialect of Ainu, Kirsten Refsing [3] refers to both groups of grammatical morphemes with a common term: "clitics").
Most written documents in the Ainu language are transcribed using Latin alphabet, Japanese katakana script, or a combination of both (all textual data used in this research is written in Latin script).After two centuries of constant evolution-with multiple alternative notation methods being simultaneously in use-Ainu orthography has been, to a certain degree, standardized.One of the milestones in that process was a textbook compiled by the Hokkaido Utari Association (now Hokkaido Ainu Association) in cooperation with Ainu language scholars, published in 1994 under the title Akor itak ("our language") [4].It was intended for the use in Ainu language classes held throughout Hokkaido and included a set of orthographic rules for both Latin alphabet and katakana-based transcription.They are widely followed to this day, for example by Hiroshi Nakagawa [5], Suzuko Tamura [2] and the authors of the Topical Dictionary of Conversational Ainu [6].For detailed analyses of notation methods employed by different authors and how they changed with time, please refer to Kirikae [7], Nakagawa [8] and End ō [9].
Concerning word segmentation, however, no standard guidelines have been established to date [10] (p.198), [11] (p. 5), and polysynthetic verb morphology only adds to the confusion [5] (p.5).Contemporary Ainu language experts-while taking different approaches to handling certain forms, such as compound nouns and lexicalized expressions-generally treat morphemes entering in syntactic relations with other words as distinct units, even if they are cliticized to a host word in the phonological realization.In dictionaries, study materials and written transcripts of Ainu oral tradition that were published in the last few decades [2,12,13], it is a popular practice to use katakana to reflect pronunciation, while parallel text in Latin characters represents the underlying forms.The problem is most noticeable in older documents and texts written by native speakers without a background in linguistics, who tended to divide text into phonological words or larger prosodic units (sometimes whole verses)-see Sunasawa [10] (p.196).As a consequence, orthographic words in their notation comprise, on average, more morphemes.This, in turn, leads to an increase in the proportion of items not to be found in the existing dictionaries, which makes the content of such texts difficult to comprehend by less experienced learners.Furthermore, in the context of Natural Language Processing it renders the already limited data even more sparse.In order to facilitate the analysis and processing of such documents, a mechanism for word boundary detection is necessary.

Related Work
Existing approaches to the problem of tokenization and word segmentation can be largely divided into rule-based and data-driven methods.Data-driven systems may be further subdivided into lexicon-based systems and those employing statistical language models or machine learning.
In space-delimited languages, rule-based tokenizers-such as the Stanford Tokenizer (https: //nlp.stanford.edu/software/tokenizer.html; accessed on 26 September 2019) [14]-are sufficient for most applications.On the other hand, in languages where word boundaries are not explicitly marked in text (such as Chinese and Japanese), word segmentation is a challenging task, receiving a great deal of attention from the research community.For such languages, a variety of data-driven word segmentation systems have been proposed.Among dictionary-based algorithms, one of the most popular approaches is the longest match method (also referred to as the maximum matching algorithm or MaxMatch) [15] and its variations [16,17].In more recent work, however, statistical and machine learning methods prevail [18][19][20][21][22]. Furthermore, as in many other Natural Language Processing tasks, the past few years have witnessed an increasing interest towards artificial neural networks among the researchers studying word segmentation, especially for Chinese.A substantial part of the advancements in this area stem from using large external resources, such as raw text corpora, for pretraining neural models [23][24][25][26][27]. Unfortunately, such large-scale data is not available for many lesser-studied languages, including Ainu.For Japanese and Chinese, word segmentation is sometimes modelled jointly with part-of-speech tagging, as the output of the latter task can provide useful information to the segmenter [21,[28][29][30].
Outside of the East Asian context, word segmentation-related research is focused mainly on languages with complex morphology and/or extensive compounding-such as Finnish, Turkish, German, Arabic and Hebrew-where splitting coarse-grained surface forms into smaller units leads to a significant reduction in the vocabulary size and thus lower proportion of out-of-vocabulary words [31][32][33][34][35]. Apart from that, even in languages normally using explicit word delimiters, there exist special types of text specific to the web domain, such as Uniform Resource Locators (URL) and hashtags, whose analysis requires the application of a word segmentation procedure [35,36].
In 2016 Grant Jenks released WordSegment-a Python module for word segmentation, utilizing a Stupid Backoff model (http://www.grantjenks.com/docs/wordsegment/;accessed on 26 September 2019).Due to relatively low computational cost, Stupid Backoff [37] is good for working with extremely large models, such as the Google's trillion-word corpus (https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html;accessed on 26 September 2019) used as WordSegment's default training data.In terms of the model's accuracy, however, other language modelling methods-in particular the approach proposed by Kneser and Ney [38] and enhanced by Chen and Goodman [39]-proved to perform better, especially with smaller amounts of data [37].For that reason, in this research, apart from comparing our word segmentation algorithm to WordSegment, we carried out additional experiments with a segmentation algorithm based on an n-gram model with modified Kneser-Ney smoothing.In the context of word segmentation, Kneser-Ney smoothing has previously been used by Doval and Gómez-Rodríguez [35].
Apart from models concerned directly with words, a widely practised approach to word segmentation is to define it as a character sequence labelling task, where each character is assigned with a tag representing its position in relation to word boundaries.While the early works belonging to this category relied on "traditional" classification techniques, such as maximum entropy models [40] and Conditional Random Fields [41], in recent studies neural architectures are being actively explored [23,27,28,30,42].In 2018, Shao et al. [43] released a language-independent character sequence tagging model based on recurrent neural networks with Conditional Random Fields interface, designed for performing word segmentation in the Universal Dependencies framework.It obtained state-of-the-art accuracies on a wide range of languages.One of the key components of their methodology (originally proposed in [30]) are the concatenated n-gram character representations, which offer a significant performance boost in comparison to conventional character embeddings, without resorting to external data sources.We used their implementation in the experiments described later in this paper, in order to verify how a character-based neural model performs under extremely low-resource conditions, such as those of the Ainu language, and how it compares with segmenters utilizing lexical n-grams, including ours.
To address the problem of word segmentation in the Ainu language, Ptaszynski and Momouchi [44] proposed a segmenter based on the longest match method.Later, Ptaszynski et al. [45] investigated the possibility of improving its performance by expanding the dictionary base used in the process.Nowakowski et al. [46] developed a lexicon-based segmentation algorithm maximizing mean token length.Finally, Nowakowski et al. [47] proposed a segmenter searching for the minimal sequence of n-grams matching the input string, an early and less efficient version of the MiNgMatch algorithm presented in this paper.

Description of the Proposed Approach
In the proposed method, we reduce the problem of word segmentation to that of finding the shortest sequence of lexical n-grams matching the input string.For each space-delimited segment in the input text, our algorithm finds a single n-gram or the shortest sequence of n-grams, such that after concatenation and removing all whitespaces is equal to that input segment.In cases where multiple segmentation paths with the same number of n-grams are possible, the sequence with the highest score is selected.Scoring function, given a candidate sequence S, can be defined as below: where Count(•) denotes the frequency of a particular n-gram in the training corpus and N is the total number of n-grams in that corpus not exceeding the maximum n-gram order specified for the model.If the model is unable to match any n-gram to the given string or its part, it is treated as an out-of-vocabulary item and returned without modification.Furthermore, the user may specify the maximum number of n-grams to be used in the segmentation of a single input segment.Strings for which the algorithm could not match a sequence of n-grams equal to or shorter than the limit, are retained without modification.The only exception to that rule are punctuation marks-they are separated from alpha-numeric strings in a post-processing step.

N-gram Data
Listing 1 shows a sample from the data used by our model.The first column contains unsegmented strings used by the matching algorithm, each of them corresponding to a lexical n-gram.In the rightmost column, we store precomputed scores, represented as logarithms.
Once the best sequence of n-grams has been selected, the indices of word boundaries for each n-gram (stored in the second column) are used to produce the final segmentation.For cases where multiple n-gram patterns recorded in the training corpus resulted in the same string after removing whitespaces from between the tokens, we only included the most frequent segmentation in the model.For instance, the 3-gram aynu mosir ka ("the land of Ainu also")-which appeared in the data 6 times-was pruned, as the bigram variant aynumosir ka was more frequent, with 63 occurrences.An alternative segmentation can still be returned by the segmenter if it is more frequent in a longer context.For example, although the preferred segmentation of the string ciki is ciki ("if"; 594 instances), rather than ci ki (first person pronominal marker ci attached to auxiliary verb ki) with 32 occurrences, in the case of a longer segment: cikisiri, the only segmentation attested in our data is ci ki siri (ci ki followed by the nominalizing evidential particle siri), appearing 3 times in the training corpus.

Computational Cost
The maximum number of candidate segmentations to be generated by our algorithm, given a string composed of n characters, can be calculated as follows: where m stands for the smallest number of n-grams existing in the model needed to create a sequence matching the input string, and l represents the limit of n-grams per input string (specified by the user), such that l ≤ n.In practice it means that, apart from rare situations where only a sequence of single-character unigrams can be matched to the given string, our algorithm has a lower computational cost than a model which considers all the 2 n−1 possible segmentations (obviously, a word segmentation algorithm evaluating each unique segmentation path would be highly impractical; a typical approach, also taken by us, is to reduce that number by applying dynamic programming and memoization techniques).

Training Data
Language models applied in this research were trained on Ainu language textual data from eight different sources: (A) Ainu Shin'yōshu [48] (SYOStrain) A collection of 13 mythic epics (kamuy yukar) compiled by Yukie Chiri.In the training of our models we used a version with modernized transcription, published by Hideo Kirikae [49].We only included 11 epics in the training set, while the remaining 2 texts were used as test data (see the next section).(B) A Talking Dictionary of Ainu: A New Version of Kanazawa's Ainu Conversational dictionary [50] (TDOA) An online dictionary based on the Ainugo kaiwa jiten [51], a dictionary compiled by Sh ōzabur ō Kanazawa and Kotora Jinb ō, and published in 1898.It contains 3,847 entries, each of them consisting of a single word, multiple related words, a phrase or a sentence.For training we used the modernized transcription produced by Bugaeva et al. [50].The last 285 entries (roughly 10% of the dictionary, character-wise) were excluded from the training data, in order to use them as test data in evaluation experiments (see the next section).
(C) Glossed Audio Corpus of Ainu Folklore [52] (GACF) A digital collection of 10 Ainu folktales with glosses (morphological annotation) and translations into Japanese and English.(D) Dictionary of Ainu place names [53] (MOPL) A dictionary of Ainu place names in the form of a database.It includes a total of 3,152 topological names, along with the analysis of their components and Japanese translations.(E) Dictionary of the Mukawa dialect of Ainu [54]  An online lexicon consisting of digitized versions of three Ainu-Japanese dictionaries [2,57,58], comprising a total of 33,126 entries.We used only the headwords included in the dictionary.After removing duplicates (homographic entries), a total of 16,107 entries remained.
The following post-processing steps were applied to the training corpus: (1) The data was cleaned, resulting in files containing only raw Ainu text in Latin alphabet.
(2) Accented vowels ( á , é , í , ó , ú ) used in some materials were replaced with their unaccented counterparts ( a , e , i , o , u ).( 3) Underscores ( _ ) used in some materials to indicate phonological alternations were removed.(4) Equality signs ( = ) used to denote personal markers were either removed, or replaced with whitespaces (if there was no whitespace in the original text).While their presence is an unambiguous indicator of a boundary between two tokens (although they are traditionally referred to as "affixes", we treat those morphemes as separate units, which is a common practice among present-day experts; for a detailed analysis of their morphological status, please refer to Bugaeva [59]), they were not used in older texts, which are going be the main target of a word segmentation system, therefore we decided to exclude them from the data.This resulted in a corpus of text comprising a total of 481,291 segments (space-delimited units of text).The statistics of all eight datasets after this step are shown in Table 1.(5) Finally, punctuation marks were separated from words.However, non-alphanumeric characters used word-internally (e.g., hyphens indicating boundaries between the constituents of compound words and apostrophes representing glottal stop) were not modified.
(B) Ainugo Kaiwa Jiten [51] (AKJ) The portion of the original dictionary corresponding to the entries from A Talking Dictionary of Ainu which we removed from the training data, was applied as the second evaluation dataset.
In the test data we retained the word segmentation of the original transcriptions (by Chiri [48] and Jinb ō and Kanazawa [51]).However, in order to prevent differences in spelling from affecting the word segmentation algorithm's performance, the text was preprocessed by unifying its spelling with modern versions transcribed by Kirikae [49] and Bugaeva et al. [50] (the task of spelling modernization is out of the scope of this paper and will be addressed separately.).In the case of the Ainugo kaiwa jiten and TDOA, there were also some differences in the usage of punctuation marks as well as several words which appeared in the original text, but the authors of the modernized transcription decided to remove them -in such cases the text was unified with the modern transcription, with the exception of equality signs attached to personal markers, which were omitted.A sample sentence from the Ainugo kaiwa jiten before and after this preprocessing step is shown in Table 2. Table 3 presents the statistics of both evaluation datasets, in comparison with the portions of modernized texts corresponding to them.

Experiment Setup
In our experiments we tested the following word segmentation systems: (1) a corpus-based word segmentation algorithm minimizing the number of n-grams needed to match the input string (MiNgMatch Segmenter); (2) a segmentation algorithm with Stupid Backoff language model (WordSegment with modifications); (3) a segmentation algorithm with a language model applying modified Kneser-Ney smoothing (later referred to as "mKN"); (4) a segmentation system based on character sequence labelling using a neural model [43] (later we will refer to it as "Universal Segmenter").

MiNgMatch Segmenter
Our algorithm was tested in two variants: • with the limit of n-grams per input segment equal to the number of characters in the input string; • with the limit of n-grams per input segment set to 2 (based on the observation that in most cases where a single input segment is divided into 3 or more n-grams, that segmentation is incorrect).
In experiments conducted by Nowakowski et al. [47] with an early version of the segmenter, the best results were in most cases yielded with the order of n-grams not exceeding 5-grams.Thus, for the n-gram models examined in the present paper we set the limit of n to 5.

WordSegment (Stupid Backoff Model)
WordSegment is an open source Python module for word segmentation developed by Grant Jenks, based on the work of Peter Norvig [60].In our evaluation experiments we applied the system with two modifications: (A) We added the option of using n-gram models with the order of n-grams higher than 2 (in original WordSegment, only unigrams and bigrams were used, whereas we wanted to test models with the order of up to 5).(B) We added the possibility of manipulating the backoff factor.Although it was a part of the original formulation by Brants et al. [37], Peter Norvig and Grant Jenks omitted it from their implementations.
We examined three different values of the backoff factor: • 1 (i.e.no backoff factor, as in original WordSegment); • 0.4 (as suggested by Brants et al. [37]; later we will refer to this model as "SB-0.4");• 0.09, only applied to 1-grams (this configuration achieved the best F-score in our preliminary experiments, at the cost of lower Precision).Let w i denote a candidate word to be evaluated in the context of k previous words (w i−1 i−k ).The recursive scoring function employed in this variant (later referred to as "SB-0.09")can be defined as follows: otherwise. ( with α representing the backoff factor specified for unigrams, and N 1 being the total number of unigrams in the training corpus.

Segmenter with Language Model Applying Modified Kneser-Ney Smoothing
In the next experiment, we tested a word segmentation system similar to WordSegment (the same dynamic programming algorithm is used to generate candidate segmentations), but employing a language model with modified Kneser-Ney smoothing for choosing the most probable segmentation path.The model was generated using the KenLM Language Model Toolkit (https://kheafield.com/code/kenlm/; accessed on 26 September 2019).
Analogically to the experiments with our system and WordSegment, we used language models with the maximum order of n-grams set to 5.

Universal Segmenter
Apart from segmenters utilizing language models based on lexical n-grams, we carried out a series of experiments using the character-level sequence labelling model developed by Yan Shao et al. [43].We trained the model in three different variants: 5.7.1.Default Model (henceforth, "US-Default") In this experiment, we applied the default training settings, designed to work with space-delimited languages.The same training data as in previous experiments was used, which means that tokens in the training set correspond to those in gold standard data.

With Spaces Ignored (henceforth, "US-ISP")
Here, we trained the model with the -isp argument.It results in the removal of space delimiters from the training which means it is effectively treated in a similar way to Chinese or Japanese script.

With Multi-Word Tokens (later referred to as "US-MWTs_rnd" and "US-MWTs")
When processing older Ainu texts, many space-delimited segments need to be split in multiple tokens.Consequently, the default model relying on whitespaces and trained on the data with modern segmentation is ineffective.Unlike in Chinese or Japanese, however, a large portion of word boundaries is already correctly indicated by whitespaces in the input text, so ignoring them altogether (as in US-ISP models) is not the optimal method, as well.In order to create a model better suited to our task, we used the concept of multi-word tokens (https://universaldependencies.org/u/overview/tokenization.html; accessed on 26 September 2019) existing in Universal Dependencies and also reflected in the Universal Segmenter.
Firstly, we converted the two datasets (SYOStrain and TDOA) for which both old and modernized transcriptions exist, to a format where boundaries between words grouped together as a single space-delimited string in the original transcription are treated as boundaries between components of a multi-word token.For the remaining six datasets, however, only a single transcription by contemporary experts is available.We therefore applied the following two methods to simulate sparser word segmentation of old texts by generating multi-word tokens artificially: (A) As a baseline method, we created multi-word tokens in a random manner.Namely, we assigned each whitespace in the data with a 50% chance of being removed and thus becoming a boundary between components of a multi-word token.This resulted in the generation of 105,663 multi-word tokens.Later we will refer to the models learned from this version of the training data as "US-MWTs_rnd".(B) In the second approach, multi-word tokens were generated in a semi-supervised manner using the Universal Segmenter itself.To achieve that, we converted multi-word tokens previously identified in SYOStrain and TDOA to multi-token words (defined in the UD scheme as words consisting of multiple tokens, but treated as a single syntactic unit) and trained a word segmentation model on these two datasets.The resulting model was then used to process the rest of the training corpus.As a result, some tokens were grouped in multi-token words (a total of 70,373 such words were generated).In the final step, we converted the multi-token words to multi-word tokens.This variant of the data was used to train the group of models later referred to as "US-MWTs".
We illustrate the operations described above in Table 4, using a sample from the SYOStrain dataset.
Apart from simple character embeddings, the Universal Segmenter allows the usage of concatenated n-gram vectors encoding rich local information.We investigated the performance with 3-, 5-, 7-, 9-and 11-grams.Any parameters of the training process not mentioned above were set to default values.

Evaluation Method
In order to evaluate word segmentation performance, we employed three metrics: Precision (P), Recall (R) and balanced F-score (F 1 ).Precision is defined as the proportion of correct word boundaries (whitespaces) within all word boundaries returned by the system (B s ), whereas Recall is the portion of word boundaries present in expert-annotated data (B e ) which were also correctly predicted by the segmenter.The balanced F-score is the harmonic mean of Precision and Recall.
In addition, we evaluated word-level Accuracy for OoV words, defined as the proportion of unseen tokens in expert-annotated data (U e ), correctly segmented by the system (U s ):

Results and Discussion
The results of the evaluation experiments with our algorithm are presented in Table 5.The variant without the limit of n-grams per input segment produces unbalanced results (especially on SYOS), with relatively low Precision.After setting the limit to 2, Precision improves at the cost of a drop in Recall.The F-score is better for SYOS, while on AKJ there is a very slight drop.
Table 6 shows the results of experiments with the Stupid Backoff model.When no backoff factor is applied, results for both test sets are similar to those from the MiNgMatch Segmenter without the limit of n-grams per input segment.Setting the backoff factor to an appropriate value allows for significant improvement in Precision and F-score (and in some cases also small improvements in Recall).For the F-score, it is better to set a low backoff factor (e.g., 0.09) for 1-grams only, than to set it to a fixed value for all backoff steps (e.g., 0.4, as Brants et al. [37] did).A backoff factor of 0.4 gives significant improvement in Precision with higher order n-gram models, but at the same time Recall drops drastically and overall performance deteriorates.For models with an n-gram order of 3 or higher, the backoff factor has a bigger impact on the results than further increasing the order of n-grams included in the model.A comparison with the results yielded by MiNgMatch shows that setting the limit of n-grams per input segment is more effective than Stupid Backoff as a method for improving precision of the segmentation process-it leads to a much smaller drop in Recall.
The results of the experiment with models employing modified Kneser-Ney smoothing are shown in Table 7.They achieve higher Precision than both the other types of n-gram models.Nevertheless, due to very low Recall, the overall results are low.
The results obtained by the Universal Segmenter are presented in Table 8.The default model (regardless of what kind of character representations are used-conventional character embeddings or concatenated n-gram vectors) learns from the training data that the first and the last character of a word (corresponding to B, E and S tags) are always adjacent either to the boundary of a space-delimited segment or to a punctuation mark.As a result, the model separates punctuation from alpha-numeric strings found in the input, but never applies further segmentation to them.US-ISP models are better but still notably worse than lexical n-gram models (especially on SYOS).Unlike with default settings, the model trained on data without whitespaces learns to predict word boundaries within strings of alpha-numeric characters.However, when presented with test data including spaces, they impede the segmentation process rather than supporting it.As shown in Table 9, if we only take into account the word boundaries not already indicated in the raw test set, the model makes more correct predictions in data where the whitespaces have all been removed.Models with multi-word tokens achieve significantly higher results.Precision of the US-MWTs model is on par with the segmenter applying Kneser-Ney smoothing, while maintaining relatively high Recall.It yields lower Recall than the model with randomly generated multi-word tokens, but the F-score is higher due to better Precision.
With the exception of the US-ISP model on SYOS, all variants of the neural segmenter achieved the best performance with concatenated 9-gram vectors.This contrasts with the results reported by Shao et al. [30] for Chinese, where in most cases there was no further improvement beyond 3-grams.This behavior is a consequence of differences between writing systems: words in Chinese are on average composed of less characters than in languages using alphabetic scripts.Due to a much bigger character set size, hanzi characters are also more informative to word segmentation [43], hence better performance with models using shorter context.

General Observations
Due to data sparsity, n-gram coverage in the test set (the fraction of n-grams in the test data that can be found in the training set) is low (see Table 10).It means that many multi-word tokens from the test set are known to n-gram models as separate unigrams, but not in the form of a single n-gram.The Stupid Backoff model with a backoff factor for unigrams set to a moderate value (such as 0.09) is able to segment such strings correctly.However, it also erroneously segments some OoV single-word tokens whose surface forms happen to be interpretable as a sequence of concatenated in-vocabulary unigrams, resulting in lower Precision.On the other hand, models assigning low scores to unigrams (such as a 4or 5-gram model with the Stupid Backoff and backoff factor set as suggested by Brants et al. [37], and in particular the model applying modified Kneser-Ney smoothing) are better at handling OoV words (see Table 11), but as a result of probability multiplication, in many cases they score unseen multi-word segments higher than a sequence of unigrams into which the given segment should be divided, hence yielding lower Recall.Universal Segmenter operates at the level of characters rather than words, which makes it more robust against unseen words.This, along with the ability of neural models to transform discrete, sparse inputs into continuous representations capturing similarities between them, such as morphological features [35], explains the fact that it is able to achieve high Precision while maintaining relatively high Recall.
In line with these observations, we found Universal Segmenter to be the only segmenter in our experiments whose output includes tokens seen neither in the training data nor in the test set.For instance, it correctly segmented the input token ekampaktehi into ekampakte hi ("a promise"), whereas other systems either did not divide it at all, or segmented it into a sequence of in-vocabulary unigrams (e.g., ekampak te hi).

Error Comparison
Using the outputs of the best performing models, we measured how similar the errors made by different segmenters were.In particular, we calculated the Jaccard index between lists of errors found in each pair of outputs.
Results are presented in Table 12.Output of the model with modified Kneser-Ney smoothing is the least similar to most other models' outputs, which can be explained simply by the fact that it made the highest number of errors on both datasets (statistics are shown in Table 13).On the other hand, the Universal Segmenter's output, while containing numbers of errors comparable to those produced by the best performing n-gram models, also exhibits a low level of similarity to them.Indeed, qualitative analysis of segmentations generated by the neural model confirms that in some parts they are quite different from the predictions made by other models.For instance, the two segments wenpuri enantuykasi were correctly divided into wen puri enan tuykasi ("[her] face [took] the color of anger") only by the Universal Segmenter.All other models incorrectly split the word tuykasi (possessive form of the locative noun tuyka, meaning "on [the face]") into tuyka si, the reason being the fact that the n-gram wen puri enan tuyka is attested (with 4 instances) in the training set.Conversely, there are also some errors only made by the Universal Segmenter.For instance, it was the only system to divide the in-vocabulary word ayapo (exclamation of pain) into two tokens: a and yapo, out of which yapo does not appear in the training data.Another example is the phrase ki aineno ("eventually"), transcribed by Kirikae as ki a ine no (3 instances in the training set) and segmented in the same way by n-gram models, whereas the neural model treated the last two words as a single unit, ineno.This prediction, however, might be arguably considered correct, as there exists one instance of ineno in Kirikae's transcription, used in the same context (iki a ineno).Based on the observations described above, we believe that implementing an ensemble of an n-gram model and a character sequence labelling neural model shall be an interesting avenue for future work.

Results on SYOS with Two Gold Standard Transcriptions
As mentioned in Section 2, there is a certain amount of inconsistencies in word segmentation even between contemporary scholars of Ainu, which means they are also present in our data.With that in mind, we decided to cross-check the results of our experiments against an additional gold standard transcription.For that purpose we used an alternative modernized transcription of SYOS by Katayama [61].
Firstly, we compared Katayama's transcription with the version edited by Kirikae [49], using the same evaluation metrics as in previous experiments with segmentation algorithms.The results are presented in Table 14.Our assumption is that-in spite of making different decisions as to whether to group certain morphemes together or to treat them as separate units-both experts produced correct transcriptions.In order to investigate the effect of this phenomenon on our experiments, we re-evaluated the outputs of the best performing segmentation models using a combination of both experts' transcriptions as the gold standard data.This time, Precision was defined as the proportion of word boundaries predicted by the model that can be also found in either of the gold standard transcriptions: Analogically, Recall was defined as the proportion of word boundaries found in both variants of the gold standard which were also correctly predicted by the model: Results are shown in Table 15.Apart from the model with Kneser-Ney smoothing, the results achieved by all models improved substantially.The highest gain was obtained for our algorithm-the result improved to such an extent that it ranked first in terms of F-score.A large share of that difference can be attributed to a single token, awa (a conjunction created by combining the perfective aspect marker a and a coordinative conjunctive particle wa), appearing a total of 17 times in the test set.The MiNgMatch algorithm, operating at the level of input segments, followed Katayama in not splitting awa, as it is more frequent in the training data, with 289 occurrences, than the 2-gram variant a wa (181 instances).Nevertheless, models considering a wider context preferred the latter option, which conforms with how Kirikae transcribed it.

Execution Speed
Table 16 compares the total time taken by each of the best performing models to process the two test sets.In the case of segmenters based on lexical n-grams, we used 5-gram models.The Universal Segmenter's speed was evaluated on the model trained with concatenated 9-gram vector representations.Experiments with n-gram models were carried out on a Windows machine with Intel Core i7 running at 1.90 GHz and 16 GB of RAM.The Universal Segmenter was tested on an Ubuntu machine with four GPUs (NVIDIA GeForce GTX 1080 Ti) and 128 GB of RAM.Each value represents an average of five consecutive runs.The results indicate that our algorithm is unrivalled in terms of speed.

Conclusions and Future Work
In this paper, we introduced the MiNgMatch Segmenter: a data-driven word segmentation algorithm finding the minimal sequence of n-grams needed to match the input text.We compared our algorithm with segmenters utilizing two state-of-the-art n-gram language modelling techniques (namely, the Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling.
The evaluation experiments revealed that the proposed approach is capable of achieving overall results comparable with the other best-performing models, especially when we take into account the variance in notation of certain lexical items by different contemporary experts.Given its low computational cost and competitive results, we believe that MiNgMatch could be applied to other languages, and possibly to other Natural Language Processing problems, such as speech recognition.
In terms of precision of the segmentation process and accuracy for out-of-vocabulary words, the sequence labelling neural model turned out to be the best option.In order to achieve that, however, it needs to be presented with training data tailored to the task, closely mimicking the intended target data.
To this end, we demonstrated that such data can be bootstrapped from a small amount of manually annotated text, using the Universal Segmenter itself.
Important tasks for the future include performing experiments with the proposed algorithm on other languages and implementing an ensemble segmenter combining an n-gram model (such as MiNgMatch) with a neural model performing word segmentation as character sequence labelling.Another area that requires improvement is the handling of OoV words.All lexical n-gram-based models applied in our experiments performed poorly in this aspect and our algorithm was not an exception.One possible way to increase the MiNgMatch Segmenter's robustness against unseen forms might be to utilize character n-grams instead of word n-grams.

Listing 1 :
Sample from the n-gram data used by MiNgMatch Segmenter.

Table 1 .
Statistics of Ainu text collections and dictionaries used as the training data.

Table 3 .
Statistics of the samples used for evaluation and their modern transcription equivalents.

Table 4 .
Operations on training data for the Universal Segmenter.

Table 6 .
Evaluation results-Stupid Backoff model (best results in bold).

Table 7 .
Evaluation results-model with Kneser-Ney smoothing (best results in bold).

Table 8 .
Evaluation results-Universal Segmenter (best results in bold).

Table 9 .
US-ISP model (with 9-gram vectors): F-score for word boundaries not indicated in original transcription.

Table 11 .
Word-level Accuracy for OoV words (best models only).

Table 13 .
Statistics of word segmentation errors.

Table 16 .
Execution times in seconds.