Next Article in Journal
Virtual Reality and Its Applications in Education: Survey
Next Article in Special Issue
Improving Basic Natural Language Processing Tools for the Ainu Language
Previous Article in Journal
The Temperature Forecast of Ship Propulsion Devices from Sensor Data
Previous Article in Special Issue
A Sustainable and Open Access Knowledge Organization Model to Preserve Cultural Heritage and Language Diversity
Open AccessArticle

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Department of Computer Science, Kitami Institute of Technology, 165 Koen-cho, Kitami, Hokkaido 090-8507, Japan
*
Author to whom correspondence should be addressed.
Information 2019, 10(10), 317; https://doi.org/10.3390/info10100317
Received: 12 September 2019 / Revised: 4 October 2019 / Accepted: 11 October 2019 / Published: 16 October 2019
(This article belongs to the Special Issue Computational Linguistics for Low-Resource Languages)
Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition. View Full-Text
Keywords: word segmentation; tokenization; language modelling; n-gram models; Ainu language; endangered languages; under-resourced languages word segmentation; tokenization; language modelling; n-gram models; Ainu language; endangered languages; under-resourced languages
MDPI and ACS Style

Nowakowski, K.; Ptaszynski, M.; Masui, F. MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language. Information 2019, 10, 317.

Show more citation formats Show less citations formats
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Access Map by Country/Region

1
Back to TopTop